摘要
人们为什么能够在他们所得到的稀少信息基础上获得那么多的知识?对这个柏拉图问题有各种各样的回答。潜伏语义分析(Latent Semantic Analysis, LSA)使用了奇异值分解的线性代数的方法说明减少维数有助于揭示语义的潜伏关系,本文举了两个事例来加以说明:一个是对包括了计算机人机对话和数学图论两个内容的九篇文章题目进行分析,两个原来无甚联系的词经处理后却有很高的相关(.90)。另一个是对中国学生英语失误的关系的分析,减少维数后能更好地解释五种水平不同的学习者的拼写失误、用词失误和句法结构的发展趋势。LSA在文本处理方面有广泛的应用范围。
The “Plato's problem”-- how do people know as much as they do with as little information as they get?-- also known as “the poverty of the stimulus”, 搉egative evidence? or 搕he logical problem of language acquisition?, has aroused the interest of many philosophers, psychologists, linguists, and computational scientists. Nativism is the answer provided by Chomsky, but psychologists like MacWhinney and computational linguists like Sampson offer different explanation. Quine calls the problem “the scandal of induction”, whereas Shepard maintains that a general theory of generalization and similarity is as necessary to psychology as Newton's laws are to physics. However, the acceptance of the hereditary nature of language propensity does not mean the solution of the general theory of generalization and similarity--the problem of categorization. Many models have been suggested to find a mechanism by which a set of stimuli, words, or concepts come to be treated as similar. They attempt to postulate some constraints that can narrow the solution space of the problem that is to be solved by induction. Latent semantic analysis (LSA) put forth by Landauer et al is“a high-dimensional linear associative model that embodies no human knowledge beyond its general learning mechanism, to analyze a large corpus of natural text and generate a representation that captures the similarity of words and text passages.”The model employs a statistical technique of linear algebra known as singular value decomposition (SVD). The input to LSA is a matrix {A} consisting of rows representing unitary event types by columns representing contexts in which instances of the event types appear. SVD then decomposes the matrix into three matrices: {A}={U}{w}{V}T, and reduction of dimensionality is carried out in the reconstruction of the original matrix. To illustrate the power of reduction of dimensionality, two examples are given. In the example given by Landauer, the text input is titles of nine technical articles, five about human-computer interaction, four about mathematical graph theory. LSA shows how in the two-dimensionally reconstructed matrices two words that were totally uncorrelated in the original are quite strongly correlated (r =.9) in the reconstructed approximation. The other example is the use of SVD in a preliminary study of the relationship among the errors by Chinese learners of English. Reduction of dimensionality offers a better explanation of trends of development of spelling errors, misuse of words, and syntactic construction among five different types of learners. LSA have a wide area of application in connection with text processing.
出处
《现代外语》
CSSCI
北大核心
2003年第1期76-84,共9页
Modern Foreign Languages
关键词
柏拉图问题
相似性
归纳
潜伏语义分析
奇异值分解
Plato' problem, similarity, induction, latent semantic analysis, singular value decomposition