摘要
传统的文本分类都是根据文本的外在特征进行的,最常见的就是基于向量空间模型的方法,使用空间向量表示文本,通过相似度比较来确定分类。为了克服向量空间模型中的词条独立性假设,文章提出了一种基于潜在语义索引的文本分类模型,通过对大量的文本集进行统计分析,揭示了词语的上下文使用含义,通过奇异值分解有效地降低了向量空间的维数,消除了同义词、多义词的影响,从而提高了文本分类的精度。
Because traditional text classification is based on explicit character, and the common method is to represent textual materials with space vectors using vector space model, then confirm the category of the test documents by comparing the degree of similarity. In order to overcome the hypothesis of term independence in VSM, the text classification based on latent semantic indexing was proposed. It extracts the contextual-usage meaning of words by statistical computations applied to a large corpus of text and can advance the accuracy of text classification by using a singular value decomposition (SVD) to effectively reduce the dimension of the vector space and remove the influences of synonymy and polysemy.
出处
《电脑与信息技术》
2006年第5期32-34,38,共4页
Computer and Information Technology
关键词
潜在语义索引
文本分类
奇异值分解
latent semantic indexing (LSI)
text classification
singular value decomposition