摘要
潜在语义索引(LSI)模型能在一定程度上解决一词多义和多词一义问题,并能过滤一部分文档噪音.然而在LSI模型中,一些对分类贡献大的特征,由于其对应的特征值小而被滤掉.针对这一问题,文中提出了一种扩展LSI模型的文本分类模型.该模型在尽量保留文档信息的同时,增加考虑了文档的类别信息,从而能比LSI模型更好地表示原始文档空间中的潜在语义结构.
In the Latent Semantic Indexing (LSI) model, the problems of polysemy and synonymy can be dealt with to a certain degree and some noise in the raw document can be reduced, while some important features may be ignored because of their small feature values. To solve the problem, a new text classification model extending the LSI model is proposed. In this model, the classification information of the training document is additionally taken into account while keeping as much document information as possible. So the proposed model can better capture the latent semantic structure behind the classification examples than the LSI model.
出处
《华南理工大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2004年第z1期99-102,共4页
Journal of South China University of Technology(Natural Science Edition)
关键词
文本分类
潜在语义索引
偏最小二乘法
text classification
latent semantic indexing
partial least square analysis