摘要
通过对全局模型和局部模型的分析,提出一种新的潜在语义索引差异模型,能将类别信息反应在词项中.以医学网页为实验对象,将网页中的文本抽取出来并分别用全局模型和差异模型表示,采用SVD和SLSI降维,利用SVM算法进行分类并计算分类正确率和F1指标.实验发现:采用差异模型表示时,2种降维技术下分类正确率和F1指标较全局模型都有明显提高;同时采用差异模型和SLSI算法并不能对分类结果有更大改善.
On the base of analysis of global LSI and local LSI, a new difference latent semantic indexing is proposed, which integrates the class information into term set. Medical web pages are used to test the new LSI. The text in medical webpage is extracted and represented by the global LSI and the difference LSI respectively. SVD and SLSI are used to reduce the dimension of feature space, SVM algorithm is employed to classify the feature vectors of testing collection, and the categorical accuracy and macro-average F1 are calculated. Experiment illustrates that the difference LSI gives higher accuracy and macro-average F1 than the global LSI when combined with SVD or SLSI. However, the difference LSI combines with SLSI can' t obtain more improvement on accuracy and the macro-average F1.
出处
《烟台大学学报(自然科学与工程版)》
CAS
2008年第2期125-129,共5页
Journal of Yantai University(Natural Science and Engineering Edition)
基金
国家自然科学基金资助项目(60772028)
山东省自然科学基金资助项目(Y2006G22)
关键词
潜在语义索引
差异模型
文本分类
SVM算法
latent semantic indexing
difference model
text categorization
SVM algorithm