摘要
针对特征提取忽略特征项语义问题,提出一种基于潜在狄利克雷分配模型(LDA)改进的特征提取算法。该算法基于文档的潜在主题分布,将文档转换为隐含主题与主题下的单词分布按特定比例组成的集合,通过一定的概率选中某个主题,并从该主题下以一定的概率选中某个词语来生成一篇文档。同时,针对LDA算法"平等"对待所有特征项的情况,对LDA模型进行高斯加权。实验结果表明,该算法相比TF-IDF算法、信息增益法,能够提取更多的有效特征,使得分类准确率有所提高。
A model based on Latent Dirichlet Allocation (Latent Dirichlet Allocation, LDA) feature extraction algorithm is proposed to tackle the problem in which inverse document frequency method (term frequency, inverse document frequency, TF-IDF) ignores feature semantics in feature extraction. The proposed algorithm is based on the underlying theme distribution of document, to convert text to implied themes and theme words collection with a distribution according to certain proportion, and to realize feature extraction according to the key semantics. At the same time the algorithm uses Gaussian weighting to treat all the characteristics of the item as "equality". Experiments results show that the proposed algorithm can extra more effective features and improve the accuracy of classification compared with TF-IDF and mutual information method.
出处
《西安邮电大学学报》
2015年第6期79-81,120,共4页
Journal of Xi’an University of Posts and Telecommunications
基金
西安市科技计划资助项目(CXY1437(8))
关键词
文本分类
特征提取
潜在狄利克雷
支持向量机
text classification, feature extraction, potential Latent Dirichlet, SVM