期刊文献+

基于互信息的文本分类改进方法研究 被引量:1

Research on text classification improvement method based on mutual information
下载PDF
导出
摘要 传统的LDA主题模型没有考虑词频对主题分类的影响,使得主题分布向高频词倾斜。为了综合考虑词频和主题间的相关性,文中利用互信息能够表达变量间相关性的特点,在互信息基础上改进作为特征选择方法,利用评价函数评价特征词的权重值改进LDA算法分类过程,提高对主题分类贡献度高的特征词的作用。通过在新闻语料库上的分类实验证明了该方法的有效性,同时表明分类的准确率也有所提高。 The traditional Latent Dirichlet Allocation( LDA) topic model does not consider the influence of word frequency on the subject classification,so that the distribution of the subject is tilted to the high frequency word. In order to comprehensively consider the correlation between word frequency and subject,this paper uses mutual information to express the characteristics of correlation between variables,and improves it as a feature selection method on the basis of mutual information. We use the evaluation function to evaluate the weight value of the characteristic word to improve the LDA algorithm classification process,and improve the contribution of the characteristic words with high contribution to the subject classification. The validity of the method is proved by the classification experiment in the news corpus,and the result shows that the accuracy of the classification is also improved.
出处 《微型机与应用》 2017年第19期19-22,共4页 Microcomputer & Its Applications
关键词 主题模型 词频 互信息 特征选择 topic model word frequency & mutual information feature selection
  • 相关文献

参考文献6

二级参考文献56

  • 1徐凤亚,罗振声.文本自动分类中特征权重算法的改进研究[J].计算机工程与应用,2005,41(1):181-184. 被引量:56
  • 2孙茂松,黄昌宁,邹嘉彦,陆方,沈达阳.利用汉字二元语法关系解决汉语自动分词中的交集型歧义[J].计算机研究与发展,1997,34(5):332-339. 被引量:66
  • 3Blei D,Ng A,Jordan M.Latent Dirichlet allocation[J].Journal of Machine Learning Research,2003:3,993-1022.
  • 4Griffiths T L,Steyvers M.A Probabilistic Approach to Semantic Representation[C]∥ Proceedings of the 24th Annual Conference of the Cognitive Science Society,2002:381-386.
  • 5Griffiths T L,Steyvers M.Prediction and Semantic Association[C]∥ Advances in Neural Information Processing Systems,2003,15:11-18.
  • 6Griffiths T L,Steyvers M.Finding Scientific Topics[C]∥ Proceedings of the National Academy of Science,2004:5228-5235.
  • 7Hofmann T.Probabilistic Latent Semantic Analysis[C]∥ Proceedings of the 15th Conference on Uncertainty in Artificial Intelligence.1999:289-296.
  • 8Deerwester S,Dumais S,Furnas G,et al.Indexing by Latent Semantic Analysis[J].Journal of the American Society for Information Science,1990,41:391-407.
  • 9Hofmann T.Unsupervised Learning by Probabilistic Latent Semantic Analysis[J].Machine Learning Journal,2001,42(1):177-196.
  • 10Blei D,Lafferty J.Correlated Topic Models[C]∥ Advances in Neural Information Processing Systems,2006,18:147-154.

共引文献203

同被引文献15

引证文献1

二级引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部