摘要
传统的LDA主题模型没有考虑词频对主题分类的影响,使得主题分布向高频词倾斜。为了综合考虑词频和主题间的相关性,文中利用互信息能够表达变量间相关性的特点,在互信息基础上改进作为特征选择方法,利用评价函数评价特征词的权重值改进LDA算法分类过程,提高对主题分类贡献度高的特征词的作用。通过在新闻语料库上的分类实验证明了该方法的有效性,同时表明分类的准确率也有所提高。
The traditional Latent Dirichlet Allocation( LDA) topic model does not consider the influence of word frequency on the subject classification,so that the distribution of the subject is tilted to the high frequency word. In order to comprehensively consider the correlation between word frequency and subject,this paper uses mutual information to express the characteristics of correlation between variables,and improves it as a feature selection method on the basis of mutual information. We use the evaluation function to evaluate the weight value of the characteristic word to improve the LDA algorithm classification process,and improve the contribution of the characteristic words with high contribution to the subject classification. The validity of the method is proved by the classification experiment in the news corpus,and the result shows that the accuracy of the classification is also improved.
出处
《微型机与应用》
2017年第19期19-22,共4页
Microcomputer & Its Applications
关键词
主题模型
词频
互信息
特征选择
topic model
word frequency & mutual information
feature selection