期刊文献+

基于加权隐含狄利克雷分配模型的新闻话题挖掘方法 被引量:14

News topic mining method based on weighted latent Dirichlet allocation model
下载PDF
导出
摘要 针对传统新闻话题挖掘准确率不高、话题可解释性差等问题,结合新闻报道的体例结构特点,提出一种基于加权隐含狄利克雷分配(LDA)模型的新闻话题挖掘方法。首先从不同角度改进词汇权重并构造复合权值,扩展LDA模型生成特征词的过程,以获取表意性较强的词汇;其次,将类别区分词(CDW)方法应用于建模结果的词序优化上,以消除话题歧义和噪声、提高话题的可解释性;最后,依据模型话题概率分布的数学特性,从文档对话题的贡献度以及话题权值概率角度对话题进行量化计算,以获取热门话题。仿真实验表明:与传统LDA模型相比,改进方法的漏报率、误报率分别平均降低1.43%、0.16%,最小标准代价平均降低2.68%,验证了该方法的可行性和有效性。 To solve the problems such as low accuracy and poor interpretability of traditional news topic mining, a new method was proposed based on weighted Latent Dirichlet Allocation (LDA) that combined with the information structure characters of the news. Firstly, the vocabulary weights were improved from different angles and the composite weights were built, the more expressive words were got by extending the process of feature items generated by the LDA model. Secondly, the Category Distinguish Word (CDW) method was used to optimize the word order of the generated result, which could reduce the noise and the ambiguity of the topics and improve the interpretability of the topics. Finally, according to the mathematical characteristics of the probability distribution model of the topics, the topics were quantified in terms of the contribution degree from the documents to the topics and the topics weight probability to get the hot topics. The simulation results show that the false negative rate and false positive rate of the weighted LDA model drop by an average of 1.43% and O. 16% compared with the traditional LDA model, and the minimum standard price drops by an average of 2.68%. It confirms the feasibility and effectiveness of this method.
出处 《计算机应用》 CSCD 北大核心 2014年第5期1354-1359,共6页 journal of Computer Applications
关键词 新闻报道 话题挖掘 加权隐含狄利克雷分配模型 类别区分词 词序优化 news report topic mining weighted Latent Dirichlet Allocation (LDA) model Category Distinguish Word (CDW) order optimization
  • 相关文献

参考文献26

二级参考文献122

共引文献410

同被引文献142

  • 1万小军,杨建武.在线新闻主题检测系统的设计与应用[J].华南理工大学学报(自然科学版),2004,32(z1):42-46. 被引量:7
  • 2周新栋,王挺.基于N元语言模型的文本分类方法[J].计算机应用,2005,25(1):11-13. 被引量:11
  • 3侯汉清 ,章成志 ,郑红 .Web概念挖掘中标引源加权方案初探[J].情报学报,2005,24(1):87-92. 被引量:32
  • 4黄德才,戚华春.PageRank算法研究[J].计算机工程,2006,32(4):145-146. 被引量:69
  • 5徐晓日.网络舆情事件的应急处理研究[J].华北电力大学学报(社会科学版),2007(1):89-93. 被引量:141
  • 6HanJ,KamberM,PeiJ.数据挖掘:概念与技术[M].第三版.范明,孟小峰译.北京:机械工业出版社,2012:211-220.
  • 7Yang Y, Liu X. A Re-examination of Text Categorization Methods [C]. In: Proceedings of the 22rid Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. ACM, 1999: 42-49.
  • 8Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation [J]. Journal of Machine Learning Research, 2003, 3: 993-1022.
  • 9Chen M, Jin X, Shen D. Short Text Classification Improved by Learning Multi-granularity Topics [C]. In: Proceedings of the 22nd International Joint Conference on Artificial Intelligence. AAAI Press, 2011: 1776-1781.
  • 10Ni X, Sun J T, Hu J, et al. Cross Lingual Text Classification by Mining Multilingual Topics from Wikipedia [C]. In: Proceedings of the 4th ACM International Conference on Web Search and Data Mining. ACM, 2011: 375-384.

引证文献14

二级引证文献114

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部