摘要
针对目前网络热点新闻话题中存在的难以区分一个话题下的多个子话题现象,提出一种基于LDA模型的子话题划分方法.首先应用LDA模型对新闻文档进行建模,采用贝叶斯标准方法确定最优主题个数,使LDA模型拟合文档性能达到最佳;其次针对子话题间文本相似度较高的特点,引入主题特征词相关性分析,采用改进的KL距离公式,计算新闻文档之间相似度,有效区分了文档内容相似但话题重点不同的报道;最后通过single-pass增量聚类算法进行文档聚类,实现子话题划分.实验验证了改进后的相似度计算方法的有效性,实验结果表明该方法能够有效地提高热点新闻话题子话题划分的准确率.
Nowadays it is difficult to distinguish the subtopics in a hot news topic on the internet. To solve this problem, in the paper, the method of subtopic division based on Latent Dirichlet Allocation is presented. It describes a news document by Latent Dirichlet Allocation, and uses Bayes standard method to determine the optimal number of topics in order to fit documents best. According to the high similarity of documents between subtopics, the relativity analysis of feature words is introduced. Using the improved Kull-back-Leibler distance to calculate the similarity of news stories can distinguish the stories which have similar content but belong to dif- ferent topics effectively. Finally, it divides a hot news topic to subtopics by clustering the news documents with the single-pass incremental clustering algorithm. Experimental results verify the availability of the improved similarity calculation method, and it shows that this method can improve the performance of subtopic division effectively comparing to the baseline method.
出处
《小型微型计算机系统》
CSCD
北大核心
2013年第4期732-737,共6页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(60873247)资助
山东省自然科学基金项目(ZR2009GZ007)资助
山东省教育厅科技项目(J09LG52)资助
山东省高新自主创新专项工程项目(2008ZZ28)资助