期刊文献+

基于LDA的新闻话题子话题划分方法 被引量:18

Subtopic Division in News Topic Based on Latent Dirichlet Allocation
下载PDF
导出
摘要 针对目前网络热点新闻话题中存在的难以区分一个话题下的多个子话题现象,提出一种基于LDA模型的子话题划分方法.首先应用LDA模型对新闻文档进行建模,采用贝叶斯标准方法确定最优主题个数,使LDA模型拟合文档性能达到最佳;其次针对子话题间文本相似度较高的特点,引入主题特征词相关性分析,采用改进的KL距离公式,计算新闻文档之间相似度,有效区分了文档内容相似但话题重点不同的报道;最后通过single-pass增量聚类算法进行文档聚类,实现子话题划分.实验验证了改进后的相似度计算方法的有效性,实验结果表明该方法能够有效地提高热点新闻话题子话题划分的准确率. Nowadays it is difficult to distinguish the subtopics in a hot news topic on the internet. To solve this problem, in the paper, the method of subtopic division based on Latent Dirichlet Allocation is presented. It describes a news document by Latent Dirichlet Allocation, and uses Bayes standard method to determine the optimal number of topics in order to fit documents best. According to the high similarity of documents between subtopics, the relativity analysis of feature words is introduced. Using the improved Kull-back-Leibler distance to calculate the similarity of news stories can distinguish the stories which have similar content but belong to dif- ferent topics effectively. Finally, it divides a hot news topic to subtopics by clustering the news documents with the single-pass incremental clustering algorithm. Experimental results verify the availability of the improved similarity calculation method, and it shows that this method can improve the performance of subtopic division effectively comparing to the baseline method.
出处 《小型微型计算机系统》 CSCD 北大核心 2013年第4期732-737,共6页 Journal of Chinese Computer Systems
基金 国家自然科学基金项目(60873247)资助 山东省自然科学基金项目(ZR2009GZ007)资助 山东省教育厅科技项目(J09LG52)资助 山东省高新自主创新专项工程项目(2008ZZ28)资助
关键词 潜在狄利克雷分布(LDA) 子话题划分 主题特征词 KL距离 相似度计算 latent dirichlet allocation subtopic division feature words kullback-leibler distance similarity calculation
  • 相关文献

参考文献3

二级参考文献37

  • 1Allan J. Topic Detection and Tracking: Event-based Information Organization[M]. [S.l.]: KluwerAcademic Publishers, 2002: 1-16.
  • 2Ault T G, Yang Yiming. Information Filtering in TREC-9 and TDT-3: A Comparative Analysis[J]. Information Retrieval, 2002, 5(2/3): 159-187.
  • 3Wei Chih-Ping, Chang Yu-Hsiu. Discovering Event Evolution Patterns from Document Sequences[J]. IEEE Transactions on Systems, Man, and Cybernetics Part A: Systems and Humans, 2007, 32(2): 12-13.
  • 4T, Brants, F, R, Chen, A, O, Farahat. A system for new event detection. In: Proc, SIGIR 2003, the 26th Annual lnt'l ACM SIGIR Conf. Research and Development in Information Retrieval.New York: ACM Press, 2003. 330-337.
  • 5R. Swan, J. Allan. Automatic generation of overview timelines.ACM SIGIR, Research and Development in Information Retrieval, Athans, Greece, 2000.
  • 6F. Fukumoto, Y. Suzuki. Event tracking based on domain dependency. ACM SIGIR, Research and Development in Information Retrieval, Athans, Greece, 2000.
  • 7David A. Smith. Detecting and browsing events in unstructured text. The 25th Annual ACM SIGIR Conf., Finland, 2002.
  • 8R. Papka. On-line new event detection, clustering and tracking:[Ph, D. dissertation]. Massachusetts: Department of Computer Science, University of Massachusetts, 1999.
  • 9Ying-Ju Chen, Hsin Hsi. NLP and IR approaches to monolingual and multilingual link detection, The 19th Int'l Conf.Computational Linguistics, Taipei, Taiwan, 2002.
  • 10J. Allan, Ao Feng, Alvaro Bolivar, Flexible intrinsic evaluation of hierarchical clustering for TDT. The 12th ACM Int'l Conf.Information and Knowledge Management (CIKM 2003 ),Louisiana, USA, 2003.

共引文献76

同被引文献156

引证文献18

二级引证文献84

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部