期刊文献+

大规模短文本的快速话题发现方法与评价研究 被引量:3

Fast topic detection and evaluation towards massive short texts
下载PDF
导出
摘要 传统的话题发现研究主要针对于长文本及新闻数据集,大规模短文本具有稀疏、无结构、多噪等特点,传统方法很难有效发现话题。提出了一个融合词共现与加权GN(CW-WGN)算法的快速话题发现方法,描述了CW-WGN方法的详细过程,给出方法的具体算法。采集了sina微博、新闻网站的标题真实的短文本数据,构建了基础测试数据集,采用LDA与K-means方法作为对比进行了大量对比实验。实验结果表明CW-WGN比LDA和K-means方法能够多发现20%以上的正确话题,而且发现的话题纯度也高于LDA与K-means。此外,CWWGN消耗的时间最少,能够有效地从实际大规模短文本上发现话题。 Most topic detection methods mainly focus on the long text and news data set. There have some key characteristics in massive short texts, such as sparse, unstructured and much noise. Traditional topic detection methods are difficult to effec- tively find the topic. This paper presented a new method, which combined word co-occurrence with the weighted GN algorithm ( CW-WGN for short). CW-WGN could be used to rapidly detect topic. This paper described the process of CW-WGN method in detail and specifically gave the implemented algorithms. Two datasets were collected from sina microblogging and main news website. It used the LDA and K-means methods as comparative methods and conducted comprehensive experiments. Experi- mental results show that CW-WGN can find more than 20% of the correct topics than K-means and LDA, and the purities of found topics are higher. In addition, the running time of CW-WGN is least, which means that CW-WGN can efficiently detect tooie from actual massive short texts.
出处 《计算机应用研究》 CSCD 北大核心 2015年第3期717-722,739,共7页 Application Research of Computers
基金 国家自然科学基金资助项目(61170112) 国家教育部人文社会科学研究青年基金资助项目(13YJC860006) 北京市属高等学校科学技术与研究生教育创新工程建设项目(PXM2012_014213_000037)
关键词 短文本 话题发现 词共现 社团发现 short text topic detection word co-occurrence community detection
  • 相关文献

参考文献20

  • 1YAMRON J P, GILLICK L, Van MULBREGT P, et al. Statistical models of topical content[ M ]//Topic Detection and Tracking. New York : Springer,2002 : 115-134.
  • 2LEEK T, SCHWARTA R, SISTA S. Probabilistic approaches to topic detection and tracking [ M ]//Topic Detection and Tracking. New York : Springer, 2002:67 - 83.
  • 3THOLLARD F, DUPONT P, De La HIGUERA C. Probabilistic DFA inference using Kullback-Leibler divergence and minimality [ C ]// Proc of the 17th International Conference on Machine Learning. 2000 : 975- 982.
  • 4LI Zhi-wei, WANG Li-bin, LI Bin, et al. A probabilistic model for retrospective news event detection [ C ]//Proc of the 28th Annual In- ternational ACM SIGIR Conference on Research and Development in Information Retrieval. New York : ACM Press ,2005 : 106-113.
  • 5WANG Chi, YU Xiao, LI Yan-en, et al. Content coverage maximiza- tion on word networks for hierarchical topic summarization [ C ]//Pmc of the 22nd ACM International Conference on Information & Know- ledge Management. New York:ACM Press,2013:249-258.
  • 6CATALDI M, Di CARO L, SCHIFANELLA C. Emerging topic de- tection on Twitter based on temporal and social terms evaluation [ C]//Proc of the lOth International Workshop on Multimedia Data Mining. New York : ACM Press ,2010 : 1 - 10.
  • 7MIHALCEA R, TARAU P. TextRank: bringing order into texts [ C ]//Proc of EMNLP. 2004:404- 411.
  • 8PHUVIPADAWAT S, MURATA T. Breaking news detection and tracking in twitter[ C ]//Proc of IEEE/WIC/ACM International Con- ference on Web Intelligence and Intelligent Agent Technology. Wash- ington DC : IEEE Computer Society,2010 : 120-123.
  • 9HUANG Xiao-hui, ZHANG Xiao-feng, YE Yun-ming, et al. A topic detection approach through hierarchical clustering on concept graph [ J]. Applied Mathematics & Information Sciences,2013,7 (6) : 2285-2295.
  • 10XUE Zhe, JIANG Shu-qiang, LI Guo-rong, et al. Cross-media topic detection associated with hot search queries [ C ]//Proc of the 5th In- ternational Conference on Internet Multimedia Computing and Service.New York : ACM Press ,2013:403-406.

二级参考文献14

共引文献76

同被引文献31

引证文献3

二级引证文献40

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部