摘要
传统的话题发现研究主要针对于长文本及新闻数据集,大规模短文本具有稀疏、无结构、多噪等特点,传统方法很难有效发现话题。提出了一个融合词共现与加权GN(CW-WGN)算法的快速话题发现方法,描述了CW-WGN方法的详细过程,给出方法的具体算法。采集了sina微博、新闻网站的标题真实的短文本数据,构建了基础测试数据集,采用LDA与K-means方法作为对比进行了大量对比实验。实验结果表明CW-WGN比LDA和K-means方法能够多发现20%以上的正确话题,而且发现的话题纯度也高于LDA与K-means。此外,CWWGN消耗的时间最少,能够有效地从实际大规模短文本上发现话题。
Most topic detection methods mainly focus on the long text and news data set. There have some key characteristics in massive short texts, such as sparse, unstructured and much noise. Traditional topic detection methods are difficult to effec- tively find the topic. This paper presented a new method, which combined word co-occurrence with the weighted GN algorithm ( CW-WGN for short). CW-WGN could be used to rapidly detect topic. This paper described the process of CW-WGN method in detail and specifically gave the implemented algorithms. Two datasets were collected from sina microblogging and main news website. It used the LDA and K-means methods as comparative methods and conducted comprehensive experiments. Experi- mental results show that CW-WGN can find more than 20% of the correct topics than K-means and LDA, and the purities of found topics are higher. In addition, the running time of CW-WGN is least, which means that CW-WGN can efficiently detect tooie from actual massive short texts.
出处
《计算机应用研究》
CSCD
北大核心
2015年第3期717-722,739,共7页
Application Research of Computers
基金
国家自然科学基金资助项目(61170112)
国家教育部人文社会科学研究青年基金资助项目(13YJC860006)
北京市属高等学校科学技术与研究生教育创新工程建设项目(PXM2012_014213_000037)
关键词
短文本
话题发现
词共现
社团发现
short text
topic detection
word co-occurrence
community detection