摘要
面对网络上更新快速的海量新闻,如何快速、有效地从中自动发现敏感话题并进行持续跟踪是当下研究的热点。文章以网络舆情分析系统为应用背景,针对其敏感话题发现过程,通过对TDT领域应用较多的Single-pass算法进行改进,提出了一种基于相似哈希的增量型文本聚类算法。基于实际应用中抓取到的新闻文本数据,实验结果表明,文章提出的算法相比于原Single-pass算法在聚类效率方面具有明显提升。从实际应用的效果来看,该算法达到了实时话题发现的预期需求,具有较高的实用价值。
Faced with the huge amounts of news data which updated on the Internet all the time, Sensitive Topic Detection and Tracking has become an important research now. In this paper, we discuss and research the incremental text clustering algorithm for sensitive topic detection in a online consensus analysis system. We introduce the related work of text clustering. Based on the Single-pass algorithm, we improve its performance and propose a new incremental text clustering algorithm which based on simhash. Based on the real online news corpus from the online consensus analysis system, we conduct an experiment to test and verify the feasibility and effectiveness of the algorithm we proposed. The result shows that the new algorithm is much more efficient compared to the original Single-pass clustering algorithm. In the real application, the new incremental text clustering algorithm basically meet the real-time demand of online topic detection and has a certain practical value.
出处
《信息网络安全》
2015年第9期170-174,共5页
Netinfo Security