期刊文献+

基于上下文相似度矩阵的Single -Pass短文本聚类 被引量:6

Single -Pass Short Text Clustering Based on Context Similarity Matrix
下载PDF
导出
摘要 在线社交网络已经成为人们信息交流的重要渠道和载体,形成了与现实世界交互影响的虚拟社会。众多的网络事件通过社交网络进行快速传播,可以在短时间内成为舆论热点,而负面事件会对国家安全和社会稳定造成冲击,从而引发一系列的社会问题。因此,挖掘社交网络中蕴含的热点信息,无论是从舆论监督方面还是舆情预警方面都具有重要的意义。文本聚类是挖掘热点信息的一种重要方法,然而,使用传统长文本聚类算法处理海量短文本时准确率将变低,复杂度急剧增长,从而导致耗时过长;现有的短文本聚类算法的准确率偏低、耗时过长。文中基于文本关键词,提出了结合上下文和相似度矩阵的关联模型,从而判断当前文本与上一文本的关联性。此外,根据该关联模型对文本关键词权重进行调整,以进一步降低噪声。最后,在Hadoop平台上实现了分布式的短文本聚类算法。与K-MEANS,SP-NN,SP-WC算法的比较实验验证了所提算法在话题挖掘速度、准确率和召回率等方面都具有更好的效果。 Online social network has become an important channel and carrier,and it has formed a virtual society interacting with the real world.Numerous network events rapidly spread through social networks,and they can become hot spots in a short period of time.However,the negative events vibrate national security and social stability,and may cause a series of social problems.Therefore,mining hotspot information contained in social networks is of great significance both in public opinion supervision and public opinion early warning.Text clustering is an important method for mining hotspot information.However,when the traditional long text clustering algorithms process massive short texts,their accuracy rate will become lower and the complexity will increase sharply,which will lead to long time-consuming.The exis-ting short text clustering algorithms also have low accuracy and takes too much time.Based on the keywordss of text,this paper presented an association model combining context and similarity matrix to determine the relevance between the current text and the previous text.In addition,the text keywords weights were modified according to the association model to further reduce the noise.Finally,a distributed short text clustering algorithm on Hadoop platform was implemented.Through the experiments,it is verified that the proposed algorithm has better results and performance compared with K-MEANS,SP-NN and SP-WC algorithms in terms of the speed of mining topics,the accuracy and the recall rate.
作者 黄建一 李建江 王铮 方明哲 HUANG Jian-yi;LI Jian-jiang;WANG Zheng;FANG Ming-zhe(School of Computer and Communication Engineering,University of Science and Technology Beijing,Beijing 100083,China)
出处 《计算机科学》 CSCD 北大核心 2019年第4期50-56,共7页 Computer Science
基金 国家重点研发计划资助项目(2017YFB0803302) 中央基本业务费(06116104)资助
关键词 在线社交网络 短文本序列 文本聚类 分布式处理 Online social network Short text sequence Text clustering Distributed processing
  • 相关文献

同被引文献51

引证文献6

二级引证文献13

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部