期刊文献+

基于改进K-means聚类的在线新闻评论主题抽取 被引量:15

Topic Extraction in News Comments Based on Improved K-means Clustering Algorithm
下载PDF
导出
摘要 新闻评论反映民众对新闻事件的观点,抽取评论主题,对用户、企业、政府都具有很高的情报分析价值。基于K-means聚类的主题挖掘算法应用到新闻评论中时,在欧氏距离下,如果使用最大距离法选初始点则会聚成一大类。为解决这个问题,论文首先在预处理阶段增加同义词替换和自动构建领域词典的部分,改善了数据稀疏性和高维性。其次,提出了K-means改进算法,用隐藏长评论-最大距离法选初始点,解决了初始点多为离群点的问题,用方差拐点确定K值,解决了预先设定聚类个数的问题,实验发现了先用BW权重选初始点,再用新提出的BW-DF权重聚类的效果最好。最后,将改进算法与原算法的聚类效果比较,实验结果表明,改进算法准确率高,抽取新闻评论主题的效果明显。 News comments on the web express readers' attitudes or opinions about the news events. Opinion topic extraction from news comments is valuable for users, businesses and government. When K-means clustering algorithm for topic mining is applied to news comments in the Euclidean distance, it has bad clustering performance through the maximum distance method to select initial centers. To solve this problem, firstly, synonym substitution and field dictionary is introduced in the preprocessing stage to solve the problem of data sparseness and multi dimension. Secondly, the improved K-means algorithm is proposed. It selects the initial cluster centers according to maximum distance after the long comments are hidden, which solves the problem that initial centers are outliers. The method of variance inflection is proposed to deal with the problem of the traditional K-means algorithm in which k values needs to be input. It is found that the new algorithm has good clustering performance by BW-DF after BW is used to select initial centers. Finally, the effect of improved clustering algorithm is compared with the original one. The results show that the improved algorithm with high accuracy extracts opinion topic effectively.
出处 《情报学报》 CSSCI 北大核心 2016年第1期55-65,共11页 Journal of the China Society for Scientific and Technical Information
基金 国家自然科学基金项目(71171153)"24小时知识工厂的知识共享活动模型与服务支持系统研究"的研究成果之一
关键词 在线新闻评论 K—means聚类改进 主题抽取 同义词替换 分词领域词典 online news comments, Improved K-means clustering algorithm, topic extraction, synonym substitution, field dictionary
  • 相关文献

参考文献28

  • 1Abdul-Mageed M M. Online news sites and journalism 2. 0 : Reader comments on A1 Jazeera Arabic [ J ]. tripleC : Communication, Capitalism & Critique. Open Access Journal for a Global Sustainable Information Society, 2008, 6 ( 2 ) : 59-76.
  • 2唐晓波,王洪艳.基于潜在狄利克雷分配模型的微博主题演化分析[J].情报学报,2013,32(3):281-287. 被引量:26
  • 3Liu Q, Zhou M, Zhao X. Understanding News 2.0: A framework for explaining the number of comments from readers on online news [ J ] . Information & Management, 2015, 52(7) : 764-776.
  • 4Walther J B, DeAndrea D, Kim J, et al. The influence of online comments on perceptions of antimarijuana public service announcements on YouTube [ J ]. Human Communication Research, 2010, 36 (4) : 469-492.
  • 5Houston J B, Hansen G J, Nisbett G S. Influence of user comments on perceptions of media bias and third-person effect in online newsEJ~. Electronic News, 2011, 5(2) : 79 -92.
  • 6Saha S K. Person Specific Comment Extraction and Classification [ D ]. Jadavpur University Kolkata, 2012.
  • 7Zhuang L, Jing F, Zhu X Y. Movie review mining and summarization [ C ]//Proceedings of the 15th ACM international conference on Information and knowledge management. ACM, 2006: 43-50.
  • 8Blei D M,Ng A Y,Jordan M I. Latent dirichlet allocation [J]. the Journal of Machine Learning Research, 2003, 3 : 993-1022.
  • 9王卫平,孟翠翠.基于句法分析与依存分析的评价对象抽取[J].计算机系统应用,2011,20(8):52-57. 被引量:8
  • 10姚天昉,程希文,徐飞玉,汉思·乌思克尔特,王睿.文本意见挖掘综述[J].中文信息学报,2008,22(3):71-80. 被引量:106

二级参考文献182

共引文献507

同被引文献228

引证文献15

二级引证文献121

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部