摘要
新闻评论反映民众对新闻事件的观点,抽取评论主题,对用户、企业、政府都具有很高的情报分析价值。基于K-means聚类的主题挖掘算法应用到新闻评论中时,在欧氏距离下,如果使用最大距离法选初始点则会聚成一大类。为解决这个问题,论文首先在预处理阶段增加同义词替换和自动构建领域词典的部分,改善了数据稀疏性和高维性。其次,提出了K-means改进算法,用隐藏长评论-最大距离法选初始点,解决了初始点多为离群点的问题,用方差拐点确定K值,解决了预先设定聚类个数的问题,实验发现了先用BW权重选初始点,再用新提出的BW-DF权重聚类的效果最好。最后,将改进算法与原算法的聚类效果比较,实验结果表明,改进算法准确率高,抽取新闻评论主题的效果明显。
News comments on the web express readers' attitudes or opinions about the news events. Opinion topic extraction from news comments is valuable for users, businesses and government. When K-means clustering algorithm for topic mining is applied to news comments in the Euclidean distance, it has bad clustering performance through the maximum distance method to select initial centers. To solve this problem, firstly, synonym substitution and field dictionary is introduced in the preprocessing stage to solve the problem of data sparseness and multi dimension. Secondly, the improved K-means algorithm is proposed. It selects the initial cluster centers according to maximum distance after the long comments are hidden, which solves the problem that initial centers are outliers. The method of variance inflection is proposed to deal with the problem of the traditional K-means algorithm in which k values needs to be input. It is found that the new algorithm has good clustering performance by BW-DF after BW is used to select initial centers. Finally, the effect of improved clustering algorithm is compared with the original one. The results show that the improved algorithm with high accuracy extracts opinion topic effectively.
出处
《情报学报》
CSSCI
北大核心
2016年第1期55-65,共11页
Journal of the China Society for Scientific and Technical Information
基金
国家自然科学基金项目(71171153)"24小时知识工厂的知识共享活动模型与服务支持系统研究"的研究成果之一