期刊文献+

一种基于词聚类信息熵的新闻提取方法 被引量:1

A News Extraction Method Based on Information Entropy of Word Clustering
下载PDF
导出
摘要 互联网的飞速发展为大众带来便利的同时也产生了大量冗余信息,利用自然语言处理技术对新话题文章进行提取,控制新话题中虚假新闻传播,可为舆情控制提供有效支持。提出一种基于词聚类信息熵的新闻提取方法,并对“一带一路”相关新闻语料进行实验。实验通过网络爬虫的方式获取相关报道,利用Pkuseg工具分词进行一系列预处理后训练生成Word2vec词向量,然后利用词频统计筛选出历史高频词进行K-means聚类,将聚类后的词簇作为随机变量计算当前文章的信息熵。若文章的信息熵高于设定阈值,则为新话题文章,需要重点关注。结果表明,该方法在阈值设置为0.65时,新闻提取结果的准确率可达到84%。 The rapid development of the Internet has brought convenience to the public while generating a large amount of redundant information.Using natural language processing techniques to extract new topic articles can provide effective support for public opinion control,this paper proposes a news extraction method based on word clustering information entropy,and conducts experiments on the“One Belt,One Road”related news corpus.The experiment obtains relevant reports by web crawling.We use the Pkuseg tool to seg⁃ment the corpus,and then perform a series of preprocessing operations such as removing the stop words and the background words.Then a word2vec word vector is generated for the processed corpus.The word frequency statistics are used to screen the historical high frequency words for k-means clustering.Then the word clusters are used as random variables to calculate the information entropy of the current article.If the information entropy of the article is higher than the set threshold,it is a new topic article and needs to be focused.The results show that the accuracy of the news extraction results can reach 84%when the threshold is set at 0.65.
作者 牛伟农 吴林 于水源 NIU Wei-nong;WU Lin;YU Shui-yuan(Key Laboratory of Convergent Media and Intelligent Technology of Ministry of Education,Communication University of China,Beijing 100024,China)
出处 《软件导刊》 2020年第1期36-40,共5页 Software Guide
基金 中国传媒大学青年理工科规划项目(3132018XNG1834)
关键词 新闻提取 新话题 词向量 聚类 信息熵 news extraction new topic word vector clustering information entropy
  • 相关文献

参考文献7

二级参考文献29

共引文献58

同被引文献21

二级引证文献6

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部