摘要
跨语言新闻话题发现是将互联网上报道相同事件的不同语言新闻进行自动归类,由于不同语言文本很难表示在同一特征空间下,对其共同话题的挖掘就比较困难。然而类似的新闻事件在不同语言文本表达上具有相同的新闻要素,这些要素之间关联能够体现出新闻事件的关联性,因此,针对汉越新闻话题发现问题,提出基于文档图聚类的汉越双语新闻话题发现方法。首先提取汉越新闻文本新闻要素,借助文本中要素相似度计算汉越文本相关度,构建汉越双语文本图模型,获得新闻文本相似度矩阵;然后,借助图模型中文本间的传播特点,采用随机游走算法对相似度矩阵进行调整,最后利用信息传递算法进行聚类。实验结果表明提出的方法取得了很好的效果。
The purpose of cross-language topic discovery is to classify news texts written in different languages by their topics automatically. However,due to the difference in different languages,it's hard to describe these texts on the same feature space,so mining the same topic is not an easy work. When a particular news event is reported,the news elements are the same no matter which language describe it. So news elements can reflect the relevance among different news texts. Therefore,the paper proposed Chinese-Vietnamese bilingual news topic detection methods based on graph clustering. Firstly,Chinese-Vietnamese bilingual news elements are extracted and the similarity of different news texts is calculated by using the news elements' similarity to set up a ChineseVietnamese bilingual news graph model. Secondly,through the propagation characteristics of the Chinese-Vietnamese bilingual news graph model,the similarity matrix is adjusted by using the random walk algorithm. Finally,affinity propagation algorithm is used to cluster topic. The experimental result shows that the proposed method is effective.
作者
王禹森
余正涛
高盛祥
周超
洪旭东
Wang Yusen, Yu Zhengtao, Gao Shengxiang, Zhou Chao, Hong Xudong(School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming, G50500, Chin)
出处
《数据采集与处理》
CSCD
北大核心
2018年第3期530-537,共8页
Journal of Data Acquisition and Processing
基金
国家自然科学基金(61472168
61175068
61672271)资助项目
云南省自然科学基金重点(2013FA130)资助项目
云南省科技创新人才基金(2014HE001)资助项目
关键词
汉越双语
事件要素
话题发现
图聚类
Chinese Vietnamese
events element
topic detection
graph clustering