摘要
双语话题分析与发现是当前国内外的研究热点,但针对特定文本研究较少。为此,在汉越双语新闻文本中,基于双语主题分布词的汉越文本相似度计算方法,提出融合标题、关键词以及实体等并针对新闻文本的新闻要素特征。将这些新闻特征信息融合到文本相似度计算中构建双语文本相似度矩阵,对汉越双语新闻文本采用自适应K均值算法进行聚类,分析汉越双语新闻话题。实验结果表明,与仅考虑新闻文本相似度的计算方法和K均值聚类方法相比,该方法的准确率、召回率和F值更高。
It is a hot research point of analyzing and discovering bilingual topics. However, there is no further research on specific contexts. So this paper puts forward a similarity calculation method for Sino-Vietnamese context based on bilingual subject distribution words in Sino-Vietnamese bilingual news texts. It is mixed with element features of news such as titles, key words and entities, integrates the news feature information into the context similarity calculation to construct bilingual text similarity matrix, and uses adaptive K-means algorithm to cluster Sino-Vietnamese bilingual news texts in order to analyze Sino-Vietnamese bilingual news topics. Experimental results prove that the accuracy rate, recall rate and F-measure of the proposed method are higher than that of the calculation method using only news text similarity and K-means clustering method.
出处
《计算机工程》
CAS
CSCD
北大核心
2016年第9期186-191,共6页
Computer Engineering
基金
国家自然科学基金资助项目(61462055
61472168
61262041)
云南省自然科学基金资助重点项目(2013FA130)
关键词
双语新闻话题分析
汉越双语
文本相似度
主题
自适应聚类
analysis of bilingual news topic
Sino-Vietnamese bilingual
text similarity
topic
adaptive clustering