摘要
传统词游走距离算法基于word2vec词向量以及词频特征向量计算文档距离,存在忽略词语语义的上下文语境以及无法充分提取词语中的语义信息等问题.因此,本文提出一种基于联合词句的文本相似度计算方法.该方法利用训练好的词向量和句向量构建特征权重系数,对词游走距离计算公式进行改进后,选取一定比例关键词的词向量与句向量计算词句转移成本,从而得到文档的文本相似度.通过三组对比实验表明,该方法的效果优于其他文本相似度计算方法和原始词游走距离算法.
The original WMD distance algorithm is based on word vector and the word frequency feature vector,ignoring the context of the semantics of the word and the inability of fully extracting the semantic information in words. Therefore,a text similarity calculation method based on joint words and sentence is proposed. The method uses the trained word vector and sentence vector to construct the weighted coefficient. When the WMD distance calculation formula is improved,a certain proportion of keywords and sentence vectors are selected to calculate the word transfer cost,so as to obtain the text similarity of two documents. Three sets of experiments show that the proposed method is superior to other text similarity calculation methods and the original WMD distance algorithm.
作者
徐鑫鑫
刘彦隆
宋明
XU Xin-xin;LIU Yan-long;SONG Ming(School of Information and Computer,Taiyuan University of Technology,Jinzhong 030600,China)
出处
《小型微型计算机系统》
CSCD
北大核心
2019年第10期2072-2076,共5页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(60772101)资助
太原理工大学项目(900203011843)资助
关键词
文本相似度
词向量
句向量
WMD距离
增强权重系数
text similarity
word embedding
sentence embedding
WMD distance
weighted coefficient