摘要
为进一步提高文本相似度计算的准确性,在系统相似函数的架构下,提出了基于词向量的文本相似函数WDS(Word Documents Similarity)及其优化算法FWDS(Fast Word Documents Similarity)。该函数将文本词语集合对应的词向量集合看作系统,将词语对应的词向量看作系统的元素,则两个文本相似度就是两个向量集合的相似度。在具体计算时,以第一个向量集合为标准进行两个向量集合的对齐操作,同时计算相似元与非相似元的多个参数。实验结果表明,随着文本长度的增加,与WMD和WJ算法相比,WDS表现出了较高的命中率。
In order to further improve the accuracy of document similarity,under the framework of system similarity function,this paper presented Word Documents Similarity(WDS)based on word embedding,and its optimization algorithm FWDS(Fast Word Documents Similarity).WDS regards the set of word embedding corresponding to the words set of documents as the system,and regards the word embedding corresponding to the word as the element of the system.So,the similarity of the documents is the similarity of the two word embedding sets.In the concrete calculation,the first vector set is used as the standard,the alignment operation of the two vector sets is carried out,and the multiple parameters of the sets that are in and not in MOPs are calculated.The experimental results show that compared with WMD and WJ,WDS always keep better hit rate with documents’length increase.
作者
王路琪
龙军
袁鑫攀
WANG Lu-qi;LONG Jun;YUAN Xin-pan(School of Software,Central South University,Changsha 410075,China;School of Computer and Communication,Hunan University of Technology,Zhuzhou,Hunan 412000,China)
出处
《计算机科学》
CSCD
北大核心
2018年第B11期113-116,共4页
Computer Science
基金
国家自然科学基金资助项目(61402165
S1651002)
湖南省重点研发计划(2016JC2018)资助
关键词
文本相似
词向量
系统相似函数
相似元
权值
Document similarity
Word embedding
System similarity function
MOP
Weight