摘要
为改善余弦相似度不能反映词袋模型中词项间顺序差异的缺点,提出了一种基于编辑距离的文档相似度度量方法.首先分析了基于tf-idf的词袋模型和余弦相似度计算方法所存在的问题;其次利用Jaccard系数和编辑距离描述两个字符串的公共子串中词语之间的顺序差异,并提出了一种词序敏感相似度计算方法;最后利用实验数据对算法的有效性进行了验证,结果显示本文方法在Top1、Top3上的F1指标比原始的余弦相似度方法分别提高了0.082 5、 0.112 6,表明本文方法能够有效地提升信息检索系统的性能,具有很好的应用价值.
In this paper, a method is proposed to calculate the similarity between documents based on edit distance in order to improve the shortcoming that the cosine similarity method cannot reflect the order difference between the terms in the bag-of-words model. Firstly, the problems of the bag-of-words model based on tf-idf and the calculation method of cosine similarity are analyzed. Secondly, the order difference between the words in the common substrings of the two character strings is described by the Jaccard coefficient and the edit distance, and a word order sensitive similarity calculation method is proposed. Finally, the experimental data is used to verify the algorithm. The results show that the F1 value of this method on Top1 and Top3 is improved by 0.082 5 and 0.112 6 respectively compared with the original cosine similarity method. It shows that the method in this paper can effectively improve the performance of the information retrieval system and has good application value.
作者
张雷
崔荣一
ZHANG Lei;CUI Rongyi(College of Engineering,Yanbian University,Yanji 133002,China)
出处
《延边大学学报(自然科学版)》
CAS
2020年第2期140-144,共5页
Journal of Yanbian University(Natural Science Edition)
关键词
文本相似度
词袋模型
编辑距离
词序
text similarity
bag-of-words model
edit distance
word order