期刊文献+

基于编辑距离的词序敏感相似度度量方法 被引量:5

A word order sensitive similarity measure based on edit distance
下载PDF
导出
摘要 为改善余弦相似度不能反映词袋模型中词项间顺序差异的缺点,提出了一种基于编辑距离的文档相似度度量方法.首先分析了基于tf-idf的词袋模型和余弦相似度计算方法所存在的问题;其次利用Jaccard系数和编辑距离描述两个字符串的公共子串中词语之间的顺序差异,并提出了一种词序敏感相似度计算方法;最后利用实验数据对算法的有效性进行了验证,结果显示本文方法在Top1、Top3上的F1指标比原始的余弦相似度方法分别提高了0.082 5、 0.112 6,表明本文方法能够有效地提升信息检索系统的性能,具有很好的应用价值. In this paper, a method is proposed to calculate the similarity between documents based on edit distance in order to improve the shortcoming that the cosine similarity method cannot reflect the order difference between the terms in the bag-of-words model. Firstly, the problems of the bag-of-words model based on tf-idf and the calculation method of cosine similarity are analyzed. Secondly, the order difference between the words in the common substrings of the two character strings is described by the Jaccard coefficient and the edit distance, and a word order sensitive similarity calculation method is proposed. Finally, the experimental data is used to verify the algorithm. The results show that the F1 value of this method on Top1 and Top3 is improved by 0.082 5 and 0.112 6 respectively compared with the original cosine similarity method. It shows that the method in this paper can effectively improve the performance of the information retrieval system and has good application value.
作者 张雷 崔荣一 ZHANG Lei;CUI Rongyi(College of Engineering,Yanbian University,Yanji 133002,China)
机构地区 延边大学工学院
出处 《延边大学学报(自然科学版)》 CAS 2020年第2期140-144,共5页 Journal of Yanbian University(Natural Science Edition)
关键词 文本相似度 词袋模型 编辑距离 词序 text similarity bag-of-words model edit distance word order
  • 相关文献

参考文献6

二级参考文献20

共引文献17

同被引文献49

引证文献5

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部