期刊文献+

一种基于随机n-Grams的文本相似度计算方法 被引量:8

A Novel Approach for Text Similarity Computing Based on Random n-Grams
下载PDF
导出
摘要 文本相似度计算广泛应用于抄袭检测、自动问答系统、文本聚类等文本应用领域,然而传统的方法往往不具有语言无关性,且要花费大量的时间分析提取文档的特征项。针对目前相关方法的诸多不足,提出了一种基于随机n—Grams(Randomn—Gram,记为R-Gram)的长文本相似度算法,该算法具备语言无关性,且可以充分利用短n—Gram的细粒度检测特性和长n—Gram的高效检测特性。实验结果表明:基于R—Gram的文本相似度算法具有快速、操作简单、精度调控灵活等优点,在长文本相似度计算中具有良好的应用价值。 Text similarity computing is widely used in many text applications such as plagiarism detection, automatic question answering system and text clustering. However, most traditional methods for computing text similarity are dependent on a special language and spend much time on analyzing and extracting of feature items. In view of the shortages of traditional methods, a novel algorithm based on Random n-Grams (R-Gram) with language independence for long text is proposed, which can make full use of fine-grained characteristics of short n-Grams and high efficiency characteristics of long n-Grams. The results strongly suggest that text similarity algorithm based on R-Gram have the advantages of fast speed, easy operation and flexibility. As a bonus, it is beneficial for text similarity computing for lung texts.
出处 《情报学报》 CSSCI 北大核心 2013年第7期716-723,共8页 Journal of the China Society for Scientific and Technical Information
基金 国家自然科学基金项目(61172084) 浙江省自然科学基金项目(Y1100137) 乐清市科技项目(2011R003)
关键词 文本相似度 评价函数 集合 N-GRAM R-Gram text similarity, evaluation function, set, n-Gram, R-Gram
  • 相关文献

参考文献23

二级参考文献67

共引文献319

同被引文献159

引证文献8

二级引证文献49

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部