
基于N-gram和向量空间模型的语句相似度研究 被引量:14

A measure of sentence similarity based on N-grams and Vector Space Model
摘要 语句相似度研究广泛应用于信息检索、语言测试自动评分和机器翻译评价等领域。以往的研究有的注重语言形式,有的偏重语言意义,把形式和意义结合起来对语句相似度进行综合考察的研究则比较少见。本文运用自然语言处理中的N-gram方法,结合向量空间模型,从语言形式和语言意义两个方面出发对语句相似度进行了深入研究。研究结果表明,该算法计算出的语句相似度与中外评分员评判的相似度之间具有较高的一致性,总体相关系数分别达到了.928和.925,显示本研究所提出的相似度算法效果显著。 Measures of sentence similarity have wide applications in Information Retrieval, language assessment and machine translation evaluation. In approaching sentence similarity, most previous studies have concentrated either on form or on meaning, and studies investigating both aspects are rarely found in the literature. This study adopts the N-gram method and the Vector Space Model to measure the semantic and formal similarities between sentences. Results of the study show that the algorithm employed in this research achieves measures which correlate highly with human judgment of semantic and formal similarities. The overall correlation coefficients with human raters reach .928 and .925 respectively, indicating that the algorithm provides a reliable measure of sentence similarity.
出处 《现代外语》 CSSCI 北大核心 2007年第4期405-413,共9页 Modern Foreign Languages
基金 国家社科基金项目"基于大型双语对应语料库的翻译研究与翻译教学平台"(项目编号05BYY013)的部分成果 北京外国语大学中国外语教育中心"中国外语教育基金"课题资助。
