摘要
尝试对平行语料库中需要去重的中文句子相似情况作分类,利用整体相似因子和局部相似因子计算句子的相似度,并借鉴KMP算法的匹配跳跃思想,提出中文字符串匹配的类KMP算法,并对算法进行实验验证。结果表明,算法具有较好的效果,能够实现平行语料库中相似句子的去重。算法开放测试的召回率达94%,去重准确率达到84%。算法可以应用于任何长度的语句比对,适用范围广。
The similarity of Chinese sentence is classified and duplicated sentence is removed.Sentence similarity depends on similarity of unitary factor and partial factor.According to the idea of KMP s jump,the simular KMP in chinese sentence is used.The experiment results show that the algorithm is effective,the recall rate of duplicate removal reach 94%,and the precision rate reach 84% in large scale testing.
出处
《广西科学院学报》
2009年第4期248-250,256,共4页
Journal of Guangxi Academy of Sciences
基金
宁市人才小高地基金项目(No.2007007)资助
关键词
去重
相似句子
平行语料库
类KMP
duplicate removal
similar sentence
parallel corpus
similar KMP