摘要
为了解决Levenshtein距离算法在长文本和大规模匹配效率的不足,本文针对Levenshtein距离算法提出一种提前终止的优化策略.首先根据Levenshtein距离矩阵中元素内在的联系,归纳总结出一个递推关系式.再依据此递推关系式,提出一种提前终止策略,可提前判断两个文本是否满足预先设定的相似度阈值.经过多个学科题库判重实验的佐证,本文的提前终止策略能显著减少计算时间.
In order to overcome the disadvantages of the Levenshtein distance algorithm for long text and large-scale matching, we propose an early termination strategy for the Levenshtein distance algorithm. Firstly, according to the intrinsic relationship between elements in the Levenshtein distance matrix, we sum up a recurrence relation. Based on this relation, an early termination strategy is proposed to determine early-on whether two texts satisfy the predefined similarity threshold. Through several tests on different subjects, it is demonstrated that the early termination strategy can significantly reduce calculation time.
作者
张衡
陈良育
ZHANG Heng;CHEN Liang-yu(Shanghai Key Laboratory of Trustworthy Computing,East China Normal University,Shanghai 200062,China)
出处
《华东师范大学学报(自然科学版)》
CAS
CSCD
北大核心
2018年第5期154-163,共10页
Journal of East China Normal University(Natural Science)
基金
国家自然科学基金(11471209)