摘要
针对中西文混合字符串,采用了将汉字作为西文字符的等价单位计算编辑距离的方法,并从输入法的角度提出了采用拼音编码和五笔编码计算编辑距离的方法,最后给出了融合三种编辑距离计算字符串相似度的算法。仿真结果表明,该方法在提高相似重复记录检测的查全率的同时,也能获得较高的查准率。
The Chinese character is treated as the equivalent of western character when computing edit distance of strings composed of Chinese and western characters. Considering from Chinese input methods,this paper proposed a new way to calculate edit distance based on PinYin code and WuBi code of Chinese character. Also proposed the algorithm of fusing three edit distances to get string similarity. Experiment results show that the new method can improve the recall rate of approximately duplicate records detection,besides getting high precision rate.
出处
《计算机应用研究》
CSCD
北大核心
2010年第12期4523-4525,共3页
Application Research of Computers
基金
中国博士后科学基金资助项目(20090461425)
江苏省博士后科研资助计划项目(0901014B)
关键词
数据清洗
相似重复记录
字符串匹配
字符串相似度
编辑距离
data cleaning
approximately duplicate records
string matching
string similarity
edit distance