摘要
对基本邻近排序算法(basic sorted-neighborhood method,SNM)进行分析,指出其不足,提出了SNM算法的一种改进方法。采用变步长伸缩窗口,动态改变检测窗口大小,避免漏配问题,并减少不必要的比较。采用动态调整等级法,根据记录相似度调整字段等级,并通过等级法将字段等级转换为权重,解决了人为赋予固定权重主观性强、不准确的问题。通过对实际系统中的数据进行测试,验证了方法的有效性和优越性。同时,这两种方法适用于大多数基于排序—合并的相似重复记录检测方法,提高了相应方法的效率和准确度。
This paper introduced the basic SNM and analyzed its deficiency, and put forward a amelioration method of SNM. To avoid missing comparison and reduce excrescent comparison, adopting changing flexible window method, which changed the size of window dynamically. Solved the problem of subjectivity and inaccurate with fixed field weight offer by man-made by using dynamic adjusting ranking method, and according the record similarity to adjust the rank of field. And it transferred the ranking of field to weight by rank-based weights method. The experiments on the data from the information system show the ef- fectiveness and advantage of the proposed method. At the same time, these two methods are the same with other approximately duplicate records examine methods which based on sorted-merge, advancing the efficiency and accuracy.
出处
《计算机应用研究》
CSCD
北大核心
2013年第9期2736-2739,共4页
Application Research of Computers
基金
中国博士后科学基金特别资助项目(201003797)
江苏省博士后科研资助计划项目(0901014B)
解放军理工大学预研基金项目(20110604)
关键词
数据清洗
相似重复记录
变步长伸缩窗口
动态调整等级
基本邻近排序算法
data cleaning
approximately duplicate records
changing step flexible window
dynamic adjusting ranking
SNM algorithm