期刊文献+

基于模糊综合评判和长度过滤的SNM改进算法 被引量:1

Improved SNM Algorithm Based on Fuzzy Comprehensive Evaluation and Length Filtering
下载PDF
导出
摘要 为了提高数据库的数据质量,需要对相似重复记录进行清洗,基本邻近排序算法是目前常用的清洗算法之一.针对判重过程中属性权值计算主观性过强的问题,提出通过多用户综合评判确定属性权值的方法,该方法能更客观地评判属性的重要性程度.在此基础上,结合属性权值计算两条记录的长度比例,排除不可能构成相似重复的记录,减少了比较次数,提高了检测效率.实验结果表明改进算法在查全率、查准率及时间效率等方面均有所提高. To improve the quality of data, the approximately duplicated records need to be cleaned. The basic sorted-neighborhood method (SNM) is one of the commonly used cleaning algorithms. Aimed at the problem of excessive subjectivity of attribute weight calculation in detection algorithm, the article proposes a method based on the fuzzy comprehensive evaluation of multiuser to determine the attribute weight, which can be more objective to judge the importance level of the attribute. The proposed algorithm calculates the length ratio of the two records with attribute weight, then uses the length ratio to exclude records that are impossible to be approximately duplicated, reduces comparison times, and improves the detection efficiency. The experiment results show that the recall, precision and time efficiency are enhanced.
作者 郭文龙 董建怀 GUO Wenlong DONG Jianhuai(College of Electronics and Information Science, Fujian Jiangxia University, Fuzhou 350108, China)
出处 《武汉工程大学学报》 CAS 2017年第4期403-408,共6页 Journal of Wuhan Institute of Technology
基金 福建省自然科学基金(2015J01653) 福建江夏学院青年科研人才培育基金(JXZ2014011)
关键词 相似重复记录 模糊综合评判 属性 长度过滤 SNM 算法 approximately duplicated records fuzzy comprehensive evaluation attribute length filtering SNM algorithm
  • 相关文献

参考文献12

二级参考文献166

共引文献77

同被引文献8

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部