摘要
数据库集成时会产生大量的相似、重复记录,字段匹配算法是对其进行检测并清洗的主要方法之一.针对等级法确定属性权值主观性过强的问题,提出改进的基于二次模糊评判的检测方法.根据等级法对属性进行第一次评判,剔除等级低的部分非重要属性;对剩余属性进行二次模糊评判,平均属性等级评判的结果,确定属性权值,然后对数据集进行分组,并在各个数据集中检测相似重复记录.理论分析和实验结果表明,该方法不仅提高了运行效率,而且可以进一步提高查重的查准率和查全率.
A large number of approximately and duplicated records are produced during the database integration,and the field matching algorithm is one of the main methods to detect and clean them.Aiming at the problem that the grading method of attribute weight is too subjective,an improved detection method based on twice fuzzy evaluation is put forward.Firstly,according to the grading method,remove some unimportant attributes which are at lower levels by the first judgement.Secondly,another fuzzy evaluation is given on the remaining attributes.Then,the attribute weights are obtained by averaging the attributes grade.Finally,the data sets are grouped,and parallelled in each data set to detect approximately duplicated records.Theoretical analysis and experimental results show that the method not only improves the efficiency,but can further improve the precision and recall.
出处
《江苏师范大学学报(自然科学版)》
CAS
2016年第1期39-42,共4页
Journal of Jiangsu Normal University:Natural Science Edition
基金
福建省教育厅科技项目(JB14129)
关键词
相似重复记录
属性
等级
权值
检测
模糊评判
approximately duplicated record
attribute
grade
weight
detection
fuzzy evaluation