摘要
大数据集相似重复记录检测和识别中,数据源组成复杂、表征数据记录的特征属性过多,因而检测精度不高、执行检测的代价过大。针对这些问题,提出了一种分组模糊聚类的特征优选方法。首先进行分组记录的属性处理,以有效降低记录属性的维数,并获得分组中的代表性记录,然后采用一种相似度比较计算方法进行组内相似重复记录的检测。理论分析和实验表明:该方法有较高的识别精度和检测效率,能较好地解决大数据集中相似重复记录的识别问题。
During duplicate records detection and recognition in large number of data sets, detection precision is low and cost of detecting is high because source of data are complicated and there are too many feature attributes. To solve these problems, an optimal feature selection method based on fuzzy clustering in groups is proposed. It deals with attributes of records in groups so as to reduce dimensions of attributes recorded effectively and obtain representative records in groups. It detects approximately duplicate records in groups by a computing method which compares with similarity. By theory analysis and experiments, it shows that identification precision and detection efficiency of this method are higher and it can solve recognition problem of approximately duplicate records in large number of data sets better.
出处
《传感器与微系统》
CSCD
北大核心
2011年第2期37-40,共4页
Transducer and Microsystem Technologies
基金
国家科技支撑计划资助项目(2008BAC35B05)
中国地震局教师科研基金资助项目(20090105
20090301
20090101)
河北省教育厅自然科学研究计划资助项目(Z2009407)
关键词
特征优选
相似重复记录
模糊聚类
相似度
optimal feature selection
approximately duplicate records
fuzzy clustering
similarity