期刊文献+

基于熵特征优选分组聚类的相似重复记录检测 被引量:4

Detection of approximately duplicated records based on entropy feature selection grouping clustering
下载PDF
导出
摘要 针对目前相似重复记录检测方法不能有效处理大数据量的问题,提出一种基于熵的特征优选分组聚类的算法。该方法通过构造一个基于对象间相似度的熵度量,对原始数据集中各属性进行重要性评估,筛选出关键属性集,并依据关键属性将数据划分为不相交的小数据集,在各小数据集中用DBSCAN聚类算法进行相似重复记录的检测。理论分析和实验结果表明:该方法识别精度和检测效率较高。 At present, the approximately duplicate records of massive data can not be detected effectively by current methods, an algorithm based on entropy feature selection grouping clustering ( FSGC ) is proposed. The basic idea is that through constructing an entropy metric based on similarity between objects, the importance of each property can be evaluated and a key property subset can be obtained, According to the key property to split the data sets into small data sets, the approximately duplicated records are identified based on the algorithm of density-based spatial of applications with noise (DBSCAN). The theory analysis and experimental results show that identification precision and detection efficiency of the method are high and it can effectively solve the problems of identification in approximately duplicate records of the massive data set.
出处 《传感器与微系统》 CSCD 北大核心 2011年第11期135-137,141,共4页 Transducer and Microsystem Technologies
基金 国家自然科学基金资助项目(60964001) 广西自然科学基金资助项目(09910192) 广西信息与通讯实验室主任基金资助项目(01902)
关键词 相似重复记录 特征优选分组聚类 approximately duplicated records entropy feature selection grouping clutering(FSGC)
  • 相关文献

参考文献8

二级参考文献45

共引文献164

同被引文献20

引证文献4

二级引证文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部