期刊文献+

面向相似重复记录检测的特征优选方法 被引量:1

An optimal feature selection method for approximately duplicate records detecting
下载PDF
导出
摘要 大数据集相似重复记录检测和识别中,数据源组成复杂、表征数据记录的特征属性过多,因而检测精度不高、执行检测的代价过大。针对这些问题,提出了一种分组模糊聚类的特征优选方法。首先进行分组记录的属性处理,以有效降低记录属性的维数,并获得分组中的代表性记录,然后采用一种相似度比较计算方法进行组内相似重复记录的检测。理论分析和实验表明:该方法有较高的识别精度和检测效率,能较好地解决大数据集中相似重复记录的识别问题。 During duplicate records detection and recognition in large number of data sets, detection precision is low and cost of detecting is high because source of data are complicated and there are too many feature attributes. To solve these problems, an optimal feature selection method based on fuzzy clustering in groups is proposed. It deals with attributes of records in groups so as to reduce dimensions of attributes recorded effectively and obtain representative records in groups. It detects approximately duplicate records in groups by a computing method which compares with similarity. By theory analysis and experiments, it shows that identification precision and detection efficiency of this method are higher and it can solve recognition problem of approximately duplicate records in large number of data sets better.
出处 《传感器与微系统》 CSCD 北大核心 2011年第2期37-40,共4页 Transducer and Microsystem Technologies
基金 国家科技支撑计划资助项目(2008BAC35B05) 中国地震局教师科研基金资助项目(20090105 20090301 20090101) 河北省教育厅自然科学研究计划资助项目(Z2009407)
关键词 特征优选 相似重复记录 模糊聚类 相似度 optimal feature selection approximately duplicate records fuzzy clustering similarity
  • 相关文献

参考文献8

二级参考文献29

  • 1陈细谦,迟忠先,昃宗亮,苏立强.地理编码在空间数据仓库ETL中的应用[J].小型微型计算机系统,2005,26(4):628-630. 被引量:11
  • 2程国达,苏杭丽.一种检测汉语相似重复记录的有效方法[J].计算机应用,2005,25(6):1362-1365. 被引量:8
  • 3李先国,梁涌.一种高效的适用于字词检索的数据结构[J].微电子学与计算机,2006,23(12):157-160. 被引量:2
  • 4张永,迟忠先.位置编码在数据仓库ETL中的应用[J].计算机工程,2007,33(1):50-52. 被引量:12
  • 5Rohit Ananthakrishna,Surajit Chaudhuri,Venkatesh Ganti.Eliminating Fuzzy Duplicates in Data Warehouses.VLDB,2002:586-597.
  • 6Luis Gravano,Panagiotis G Ipeirotis,H V Jagadish et al.Divesh Srivastava:Using q--grams in a DBMS for Approximate String Processing[J]. IEEE Data Eng Bull,2001 ;24(4) :28-34.
  • 7Pdcardo A Baeza-Yates,Berthier A Ribeiro-Neto.Modem Information Retrieval[M].ACM Press/Addison-Wesley, 1999.
  • 8Alvaro E Monge,Charles Elkan.An Efficient Domain-Independent Algorithm for Detecting Approximately Duplicate Database Records. DMKD, 1997.
  • 9M Hemandez,S Stolfo.Real-world data is dirty:Data cleansing and the merge/purge problem[J].Data Mining and Knowledge Discovery, 1997,2(1).
  • 10Erhard Rahm, Hong Hai Do.Data Cleaning :Problems and Current Approaches[J].IEEE Data Eng Bull,2000;23(4):3-13.

共引文献39

同被引文献11

引证文献1

二级引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部