期刊文献+

一种在线数据清洗方法 被引量:2

An Online Data Cleaning Method
下载PDF
导出
摘要 提出一种新的在线数据清洗方法:将确认为干净的参照表中的记录字符串映射成高维空间中的点后进行聚类划分,然后利用B+树对划分中的点进行索引从而将高维空间的查询转换成一维空间的范围查询.输入表中的元组利用索引采用分枝限界策略搜索KNN(Knearestneighbors)记录从而完成与其最匹配记录的识别.理论分析和实验表明这是一种解决在线数据清洗的有效途径. A new method for online data cleaning is presented. First, each clean record in the reference table is mapped as a point in a high-dimensional metric space measured by Manhattan distance. Next, all the points in the space are partitioned by clustering and indexed with (B+) tree. In this way, the search in high-dimensional space can be translated into search in one-dimensional space. To find the KNN (K nearest neighbors) in reference table for each incoming record, the search method of branch and bound is employed. The top K records that best match the incoming record are then identified. Theory and experiment show that it is an effective approach for online data cleaning.
出处 《应用科学学报》 CAS CSCD 北大核心 2005年第3期292-296,共5页 Journal of Applied Sciences
基金 江苏省十五高科技资助项目(BG2001013)
  • 相关文献

参考文献14

  • 1Hernandez S S. The merge/purge problem for large databases[A]. Proceedings of the 1995 ACM SIGMOD International Conference on Management of Data[C]. New York, USA:ACM press, 1995. 127 - 138.
  • 2Alvaro M, Charles E. An efficient domain-independent algorithm for detecting approximately duplicate database records[A]. In Proceedings of the ACM-SIGMOD Workshop on Research Issues on Knowledge Discovery and DataMining[C].New York,USA :ACM press , 1997.
  • 3Hyhon J A. Identifying and merging related bibliographic records [D] . MIT, 1996.
  • 4Liang J, Chen L, Sharad M. Efficient record linkage in large data sets[A].Eighth International Conference on Database Systems for Advanced Applications[C]. Kyoto,Japan: IEEE Computer Society, 2003. 137.
  • 5Surajit C, Kris G,Venkatesh G, et al. Robust and efficient fuzzy match for online data cleaning[A]. Proceedings of the 2003 ACM SIGMOD International Conference on Management of Data[C]. New York, USA: ACM press,2003.313 - 324.
  • 6Sunita S,Anuradha B. Interactive deduplication using active learning[A].In The Proc of Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining[C].New York, USA: ACM press, 2002. 269 -278.
  • 7Wai L L,Mong L L,Tok W L.A knowledge-based approach for duplicate elimination in data cleaning[J].Information Systems, 2001,26 ( 8 ) : 585-606.
  • 8Rohit A,Surajit C,Venkatesh G.Eliminating fuzzy duplicates in data warehouses[A].Proceedings of 28^th VLDB[C] . Hong Kong China: Morgan Kaufmann, 2002.586 - 597.
  • 9Karen K. Techniques for automatically correcting words in text[J]. ACM Computing Surveys, 1992,28:377 - 439.
  • 10Cui Y, Beng C O, Kian L T, et al. Indexing the distance:An efficient method to KNN processing [A]. Proc of 27^th VLDB [C]. Roma, Italy: Morgan Kaufmann, 2001.421 -430.

同被引文献75

  • 1韩京宇,徐立臻,董逸生.一种大数据量的相似记录检测方法[J].计算机研究与发展,2005,42(12):2206-2212. 被引量:32
  • 2刘奕群,张敏,马少平.面向信息检索需要的网络数据清理研究[J].中文信息学报,2006,20(3):70-77. 被引量:5
  • 3王永红.定量专利分析的样本选取与数据清洗[J].情报理论与实践,2007,30(1):93-96. 被引量:30
  • 4Hon D B, Dewi V J.Duplicate record elimination in large data files[J].ACM Transactions on Database Sys- tem, 1995.
  • 5Lee M L, Lu H, Ling T W, et al.Cleaning data for mining and warehousing[C]//DEXA'99,1999.
  • 6Fan Wenfei.Extending dependencies with conditions for data cleaning[C]//8th IEEE International Conference on Computer and Information Technology, 2008 :185-190.
  • 7Eckerson W W.Data quality and the bottom line:achiev- ing business success through a commitment to high quality data[R].The Data Warehousing Institute,2002.
  • 8English L.Plain English on data quality: information quality management:the next frontier[J].DM Review Magazine, 2000.
  • 9Eppler M J, Algesheimer R, Dimpfel M.Quality criteria of content-driven websites and their influence on cus- tomer satisfaction and loyalty: an empirical test of an information quality framework[C]//Sth International Con- ference on Information Quality(IQ 2003 ), 2003 : 108-120.
  • 10Shilakes C C'Tylman J.Enterprise information portals[Z]. 1998.

引证文献2

二级引证文献58

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部