期刊文献+

基于SNM改进算法的相似重复记录消除 被引量:9

Research on Eliminating Duplicate Records Based on SNM Improved Algorithm
下载PDF
导出
摘要 高质量的数据是构建数据仓库的最重要因素,低质量的数据可能对决策产生不利影响。来自不同数据源的相似重复记录是数据仓库构建中影响数据质量的主要问题之一,在源数据进入数据仓库之前尽可能地消除相似重复记录能很大程度地提高数据质量。为此,比较了现有的相似重复记录消除算法,改进了SNM算法,并通过实验比较了传统SNM方法与改进SNM算法。实验结果显示:在相似重复记录消除方面,SNM改进算法具有明显的优势。 High quality data is the most important factor to build the data warehouse. The low quality data may be bad for decision making. An approximately duplicate record from different data sources is one of the main data quality issues to build data warehouse. To eliminate approximately duplicate data as far as possible before the source data enters into a data warehouse can greatly improve the quality of data. Firstly, the existing approximately duplicate records elimination algorithms were compared, and then SNM algorithm was improved. The authors compared traditional SNM method and SNM improved algorithm by the experiment, and the results show: SNM improved algorithm has obvious advantages in eliminating duplicate records.
出处 《重庆理工大学学报(自然科学)》 CAS 2016年第4期91-96,共6页 Journal of Chongqing University of Technology:Natural Science
基金 国家自然科学基金资助项目(71473185)
关键词 SNM算法 SNM改进算法 相似重复记录消除 SNM algorithm SNM improved algorithm approximately duplicate records elimination
  • 相关文献

参考文献10

  • 1KIMBALL R, REEVES L, ROSS M, et al. The Data Warehouse Lifecycle Toolkit:The Definitive Guide to Di- mensional Modeling [ M ]. Indiana: Wiley Publishing Inc,2013.
  • 2LOSHIN D. Data Quality ROI in the Absence of Profits [ J ]. Information & Management,2003 (9) :22.
  • 3HUANG K, LEE T, Y W WANG, et al. Quality Informa- tion and Knowledge [ M ]. NJ : Prentice-Hall, 1999.
  • 4CLIKEMAN P M. Improving information quality [ J ] Internal Auditor, 1999 ( 3 ) :32 - 33.
  • 5SINGH R, SINGH K. A descriptive classification of causes of data quality problems in data warehousing [ J ] International Journal of Computer Science Issues ,2010.
  • 6张建中,方正,熊拥军,袁小一.对基于SNM数据清洗算法的优化[J].中南大学学报(自然科学版),2010,41(6):2240-2245. 被引量:17
  • 7陈爽,刁兴春,宋金玉,曹建军,丁晨路.基于伸缩窗口和等级调整的SNM改进方法[J].计算机应用研究,2013,30(9):2736-2739. 被引量:14
  • 8叶焕倬,吴迪.相似重复记录清理方法研究综述[J].现代图书情报技术,2010(9):56-66. 被引量:21
  • 9MAURICIO A HERN~,NDEZ, SALVATORE J S. Real- world Data is Dirty:Data Cleansing and The Merge/Purge Problem [ J ]. Data Mining and Knowledge Discovery, 1998,2( 1 ) :9 -37.
  • 10HERNANDEZ M, STOLFO S. The Merge/Purge Problem for Large Databases [ C ]//Proceedings of the ACM SIG- MOD International Conference on Management of Data. San Jose, California : [ s. n. ] , 1995 : 127 - 138.

二级参考文献106

共引文献42

同被引文献79

  • 1李星毅,包从剑,施化吉.数据仓库中的相似重复记录检测方法[J].电子科技大学学报,2007,36(6):1273-1277. 被引量:25
  • 2Arasu A,Ganti V,Kaushik R.Efficient exact set-similarity joins[C]∥Proceedings of the 32nd International Conference on Very Large Data Bases.VLDB Endowment,2006:918-929.
  • 3Xiao C,Wang W,Lin X,et al.Efficient similarity joins for near-duplicate detection[J].ACM Transactions on Database Systems (TODS),2011,36(3):15-20.
  • 4Manku G S,Jain A,Das Sarma A.Detecting near-duplicates for Web crawling[C]∥Proceedings of the 16th International Confe-rence on World Wide Web.ACM,2007:141-150.
  • 5Koren Y,Bell R.Advances in collaborative filtering[M].Recommender systems handbook.Springer US,2011:145-186.
  • 6Li Shu-kui.Research on Time series similarity problem[D].Wuhan:Huazhong University of Science and Technology,2008.
  • 7Stein B.Principles of hash-based text retrieval[C]∥Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.ACM,2007:527-534.
  • 8Indyk P.A small approximately min-wise independent family of hash functions[J].Journal of Algorithms,2011,38(1):84-90.
  • 9Broder A Z,Charikar M,Frieze A M,et al.Min-wise indepen-dent permutations[J].Journal of Computer and System Sciences,2010,60(3):630-659.
  • 10Li P,Knig C.b-Bit minwise hashing[C]∥Proceedings of the 19th International Conference on World Wide Web.ACM,2010:671-680.

引证文献9

二级引证文献31

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部