期刊文献+

大数据环境下的相似重复记录检测方法 被引量:6

Method for detecting approximately duplicate database records in big data environment
下载PDF
导出
摘要 大数据环境下的相似重复记录影响数据统计分析结果的准确性,需要过滤相似重复记录.对相似重复记录检测的研究现状做了介绍,在此基础上提出了属性加权的思想,对属性进行加权,并根据属性权值进行排序分组;在对属性加权时,考虑到一些字段的取值是一一对应的关系,权值相同,提出了同义属性的概念,在原数据集的基础上排除部分同义属性来缩减数据集,提高重复数据检测的效率,最后给出了相似重复记录判定的方法.考虑到大数据集给重复记录检测带来的挑战,将大数据集拆分成若干小数据集,充分利用MapReduce机制进行处理,将大数据集按照权重较大的属性取值进行分组,分割成若干个map任务,分别进行处理.实验结果表明,该方法能够有效地提高相似重复记录检测的效率. The accuracy of the data statistical analysis is affected by approximately duplicated records in big data environments, so the approximately duplicated records need to be filtered. We introduced the current research of approximately duplicated records and proposed the weighted attribute idea, weigh- ting the attributes and grouping them according to the weights. Considering that some field's relation- ship is one to one, we proposed synonymous property. We excluded some synonymous property on the basis of the original dataset to reduce the dataset and improve the efficiency of detection of approximate- ly duplicated records . Finally synonymous property was proposed. Big datasets were split into a num- ber of small datasets considering the challenge of approximately duplicated records in big dataset. Tak- ing full advantage of MapReduce processing mechanism, big datasets were grouped according to the weight of the larger attribute values, and then divided into a number of map tasks to process. Experi- ment shows that this method can improve detection efficiency of approximately duplicated records effec- tively.
作者 殷秀叶
出处 《武汉工程大学学报》 CAS 2014年第9期66-69,共4页 Journal of Wuhan Institute of Technology
基金 国家自然科学基金青年项目(61103143) 周口师范学院青年科研基金项目(zknuc0215)
关键词 相似重复记录 大数据 同义属性 approximately duplicated records big data MapReduce synonymous property
  • 相关文献

参考文献8

二级参考文献258

共引文献488

同被引文献52

  • 1赵作鹏,尹志民,王潜平,许新征,江海峰.一种改进的编辑距离算法及其在数据处理中的应用[J].计算机应用,2009,29(2):424-426. 被引量:51
  • 2杜丁柱,葛可一,王洁.计算复杂性导引[M].北京:高等教育出版社,2002.
  • 3ARORA S, BARAK B. Complexity Theory: A Mod- em Approach Cambridge University Press [M].Cam- bridge, 2009.
  • 4AARONSON S. Is P versus NP formally independent [J]. Bulletin of the European Association for Theoreti- cal Computer Science, 2003,81 (10) : 109-136.
  • 5SARTAJ Sahni, Data Structures, Algorithms, and Appli- cations in C++[M]. McGraw-Hill, 1998.
  • 6COOK S A. The complexity of theorem proving proce- dures [M]. Proceedings of Third Annual ACM Sympo- sium, New York: on Theory of Computing, Association for Computing Machinery, 1971 : 151-158.
  • 7KARP R M. Reducibility among combinatorial problems [M]. Miller R E, Thatcher J W Plenum Press, Com- plexity of Computer Computations, New York: 1972: 85- 104.
  • 8LANCE Fortnow. The Status of the P Versus NP Prob- lem[J].Communications of the ACM, 2010,52 (9) : 78- 86.
  • 9POSA L. Hamihonian circuits in random graphs [J] .Discrete Math, 1976(14) :359-364.
  • 10邰林,黄芝平,唐贵林,郭晓俊.并行缓存结构在高速海量数据记录系统中的应用[J].计算机测量与控制,2008,16(4):527-529. 被引量:6

引证文献6

二级引证文献20

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部