期刊文献+

基于Hadoop平台的XML文档重复数据检测 被引量:1

XML Data Duplicate Detection Based on Hadoop Platform
下载PDF
导出
摘要 XML数据越来越广泛地被用于信息交换与集成中,其数据质量问题引起了人们的关注.解决由数据质量引发的问题,实体识别技术非常关键.为了克服现有方法的不足,在海量XML数据上进行高效的重复对象检测,以实体识别技术为基础提出了基于Hadoop平台的XML文档重复检测算法,它将所有标签节点统称为属性,用实体来描述属性,通过属性的比较,快速地找到在某些属性上相同的所有实体对象,并利用Hadoop应用框架处理海量数据的优势实现并行处理.经过试验验证该方法良好的扩展性,伸缩性和高效性. As being more and more widely used for data exchange and integration, the XML data quality issues cause more concern. In order to overcome the problems caused by data quality, Entity Resolution(ER) is critical. To overcome the drawbacks of current methods's deficiency and perform entity resolution efficiently and effectively on massive XML data set, under the basis of Entity Resolution, an XML data duplicate detection based on hadoop platform algorithm is presented in this paper. The method uses entities to describe their atrributes. By the comparing of the attributes,we can find all the objects that have the same attributes quickly. Meanwhile, taking the advantage of the Hadoop platform which can process massive data parallel. From the experiments, the method has excellent performance in scalability, flexibility and efficiency.
作者 李振兴 刘波
出处 《计算机系统应用》 2013年第11期195-199,共5页 Computer Systems & Applications
关键词 XML 数据质量 重复检测 HADOOP 分布式 XML data quality duplicate detection Hadoop distribute
  • 相关文献

参考文献13

  • 1Hernandez MA, Stolfo SJ. Real-world data is dirty:datacleansing and the merge/purge problem. Data Mining andKnowledge Discovery, 1988,2(1): 9-37.
  • 2Hassanzadeh 0,Sadoghi M,Miller RJ. Accuracy of approxi-mate string joins using grams. Proc. of the International.Workshop on Quality in Database(QDB). Vienna, Austria.2007.11-18.
  • 3Hassanzadeh O. Benchmarking declarative ^proximate selec-tion redicates. University of Toronto, Canada, 2007.
  • 4Whang SE, Menestrina D, Koutrika G Entity resolution withiterative blocking. Proc. of the 35th SIGMOD InternationalConference on Management of Data. Rhode Island, USA.2009.219-231.
  • 5Weis M, Naumann F. Detecting duplicate objects in XMLdocuments. Proc. of the IQIS. Pairs, France. 2004.10-19.
  • 6Weis G, Naumann F. DogmatiX tracks down duplicates inXML. Proc. of the ACM SIGMOD 2005. New York,USA.2005.431442.
  • 7Pluempitiwiriyawej C,Hammer J. Element matching acrossdata-oriented XML sources using a multi-strategy clusteringmodel. Data&Knowledge Engineering, 2004,48(3): 297-333.
  • 8王天亮 陈刚 徐宏炳.基于对象树相似匹配的XML重复对象检测[J].计算机科学,2006,:162-166.
  • 9Karr AF. Exploratory data mining and data cleaning. Journalof the American Statistical Association, 2006,101(473):399-399.
  • 10Low WLS Lee ML, Ling TW. A knowledge-based approachfor duplicate Elimination in data cleaning. InformationSystems, 2001,26(8): 585-606.

共引文献1

同被引文献11

引证文献1

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部