期刊文献+

多Web数据源环境下的重复实体识别方法研究 被引量:3

A Duplicate Web Entity Identification Approach Based on Iterative Training
下载PDF
导出
摘要 Web中大量可访问的数据源为人们获取有用的信息带来了极大的便利。作为Web数据源集成的一个必要的步骤,需要将存在于不同数据源表达形式各异的重复Web实体准确地识别出来。在已有的重复实体识别的工作中,主要是在两个数据源之间进行。由于Web数据源数量众多,使得这些方法无法应用于多个Web数据源之间的重复实体识别。针对这个问题提出了一种基于迭代训练的Web重复实体识别方法,可以在较小规模的训练样本上实现在多个Web数据源上的重复实体识别。通过在图书和计算机产品两个不同领域中多个Web数据源上的广泛实验,表明了提出方法的有效性。 A large number of Web data sources that can be accessed online make users convenient to obtain their desired information. As the necessary step in Web data integration, the duplicate Web entities with various presentations should be identified accurately from Web data sources. To the best of our knowledge, previous works focus on this issue only between two data sources. The large quantity of Web data sources make these approaches unpractical. To this end, an effective iterative-training-based approach is proposed to address this issue of duplicate Web entity identification, which can be applied to multiple Web data sources using a small training set. The extensive experiments on book domain and computer domain validate the effectiveness of the proposed approach.
作者 刘伟 肖建国
出处 《计算机科学与探索》 CSCD 2010年第7期599-607,共9页 Journal of Frontiers of Computer Science and Technology
基金 国家自然科学基金No.60875033 中国博士后科学基金No.20080440256 200902014~~
关键词 Web实体 重复实体识别 WEB数据集成 迭代训练 Web entity duplicate entity identification Web data integration iterative training
  • 相关文献

参考文献16

  • 1Madhavan J,Jeffery S R,Cohen S,et al.Web-scale data integration:You can only afford to pay as you go[C] //Proc of the 3rd Biennial Conference on Innovative Data Systems Research,2007:342-350.
  • 2Dragut E C,Yu C T,Meng W.Meaningful labeling of integrated query interfaces[C] //Proc of the 32nd International Conference on Very Large Data Bases,2006:679-690.
  • 3Kabisch T,Dragut E C,Yu C T.A hierarchical approach to model Web query interfaces for Web source integration[J].Proc of the Very Large Data Bases Endowment,2009,2(1):325-336.
  • 4Zhao H,Meng W,Wu Z,et al.Fully automatic wrapper generation for search engines[C] //Proc of the 14th International Conference on World Wide Web,2005:66-75.
  • 5Lu Y,He H,Zhao H,et al.Annotating structured data of the deep Web[C] //Proc of the 23rd International Conference on Data Engineering,2007:376-385.
  • 6Elmagarmid A K,Ipeirotis P G,Verykios V S.Duplicate record detection:A survey[J].IEEE Trans Knowl Data Eng,2007,19(1):1-16.
  • 7Bilenko M,Mooney R.Adaptive duplicate detection using learnable string similarity measures[C] //Proc of the 9th ACM International Conference on Knowledge Discovery and Data Mining,2003:39-48.
  • 8Cohen W,Richman J.Learning to match and cluster large high-dimensional data sets for data integration[C] //Proc of the 8th ACM International Conference on Knowledge Discovery and Data Mining,2002:475-480.
  • 9Sarawagi S,Bhamidipaty A.Interactive deduplication using active learning[C] //Proc of the 8th ACM International Conference on Knowledge Discovery and Data Mining,2002:269-278.
  • 10Tejada S,Knoblock C,Minton S.Learning domain independent string transformation weights for high accuracy object identification[C] //Proc of the 8th ACM International Conference on Knowledge Discovery and Data Mining,2002:350-359.

二级参考文献4

共引文献17

同被引文献19

  • 1凌妍妍,刘伟,王仲远,艾静,孟小峰.Deep Web数据集成中的实体识别方法[J].计算机研究与发展,2006,43(z3):46-53. 被引量:4
  • 2强保华,陈凌,余建桥,吴开贵,吴中福.基于BP神经网络的属性匹配方法研究[J].计算机科学,2006,33(1):249-251. 被引量:4
  • 3何涛,刘君强,张学斌.异构数据源数据集成的研究[J].计算机工程与科学,2006,28(9):132-135. 被引量:8
  • 4刘伟,孟小峰,孟卫一.Deep Web数据集成研究综述[J].计算机学报,2007,30(9):1475-1489. 被引量:136
  • 5MADHAVAN J,JEFFERY S R,COHEN S. Web-scale data integration:you can only afford to pay as you go[A].California,USA:CIDR,2007.342-350.
  • 6CHAUDHURI S,GRANTI V,MOTWANI R. Robust identification of fuzzy duplicates[A].Washington,DC:IEEE Computer Society,2005.865-876.
  • 7SHEN W,DEROSE P,VU L. Source-aware entity matching:a compositional approach[A].Washington,DC:IEEE Computer Society,2007.196-205.
  • 8马锐.人工神经网络原理[M]北京:机械工业出版社,2010.
  • 9朱命冬;申德容;寇月.一种应用于Deep Web环境下重复记录识别模型[J]计算机研究与发展,2009(Suppl):14-21.
  • 10LI W S. SeEMINT:a tool for identifying attribute correspondences in heterogeneous database using neural networks[J].Data and Knowledge Engineering,2000,(01):49-84.

引证文献3

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部