期刊文献+

Active transfer learning of matching query results across multiple sources 被引量:2

Active transfer learning of matching query results across multiple sources
原文传递
导出
摘要 Entity resolution (ER) is the problem of identi- fying and grouping different manifestations of the same real world object. Algorithmic approaches have been developed where most tasks offer superior performance under super- vised learning. However, the prohibitive cost of labeling training data is still a huge obstacle for detecting duplicate query records from online sources. Furthermore, the unique combinations of noisy data with missing elements make ER tasks more challenging. To address this, transfer learning has been adopted to adaptively share learned common structures of similarity scoring problems between multiple sources. Al- though such techniques reduce the labeling cost so that it is linear with respect to the number of sources, its random sam- piing strategy is not successful enough to handle the ordinary sample imbalance problem. In this paper, we present a novel multi-source active transfer learning framework to jointly select fewer data instances from all sources to train classi- fiers with constant precision/recall. The intuition behind our approach is to actively label the most informative samples while adaptively transferring collective knowledge between sources. In this way, the classifiers that are learned can be both label-economical and flexible even for imbalanced or quality diverse sources. We compare our method with the state-of-the-art approaches on real-word datasets. Our exper- imental results demonstrate that our active transfer learning algorithm can achieve impressive performance with far fewerlabeled samples for record matching with numerous and var- ied sources. Entity resolution (ER) is the problem of identi- fying and grouping different manifestations of the same real world object. Algorithmic approaches have been developed where most tasks offer superior performance under super- vised learning. However, the prohibitive cost of labeling training data is still a huge obstacle for detecting duplicate query records from online sources. Furthermore, the unique combinations of noisy data with missing elements make ER tasks more challenging. To address this, transfer learning has been adopted to adaptively share learned common structures of similarity scoring problems between multiple sources. Al- though such techniques reduce the labeling cost so that it is linear with respect to the number of sources, its random sam- piing strategy is not successful enough to handle the ordinary sample imbalance problem. In this paper, we present a novel multi-source active transfer learning framework to jointly select fewer data instances from all sources to train classi- fiers with constant precision/recall. The intuition behind our approach is to actively label the most informative samples while adaptively transferring collective knowledge between sources. In this way, the classifiers that are learned can be both label-economical and flexible even for imbalanced or quality diverse sources. We compare our method with the state-of-the-art approaches on real-word datasets. Our exper- imental results demonstrate that our active transfer learning algorithm can achieve impressive performance with far fewerlabeled samples for record matching with numerous and var- ied sources.
出处 《Frontiers of Computer Science》 SCIE EI CSCD 2015年第4期595-607,共13页 中国计算机科学前沿(英文版)
关键词 entity resolution active learning transfer learn-ing convex optimization entity resolution, active learning, transfer learn-ing, convex optimization
  • 相关文献

参考文献1

二级参考文献16

  • 1Madhavan J,Jeffery S R,Cohen S,et al.Web-scale data integration:You can only afford to pay as you go[C] //Proc of the 3rd Biennial Conference on Innovative Data Systems Research,2007:342-350.
  • 2Dragut E C,Yu C T,Meng W.Meaningful labeling of integrated query interfaces[C] //Proc of the 32nd International Conference on Very Large Data Bases,2006:679-690.
  • 3Kabisch T,Dragut E C,Yu C T.A hierarchical approach to model Web query interfaces for Web source integration[J].Proc of the Very Large Data Bases Endowment,2009,2(1):325-336.
  • 4Zhao H,Meng W,Wu Z,et al.Fully automatic wrapper generation for search engines[C] //Proc of the 14th International Conference on World Wide Web,2005:66-75.
  • 5Lu Y,He H,Zhao H,et al.Annotating structured data of the deep Web[C] //Proc of the 23rd International Conference on Data Engineering,2007:376-385.
  • 6Elmagarmid A K,Ipeirotis P G,Verykios V S.Duplicate record detection:A survey[J].IEEE Trans Knowl Data Eng,2007,19(1):1-16.
  • 7Bilenko M,Mooney R.Adaptive duplicate detection using learnable string similarity measures[C] //Proc of the 9th ACM International Conference on Knowledge Discovery and Data Mining,2003:39-48.
  • 8Cohen W,Richman J.Learning to match and cluster large high-dimensional data sets for data integration[C] //Proc of the 8th ACM International Conference on Knowledge Discovery and Data Mining,2002:475-480.
  • 9Sarawagi S,Bhamidipaty A.Interactive deduplication using active learning[C] //Proc of the 8th ACM International Conference on Knowledge Discovery and Data Mining,2002:269-278.
  • 10Tejada S,Knoblock C,Minton S.Learning domain independent string transformation weights for high accuracy object identification[C] //Proc of the 8th ACM International Conference on Knowledge Discovery and Data Mining,2002:350-359.

共引文献2

同被引文献4

引证文献2

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部