Active transfer learning of matching query results across multiple sources 被引量：2

Active transfer learning of matching query results across multiple sources

导出

摘要 Entity resolution （ER） is the problem of identi- fying and grouping different manifestations of the same real world object. Algorithmic approaches have been developed where most tasks offer superior performance under super- vised learning. However, the prohibitive cost of labeling training data is still a huge obstacle for detecting duplicate query records from online sources. Furthermore, the unique combinations of noisy data with missing elements make ER tasks more challenging. To address this, transfer learning has been adopted to adaptively share learned common structures of similarity scoring problems between multiple sources. Al- though such techniques reduce the labeling cost so that it is linear with respect to the number of sources, its random sam- piing strategy is not successful enough to handle the ordinary sample imbalance problem. In this paper, we present a novel multi-source active transfer learning framework to jointly select fewer data instances from all sources to train classi- fiers with constant precision/recall. The intuition behind our approach is to actively label the most informative samples while adaptively transferring collective knowledge between sources. In this way, the classifiers that are learned can be both label-economical and flexible even for imbalanced or quality diverse sources. We compare our method with the state-of-the-art approaches on real-word datasets. Our exper- imental results demonstrate that our active transfer learning algorithm can achieve impressive performance with far fewerlabeled samples for record matching with numerous and var- ied sources. Entity resolution （ER） is the problem of identi- fying and grouping different manifestations of the same real world object. Algorithmic approaches have been developed where most tasks offer superior performance under super- vised learning. However, the prohibitive cost of labeling training data is still a huge obstacle for detecting duplicate query records from online sources. Furthermore, the unique combinations of noisy data with missing elements make ER tasks more challenging. To address this, transfer learning has been adopted to adaptively share learned common structures of similarity scoring problems between multiple sources. Al- though such techniques reduce the labeling cost so that it is linear with respect to the number of sources, its random sam- piing strategy is not successful enough to handle the ordinary sample imbalance problem. In this paper, we present a novel multi-source active transfer learning framework to jointly select fewer data instances from all sources to train classi- fiers with constant precision/recall. The intuition behind our approach is to actively label the most informative samples while adaptively transferring collective knowledge between sources. In this way, the classifiers that are learned can be both label-economical and flexible even for imbalanced or quality diverse sources. We compare our method with the state-of-the-art approaches on real-word datasets. Our exper- imental results demonstrate that our active transfer learning algorithm can achieve impressive performance with far fewerlabeled samples for record matching with numerous and var- ied sources.

作者 Jie XIN Zhiming CUI Pengpeng ZHAO Tianxu HE

机构地区 The Institute of Intelligent Information Processing and Application Provincial Key Laboratory for Computer Information Processing Technology

出处《Frontiers of Computer Science》 SCIE EI CSCD 2015年第4期595-607,共13页 中国计算机科学前沿（英文版）

关键词 entity resolution active learning transfer learn-ing convex optimization entity resolution, active learning, transfer learn-ing, convex optimization

分类号 TP183 [自动化与计算机技术—控制理论与控制工程] TP311.134 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献1

1刘伟,肖建国.多Web数据源环境下的重复实体识别方法研究[J].计算机科学与探索,2010,4(7):599-607. 被引量：3

二级参考文献16

1Madhavan J,Jeffery S R,Cohen S,et al.Web-scale data integration:You can only afford to pay as you go[C] //Proc of the 3rd Biennial Conference on Innovative Data Systems Research,2007:342-350.
2Dragut E C,Yu C T,Meng W.Meaningful labeling of integrated query interfaces[C] //Proc of the 32nd International Conference on Very Large Data Bases,2006:679-690.
3Kabisch T,Dragut E C,Yu C T.A hierarchical approach to model Web query interfaces for Web source integration[J].Proc of the Very Large Data Bases Endowment,2009,2(1):325-336.
4Zhao H,Meng W,Wu Z,et al.Fully automatic wrapper generation for search engines[C] //Proc of the 14th International Conference on World Wide Web,2005:66-75.
5Lu Y,He H,Zhao H,et al.Annotating structured data of the deep Web[C] //Proc of the 23rd International Conference on Data Engineering,2007:376-385.
6Elmagarmid A K,Ipeirotis P G,Verykios V S.Duplicate record detection:A survey[J].IEEE Trans Knowl Data Eng,2007,19(1):1-16.
7Bilenko M,Mooney R.Adaptive duplicate detection using learnable string similarity measures[C] //Proc of the 9th ACM International Conference on Knowledge Discovery and Data Mining,2003:39-48.
8Cohen W,Richman J.Learning to match and cluster large high-dimensional data sets for data integration[C] //Proc of the 8th ACM International Conference on Knowledge Discovery and Data Mining,2002:475-480.
9Sarawagi S,Bhamidipaty A.Interactive deduplication using active learning[C] //Proc of the 8th ACM International Conference on Knowledge Discovery and Data Mining,2002:269-278.
10Tejada S,Knoblock C,Minton S.Learning domain independent string transformation weights for high accuracy object identification[C] //Proc of the 8th ACM International Conference on Knowledge Discovery and Data Mining,2002:350-359.

共引文献2

1徐红艳,党晓婉,冯勇,李军平.基于BP神经网络的Deep Web实体识别方法[J].计算机应用,2013,33(3):776-779. 被引量：5
2张波,党德鹏.面向应急预案领域的Deep Web数据集成研究[J].计算机应用与软件,2013,30(10):8-11. 被引量：1

同被引文献4

1耿寅融,刘波.基于条件函数依赖的数据库一致性检测研究[J].计算机工程与应用,2012,48(3):122-125. 被引量：9
2李建中,刘显敏.大数据的一个重要方面:数据可用性[J].计算机研究与发展,2013,50(6):1147-1162. 被引量：261
3谭明超,刁兴春,曹建军.实体分辨研究综述[J].计算机科学,2014,41(4):9-12. 被引量：10
4谭明超,刁兴春,曹建军,冯径.一种基于函数依赖的属性相似度调整算法[J].上海交通大学学报,2015,49(8):1075-1083. 被引量：1

引证文献2

1Nengneng GAO,Sheng-Jun HUANG,Songcan CHEN.Multi-label active learning by model guided distribution matching[J].Frontiers of Computer Science,2016,10(5):845-855. 被引量：4
2冉德彤,游宏梁.一种基于数据一致性的记录比较方法[J].电子设计工程,2018,26(1):66-69. 被引量：4

二级引证文献8

1徐淼,周志华.利用辅助信息进行矩阵补全的核方法及其在多标记学习中的应用[J].中国科学：信息科学,2018,48(1):47-59. 被引量：1
2Min-Ling ZHANG,Yu-Kun LI,Xu-Ying LIU,Xin GENG.Binary relevance for multi-label learning： an overview[J].Frontiers of Computer Science,2018,12(2):191-202. 被引量：26
3Hao SHAO.Query by diverse committee in transfer active learning[J].Frontiers of Computer Science,2019,13(2):280-291. 被引量：3
4徐超,姜国标,陈勇.区块链技术支持下电子数据保障方法探究[J].软件导刊,2019,18(5):1-4. 被引量：2
5Jie ZHANG,Xiaowei ZHAO,Meina KAN,Shiguang SHAN,Xiujuan CHAI,Xilin CHEN.Locality-constrained framework for face alignment[J].Frontiers of Computer Science,2019,13(4):789-801.
6刘柯健,黄静,郑华东,肖波,欧阳航空.基于特征尺寸测量的绝缘子型号在线识别系统研究[J].智慧电力,2019,47(6):82-87. 被引量：8
7徐超,陈勇,葛红美,何炎祥.基于大数据的审计技术研究[J].电子学报,2020,48(5):1003-1017. 被引量：37
8许绘香,张慧,苏玉.大数据网络多记录信息自适应融合仿真研究[J].计算机仿真,2019,36(2):275-278.

1张景祥,王士同,邓赵红,李奕,蒋亦樟.具有协同约束的共生迁移学习算法研究[J].电子学报,2014,42(3):556-560. 被引量：3
2许敏,王士同,史荧中.一种新的面向迁移学习的L_2核分类器[J].电子与信息学报,2013,35(9):2059-2065. 被引量：1
3Hebah ELGIBREEN,Mehmet Sabih AKSOY.RULES-IT： incremental transfer learning with RULES family[J].Frontiers of Computer Science,2014,8(4):537-562.
4ZHUANG FuZhen,LUO Ping,HE Qing,SHI ZhongZhi.Inductive transfer learning for unlabeled target-domain via hybrid regularization[J].Chinese Science Bulletin,2009,54(14):2470-2478. 被引量：3
5Qi WANG,Shuming LIU,Wenjun LIU,Zoran KAPELAN,Dragan SAVIC.Decision Support System for emergency scheduling of raw water supply systems with multiple sources[J].Frontiers of Environmental Science & Engineering,2013,7(5):777-786. 被引量：2
6顾鑫,王士同.基于最小包含球的领域迁移学习新方法[J].计算机科学,2013,40(7):187-191. 被引量：4
7吉阳生,陈家骏,牛罡,商琳,戴新宇.Transfer Learning via Multi-View Principal Component Analysis[J].Journal of Computer Science & Technology,2011,26(1):81-98. 被引量：2
8冯玉,王珊.Compressed Data Cube for Approximate OLAP Query Processing[J].Journal of Computer Science & Technology,2002,17(5):625-635. 被引量：3
9LU Heng,FU Xiao,LIU Chao,LI Long-guo,HE Yu-xin,LI Nai-wen.Cultivated land information extraction in UAV imagery based on deep convolutional neural network and transfer learning[J].Journal of Mountain Science,2017,14(4):731-741. 被引量：14
10Xiang Chen,Wei-Wei Xu,Sai-Kit Yeung,Kun Zhou.View-Aware Image Object Compositing and Synthesis from Multiple Sources[J].Journal of Computer Science & Technology,2016,31(3):463-478. 被引量：1

Frontiers of Computer Science

2015年第4期

浏览历史

内容加载中请稍等...