利用二次归并的Deep Web实体匹配方法

Deep Web entity matching method based on twice-merging

下载PDF

导出

摘要针对权重边剪枝(WEP)方法在准确率和匹配效率等方面的不足,通过引入自匹配和归并概念,提出一种基于二次归并的Deep Web实体匹配方法。首先,提取各对象的属性值,并按属性值重组对象,使具有相同属性值的对象聚集在一起,实现块的有效划分;其次,计算块内各对象间的匹配度,并据此进行剪枝、自匹配检测、归并,输出初步类簇;最后,以初步类簇为基础,利用簇内对象间传递的消息以及对象属性相似值,进一步挖掘匹配关系,触发新一轮的类簇归并与更新。实验结果表明,与WEP方法相比,所提方法通过自匹配检测,自动区分匹配关系并采取合适的匹配策略,使归并过程逐渐精化,提高了匹配准确率;通过分块、剪枝,有效缩减了匹配空间,提高了系统运行效率。 Concerning the limitations of the Weighted Edge Pruning （WEP） method in accuracy and matching efficiency, a Deep Web entity matching method based on twice-merging was proposed by introducing the concepts of self-matching and merging. Firstly, attribute values of each object were extracted to regroup objects for gathering objects with the same attribute value together, therefore, all objects could be divided into blocks efficiently. Secondly, the matching values between objects within a same block were calculated for pruning, self-matching detection, merging explicit matching to generate preliminary clusters. Finally, based on these preliminary clusters, matching relationships were further discovered by using the message passing between objects within a cluster and objects＇ attribute similarity values, which triggered a new round of cluster merging and updating. Experimental results show that compared with the WEP method, the proposed method, by detecting self- matching to automatically distinguish matching relationships and take the proper matching method, gradually refines the merging process to improve the matching accuracy; simultaneously, by blocking and pruning to effectively reduce the matching space, its system efficiency is improved.

作者陈丽君

机构地区浙江越秀外国语学院网络传播研究所

出处《计算机应用》 CSCD 北大核心 2016年第8期2139-2143,共5页 journal of Computer Applications

基金全国教育信息技术研究课题资助项目(136241401) 浙江越秀外国语学院科研项目(N201375)~~

关键词二次归并 DEEP WEB 实体匹配类簇相似值 twice-merging Deep Web entity matching cluster similarity value

分类号 TP391 [自动化与计算机技术—计算机应用技术] TP311 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献14

1陈丽君,林怀忠.一种用于深层网接口集成的模式匹配方法[J].计算机工程,2012,38(12):42-44. 被引量：2
2KOPCKE H, RAHM E. Frameworks for entity matching: a compari- son [ J]. Data & Knowledge Engineering, 2010, 69(2): 197 - 210.
3HAN X, SUN L, ZHAO J. Collective entity linking in Web text: a graph-based method [ C]//SIGIR '11: Proceedings of the 34th An- nual ACM SIG1R Conference on Research and development in Infor- mation Retrieval. New York: ACM, 2011:765-774.
4RASTOGI V, DALVI N, GAROFALAKIS M. Large-scale collective entity matching [ J]. Proceedings of the VLDB Endowment, 2011, 4 (4) : 208 -218.
5WANG Z, LI J, WANG Z, et al. Cross-lingual knowledge linking across Wiki knowledge bases [ C]// WWW '12: Proceedings of the 21st International Conference on Word Wide Web. New York: ACM, 2012:459-468.
6FAN J, LU M, OOI B C, et al. A hybrid machine-crowdsourcing system for matching Web tables [ C]// Proceedings of the 2014 IEEE 30th International Conference on Data engineering. Washing- ton, DC: IEEE Computer Society, 2014:976-987.
7崔晓军,肖红宇,丁立新.基于距离的自适应Web数据库记录匹配方法[J].武汉大学学报（理学版）,2012,58(1):89-94. 被引量：5
8LIU W, MENG X. A holistic solution for duplicate entity identifica- tion in deep Web data integration [ C]// SKG '10: Proceedings of the 2010 Sixth International Conference on Semantics, Knowledge and Grids. Washington, DC: IEEE Computer Society, 2010:267 - 274.
9徐红艳,党晓婉,冯勇,李军平.基于BP神经网络的Deep Web实体识别方法[J].计算机应用,2013,33(3):776-779. 被引量：5
10LIU W, MENG X, YANG J, et al. Duplicate identification in Deep Web data integration [ C]// WAIM '10: Proceedings of the l lth International Conference on Web-age Information Manage- ment, LNCS 6184. Berlin: Springer-Verlag, 2010:5-17.

二级参考文献57

1凌妍妍,刘伟,王仲远,艾静,孟小峰.Deep Web数据集成中的实体识别方法[J].计算机研究与发展,2006,43(z3):46-53. 被引量：4
2强保华,陈凌,余建桥,吴开贵,吴中福.基于BP神经网络的属性匹配方法研究[J].计算机科学,2006,33(1):249-251. 被引量：4
3朱恒民,王宁生.一种改进的相似重复记录检测方法[J].控制与决策,2006,21(7):805-808. 被引量：12
4王丽娟,关守义,王晓龙,王熙照.基于属性权重的Fuzzy C Mean算法[J].计算机学报,2006,29(10):1797-1803. 被引量：45
5Hernandez M A, Stolfo S J. Real-world data is dirty: data cleansing and the merge/purge problem. Data Mining and Knowledge Discovery, 1988, 2(1): 9-37.
6Hassanzadeh O, Sadoghi M, Miller R J. Accuracy of approx- imate string joins using grams//Proceedings of the Interna- tional Workshop on Quality in Databases (QDB). Vienna, Austria, 2007:11-18.
7Hassanzadeh O. Benchmarking declarative approximate selection predicates[Ph. D. dissertation]. University of Toronto, Canada, 2007.
8Whang Steven Euijong, Menestrina David, Koutrika Georgia et al. Entity resolution with iterative blocking//Proceedings of the 35th SIGMOD International Conference on Manage- ment of Data. Rhode Island, USA, 2009:219-231.
9Weis M, Naumann F. Detecting duplicate objects in XML documents//Proeeedings of the IQIS. Paris, France, 2004: 10-19.
10Weis Georgia, Naumann Felix. DogmatiX tracks down dupli- cates in XML//Proceedings of the ACM SIGMOD 2005. New York, USA, 2005:431-442.

共引文献30

1刘芳.查询自动生成器在Web数据库发现中的应用[J].信息技术,2009,33(6):85-87. 被引量：2
2寇月,申德荣,于戈,聂铁铮.Combining Local Scoring and Global Aggregation to Rank Entities for Deep Web Queries[J].Journal of Computer Science & Technology,2009,24(4):626-637. 被引量：1
3刘金红,陆余良,施凡,宋舜宏.基于语义上下文分析的因特网人物信息挖掘[J].安徽大学学报（自然科学版）,2009,33(4):33-37. 被引量：1
4刘伟,肖建国.多Web数据源环境下的重复实体识别方法研究[J].计算机科学与探索,2010,4(7):599-607. 被引量：3
5陈国华,汤庸,彭泽武,李建国.基于学术社区的学术搜索引擎设计[J].计算机科学,2011,38(8):171-175. 被引量：13
6李海滨,许南山.基于高级搜索页面的动态表单搜索[J].计算机系统应用,2011,20(10):180-183. 被引量：1
7李春林.Web数据库集成技术及其发展趋势[J].硅谷,2012,5(9):1-2. 被引量：2
8张利,张刚.卡银行系统业务模型的分析和建立[J].太原理工大学学报,2012,43(4):464-467.
9熊波,王卓.LabVIEW机械故障诊断系统的Web实现[J].网络安全技术与应用,2012(11):55-58.
10杨丹,申德荣,于戈,聂铁铮,寇月.数据空间中时间为中心的集合实体识别策略[J].计算机科学与探索,2012,6(11):974-984. 被引量：4

1魏元凤,骆洪青,辛崇波,夏祖勋.属性相似案例的检索模型比较研究[J].华东船舶工业学院学报,1999,13(4):41-44. 被引量：6
2陶松桥,郭顺生.基于面属性相似的CAD模型检索方法[J].武汉理工大学学报（信息与管理工程版）,2015,37(5):537-541. 被引量：2
3姚文明.汉英自动翻译主要问题探讨[J].西南民族学院学报（自然科学版）,1997,23(2):207-210.
4崔广才.通用汉字输入系统的重码自动区分软件工具[J].长春光学精密机械学院学报,1992,15(3):67-70.
5许为,林柏钢,林思娟,杨旸.一种基于用户交互行为和相似度的社交网络社区发现方法研究[J].信息网络安全,2015(7):77-83. 被引量：11
6胡旭,鲁汉榕,陈新,周国安.基于项目属性相似和MapReduce并行化的Slope One算法[J].空军预警学院学报,2015,29(1):54-58. 被引量：2
7李云,刘宗田,吴强,沈夏炯,强宇.概念格的分布处理研究[J].小型微型计算机系统,2005,26(3):448-451. 被引量：11
8周董.一种基于机器学习的属性缺失值模糊填补方法[J].计算机与现代化,2008(12):91-93.
9李明.鲍尔默胜任转型时期微软CEO一职吗?[J].IT时代周刊,2013(16):9-9.
10罗卫兵.调制解调器的安全隐患及策略[J].信息网络安全,2001(4):30-31.

计算机应用

2016年第8期

浏览历史

内容加载中请稍等...

利用二次归并的Deep Web实体匹配方法

参考文献14

二级参考文献57

共引文献30

相关作者

相关机构

相关主题

浏览历史