基于属性权重的实体解析技术被引量：5

An Entity Resolution Approach Based on Attributes Weights

下载PDF

导出

摘要实体解析是将在同一个数据源或者不同数据源中,指向现实世界同一实体的元组识别出来并合并的过程.如何高效而准确地将指向同一实体的记录识别出来一直是研究人员不懈追求的目标.在基于规则的匹配算法中,大多数算法是将所有的属性都作为匹配属性进行计算,并且各个属性的权重都是一样的,然而这样不能充分体现关键属性的重要性.主要针对关系数据库数据源中实体解析准确性的问题,采用信息增益的方法和概率统计的方法计算数据属性的权重,用来代表该属性在记录中的重要性,达到提高实体解析准确度的目的.并且在此基础上采用top-k算法,选择出最佳分类属性集合,减少匹配属性的个数,从而加快了实体解析的速度. Entity resolution is a process of identifying and merging entity tuples pointed to the same entity in the real world during the same data source or different data sources.How efficiently and accurately to identify records pointed to the same entity has been the relentless pursuit of the goal of the researchers.During rule-based matching algorithms,most of the algorithm takes all the attributes as matching attributes to be calculated,and the weights of the various properties are the same, however,this does not fully reflect the importance of the key attributes.This paper focuses on the accuracy of entity resolution in a relational database adopting information gain and probability statistics methods to calculate the data attribute weights to represent the importance of the attributes in the record,and using top-kalgorithm,so as to achieve improve the object of the entity resolution accuracy and accelerate running time.On this basis,this paper adopts top-k algorithm to select best classified attributes,and reduce the number of matching attributes,in order to accelerate the speed of entity resolution.

作者甄灵敏杨晓春王斌 Ahmed A Hussein

机构地区东北大学信息科学与工程学院

出处《计算机研究与发展》 EI CSCD 北大核心 2013年第S1期281-289,共9页 Journal of Computer Research and Development

基金国家自然科学基金项目(61272178 61173031) 国家自然科学基金海外及港澳学者合作基金项目(61129002) 教育部高等学校博士学科点专项科研基金项目(2011004211028) 中央高校基本科研业务费专项资金项目(N120504001 N110404015)

关键词实体解析属性权重信息增益实体识别 TOP-K entity resolution attribute weight information gain entity identification top-k

分类号 TP311.13 [自动化与计算机技术—计算机软件与理论]

引文网络
相关文献

参考文献1

1Xiaochun Yang,Yiu-Kai Ng.Answering form-based web queries using the data-mining approach[J].Journal of Intelligent Information Systems.2008(1)

同被引文献70

1Newcombe H B, Kennedy J M, Axford S J, et al. Automatic Linkage of Vital Records [J]. Science, 1959, 130(3381): 954-959.
2Fellegi I P, Sunter A B. A Theory for Record Linkage [J]. Journal of the American Statistical Association, 1969, 64(328): 1183-1210.
3Newcombe H B, Kennedy J M. Record Linkage: Making Maximum Use of the Discriminating Power of Identifying Information [J]. Communications of the ACM, 1962, 5(11): 563-566.
4Hernandez M A, Stolfo S J. The Merge/Purge Problem for Large Databases[C]. In: Proceedings of the 1995.ACM SIGMOD International Conference on Management of Data (SIGMOD'95), San Jose, California, USA. New York: ACM, 1995: 127-138.
5Sarawagi S, Bhamidipaty A. Interactive Deduplication Using Active Learning [C]. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'02), Edmonton, Alberta, Canada. New York: ACM, 2002: 269-278.
6Dong X, Halevy A, Madhavan J. Reference Reconciliation in Complex Information Spaces [C].In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA. New York: ACM, 2005: 85-96.
7Tejada S, Knoblock C A, Minton S. Learning Object Identification Rules for Information Integration [J]. Information Systems, 2001, 26(8): 607-633.
8Christen P. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection [M]. Springer Berlin Heidelberg, 2012.
9Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate Record Detection: A Survey [J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1-16.
10Winkler W E. Overview of Record Linkage and Current Research Directions [R]. Washington, D C: U.S. Census Brueau, 2006.

引证文献5

1高广尚,张智雄.关系数据库中实体解析研究综述[J].现代图书情报技术,2015(7):37-47. 被引量：1
2齐林峰.利用实体解析的跨社交媒体同一用户识别[J].图书情报工作,2017,61(6):107-114. 被引量：4
3郑彦宁,梁子豪,刘志辉.基于关系整合的企业合作分析方法研究——以LED领域上市公司为例[J].情报理论与实践,2018,41(1):76-80. 被引量：1
4刘叶,吴晟,吴兴蛟,周海河,李英娜,张晶.数据仓库下基于学习的并行实体解析算法研究[J].软件导刊,2018,17(2):19-22.
5徐耀丽,李战怀,陈群,王艳艳,樊峰峰.基于因子图的不一致记录对消歧方法[J].计算机研究与发展,2020,57(1):175-187. 被引量：2

二级引证文献8

1刘奇飞,杜彦辉,芦天亮.基于用户关系的跨社交网络用户身份关联方法[J].计算机应用研究,2020,37(2):381-384. 被引量：4
2张冉,宋宝燕,单晓欢,王俊陆.多源异构区块链数据质量评估模型[J].计算机与数字工程,2023,51(1):14-19.
3王晰巍,贾若男,王铎,郭宇.图书情报领域人工智能的研究热点及发展趋势研究[J].图书情报工作,2019,63(1):70-80. 被引量：36
4齐林峰.利用实体解析的跨社交媒体同一用户识别[J].图书情报工作,2017,61(6):107-114. 被引量：4
5杨华,李玲丽,莫晓霞,辛蕾,牛悦.基于上市公司年报分析的图书馆与数据库商博弈思路研究[J].图书情报工作,2023,67(5):99-107. 被引量：1
6敬少杰,季铎,庄云行,刘云钊.面向开放式社交平台的虚拟用户身份同一认定技术的研究[J].网络安全技术与应用,2023(4):54-58. 被引量：1
7黄伟鑫,毕达天,杨阳,孔婧媛.平台特征对跨社交媒体UGC信息分享行为的影响机理研究[J].现代情报,2024,44(2):115-129. 被引量：1
8张海粟,王龙,祁超.融合拓扑势与因子图的在线社交网用户影响力推断[J].小型微型计算机系统,2024,45(5):1157-1162.

1吴琪.网络虚拟环境下不确定数据查询算法的改进[J].计算机光盘软件与应用,2014,17(11):95-95.
2林海.一种基于Bloom Filter的频繁模式挖掘算法[J].数学的实践与认识,2009,39(3):172-177.
3白洪涛,孙吉贵,莫旭,杨凤杰.一个专用ETL程序的实现[J].计算机应用,2004,24(2):101-104. 被引量：5
4李柰,王斌,关晶,王国仁.结构化网络中聚合Top-K查询优化技术[J].小型微型计算机系统,2007,28(11):2033-2037. 被引量：1
5李斌,郭雅娟,陈锦铭,袁晓冬.电能质量监测系统95概率大值的top-k优化研究[J].电力信息化,2013,11(1):20-24. 被引量：3
6潘林,齐庆芳.移动计算中概率数据集成的Top-k算法[J].德州学院学报,2014,30(6):63-67.
7李雷,李晓东,刘欣阳.分布式网络中的一种高效top-k求解方法研究[J].计算机工程与应用,2010,46(18):89-92. 被引量：1
8陈钦荣,刘顺来.基于Top-k查询算法改进的储存与NSDL调度算法研究[J].现代计算机（中旬刊）,2015(5):28-32.
9黄伟焕.赫夫曼算法在分类优化应用中的缺陷及其修正[J].温州职业技术学院学报,2002,2(4):31-32.
10王波,张永祥.基于XQuery处理器的异构数据集成中间件[J].计算机系统应用,2011,20(10):87-91.

计算机研究与发展

2013年第S1期

浏览历史

内容加载中请稍等...

基于属性权重的实体解析技术被引量：5

参考文献1

同被引文献70

引证文献5

二级引证文献8

相关作者

相关机构

相关主题

浏览历史

基于属性权重的实体解析技术 被引量：5

参考文献1

同被引文献70

引证文献5

二级引证文献8

相关作者

相关机构

相关主题

浏览历史

基于属性权重的实体解析技术被引量：5