期刊文献+

基于属性权重的实体解析技术 被引量:5

An Entity Resolution Approach Based on Attributes Weights
下载PDF
导出
摘要 实体解析是将在同一个数据源或者不同数据源中,指向现实世界同一实体的元组识别出来并合并的过程.如何高效而准确地将指向同一实体的记录识别出来一直是研究人员不懈追求的目标.在基于规则的匹配算法中,大多数算法是将所有的属性都作为匹配属性进行计算,并且各个属性的权重都是一样的,然而这样不能充分体现关键属性的重要性.主要针对关系数据库数据源中实体解析准确性的问题,采用信息增益的方法和概率统计的方法计算数据属性的权重,用来代表该属性在记录中的重要性,达到提高实体解析准确度的目的.并且在此基础上采用top-k算法,选择出最佳分类属性集合,减少匹配属性的个数,从而加快了实体解析的速度. Entity resolution is a process of identifying and merging entity tuples pointed to the same entity in the real world during the same data source or different data sources.How efficiently and accurately to identify records pointed to the same entity has been the relentless pursuit of the goal of the researchers.During rule-based matching algorithms,most of the algorithm takes all the attributes as matching attributes to be calculated,and the weights of the various properties are the same, however,this does not fully reflect the importance of the key attributes.This paper focuses on the accuracy of entity resolution in a relational database adopting information gain and probability statistics methods to calculate the data attribute weights to represent the importance of the attributes in the record,and using top-kalgorithm,so as to achieve improve the object of the entity resolution accuracy and accelerate running time.On this basis,this paper adopts top-k algorithm to select best classified attributes,and reduce the number of matching attributes,in order to accelerate the speed of entity resolution.
出处 《计算机研究与发展》 EI CSCD 北大核心 2013年第S1期281-289,共9页 Journal of Computer Research and Development
基金 国家自然科学基金项目(61272178 61173031) 国家自然科学基金海外及港澳学者合作基金项目(61129002) 教育部高等学校博士学科点专项科研基金项目(2011004211028) 中央高校基本科研业务费专项资金项目(N120504001 N110404015)
关键词 实体解析 属性权重 信息增益 实体识别 TOP-K entity resolution attribute weight information gain entity identification top-k
  • 相关文献

参考文献1

  • 1Xiaochun Yang,Yiu-Kai Ng.Answering form-based web queries using the data-mining approach[J].Journal of Intelligent Information Systems.2008(1)

同被引文献70

  • 1Newcombe H B, Kennedy J M, Axford S J, et al. Automatic Linkage of Vital Records [J]. Science, 1959, 130(3381): 954-959.
  • 2Fellegi I P, Sunter A B. A Theory for Record Linkage [J]. Journal of the American Statistical Association, 1969, 64(328): 1183-1210.
  • 3Newcombe H B, Kennedy J M. Record Linkage: Making Maximum Use of the Discriminating Power of Identifying Information [J]. Communications of the ACM, 1962, 5(11): 563-566.
  • 4Hernandez M A, Stolfo S J. The Merge/Purge Problem for Large Databases[C]. In: Proceedings of the 1995.ACM SIGMOD International Conference on Management of Data (SIGMOD'95), San Jose, California, USA. New York: ACM, 1995: 127-138.
  • 5Sarawagi S, Bhamidipaty A. Interactive Deduplication Using Active Learning [C]. In: Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD'02), Edmonton, Alberta, Canada. New York: ACM, 2002: 269-278.
  • 6Dong X, Halevy A, Madhavan J. Reference Reconciliation in Complex Information Spaces [C].In: Proceedings of the 2005 ACM SIGMOD International Conference on Management of Data, Baltimore, Maryland, USA. New York: ACM, 2005: 85-96.
  • 7Tejada S, Knoblock C A, Minton S. Learning Object Identification Rules for Information Integration [J]. Information Systems, 2001, 26(8): 607-633.
  • 8Christen P. Data Matching: Concepts and Techniques for Record Linkage, Entity Resolution, and Duplicate Detection [M]. Springer Berlin Heidelberg, 2012.
  • 9Elmagarmid A K, Ipeirotis P G, Verykios V S. Duplicate Record Detection: A Survey [J]. IEEE Transactions on Knowledge and Data Engineering, 2007, 19(1): 1-16.
  • 10Winkler W E. Overview of Record Linkage and Current Research Directions [R]. Washington, D C: U.S. Census Brueau, 2006.

引证文献5

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部