期刊文献+

一种基于属性显著度的实体解析算法 被引量:1

An Entity Resolution Algorithm Based on Attribute Salience
下载PDF
导出
摘要 实体解析(ER)是数据集成和数据清洗的一个重要步骤。在领域数据清洗与集成中,实体中不同的属性通常能表现出不同的区分能力,计算并利用属性的区分能力能够提高记录相似度的精确度。目前实体解析的方法有采用基于字符串的记录相似度算法以及基于机器学习的算法等方法来计算记录相似度,缺少考虑不同属性的重要程度。因此本文利用SimRank和PageRank算法的思想并结合随机抽样得到的属性显著度提出了一种基于属性显著度的计算记录相似度算法。首先,构造一个加权的属性记录对二部图来表示属性与记录对之间的关系;其次,根据属性显著度结合图论相似度算法提出了基于属性显著度的计算记录相似度的迭代算法。最后,构造一个记录图来表示记录对之间的匹配概率(二部图中的权值 w(ri,rj)),并使用改进的随机游走算法估计记录对匹配的概率。再将记录对的匹配概率反馈给加权的属性记录对二部图,并对基于属性显著度的计算记录相似度算法中的权值w(ri,rj) 进行修正,直到收敛。利用房地产领域数据集进行了实验评估,结果表明,本文提出的基于属性显著度的实体解析算法与主流方法相比,具有较高的精确度。 Entity resolution (ER) is an important step in data integration and data cleansing. In domain data cleaning and integration, different attributes in an entity usually exhibit different discriminating abilities. Calculating and utilizing the discriminating abilities of attributes can improve the accuracy of record similarity. Current entity resolution methods include record similarity algorithm based on string and algorithm based on machine learning to calculate record similarity, which lacks the im-portance of considering different attributes. Therefore, this paper uses the idea of SimRank and PageRank algorithm and combines the attribute salience obtained by random sampling to propose a similarity algorithm based on attribute salience. Firstly, a weighted attribute record pair bipartite graph is constructed to represent the relationship between attribute and record pair. Secondly, an iterative algorithm for calculating record similarity based on attribute significance is proposed ac-cording to attribute significance combined with graph similarity algorithm. Finally, a record graph is constructed to represent the matching probability between the record pairs (the weight in the bipartite graph), and the improved random walk algorithm is used to estimate the matching probability of the record pairs. Then, the matching probability of record pairs is fed back to the weighted bipartite graph of attribute record pairs, and the weight in the algorithm of calculating record similarity based on attribute salience is modified until convergence. Experi-mental evaluation using real estate data sets shows that the proposed entity resolution algorithm based on attribute salience is more accurate than the mainstream methods.
机构地区 沈阳建筑大学
出处 《数据挖掘》 2021年第2期27-37,共11页 Hans Journal of Data Mining
  • 相关文献

参考文献4

二级参考文献33

  • 1Qiang Baohua, Wu Kaigui, Wu Zhongfu. A Data-type-based Approach for Identifying Corresponding Attributes in Heterogeneous Databases. Xi'an, China: In: Proceedings of 2003 International Conference on Machine Learning and Cybernetics, 2003-11.
  • 2Qiang Baohua, Wu Kaigui, Liao Xiaofeng. Similarity Determination on Data Types in Heterogeneous Databases Using Neural Networks. Nanjing, China: In: Proceedings of 2003 International Conference on Neural Networks and Signal Processing, 2003-12.
  • 3Copas J B, Hilton F J, Record Linkage: Statistical Models for Matching Computer Records. J. Royal Statistical Soc.,1990, 153(3):287-320.
  • 4Dey D, Sarkar S, De P. A Probabilistic Decision Model for Entity Matching in Heterogeneous Databases. Management Science, 1998,12(10): 1379-1395.
  • 5Dey D, Sarkar S, De R A Distance-based Approach to Entity Reconciliation in Heterogeneous Databases. IEEE Transaction on Knowledge and Data Engineering, 2002, 14(3).
  • 6Barron F H, Barrett B E. Decision Quality Using Ranked Attribute Weights. Management Science, 1996, 42( 11 ): 1515-1523.
  • 7Bemers-Lee T. Linked Data-Design Issues[OL].http://www.w3.org/DesignIssues/LinkedData.html,.
  • 8Manola F,Miller E. RDF Primer.W3C[OL].http://www.w3c.org/TR/rdf-primer/,February,2004.
  • 9Heath T,Bizer C. Linked Data:Evolving the Web into a Global Data Space[M].Synthesisi Lectures on the Semantic Web:Theory and Technology,2011.
  • 10Bizer C,Heath T,Berners-Lee T. Linked data-the story so far[J].Int J Semantic Web Inf Syst,2009,(03):1-22.

共引文献13

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部