期刊文献+

基于实体演化的记录链接算法 被引量:1

A temporal record matching based on entity evolution
下载PDF
导出
摘要 实体识别(Entity Resolution)是指判断一个或多个数据源中两个不同记录是否描述相同实体,它有时也被称作记录连接(Record Linkage),在数据集成中被用于数据清洗(Data Clean)、去重(Deduplication)和相似连接(Similarity Joins)等集成操作中.实体识别技术可被广泛应用于人口普查、引文识别、Web搜索、数据清洗以及剽窃检验等诸多领域.然而,在真实世界中,实体的属性会随着时间的变化而变化,两条记录的属性值不同不能表明这两条记录对应不同的实体,具有相同的属性值的两条记录也不能表明对应相同的实体.时间记录链接就是匹配描述同一实体的带有时间戳的记录.已有的解决时间记录链接的方法依赖于时间模型来捕捉实体的演化,但是已有的时间模型在预测实体的演化时,实体匹配准确率不高,而聚类计算复杂度较高.为此提出了更加细致的捕捉实体演化的模型和新的两阶段的快速聚类算法.通过在三个真实数据集上的实验结果表明,提出的时间模型可以更加细致地捕捉实体的演化,提出的聚类算法能更快速而准确的聚类描述同一实体的记录,提高了识别的准确率和效率. Entity resolution,also named as record linkage,is to judge whether two different records in one or more data sources belong to the same entity.In the area of data integration,entity resolution is widely used for data clean,deduplication and similarity joins.Entity resolution can be also widely applied in census,citation recognition,web search,data cleaning,plagiarism and inspection.However,in reality,entity attribute changes over time.That is,the two records with different attributes do not mean the two records belong to different entity.On the contrary,the two records with the same attributes also can not demonstrate the reference to the same entity.Then,the problem of linking temporal record,which aims at linking the records with time stamps,is proposed.Most state-of-the-art methods prefer to present different temporal models to capture the entity evolution.However,these temporal models have a low accuracy and a high computation cost in solving temporal record linkage.In this paper,we firstly present a more novel temporal model for capturing entity evolution.Then,a two-stage fast clustering algorithm are presented.Atlast,experimental results on three real-world datasets demonstrate that our temporal model has better performance in capturing the entity evolution,and our clustering algorithm is more fast and accurate in solving temporal record linkage.
出处 《南京大学学报(自然科学版)》 CAS CSCD 北大核心 2017年第6期991-1003,共13页 Journal of Nanjing University(Natural Science)
基金 国家自然科学基金(61472070 61672142)
关键词 实体演化 记录链接 时间模型 聚类算法 entity evolution, record linkage, temporal model, clustering algorithm
  • 相关文献

同被引文献8

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部