期刊文献+

一种基于离群点检测的自动实体匹配方法 被引量:10

An Outlier-Detection Based Approach for Automatic Entity Matching
下载PDF
导出
摘要 实体匹配也叫记录匹配,是数据集成与数据清洗过程中的一项关键技术.其典型用例包括不同网站之间的商品匹配以及DBLP(Digital Bibliorgrophy&Library Project)与Scholar文献数据库之间的文献实体匹配.真实数据中广泛存在的数据质量缺陷,如错误值、缺失值和数据表达形式多样性等数据质量问题,使得实体匹配问题很具挑战性.目前流行的实体匹配算法可划分为三大类:基于规则的、基于概率的和基于学习的.电商数据中,对同一商品的描述可能差异巨大.对于这类充满表达多样性的实体匹配问题,通常并不存在简洁高效的匹配规则,训练精准的分类模型也很困难.针对这个问题,文中提出了一种基于离群点检测(Outlier Detection)的自动实体匹配方法,记为ODetec算法.首先计算记录序偶在匹配属性上的相似度,并将序偶映射为特征空间上的点;接着在特征空间中估算每个序偶的离群距离;最后根据离群距离和匹配约束,抽取匹配序偶.另外,ODetec算法采用主成分分析方法将多个存在相关性的匹配特征变换为彼此正交的主成分,突破了Fellegi-Sunter模型中属性之间须满足条件独立假设的限制,具备了更好的匹配效果和更为广泛的适用性.实验结论证实了ODetec方法的有效性. Entity Matching, also known as Record Matching, is a key technique in data integration and cleaning process. Its typical applications include the commercial products matching across different websites and the research paper records matching between the DBLP (Digital Bibliorgrophy Library Project) and Scholar digital libraries. The widespread data quality defects in real data, e. g. , tuple errors, missing values and representation diversities, make the entity matching problem much challenging. The popular entity matching algorithms can be categorized into rule-based, probabilistic and learning-based approaches. In e-commercial data, the descriptions of the same products may vary greatly. For the entity matching task on those datasets with representation diversity problems, it is difficult to design effective matching rules and remains challenging to train classification models. To address this issue, this paper proposes an Outlier-Detection-based approach, denoted by ODetec, for automatic entity matching. Firstly, the ODetec measures the similarities on the matching attributes for each record pair, and map the pairs into points in feature space. Then it calculates the outlier distances for each record pair in the feature space. Finally, it ranks the pairs by their outlier distances and extracts those matching candidates that meet the matching constraints. In addition, ODetec can transform multiple co-related matching features into orthogonal principal components by Principal Component Analysis, breaking through the limitation of conditional independence between attributes that is required by Fellegi-Sunter model. Thus it reaches better effect and broader applicability. Our extensive experiments on real datasets have verifiedthe effectiveness of the ODetee approach.
作者 樊峰峰 李战怀 陈群 刘海龙 FAN Feng-Feng LI Zhan-Huai CHEN Qun LIU Hai-Long(Department of Computer Science, Northwestern Polytechnical University, Xi'an 710072)
出处 《计算机学报》 EI CSCD 北大核心 2017年第10期2197-2211,共15页 Chinese Journal of Computers
基金 国家"九七三"重点基础研究发展计划项目基金(2012CB316203) 国家自然科学基金(61332006 61472321 61502390)资助~~
关键词 数据集成 实体匹配 数据质量 离群点检测 主成分分析 data integration entity matching data quality outlier detection principal component analysis
  • 相关文献

参考文献6

二级参考文献31

共引文献71

同被引文献64

引证文献10

二级引证文献21

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部