摘要
互联网、物联网和云计算技术的不断融合,使得各行各业信息化程度越来越高,但同时也带来了数据碎片化的问题.数据碎片化的海量性、异构性、隐私性、相依性和低质性等特征,导致了数据可用性较差,利用这些数据难以挖掘出准确而完整的信息.为了更有效地利用数据,实体匹配、融合和消歧变得尤为重要.主要对异构网络中实体匹配算法进行了综述,对实体相似度度量和数据预处理技术进行了梳理;特别针对海量数据,概述了可扩展实体匹配方法的研究进展,综述了运用监督学习和非监督学习两类技术的实体匹配算法.
The continuous integration of Internet, Internet of Things, and cloud comput ing technologies has been improving digitization across different industries, but it has also introduced increased data fragmentation. Data fragmentation is characterized by mass, heterogeneity, privacy, dependence, and low quality, resulting in poor data availability. As a esult, it is often difficult to obtain accurate and complete information for many r ' analytical tasks. To make effective use of data, entity matching, fusion, and disambiguation are of particular significance. In this paper, we summarize data preprocessing, similarity measurements, and entity matching algorithms of heterogeneous networks. In addition, particularly for large datasets, we investigate scalable entity matching algorithms. Existing entity matching algorithms can be categorized into two groups, supervised and unsupervised learning-based algorithms. We conclude the study with research progress on entity matching and topics for future research.
作者
李娜
金冈增
周晓旭
郑建兵
高明
LI Na;JIN Gang-zeng;ZHOU Xiao-xu;ZHENG Jian-bing;GAO Ming(School of Data Science and Engineering,East China Normal University,Shanghai 200062,China)
出处
《华东师范大学学报(自然科学版)》
CAS
CSCD
北大核心
2018年第5期41-55,共15页
Journal of East China Normal University(Natural Science)
基金
国家重点研发计划项目(2016YFB1000905)
国家自然科学基金广东省联合重点项目(U1401256)
国家自然科学基金(61672234
61502236
61472321)
上海市科技兴农推广项目(T20170303)
关键词
数据融合
实体匹配
记录链接
实体解析
data fusion
entity matching
record linkage
entity resolution