摘要
本文探讨一种适应于大数据集的基于相似性传递的记录增量分组方法.论文首先分析如何逐步推算出记录之间的相似性,然后提出如何基于排序键构建基准组,如何基于相似性传递增量更新基准组,以及如何基于并查集实现基准组中的增量更新,最后通过实验验证提出方法的可行性和高效性.实验结果显示,提出的方法比传统方法更能提高分组质量,提升分组效率.论文没有对属性值本身存在的数据质量问题进行详细分析研究,并没有设计排序键生成算法.提出的方法不仅能有助于解决数据清洗、信息集成与管理等技术中的记录漏配问题,而且具有较好的可扩展性可重用性和不受领域限制等优点因为它仅从纯数据处理的角度来设计算法.
This paper investigates an approach to record incremental grouping based on transferred similarity for large data sets.The paper first analyzes how to gradually calculate similarity between records,then proposes how to construct reference group based on sorting key,how to incrementally update reference group based on transferred similarity,and how to perform incremental updates in reference group based on union-find,finally proves the feasibility and efficiency of the proposed method through experiments.Experimental results show that the proposed method can improve grouping quality and improve grouping efficiency more than traditional methods.There is no detailed analysis of the data quality problem existing in the attribute value itself,and there is no design of the sorting key generation algorithm.The proposed method can not only help solve the problem of missing record pairs in data cleaning,information integration and management,but also has advantages such as better scalability,reusability,and freedom from the domain,because it only designs algorithms from the perspective of pure data processing.
作者
高广尚
GAO Guangshang(Research Center for Modern Enterprise Management,Guilin University of Technology,Guilin 541004,China;School of Management,Guilin University of Technology,Guilin 541004,China)
出处
《系统工程理论与实践》
EI
CSSCI
CSCD
北大核心
2019年第5期1287-1297,共11页
Systems Engineering-Theory & Practice
基金
国家自然科学基金(71761008)
广西高校人文社会科学重点研究基地基金(16YB010)~~
关键词
排序键
相似性传递
并查集
实体解析
数据质量
sorting key
transferred similarityunion-find
entity resolution
data quality