摘要
对如何在不断快速演化的数据集中构建出规范的代表性记录,以确保Web应用的前端、后端能对数据集进行高效的比较分析开展了研究。论文首先分析记录之间的相似合并策略,具体包括记录间相似性策略、记录间合并策略和相似与合并组合策略,然后给出代表性记录的基本定义,并分析成为最佳代表性记录的先决条件,接着探讨面向演化数据的代表性记录构建方法,该方法首先在静态数据上利用匹配函数、合并函数生成代表性记录,然后在演化数据到来时基于出现操作及演化记录来有效更新先前生成的代表性记录集,最后通过实验和数据分析验证提出的方法。实验结果显示,提出的方法在静态数据上比传统方法更能提高生成质量,且在演化数据上具有良好的增量更新性能,最终保证提出的方法在演化数据环境下整体上的可行性和高效性。提出的方法不仅能有助于解决多源数据演化环境下的代表性记录高效构建问题,而且具有较好的稳定性和通用性,能适合诸多实际领域,因为它能适用于任何类型的相似性度量函数。
How to construct canonical records in the rapidly evolving data set to ensure efficient comparative analysis of data sets on the front and back ends of Web applications is studied.This paper first analyzes the similarity merging strategy between records,including the similarity strategy between records,the merging strategy between records,and the similarity and merging combination strategy,and then gives the basic definition of canonical record,and analyzes the prerequisites for becoming the best canonical record,then discusses the construction method of canonical record for evolution data.It first uses matching function and merging function to generate canonical records on static data,and then effectively updates the previously generated canonical record set based on occurrence operation and evolutionary record when the evolutionary data arrives.The proposed method is verified by experiments and data analysis.The experimental results show that the proposed method can improve the generation quality more than traditional methods on static data,and has good incremental update performance on evolution data.Finally,the overall feasibility and efficiency of the proposed method in the context of evolutionary data are guaranteed.The proposed method can not only help to solve the problem of efficient construction of canonical records in a multi-source data evolution environment,but also has good stability and versatility.It can be suitable for many practical fields because it is suitable for any type of similarity function.
作者
高广尚
GAO Guang-shang(School of Management,Guilin University of Technology,Guilin 541004,China)
出处
《系统工程》
北大核心
2022年第3期137-148,共12页
Systems Engineering
基金
国家自然科学基金项资助目(71761008)
广西科技计划项目(桂科AD19245122)
桂林理工大学科研启动基金资助项目(GUTQDJJ2016020)。
关键词
近似重复记录
演化数据
代表性记录
实体解析
Approximately Duplicate Records
Evolutionary Data
Canonical Records
Entity Resolution