摘要
随着数据量的增加、数据间的关联和交叉,需要通过数据融合来实现数据的价值最大化。然而,由于数据融合过程复杂,为清晰解释数据融合过程,建立数据融合的回溯机制十分必要。虽然对数据溯源的研究很多,但大多是面向查询和工作流的溯源研究,而面向数据融合的溯源研究很少。文中面向数据融合溯源展开研究,提出了一种支持多粒度数据溯源的方法。首先,对数据融合过程进行抽象,以实体为核心构建模式、实体和属性的语义图,将数据融合过程语义化,并提出优化的溯源信息存储模式;然后,基于语义图,分别提出了实体级和属性级的溯源查询算法,以及相应的查询优化策略;最后,通过实验证明了提出的数据溯源方法的有效性。
As the amount of data increases, correlates and crosses between data, the value of data needs to be maximized through data fusion.However, due to the complexity of the data fusion process, to clearly explain the data fusion process, it is necessary to establish a backtracking mechanism for data fusion.Although many researches are focused on data provenance, most of them are based on query and workflow, and few of them are for data fusion.This paper focuses on the provenance of data fusion, and proposes a method to support multi-granularity provenance.Firstly, the data fusion process is abstracted, and the semantic graphs of patterns, entities and attributes are constructed with the entity as the core, and an optimized model for storing storage provenance information is proposed.Secondly, on the basis of the semantic graph, the data provenance query algorithms at the entity level and the attribute level are proposed respectively, and the corresponding query optimization strategy are also proposed.Finally, experiments demonstrate the effectiveness of the proposed data provenance method.
作者
杨斐斐
沈思妤
申德荣
聂铁铮
寇月
YANG Fei-fei;SHEN Si-yu;SHEN De-rong;NIE Tie-zheng;KOU Yue(College of Computer Science and Engineering,Northeastern University,Shenyang 110169,China)
出处
《计算机科学》
CSCD
北大核心
2022年第5期120-128,共9页
Computer Science
基金
国家自然科学基金(62072084,62072086)
国家重点研发计划(2018YFB1003404)。
关键词
数据溯源
数据融合
多粒度
Data provenance
Data fusion
Multi-granularity