摘要
数据之间存在相互引用关系,在进行数据开发时,通常存在一些具有高热度的数据,此类数据被其他数据大量引用,它们的缺陷往往会给整个大数据平台产出的数据结果带来极大影响。因此,对高热度数据进行预测并予以相应保护至关重要。面向基于数据热度的数据分级治理需求,提出一种采用数据血缘的数据热度预测方法。首先通过构建数据系统中的数据血缘捕获数据节点之间的引用关系;然后,提取数据血缘的时间和结构特征,并采用图卷积网络(GCN)进行数据血缘图特征的学习;最后,提出一种数据血缘传播趋势分层读出的方法读出图特征,对数据热度进行预测。在浙江中烟营销系统数据集ZJZY-SL和高能物理现象学相关论文引文数据集(HEP-PH)上的实验结果表明,相较于DeepCCP等方法,所提方法的识别准确率分别提升7.64、2.88个百分点,平均F1分别提升4.7、4.34个百分点。所提方法能充分挖掘数据在被引用早期的数据血缘特征,并预测数据节点未来的热度。
There are mutual reference relationships between data.In the process of data development,there are usually some data with high citation popularity.Such data are heavily referenced by other data,and their defects often bring great impact to the data results produced by the whole big data platform.Therefore,it is crucial to predict and protect high citation popularity data.Facing the demand for hierarchical data governance based on data citation popularity,a data citation popularity prediction method by data lineage was proposed.Firstly,the reference relationship between data nodes was captured by constructing the data lineage in the data system.Then,the temporal and structural features of the data lineage were extracted,and Graph Convolutional Network(GCN)was used to learn the features of the data lineage graph.Finally,a method was proposed to hierarchically read out the graph features of data lineage propagation trend to predict the data citation popularity.Experimental results on Zhejiang tobacco marketing system dataset called ZJZY-SL and High Energy Physics PHenomenology-related paper citation dataset(HEP-PH)show that compared with DeepCCP(an end-to-end Deep learning neural network for paper Citation Counts Prediction),the proposed method has the recognition accuracy increased by 7.64 and 2.88 percentage points respectively,and the average F1 score increased by 4.7 and 4.34 percentage points respectively.The proposed method can fully explore the data lineage features at the early stage of being referenced,and predict the future citation popularity of data nodes.
作者
金泳
高扬华
潘晓华
沈诗婧
朱心洲
JIN Yong;GAO Yanghua;PAN Xiaohua;SHEN Shijing;ZHU Xinzhou(Information Center,China Tobacco Zhejiang Industrial Company Limited,Hangzhou Zhejiang 310007,China;Binjiang Institute of Zhejiang University,Hangzhou Zhejiang 310053,China;School of Software Technology,Zhejiang University,Hangzhou Zhejiang 310013,China)
出处
《计算机应用》
CSCD
北大核心
2023年第S01期119-125,共7页
journal of Computer Applications
基金
浙江大学-浙江中烟联合实验室科技项目ZJZY2021E006(ZD-ZJZY20211001)
中国烟草总公司重点研发项目(110202102030)
浙江中烟工业有限责任公司科技项目(ZJZY2021E006)
关键词
数据血缘
图卷积网络
数据热度
传播趋势
数据治理
data lineage
Graph Convolutional Network(GCN)
data citation popularity
propagation trend
data governance