摘要
分布式数据流已成为现代数据驱动应用产生数据的主要形式,而局部节点的数据虽然独立存储,但彼此之间是相互关联的,因此如何高效地共享局部节点数据来构建全局学习器是分布式在线学习的关键问题.针对此问题,提出一种分布式在线学习的数据共享解决方案,包括基于指数损失的半监督聚类方法和基于协方差矩阵与均值向量的数据共享方法,并证明重构数据集的累计绝对误差小于给定绝对误差界的概率下界.实验表明:所提出的方法可以使节点间的共享数据量维持在一个较低的水平,同时保证基于重构数据训练得到的学习器具有很好的泛化学习能力.
Distributed data stream generated by current data-driven applications has become a main data representation.Although distributed data stream is captured from different data sources,they are correlated to a common event.Hence,the key issue of distributed online learning is how to build global learners by sharing data of local node.For this problem,this paper proposes a sharing data solution for distributed online learning,containing the semi-supervised clustering approach based on exponential loss and the sharing data approach based on covariance matrixes and mean vectors,and proves the cumulative absolute error between the rebuilding data set and the original data set is bounded on the given threshold under some probability.Experimental study demonstrates that the proposed approach has lower network traffic between nodes,and gets the learner having better generalization capability.
作者
张宇
刘威
邵良杉
ZHANG Yu;LIU Wei;SHAO Liang-shan(College of Science,Liaoning Technical University,Fuxin 123000,China;Research Centre in Management Science,Liaoning Technical University,Huludao 125105,China)
出处
《控制与决策》
EI
CSCD
北大核心
2021年第8期1871-1880,共10页
Control and Decision
基金
辽宁省教育厅项目(LJ2019QL016)
国家自然科学基金项目(71771111)。
关键词
分布式数据流
全局学习器
在线学习
数据共享
半监督聚类
数据集重构
distributed data stream
global learner
online learning
sharing data
semi-supervised clustering
rebuilding data set