摘要
概念漂移是动态流数据挖掘中一类常见的问题,但混杂噪声或训练样本规模过小而产生的伪概念漂移会引起与真实概念漂移相似的结果,即模型在线测试性能的不稳定波动,导致二者容易混淆,发生概念漂移的误报.针对流数据中真伪概念漂移的混淆问题,提出一种基于在线性能测试的概念漂移检测方法(concept drift detection method based on online performance test,简称CDPT).该方法将最新获得的数据集进行均匀分组,在每组子数据集上分别进行在线学习,同时记录每组子数据集训练测试得到的分类精度向量,并计算相邻学习时间单元之间的精度落差,依据测试精度下降阈值得到有效波动位点.然后采用交叉检验的方式整合不同分组中的有效波动位点,以消除流数据在线学习过程中由于训练样本过小导致模型不稳定造成的检测干扰,根据精度波动一致性得到一致波动位点.最后,通过跟踪在线学习分类准确率,得到一致波动位点邻域参照点的测试精度变化,比较一致波动位点邻域参照点对应的模型测试精度下降幅度及收敛情况,以有效检测一致波动位点当中真实的概念漂移位点.实验结果表明,该方法能够有效辨识流数据在线学习过程中发生的真实概念漂移,并能有效避免训练样本过小或者流数据中噪声对检测结果的负面影响,同时提高模型的泛化性能.
Concept drift is a common problem in dynamic streaming data mining,but the false concept drift generated by the mixed noise data or too small scale size training data will cause similar results to the concept drift,that is,the instability fluctuation of model online testing performance,which leads to confusion between them,and the false alarm of concept drift.To address the problem which is easy to confuse the authenticity of concept drift,concept drift detection method based on online performance test,namely CDPT,is presented.With CDPT,the latest acquired data are evenly divided into groups,and online learning is performed on each group sub sets.At the same time,the classification accuracy vectors obtained by training and testing of each group sub sets are recorded,and the accuracy difference between adjacent learning time units is calculated.The effective fluctuation points are obtained according to the testing accuracy decline threshold.Then,the effective fluctuation points in different groups are integrated by cross checking to eliminate the detection interference caused by the instability of the model due to the small training samples in the online learning process of streaming data,and the consistent fluctuation points are obtained according to the consistency of accuracy fluctuation.Finally,by tracking the classification accuracy of online learning,the change of testing accuracy can be achieved of neighborhood reference points of consistent fluctuation points,and the decline and convergence of model testing accuracy can be compared of neighborhood reference points of consistent fluctuation points,so as to effectively detect the true concept drift points of the consistent fluctuation points.The experimental results demonstrate that the proposed CDPT method can effectively identify the true concept drift occurring in the online learning process of streaming data,effectively avoid the negative impact of too small training samples or noise on the detection results,and improve the generalization performance of the model.
作者
郭虎升
张爱娟
王文剑
GUO Hu-Sheng;ZHANG Ai-Juan;WANG Wen-Jian(School of Computer and Information Technology,Shanxi University,Taiyuan 030006,China;Key Laboratory of Computational Intelligence and Chinese Information Processing of Ministry of Education(Shanxi University),Taiyuan 030006,China)
出处
《软件学报》
EI
CSCD
北大核心
2020年第4期932-947,共16页
Journal of Software
基金
国家自然科学基金(61503229,61673249,U1805263)
山西省自然科学基金(201901D111033)
山西省重点研发计划(国际合作)(201903D421050)。
关键词
流数据
概念漂移
交叉检验
有效波动位点
一致波动位点
概念漂移位点
streaming data
concept drift
cross checking
effective fluctuation point
consistent fluctuation point
concept drift point