期刊文献+

面向不同类型概念漂移的两阶段自适应集成学习方法

Two-Stage Adaptive Ensemble Learning Method for Different Types of Concept Drift
下载PDF
导出
摘要 大数据时代,流数据大量涌现.概念漂移作为流数据挖掘中最典型且困难的问题,受到了越来越广泛的关注.集成学习是处理流数据中概念漂移的常用方法,然而在漂移发生后,学习模型往往无法对流数据的分布变化做出及时响应,且不能有效处理不同类型概念漂移,导致模型泛化性能下降.针对这个问题,提出一种面向不同类型概念漂移的两阶段自适应集成学习方法(two-stage adaptive ensemble learning method for different types of concept drift,TAEL).该方法首先通过检测漂移跨度来判断概念漂移类型,然后根据不同漂移类型,提出“过滤-扩充”两阶段样本处理机制动态选择合适的样本处理策略.具体地,在过滤阶段,针对不同漂移类型,创建不同的非关键样本过滤器,提取历史样本块中的关键样本,使历史数据分布更接近最新数据分布,提高基学习器有效性;在扩充阶段,提出一种分块优先抽样方法,针对不同漂移类型设置合适的抽取规模,并根据历史关键样本所属类别在当前样本块上的规模占比设置抽样优先级,再由抽样优先级确定抽样概率,依据抽样概率从历史关键样本块中抽取关键样本子集扩充当前样本块,缓解样本扩充后的类别不平衡现象,解决当前基学习器欠拟合问题的同时增强其稳定性.实验结果表明,所提方法能够对不同类型的概念漂移做出及时响应,加快漂移发生后在线集成模型的收敛速度,提高模型的整体泛化性能. In the era of big data,there is a large amount of streaming data emerging.Concept drift,as the most typical and difficult problem in streaming data mining,has received increasing attention.Ensemble learning is a common method for handling concept drift in streaming data.However,after drift occurs,learning models often cannot timely respond to the distribution changes of streaming data and cannot effectively handle different types of concept drift,leading to the decrease in model generalization performance.Aiming at this problem,we propose a two-stage adaptive ensemble learning method for different types of concept drift(TAEL).Firstly,the concept drift type is determined by detecting the drift span.Then,based on different drift types,a“filtering-expansion”two-stage sample processing mechanism is proposed to dynamically select appropriate sample processing strategy.Specifically,during the filtering stage,different non-critical sample filters are created for different drift types to extract key samples from historical sample blocks,making the historical data distribution closer to the latest data distribution and improving the effectiveness of the base learners.During the expansion stage,a block-priority sampling method is proposed,which sets an appropriate sampling scale for the drift type and sets the sampling priority according to the size proportion of the class in the current sample block to which the historical key sample belongs.Then,the sampling probability is determined based on the sampling priority,and a subset of key samples is extracted from the historical key sample blocks according to the sampling probability to expand the current sample block.This alleviates the class imbalance phenomenon after sample expansion,solves the underfitting problem of the current base learner and enhances its stability.Experimental results show that the proposed method can timely respond to different concept drift types,accelerate the convergence speed of online ensemble models after drift occurs,and improve the overall generalization performance of the model.
作者 郭虎升 张洋 王文剑 Guo Husheng;Zhang Yang;Wang Wenjian(School of Computer and Information Technology,Shanxi University,Taiyuan 030006;Key Laboratory of Computational Intelligence and Chinese Information Processing(Shanxi University),Ministry of Education,Taiyuan 030006)
出处 《计算机研究与发展》 EI CSCD 北大核心 2024年第7期1799-1811,共13页 Journal of Computer Research and Development
基金 国家自然科学基金项目(62276157,U21A20513,62076154,61503229) 山西省重点研发计划项目(202202020101003)。
关键词 流数据 概念漂移 集成学习 漂移类型 过滤阶段 扩充阶段 streaming data concept drift ensemble learning drift type filtering stage expansion stage
  • 相关文献

参考文献8

二级参考文献54

  • 1许冠英,韩萌,王少峰,贾涛.数据流集成分类算法综述[J].计算机应用研究,2020,37(1):1-8. 被引量:11
  • 2杨宜东,孙志挥,张净.基于核密度估计的分布数据流离群点检测[J].计算机研究与发展,2005,42(9):1498-1504. 被引量:8
  • 3钱江波,徐宏炳,董逸生,王永利,刘学军,杨雪梅.基于最小生成树的数据流窗口连接优化算法[J].计算机研究与发展,2007,44(6):1000-1007. 被引量:3
  • 4S Muthukrishnan.Data streams:Algorithms and applications[C].The 14th Annual ACM-SIAM Symp on Discrete Algorithms,Baltimore,MD,USA,2003
  • 5H Wang,W Fan,P Yu,et al.Mining concept-drifting data streams using ensemble classifiers[C].The 9th ACM Int'l Conf on Knowledge Discovery and Data Mining (SIGKDD),Washington,2003
  • 6Q H Xie.An efficient approach for mining concept-drifting data streams:[Master dissertation][D].Tainan,China:National University of Tainan,2004
  • 7M Guetova,Holldobter,H V Storr.Incremental fuzzy decision trees[C].The 25th German Conf on Artificial Intelligence(KI2002),Aachen,Germany,2002
  • 8V Ganti,J Gehrke,R Ramakrishnan.Mining data streams under block evolution[J].SIGMOD Explorations,2002,3(2):1-10
  • 9J Han,M Kamber.Data Mining:Concepts and Techniques[M].San Francisco:Morgan Kaufmann,2001
  • 10D Kifer,S Ben-David,J Gehrke.Detecting change in data streams[G].In:Proc of VLDB 2004.San Francisco:Morgan Kaufmann,2004

共引文献84

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部