摘要
数据流挖掘中的主要问题是概念流动和噪音污染。目前的数据流挖掘算法不能有效地处理数据流中的噪音,而一个理想的学习算法应该同时拥有对概念流动的敏感性和对噪音的健壮性。文中探讨了如何使用聚类方法在数据流中区分出噪音实例和难以学习的实例,并提出了相应的概念流动检测方法。在此基础上设计了基于推进技术的集合分类器算法RobustBoosting。通过在合成数据集和实际数据集上的实验,表明文中的算法即使在高达40%的类噪音时,与AdaptiveBoosting算法[1]相比,仍能保持更高的分类准确度,更快地收敛到新的目标概念。
Existing algorithms can not strike a good balance between robustness to data noise and sensitivity to concept drifting.We now propose an algorithm that we believe can strike a balance better than those of existing algorithms.In the full paper,we explain in some detail the algorithm we propose,called by us RobustBoosting algorithm.In this abstract,we just add some pertinent remarks to listing the four topics of explanation.The first topic is: distinguishing data noise from hard-to-learn samples.The second topic is: separating hard-to-learn samples from data noise with a clustering method based on density.In the second topic,we say that the separating is not absolute but according to mathematical probability we do achieve the separation into two groups: one group consisting mostly of hard-to-learn samples and an insignificant amount of data noise and another group that is just the reverse.The third topic is: discovering concept drifting.In the third topic,we derive three equations for discovering concept drifting.The fourth topic is: the design of our RobustBoosting algorithm.We compared RobustBoosting algorithm with AdaptiveBoosting algorithm[1] on both synthetic and real-life data sets.The experimental results,given in two figures in the full paper,show preliminarily that the proposed method has substantial advantage over AdaptiveBoosting algorithm in prediction accuracy,and it can converge to target concepts with high accuracy and speed even with 40% data noise samples.
出处
《西北工业大学学报》
EI
CAS
CSCD
北大核心
2007年第4期603-607,共5页
Journal of Northwestern Polytechnical University
基金
国家自然科学基金(60373108)资助