摘要
传统的概念漂移数据流分类算法通常利用测试数据的真实类标来检测数据流是否发生概念漂移,并根据需要调整分类模型。然而,真实类标的标记需要耗费大量的人力、物力,而持续不断到来的高速数据流使得这种解决方案在现实中难以实现。针对上述问题,提出一种基于少量类标签的概念漂移检测算法。它根据快速KNNModel算法利用模型簇分类的特点,在未知分类数据类标的情况下,根据当前数据块不被任一模型簇覆盖的实例数目较之前数据块在一定的显著水平下是否发生显著增大,来判断是否发生概念漂移。在概念漂移发生的情况下,让领域专家针对那些少量的不被模型簇覆盖的数据进行标记,并利用这些数据自我修正模型,较好地解决了概念漂移的检测和模型自我更新问题。实验结果表明,该方法能够在自适应处理数据流概念漂移的前提下对数据流进行快速的分类,并得到和传统数据流分类算法近似或更高的分类精度。
Most existing algorithms for data streams mining utilize the true label of testing data to detect concept drift and adjust current model according to requirements. It is impractical in real-world applications as manual labeling of instances which arrive continuously at a high speed requires a lot of human and material resources. Therefore, a concept drift detection method with limited amount of labeled data was proposed. The proposed method used the model clusters generated by the fast KNNModel algorithm to classify instances. It was able to detect concept drift on whether the number of instances which were not covered by any model clusters on the current block increased remarkably at a certain significance level than that of the prior block. Once concept drift happened, the domain experts were asked to label a few instances which were not covered by the model clusters and these representative instances were used to update the current model. The experimental results show that, compared with the traditional classification algorithms, the proposed method not only adapts to the situation of concept drift, but also acquires approximate or better classification accuracy.
出处
《计算机应用》
CSCD
北大核心
2012年第8期2176-2181,2185,共7页
journal of Computer Applications
基金
国家自然科学基金资助项目(61070062
61175123)
福建高校产学合作科技重大项目(2010H6007)