期刊文献+

一种基于混合模型的数据流概念漂移检测算法 被引量:13

Concept Drift Detection for Data Streams Based on Mixture Model
下载PDF
导出
摘要 由于在信用卡欺诈分析等领域的广泛应用,学者们开始关注概念漂移数据流分类问题.现有算法通常假设数据一旦分类后类标已知,利用所有待分类实例的真实类别来检测数据流是否发生概念漂移以及调整分类模型.然而,由于标记实例需要耗费大量的时间和精力,该解决方案在实际应用中无法实现.据此,提出一种基于KNNModel和增量贝叶斯的概念漂移检测算法KnnM-IB.新算法在具有KNNModel算法分类被模型簇覆盖的实例分类精度高、速度快优点的同时,利用增量贝叶斯算法对难处理样本进行分类,从而保证了分类效果.算法同时利用可变滑动窗口大小的变化以及主动学习标记的少量样本进行概念漂移检测.当数据流稳定时,半监督学习被用于扩大标记实例的数量以对模型进行更新,因而更符合实际应用的要求.实验结果表明,该方法能够在对数据流进行有效分类的同时检测数据流概念漂移及相应地更新模型. As its application in credit card fraud detection and many other fields,more and more scholars are paying attention to the classification for concept drifting data streams.Most existing algorithms assume that the true labels of the testing instances can be accessed right after they are classified,and utilize them to detect concept drift and adjust current model.It is an impractical assumption in real-world because manual labeling of instances which arrive continuously at a high speed requires a lot of time and effort.For the problem mentioned above,this paper proposes a concept drift detection method based on KNNModel algorithm and incremental Bayes algorithm which is called KnnM-IB.The proposed method has the virtue of the KNNModel algorithm when classifying instances covered by the model clusters.In addition,the incremental Bayes algorithm is used to handle the confused instances and update the model.Using the change of the window size and the few labeled most informative instances which are chosen by active learning,the KnnM-IB algorithm can detect the concept drift on data streams.Semi-supervised learning technology is also used to increase the number of the labeled instances to update the model when the underlying concept of the data streams is stable.Experimental results show that compared with the traditional classification algorithms,the proposed method not only adapts to the situation of concept drift,but also acquires the comparable or better classification accuracy.
出处 《计算机研究与发展》 EI CSCD 北大核心 2014年第4期731-742,共12页 Journal of Computer Research and Development
基金 国家自然科学基金项目(61070062 61175123) 福建省高校产学合作科技重大项目(2010H6007)
关键词 概念漂移 数据流 分类 主动学习 半监督学习 concept drift data stream classification active learning semi-supervised learning
  • 相关文献

参考文献33

  • 1Kotsiantis S B, Pintelas P E. Recent advances in clustering: A brief survey [J]. WSEAS Trans on Information Science and Application, 2004, 11(1): 73-81.
  • 2Zhang P, Zhu X, Shi Y, et al. An aggregate ensemble for mining concept drifting data streams with noise [C] //Proc of the 13th Pacific-Asia Conf on Knowledge Discovery. Berlin: Springer, 2009:1021-1029.
  • 3李南,郭躬德.面向高速数据流的集成分类器算法[J].计机应用,2012,32(3):629-633.
  • 4Liu J, Li X, Zhong W. Ambiguous decision trees for mining concept-drifting data streams [J]. Pattern Recognition Letters, 2008, 30(15) : 1347-1355.
  • 5李南,郭躬德.基于子空间集成的概念漂移数据流分类算法[J].计算机系统应用,2011,20(12):240-248. 被引量:5
  • 6Widmer G, Kubat M. Learning in the presence of concept drift and hidden contexts [J]. Machine Learning, 1996, 23 (1) : 69-101.
  • 7Delany S J, Cunningham P, Tsymbal A. A comparison of ensemble and case-base maintenance techniques for handing concept drift in spare filtering [C] //Proc of the 19th Int Conf on Artificial Intelligence. Menlo Park: AAAI, 2006: 340- 345.
  • 8Zhou D, Bousquet O, Lal T N, et al. Learning with local and global consistency [C]//Proc of the 18th Annual Conf on Neural Information Processing Systems. Cambridge: MIT, 2003:321-328.
  • 9张孝飞,黄河燕.一种采用聚类技术改进的KNN文本分类方法[J].模式识别与人工智能,2009,22(6):936-940. 被引量:32
  • 10陈黎飞,郭躬德.最近邻分类的多代表点学习算法[J].模式识别与人工智能,2011,24(6):882-888. 被引量:18

二级参考文献122

共引文献160

同被引文献88

  • 1王涛,李舟军,颜跃进,陈火旺.数据流挖掘分类技术综述[J].计算机研究与发展,2007,44(11):1809-1815. 被引量:40
  • 2Jonathan A S, Elaine R F, Rodrigo C B, et al: Data stream clustering: a survey[J]. ACM Computing Surveys, 2013, 46(1): 13:1-13:31.
  • 3Shifei D, Fulin W, Jun Q, et al: Research on data stream clustering algorithms[J]. Artificial Intelligence Review, 2013, 43(4): 593-600.
  • 4Tian Z, Raghu R, and Miron L. BIRCH: an efficient data clustering method for very large databases[C]. Proceedings of the ACM SIGMOD International Conference on Management of Data, New York, USA, 1996: 103-114.
  • 5Aggarwal C C, Han J, and Yu P S. A framework for clustering evolving data streams[C]. Proceedings of the 29th Conference on Very Large Data Bases, Berlin, Germany, 2003 81-92.
  • 6Chen Y and Tu L. Density-based clustering for real-time stream data[C]. Proceedings of the 13th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, New York, USA, 2007: 133-142.
  • 7Cao F, Ester M, Qian W, et al: Density-based clustering over an evolving data stream with noise[C]. Proceedings of the 16th SIAM International Conference on Data Mining, Maryland, USA, 2006: 328-339.
  • 8Ackermann M R, M:rtens M, Raupach C, et al: StreamKM ++: a clustering algorithm for data streams[J]. Journal of Experimental Algorithmics, 2012, 17(1): 2-4.
  • 9Arthur D and Vassilvitskii S. K-means++: the advantages of careful seeding[C]. Proceedings of the 2007 ACM-SIAM Symposium on Discrete Algorithm, New Orleans, USA, 2007: 1027-1035.
  • 10Baraldi A and Blonda P. A survey of fuzzy clustering algorithms for pattern recognition[J]. IEEE Transactions on Systems, Man, and Cybernetics, Part B: Cybernetics, 1999, 29(6): 778-785.

引证文献13

二级引证文献70

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部