摘要
隐私保护、数据丢失、网络错误等原因导致网络中大量数据存在不确定性.数据流系统中数据连续不断到达系统,故不能一次性获得全部数据,此外数据的概念特征经常发生变化.针对这种情况,构建了一个增量式分类模型来处理数据具有不确定性的隐含概念漂移的数据流分类问题.该模型采用非常快速决策树算法,在学习阶段使用霍夫丁边界理论迅速构建能处理数据不确定性的决策树模型;在分类阶段将加权贝叶斯分类器应用于决策树的叶子节点,以提高不确定数据分类的准确率;采用滑动窗口技术和替换树来处理数据流中的概念漂移现象.实验表明,无论对人工数据还是实际数据,该算法均有较高的分类准确率和执行效率.
Data in the Web have much uncertainty because of privacy protection, data loss, network errors, etc. In a data stream system, data arrive continuously and therefore one cannot obtain all data in any time. In addition, the concept drift often occurs in the data stream. This paper constructs an incremental classification model to deal with data stream classification with data uncertainty and concept drift. In this model, a fast decision tree algorithm is used. It can analyze uncertain information quickly and effectively both in the learning stage and the classification stage. In the learning stage, it uses the Hoeffding bound theory to quickly construct a decision tree model for the data stream with data uncertainty. In the classification stage, it uses a weighted Bayes classifier in the tree leaves to improve precision of the classification. The use of a sliding window to replace the tree ensures that the algorithm can deal with concept drift. Experimental results show that the algorithm has good classification accuracy and execution efficiency both on artificial and real data.
作者
吕艳霞
王翠容
王聪
苑迎
LU Yan-xia WANG Cui-rong WANG Cong YUAN Ying(College of Computer Science and Engineering, Northeastern University, Shenyang 110819, China School of Computer and Communication Engineering, Northeastern University at Qinhuangdao, Northeastern University, Qinhuangdao 066004, Hebei Province, China)
出处
《应用科学学报》
CSCD
北大核心
2017年第5期559-569,共11页
Journal of Applied Sciences
基金
国家自然科学基金(No.61300195)
河北省自然科学基金(No.F2014501078
No.F2016501079)资助
关键词
数据不确定性
数据流
决策树
分类
概念漂移
data uncertainty, data stream, decision tree, classification, concept drift