摘要
以提高连续属性数据流的分类挖掘效率为目标,设计并实现了一种基于红黑树的连续属性数据流快速决策树分类算法VFDT_RBT。该算法利用红黑树来更有效地处理样本的插入,使得有序插入时的时间复杂度仍为O(nlogn);利用堆栈和红黑树中序遍历有序的特点来降低最佳划分阈值选取过程的时间复杂度;利用hoeffding不等式确定连续属性划分阈值所需的样本数量;在允许连续属性多次出现的原则下选择划分属性建立决策树,提高了算法的分类精度。在多个数据集上的分类实验结果表明:VFDT_RBT比已有的VFDTc具有更低的时间复杂度和更高的分类精度,更适合处理多属性样本。
A decision tree classification algorithm based on red-black tree, called the VFDT_RBT, is designed and implemented. The algorithm uses red-black tree to deal with sample insertions and the complexity of the orderly insertion is 0 (nlogn). Stack and some characters about inorder traversal of Red-Black Tree are used to decrease the processing time for choosing the best split point. Hoeffding inequality is used to determine the number of training samples for obtaining the best split point. The principle of allowing the multiple occurrences of continuous attributes is presented, thus improving the classification accuracy. Experimental results based on different data sets show that VFDT_RBT has lower processing time and higher classification accuracy than VFDTc, and it is more suitable for the multiple attribute examples.
出处
《南京邮电大学学报(自然科学版)》
北大核心
2017年第2期86-90,共5页
Journal of Nanjing University of Posts and Telecommunications:Natural Science Edition
基金
国家自然科学基金(61302158
61571238)资助项目
关键词
数据流
红黑树
连续属性
VFDTc
决策树
data streams
red-black tree
continuous attribute
VFDTc
decision tree