摘要
针对异常检测中异常数据与正常数据的比例严重不平衡导致决策树性能下降的问题,提出了C4.5决策树的三种改进方法——C4.5+δ、均匀分布熵(UDE)和改进分布熵函数(IDEF)。首先,推导了C4.5算法中属性选择准则会倾向于选择偏斜划分的属性;然后,分析了偏斜划分使得异常(少数类)检测精度下降的原因;其次,分别通过引入缓和因子、均匀分布熵或替换分布熵函数改进了C4.5算法的属性选择准则——信息增益率;最后,利用WEKA平台和NSL-KDD数据集对改进的决策树进行验证。实验结果表明,三种改进方法均能提高异常检测精度。其中,相比于C4.5,C4.5+7、UDE和IDEF算法在KDDTest-21数据集上的少数类检测精度(灵敏度)分别提高了3.16、3.02和3.12个百分点,均优于采用Rényi熵和Tsallis熵作为分裂准则的方法。此外,利用三种改进的决策树检测工业控制系统中的异常,不仅可以提高异常的查全率还能减小误报率。
Focusing on the problem that serious imbalance between abnormal data and normal data in anomaly detection will lead to performance degradation of decision tree,three improved methods for C4.5 decision tree were proposed,which are C4.5+δ,UDE(Uniform Distribution Entropy)and IDEF(Improved Distribution Entropy Function).Firstly,it was deduced that the attribute selection criterion of C4.5 tends to choose the ones with imbalanced splitting.Secondly,why imbalanced splitting decreases the accuracy of anomaly(minority)detection was analyzed.Thirdly,the attribute selection criterion—information gain ratio of C4.5 was improved by introducing relaxation factor and uniform distribution entropy,or substituting distribution entropy function.Finally,three improved decision trees were verified on WEKA platform and NSL-KDD dataset.Experimental results show that three proposed improved methods can increase the accuracy of anomaly detection.Compared with C4.5,the accuracies of C4.5+7,UDE and IDEF on KDDTest-21 dataset are improved by 3.16,3.02 and 3.12 percentage points respectively,which are better than the methods using Rényi entropy or Tsallis entropy as splitting criterion.Furthermore,using improved decision trees to detect anomalies in the industrial control system can not only improve the recall ratio of anomalies,but also reduce false positive rate.
作者
王伟
谢耀滨
尹青
WANG Wei;XIE Yaobin;YIN Qing(State Key Laboratory of Mathematic Engineering and Advanced Computing(Information Engineering University),Zhengzhou Henan 450000,China)
出处
《计算机应用》
CSCD
北大核心
2019年第3期623-628,共6页
journal of Computer Applications
基金
国家自然科学基金资助项目(61802431)~~
关键词
不平衡数据
异常检测
决策树
C4.5
信息增益率
imbalanced data
anomaly detection
decision tree
C4.5
information gain ratio