摘要
针对大数据集中存在海量数据,当数据规模扩大到一定程度时,离散点检测处理效率受到限制的问题,提出了一种基于分类和回归树(CART)决策树的网络大数据集离群点动态检测算法。首先,划分大数据集异常数据标准,利用方差衡量数据离散程度,使用支持向量机建立异常数据样本关联规则矩阵,明确大数据集异常数据范围,并通过动态网格划分策略降低离群点检测计算量;然后,运用CART决策树方法在分支节点采取布尔检测,将待检测数据统一拟作连续数据,升序排列训练数据集,计算数据最高信息增益,剪枝决策树直到没有非叶子节点可被替换,得到离群点动态检测结果。仿真结果证明,本文算法离群点检测准确率高、检测耗时短,具备显著的计算优势,能为大数据集的可靠应用提供积极帮助。
There are massive data in big data sets,and when the data scale expands to a certain extent,the processing efficiency of discrete point detection is limited.Therefore,a dynamic outlier detection algorithm based on CART decision tree was proposed.Firstly,the abnormal data standard of large data set was divided,the data dispersion degree by variance was measured,the abnormal data sample association rule matrix by support vector machine was established,the abnormal data range of large data set was clarified,and the amount of outlier detection calculation by dynamic meshing strategy was reduced.Then,the classification and regression trees(CART)decision tree method was used to take Boolean detection at the branch nodes,unify the data to be detected as continuous data,arrange the training data set in ascending order,calculate the maximum information gain of the data,prune the decision tree until no non leaf nodes can be replaced,and obtain the dynamic detection results of outliers.Simulation results show that the proposed algorithm has high outlier detection accuracy,short detection time,significant computational advantages,and can provide positive help for the reliable application of large data sets.
作者
傅丽芳
陈卓
敖长林
FU Li-fang;CHEN Zhuo;AO Chang-lin(College of Science,Northeast Agricultural University,Harbin 150030,China;College of Engineering,NortheastAgricultural University,Harbin 150030,China)
出处
《吉林大学学报(工学版)》
EI
CAS
CSCD
北大核心
2023年第9期2620-2625,共6页
Journal of Jilin University:Engineering and Technology Edition
基金
国家自然科学基金项目(71874026)。
关键词
分类和回归树决策树
大数据集
离群点检测
数据预处理
网格划分
基尼系数
classification and regression trees(CART)decision tree
large data sets
outlier detection
data preprocessing
meshing
Gini coefficient