期刊文献+

FS-CRF:基于特征切分与级联随机森林的异常点检测模型 被引量:2

FS-CRF:Outlier Detection Model Based on Feature Segmentation and Cascaded Random Forest
下载PDF
导出
摘要 大数据时代,攻击篡改、设备故障、人为造假等原因导致海量数据中潜藏着许多异常值。准确地检测出数据中的异常点,实现数据清洗,至关重要。文中提出一种结合特征切分与多层级联随机森林的异常点检测模型(outlier detection model based on Feature Segmentation and Cascaded Random Forest,FS-CRF)。利用滑动窗口与随机森林对原始特征进行细粒度切分,生成类概率向量,用于训练多层级联的随机森林;由级联层中最后一层的随机森林投票决定样本的最终类别。仿真实验结果表明,新方法在基于多个UCI数据集进行的异常分类任务中均获得较高F1-measure评分;级联结构使新模型相比于经典的随机森林算法进一步提高了泛化能力;在高维数据集上所提方法比梯度提升决策树和XGBoost拥有更优的性能,且超参数较少,易于调优,具有更好的综合性能。 In the era of big data,there are many abnormal values hidden in massive data due to attack tampering,equipment fai-lure,artificial fraud and other reasons.Accurately detect outliers in data is critical to data cleaning.Therefore,an outlier detection model combining feature segmentation and multi-level cascaded random forest(FS-CRF)is proposed.Using the sliding window and the random forest to segment the original features,the generated class probability vector is used to train the multi-level cascaded random forest.Finally,the category of the sample is determined by the vote of the last layer.Simulation experiment results show that the new method can effectively detect outlier in classification tasks on UCI data sets,with high F1-measure scores obtained on both high and low dimensional data sets.The cascade structure further improves the generalization ability of the model compared to the classical random forest.Compared with the GBDT and XGBoost,the proposed method has performance advantages on high-dimensional data sets,and has fewer hyper-parameters that easy to tune and has better comprehensive performance.
作者 刘振鹏 苏楠 秦益文 卢家欢 李小菲 LIU Zhen-peng;SU Nan;QIN Yi-wen;LU Jia-huan;LI Xiao-fei(School of Cyber Security and Computer,Hebei University,Baoding,Hebei 071002,China;Information Technology Center,Hebei University,Baoding,Hebei 071002,China;School of Electronic and Information Engineering,Lanzhou Jiaotong University,Lanzhou 730070,China)
出处 《计算机科学》 CSCD 北大核心 2020年第8期185-188,共4页 Computer Science
基金 河北省自然科学基金(F2019201427) 教育部“云数融合科教创新”基金(2017A20004)。
关键词 数据清洗 细粒度特征 级联随机森林 集成学习 异常点检测 Data cleaning Grained feature Cascade random forest Ensemble learning Outlier detection
  • 相关文献

参考文献1

二级参考文献4

共引文献13

同被引文献31

引证文献2

二级引证文献19

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部