摘要
SMOTE算法被广泛应用在不均衡数据研究中,但原始数据集中的噪声数据可能会使数据边界模糊造成数据分布改变.本文基于采样平衡与特征选择提出了BSL-FSRF算法.首先提出BSL采样,将少数类样本分为安全样本、噪声样本、边界样本,只对边界样本进行SMOTE插值,再利用Tomek link进行数据清洗,使数据集基本达到均衡的同时减少噪声样本的数量;其次引入"假设间隔"思想对各个特征维度进行度量,设定合适的阈值,将与类别相关性不高的特征移除,对数据降维;最后以随机森林作为分类器,用改进的网格搜索(Gridsearch)算法进行参数寻优.在公共数据集上对BSL-FSRF算法进行实验验证,结果表明该算法在少数类样本的分类准确率和分类器的整体性能上有明显改善,并且节省了运行时间.
The SMOTE algorithm is widely used in the study of unbalanced data,but the noise data in the original data set may cause the data boundary blurring and change the data distribution.This paper proposes the BSL-FSRF algorithm based on sampling balance and feature selection.Firstly,the BSL sampling is proposed.The minority samples are divided into safety samples,noise samples and boundary samples.Only the boundary samples are SMOTE interpolated,and then the Tomek link is used for data cleaning,so that the data set is basically balanced and the number of noise samples is reduced.Secondly,the idea of"hypothesis interval"is introduced to measure each feature dimension,and appropriate thresholds are set.The feature with low correlation is removed to reduce the dimension of the data.Finally,the random forest is used as a classifier and the improved grid search algorithm is used to optimize the parameters.The BSL-FSRF algorithm is experimentally verified on the public dataset.The results show that the algorithm has improved the classification accuracy of the minority samples and the overall performance of the classifier,and saves the running time.
作者
张忠林
曹婷婷
ZHANG Zhong-lin;CAO Ting-ting(College of Electronic and Information Engineering,Lanzhou Jiaotong University,Lanzhou 730070,China)
出处
《小型微型计算机系统》
CSCD
北大核心
2020年第6期1327-1333,共7页
Journal of Chinese Computer Systems
基金
国家自然科学基金项目(61662043)资助.