摘要
过采样方法是解决数据类别不均衡的有效方法之一,现有的过采样方法容易使样本具有高相似性导致过拟合。针对该问题,提出一种基于高斯混合模型和Jensen-Shannon散度的过采样方法(GJ-RSMOTE)。利用高斯混合模型对少数类样本进行聚类,通过簇的稀疏度计算各簇的采样数量以及采用超球体插值方法扩大生成样本的范围,避免了生成样本过拟合,通过Jensen-Shannon散度控制最终生成样本的数量。实验结果表明,GJ-RSMOTE可实现样本类别均衡性,可有效提高分类模型的识别精度。
The over-sampling approach is one of the effective methods to solve the imbalanced samples for classifier learning.However,the existing oversampling methods easily make the generated samples highly similar,which may cause the over-fitting.To solve the problem,this paper proposes an over-sampling method that combines the Gaussian mixture model and Jensen-Shannon divergence,called GJ-RSMOTE.This method utilized a Gaussian mixture model to cluster for the minority class samples and then calculated the number of sampling in each cluster according to the sparsity of the clusters.In addition,to avoid over-fitting,the GJ-RSMOTE utilized the hypersphere interpolation method to expand the range of generated samples.The Jensen-Shannon divergence was used to control the number of sampling.The experimental results show that the GJ-RSMOTE can achieve the balance of samples’label and improve its classification accuracy.
作者
李国和
刘顺欣
张予杰
郑艺峰
洪云峰
周晓明
Li Guohe;Liu Shunxin;Zhang Yujie;Zheng Yifeng;Hong Yunfeng;Zhou Xiaoming(Beijing Key Lab of Petroleum Data Mining,China University of Petroleum-Beijing,Beijing 102249,China;College of Information Science and Engineering,China University of Petroleum-Beijing,Beijing 102249,China;Oil&Gas Development of Talimu Oil Filed,Kuerle 841000,Xinjiang,China;China Anti-Infringement and Anti-Counterfeit Innovation Strategic Alliance,Hangzhou 310010,Zhejiang,China;Xiamen Hanying Internet of Things Application Research Institute,Xiamen 361021,Fujian,China)
出处
《计算机应用与软件》
北大核心
2022年第10期230-237,共8页
Computer Applications and Software
基金
国家自然科学基金项目(60473125)
中国石油大学(北京)克拉玛依校区科研启动基金项目(RCYJ2016B-03-001)
福建省自然科学基金项目(2018J01546,2019J01748)。