Imbalanced data classification is the task of classifying datasets where there is a significant disparity in the number of samples between different classes.This task is prevalent in practical scenarios such as indust...Imbalanced data classification is the task of classifying datasets where there is a significant disparity in the number of samples between different classes.This task is prevalent in practical scenarios such as industrial fault diagnosis,network intrusion detection,cancer detection,etc.In imbalanced classification tasks,the focus is typically on achieving high recognition accuracy for the minority class.However,due to the challenges presented by imbalanced multi-class datasets,such as the scarcity of samples in minority classes and complex inter-class relationships with overlapping boundaries,existing methods often do not perform well in multi-class imbalanced data classification tasks,particularly in terms of recognizing minority classes with high accuracy.Therefore,this paper proposes a multi-class imbalanced data classification method called CSDSResNet,which is based on a cost-sensitive dualstream residual network.Firstly,to address the issue of limited samples in the minority class within imbalanced datasets,a dual-stream residual network backbone structure is designed to enhance the model’s feature extraction capability.Next,considering the complexities arising fromimbalanced inter-class sample quantities and imbalanced inter-class overlapping boundaries in multi-class imbalanced datasets,a unique cost-sensitive loss function is devised.This loss function places more emphasis on the minority class and the challenging classes with high interclass similarity,thereby improving the model’s classification ability.Finally,the effectiveness and generalization of the proposed method,CSDSResNet,are evaluated on two datasets:‘DryBeans’and‘Electric Motor Defects’.The experimental results demonstrate that CSDSResNet achieves the best performance on imbalanced datasets,with macro_F1-score values improving by 2.9%and 1.9%on the two datasets compared to current state-of-the-art classification methods,respectively.Furthermore,it achieves the highest precision in single-class recognition tasks for the minority class.展开更多
The synthetic minority oversampling technique(SMOTE) is a popular algorithm to reduce the impact of class imbalance in building classifiers, and has received several enhancements over the past 20 years. SMOTE and its ...The synthetic minority oversampling technique(SMOTE) is a popular algorithm to reduce the impact of class imbalance in building classifiers, and has received several enhancements over the past 20 years. SMOTE and its variants synthesize a number of minority-class sample points in the original sample space to alleviate the adverse effects of class imbalance. This approach works well in many cases, but problems arise when synthetic sample points are generated in overlapping areas between different classes, which further complicates classifier training. To address this issue, this paper proposes a novel generalization-oriented rather than imputation-oriented minorityclass sample point generation algorithm, named overlapping minimization SMOTE(OM-SMOTE). This algorithm is designed specifically for binary imbalanced classification problems. OM-SMOTE first maps the original sample points into a new sample space by balancing sample encoding and classifier generalization. Then, OM-SMOTE employs a set of sophisticated minority-class sample point imputation rules to generate synthetic sample points that are as far as possible from overlapping areas between classes. Extensive experiments have been conducted on 32 imbalanced datasets to validate the effectiveness of OM-SMOTE. Results show that using OM-SMOTE to generate synthetic minority-class sample points leads to better classifier training performances for the naive Bayes,support vector machine, decision tree, and logistic regression classifiers than the 11 state-of-the-art SMOTE-based imputation algorithms. This demonstrates that OM-SMOTE is a viable approach for supporting the training of high-quality classifiers for imbalanced classification. The implementation of OM-SMOTE is shared publicly on the Git Hub platform at https://github.com/luxuan123123/OM-SMOTE/.展开更多
When building a classification model,the scenario where the samples of one class are significantly more than those of the other class is called data imbalance.Data imbalance causes the trained classification model to ...When building a classification model,the scenario where the samples of one class are significantly more than those of the other class is called data imbalance.Data imbalance causes the trained classification model to be in favor of the majority class(usually defined as the negative class),which may do harm to the accuracy of the minority class(usually defined as the positive class),and then lead to poor overall performance of the model.A method called MSHR-FCSSVM for solving imbalanced data classification is proposed in this article,which is based on a new hybrid resampling approach(MSHR)and a new fine cost-sensitive support vector machine(CS-SVM)classifier(FCSSVM).The MSHR measures the separability of each negative sample through its Silhouette value calculated by Mahalanobis distance between samples,based on which,the so-called pseudo-negative samples are screened out to generate new positive samples(over-sampling step)through linear interpolation and are deleted finally(under-sampling step).This approach replaces pseudo-negative samples with generated new positive samples one by one to clear up the inter-class overlap on the borderline,without changing the overall scale of the dataset.The FCSSVM is an improved version of the traditional CS-SVM.It considers influences of both the imbalance of sample number and the class distribution on classification simultaneously,and through finely tuning the class cost weights by using the efficient optimization algorithm based on the physical phenomenon of rime-ice(RIME)algorithm with cross-validation accuracy as the fitness function to accurately adjust the classification borderline.To verify the effectiveness of the proposed method,a series of experiments are carried out based on 20 imbalanced datasets including both mildly and extremely imbalanced datasets.The experimental results show that the MSHR-FCSSVM method performs better than the methods for comparison in most cases,and both the MSHR and the FCSSVM played significant roles.展开更多
The problem of imbalanced data classification learning has received much attention.Conventional classification algorithms are susceptible to data skew to favor majority samples and ignore minority samples.Majority wei...The problem of imbalanced data classification learning has received much attention.Conventional classification algorithms are susceptible to data skew to favor majority samples and ignore minority samples.Majority weighted minority oversampling technique(MWMOTE)is an effective approach to solve this problem,however,it may suffer from the shortcomings of inadequate noise filtering and synthesizing the same samples as the original minority data.To this end,we propose an improved MWMOTE method named joint sample position based noise filtering and mean shift clustering(SPMSC)to solve these problems.Firstly,in order to effectively eliminate the effect of noisy samples,SPMsC uses a new noise filtering mechanism to determine whether a minority sample is noisy or not based on its position and distribution relative to the majority sample.Note that MWMOTE may generate duplicate samples,we then employ the mean shift algorithm to cluster minority samples to reduce synthetic replicate samples.Finally,data cleaning is performed on the processed data to further eliminate class overlap.Experiments on extensive benchmark datasets demonstrate the effectiveness of SPMsC compared with other sampling methods.展开更多
Credit Card Fraud Detection(CCFD)is an essential technology for banking institutions to control fraud risks and safeguard their reputation.Class imbalance and insufficient representation of feature data relating to cr...Credit Card Fraud Detection(CCFD)is an essential technology for banking institutions to control fraud risks and safeguard their reputation.Class imbalance and insufficient representation of feature data relating to credit card transactions are two prevalent issues in the current study field of CCFD,which significantly impact classification models’performance.To address these issues,this research proposes a novel CCFD model based on Multifeature Fusion and Generative Adversarial Networks(MFGAN).The MFGAN model consists of two modules:a multi-feature fusion module for integrating static and dynamic behavior data of cardholders into a unified highdimensional feature space,and a balance module based on the generative adversarial network to decrease the class imbalance ratio.The effectiveness of theMFGAN model is validated on two actual credit card datasets.The impacts of different class balance ratios on the performance of the four resamplingmodels are analyzed,and the contribution of the two different modules to the performance of the MFGAN model is investigated via ablation experiments.Experimental results demonstrate that the proposed model does better than state-of-the-art models in terms of recall,F1,and Area Under the Curve(AUC)metrics,which means that the MFGAN model can help banks find more fraudulent transactions and reduce fraud losses.展开更多
The performance of traditional imbalanced classification algorithms is degraded when dealing with highly imbalanced data.How to deal with highly imbalanced data is a difficult problem.In this paper,the authors propose...The performance of traditional imbalanced classification algorithms is degraded when dealing with highly imbalanced data.How to deal with highly imbalanced data is a difficult problem.In this paper,the authors propose an ensemble tree classifier for highly imbalanced data classification.The ensemble tree classifier is constructed with a complete binary tree structure.A mathematical model is established based on the features and classification performance of the classifier,and it is proven that the model parameters of the ensemble classifier can be solved by calculation.First,the AdaBoost method is used as the benchmark classifier to construct the tree structure model.Then,the classification cost of the model is calculated,and the quantitative mathematical description between the cost and features of the ensemble tree classifier model is obtained.Then,the cost of the classification model is transformed into an optimization problem,and the parameters of the integrated tree classifier are given through theoretical derivation.This approach is tested on several highly imbalanced datasets in different fields and takes the AUC(area under the curve)and F-measure as evaluation criteria.Compared with the traditional imbalanced classification algorithm,the ensemble tree classifier has better classification performance.展开更多
Check dams have been widely constructed in the Chinese Loess Plateau and has played an important role in controlling soil loss during last 70 years.However,the large-scale and automatic mapping of the check dams and t...Check dams have been widely constructed in the Chinese Loess Plateau and has played an important role in controlling soil loss during last 70 years.However,the large-scale and automatic mapping of the check dams and the resulting silted fields are lacking.In this study,we present a novel methodological framework to extract silted fields and to estimate the location of the check dams at a pixel level in the Wuding River catchment by remote sensing and ensemble learning models.The random under-sampling method and 23 features were used to train and validate three ensemble learning models,namely Random Forest,Extreme Gradient Boosting and EasyEnsemble,based on a large number of samples.The established optimal model was then applied to the whole study area to map check dams and silted fields.Our results indicate that the imbalance ratio of the samples has a significant impact on the performance of the models.Validation of the results on the testing set show that the F1-score of silted fields of three models is higher than 0.75 at the pixel level.Finally,we produced a map of silted fields and check dams at 10 m-spatial resolution by the optimal model with an accuracy of ca.90%at the object level.The proposed framework can be used for the large-scale and high-precision mapping of check dams and silted fields,which is of great significance for the monitoring and management of the dynamics of check dams and the quantitative evaluation of their eco-environmental benefits.展开更多
基金supported by Beijing Municipal Science and Technology Project(No.Z221100007122003)。
文摘Imbalanced data classification is the task of classifying datasets where there is a significant disparity in the number of samples between different classes.This task is prevalent in practical scenarios such as industrial fault diagnosis,network intrusion detection,cancer detection,etc.In imbalanced classification tasks,the focus is typically on achieving high recognition accuracy for the minority class.However,due to the challenges presented by imbalanced multi-class datasets,such as the scarcity of samples in minority classes and complex inter-class relationships with overlapping boundaries,existing methods often do not perform well in multi-class imbalanced data classification tasks,particularly in terms of recognizing minority classes with high accuracy.Therefore,this paper proposes a multi-class imbalanced data classification method called CSDSResNet,which is based on a cost-sensitive dualstream residual network.Firstly,to address the issue of limited samples in the minority class within imbalanced datasets,a dual-stream residual network backbone structure is designed to enhance the model’s feature extraction capability.Next,considering the complexities arising fromimbalanced inter-class sample quantities and imbalanced inter-class overlapping boundaries in multi-class imbalanced datasets,a unique cost-sensitive loss function is devised.This loss function places more emphasis on the minority class and the challenging classes with high interclass similarity,thereby improving the model’s classification ability.Finally,the effectiveness and generalization of the proposed method,CSDSResNet,are evaluated on two datasets:‘DryBeans’and‘Electric Motor Defects’.The experimental results demonstrate that CSDSResNet achieves the best performance on imbalanced datasets,with macro_F1-score values improving by 2.9%and 1.9%on the two datasets compared to current state-of-the-art classification methods,respectively.Furthermore,it achieves the highest precision in single-class recognition tasks for the minority class.
基金Project supported by the National Natural Science Foundation of China(No.61972261)the Natural Science Foundation of Guangdong Province,China(No.2023A1515011667)+1 种基金the Key Basic Research Foundation of Shenzhen,China(No.JCYJ20220818100205012)the Basic Research Foundation of Shenzhen,China(No.JCYJ20210324093609026)。
文摘The synthetic minority oversampling technique(SMOTE) is a popular algorithm to reduce the impact of class imbalance in building classifiers, and has received several enhancements over the past 20 years. SMOTE and its variants synthesize a number of minority-class sample points in the original sample space to alleviate the adverse effects of class imbalance. This approach works well in many cases, but problems arise when synthetic sample points are generated in overlapping areas between different classes, which further complicates classifier training. To address this issue, this paper proposes a novel generalization-oriented rather than imputation-oriented minorityclass sample point generation algorithm, named overlapping minimization SMOTE(OM-SMOTE). This algorithm is designed specifically for binary imbalanced classification problems. OM-SMOTE first maps the original sample points into a new sample space by balancing sample encoding and classifier generalization. Then, OM-SMOTE employs a set of sophisticated minority-class sample point imputation rules to generate synthetic sample points that are as far as possible from overlapping areas between classes. Extensive experiments have been conducted on 32 imbalanced datasets to validate the effectiveness of OM-SMOTE. Results show that using OM-SMOTE to generate synthetic minority-class sample points leads to better classifier training performances for the naive Bayes,support vector machine, decision tree, and logistic regression classifiers than the 11 state-of-the-art SMOTE-based imputation algorithms. This demonstrates that OM-SMOTE is a viable approach for supporting the training of high-quality classifiers for imbalanced classification. The implementation of OM-SMOTE is shared publicly on the Git Hub platform at https://github.com/luxuan123123/OM-SMOTE/.
基金supported by the Yunnan Major Scientific and Technological Projects(Grant No.202302AD080001)the National Natural Science Foundation,China(No.52065033).
文摘When building a classification model,the scenario where the samples of one class are significantly more than those of the other class is called data imbalance.Data imbalance causes the trained classification model to be in favor of the majority class(usually defined as the negative class),which may do harm to the accuracy of the minority class(usually defined as the positive class),and then lead to poor overall performance of the model.A method called MSHR-FCSSVM for solving imbalanced data classification is proposed in this article,which is based on a new hybrid resampling approach(MSHR)and a new fine cost-sensitive support vector machine(CS-SVM)classifier(FCSSVM).The MSHR measures the separability of each negative sample through its Silhouette value calculated by Mahalanobis distance between samples,based on which,the so-called pseudo-negative samples are screened out to generate new positive samples(over-sampling step)through linear interpolation and are deleted finally(under-sampling step).This approach replaces pseudo-negative samples with generated new positive samples one by one to clear up the inter-class overlap on the borderline,without changing the overall scale of the dataset.The FCSSVM is an improved version of the traditional CS-SVM.It considers influences of both the imbalance of sample number and the class distribution on classification simultaneously,and through finely tuning the class cost weights by using the efficient optimization algorithm based on the physical phenomenon of rime-ice(RIME)algorithm with cross-validation accuracy as the fitness function to accurately adjust the classification borderline.To verify the effectiveness of the proposed method,a series of experiments are carried out based on 20 imbalanced datasets including both mildly and extremely imbalanced datasets.The experimental results show that the MSHR-FCSSVM method performs better than the methods for comparison in most cases,and both the MSHR and the FCSSVM played significant roles.
基金This work was supported in part by the Anhui Provincial Natural Science Foundation(No.2208085MF168)the Program for Synergy Innovation in the Anhui Higher Education Institutions of China(Nos.GXXT-2019-025 and GXXT-2022-052).
文摘The problem of imbalanced data classification learning has received much attention.Conventional classification algorithms are susceptible to data skew to favor majority samples and ignore minority samples.Majority weighted minority oversampling technique(MWMOTE)is an effective approach to solve this problem,however,it may suffer from the shortcomings of inadequate noise filtering and synthesizing the same samples as the original minority data.To this end,we propose an improved MWMOTE method named joint sample position based noise filtering and mean shift clustering(SPMSC)to solve these problems.Firstly,in order to effectively eliminate the effect of noisy samples,SPMsC uses a new noise filtering mechanism to determine whether a minority sample is noisy or not based on its position and distribution relative to the majority sample.Note that MWMOTE may generate duplicate samples,we then employ the mean shift algorithm to cluster minority samples to reduce synthetic replicate samples.Finally,data cleaning is performed on the processed data to further eliminate class overlap.Experiments on extensive benchmark datasets demonstrate the effectiveness of SPMsC compared with other sampling methods.
基金supported by the National Key R&D Program of China(Nos.2022YFB3104103,and 2019QY1406)the National Natural Science Foundation of China(Nos.61732022,61732004,61672020,and 62072131).
文摘Credit Card Fraud Detection(CCFD)is an essential technology for banking institutions to control fraud risks and safeguard their reputation.Class imbalance and insufficient representation of feature data relating to credit card transactions are two prevalent issues in the current study field of CCFD,which significantly impact classification models’performance.To address these issues,this research proposes a novel CCFD model based on Multifeature Fusion and Generative Adversarial Networks(MFGAN).The MFGAN model consists of two modules:a multi-feature fusion module for integrating static and dynamic behavior data of cardholders into a unified highdimensional feature space,and a balance module based on the generative adversarial network to decrease the class imbalance ratio.The effectiveness of theMFGAN model is validated on two actual credit card datasets.The impacts of different class balance ratios on the performance of the four resamplingmodels are analyzed,and the contribution of the two different modules to the performance of the MFGAN model is investigated via ablation experiments.Experimental results demonstrate that the proposed model does better than state-of-the-art models in terms of recall,F1,and Area Under the Curve(AUC)metrics,which means that the MFGAN model can help banks find more fraudulent transactions and reduce fraud losses.
基金supported by the National Natural Science Foundation of China under Grant No.61976198the Natural Science Research Key Project for Colleges and Universities of Anhui Province under Grant No.KJ2019A0726the High-level Scientific Research Foundation for the Introduction of Talent of Hefei Normal University under Grant No.2020RCJJ44。
文摘The performance of traditional imbalanced classification algorithms is degraded when dealing with highly imbalanced data.How to deal with highly imbalanced data is a difficult problem.In this paper,the authors propose an ensemble tree classifier for highly imbalanced data classification.The ensemble tree classifier is constructed with a complete binary tree structure.A mathematical model is established based on the features and classification performance of the classifier,and it is proven that the model parameters of the ensemble classifier can be solved by calculation.First,the AdaBoost method is used as the benchmark classifier to construct the tree structure model.Then,the classification cost of the model is calculated,and the quantitative mathematical description between the cost and features of the ensemble tree classifier model is obtained.Then,the cost of the classification model is transformed into an optimization problem,and the parameters of the integrated tree classifier are given through theoretical derivation.This approach is tested on several highly imbalanced datasets in different fields and takes the AUC(area under the curve)and F-measure as evaluation criteria.Compared with the traditional imbalanced classification algorithm,the ensemble tree classifier has better classification performance.
基金supported by the National Natural Science Foundation of China(No.41907048)The Fundamental Research Funds for the Central Universities,CHD(No.300102260206)The Shannxi Academy of Forestry(No.SXLK2023-02-15).
文摘Check dams have been widely constructed in the Chinese Loess Plateau and has played an important role in controlling soil loss during last 70 years.However,the large-scale and automatic mapping of the check dams and the resulting silted fields are lacking.In this study,we present a novel methodological framework to extract silted fields and to estimate the location of the check dams at a pixel level in the Wuding River catchment by remote sensing and ensemble learning models.The random under-sampling method and 23 features were used to train and validate three ensemble learning models,namely Random Forest,Extreme Gradient Boosting and EasyEnsemble,based on a large number of samples.The established optimal model was then applied to the whole study area to map check dams and silted fields.Our results indicate that the imbalance ratio of the samples has a significant impact on the performance of the models.Validation of the results on the testing set show that the F1-score of silted fields of three models is higher than 0.75 at the pixel level.Finally,we produced a map of silted fields and check dams at 10 m-spatial resolution by the optimal model with an accuracy of ca.90%at the object level.The proposed framework can be used for the large-scale and high-precision mapping of check dams and silted fields,which is of great significance for the monitoring and management of the dynamics of check dams and the quantitative evaluation of their eco-environmental benefits.