For imbalanced datasets, the focus of classification is to identify samples of the minority class. The performance of current data mining algorithms is not good enough for processing imbalanced datasets. The synthetic...For imbalanced datasets, the focus of classification is to identify samples of the minority class. The performance of current data mining algorithms is not good enough for processing imbalanced datasets. The synthetic minority over-sampling technique(SMOTE) is specifically designed for learning from imbalanced datasets, generating synthetic minority class examples by interpolating between minority class examples nearby. However, the SMOTE encounters the overgeneralization problem. The densitybased spatial clustering of applications with noise(DBSCAN) is not rigorous when dealing with the samples near the borderline.We optimize the DBSCAN algorithm for this problem to make clustering more reasonable. This paper integrates the optimized DBSCAN and SMOTE, and proposes a density-based synthetic minority over-sampling technique(DSMOTE). First, the optimized DBSCAN is used to divide the samples of the minority class into three groups, including core samples, borderline samples and noise samples, and then the noise samples of minority class is removed to synthesize more effective samples. In order to make full use of the information of core samples and borderline samples,different strategies are used to over-sample core samples and borderline samples. Experiments show that DSMOTE can achieve better results compared with SMOTE and Borderline-SMOTE in terms of precision, recall and F-value.展开更多
A new identification method for a linear discrete-time closed-loop system is proposed based on an output over-sampling scheme. When the system outputs are over-sampled the new output sequences would contain more infor...A new identification method for a linear discrete-time closed-loop system is proposed based on an output over-sampling scheme. When the system outputs are over-sampled the new output sequences would contain more information about the plant structure. Using general least squares method (GLS) the plant over-sampled model should be recognized. Then the original plant model should be obtained by its relationship with the over-sampled model. Compared with conventional approaches the advantage of the new method is that even if the ordinary identifiability conditions are not satisfied, a close-loop system can be identified by using the oversampled output without utilizing any external test signal. Accuracy analysis shows the relationship between the estimation error and the over-sampling rate. Numerical simulation illnstrates its effectiveness.展开更多
β-turn is one of the most important reverse turns because of its role in protein folding. Many computational methods have been studied for predicting β-turns and β-turn types. However, due to the imbalanced dataset...β-turn is one of the most important reverse turns because of its role in protein folding. Many computational methods have been studied for predicting β-turns and β-turn types. However, due to the imbalanced dataset, the performance is still inadequate. In this study, we proposed a novel over-sampling technique FOST to deal with the class-imbalance problem. Experimental results on three standard benchmark datasets showed that our method is comparable with state-of-the-art methods. In addition, we applied our algorithm to five benchmark datasets from UCI Machine Learning Repository and achieved significant improvement in G-mean and Sensitivity. It means that our method is also effective for various imbalanced data other than β-turns and β-turn types.展开更多
MicroRNAs (miRNAs) are short (~22nt) non-coding RNAs that play an indispensable role in gene regulation of many biological processes. Most of current computational, comparative, and non-comparative methods commonly cl...MicroRNAs (miRNAs) are short (~22nt) non-coding RNAs that play an indispensable role in gene regulation of many biological processes. Most of current computational, comparative, and non-comparative methods commonly classify human precursor micro- RNA (pre-miRNA) hairpins from both genome pseudo hairpins and other non-coding RNAs (ncRNAs). Although there were a few approaches achieving promising results in applying class imbalance learning methods, this issue has still not solved completely and successfully yet by the existing methods because of imbalanced class distribution in the datasets. For example, SMOTE is a famous and general over-sampling method addressing this problem, however in some cases it cannot improve or sometimes reduces classification performance. Therefore, we developed a novel over-sampling method named incre-mental- SMOTE to distinguish human pre-miRNA hairpins from both genome pseudo hairpins and other ncRNAs. Experimental results on pre-miRNA datasets from Batuwita et al. showed that our method achieved better Sensitivity and G-mean than the control (no over- sampling), SMOTE, and several successsors of modified SMOTE including safe-level-SMOTE and border-line-SMOTE. In addition, we also applied the novel method to five imbalanced benchmark datasets from UCI Machine Learning Repository and achieved improvements in Sensitivity and G-mean. These results suggest that our method outperforms SMOTE and several successors of it in various biomedical classification problems including miRNA classification.展开更多
Stroke is a life-threatening disease usually due to blockage of blood or insufficient blood flow to the brain.It has a tremendous impact on every aspect of life since it is the leading global factor of disability and ...Stroke is a life-threatening disease usually due to blockage of blood or insufficient blood flow to the brain.It has a tremendous impact on every aspect of life since it is the leading global factor of disability and morbidity.Strokes can range from minor to severe(extensive).Thus,early stroke assessment and treatment can enhance survival rates.Manual prediction is extremely time and resource intensive.Automated prediction methods such as Modern Information and Communication Technologies(ICTs),particularly those inMachine Learning(ML)area,are crucial for the early diagnosis and prognosis of stroke.Therefore,this research proposed an ensemble voting model based on three Machine Learning(ML)algorithms:Random Forest(RF),Extreme Gradient Boosting(XGBoost),and Light Gradient Boosting Machine(LGBM).We apply data preprocessing to manage the outliers and useless instances in the dataset.Furthermore,to address the problem of imbalanced data,we enhance the minority class’s representation using the Synthetic Minority Over-Sampling Technique(SMOTE),allowing it to engage in the learning process actively.Results reveal that the suggested model outperforms existing studies and other classifiers with 0.96%accuracy,0.97%precision,0.97%recall,and 0.96%F1-score.The experiment demonstrates that the proposed ensemble voting model outperforms state-of-the-art and other traditional approaches.展开更多
In the class imbalanced learning scenario, traditional machine learning algorithms focusing on optimizing the overall accuracy tend to achieve poor classification performance especially for the minority class in which...In the class imbalanced learning scenario, traditional machine learning algorithms focusing on optimizing the overall accuracy tend to achieve poor classification performance especially for the minority class in which we are most interested. To solve this problem, many effective approaches have been proposed. Among them, the bagging ensemble methods with integration of the under-sampling techniques have demonstrated better performance than some other ones including the bagging ensemble methods integrated with the over-sampling techniques, the cost-sensitive methods, etc. Although these under-sampling techniques promote the diversity among the generated base classifiers with the help of random partition or sampling for the majority class, they do not take any measure to ensure the individual classification performance, consequently affecting the achievability of better ensemble performance. On the other hand, evolutionary under-sampling EUS as a novel under- sampling technique has been successfully applied in searching for the best majority class subset for training a good- performance nearest neighbor classifier. Inspired by EUS, in this paper, we try to introduce it into the under-sampling bagging framework and propose an EUS based bagging ensemble method EUS-Bag by designing a new fitness function considering three factors to make EUS better suited to the framework. With our fitness function, EUS-Bag could generate a set of accurate and diverse base classifiers. To verify the effectiveness of EUS-Bag, we conduct a series of comparison experiments on 22 two-class imbalanced classification problems. Experimental results measured using recall, geometric mean and AUC all demonstrate its superior performance.展开更多
The rapid variation of channel can induce the intercarrier interference in orthogonal frequency-division multiplexing (OFDM) systems. Intercarrier interference will significantly increase the difficulty of OFDM chan...The rapid variation of channel can induce the intercarrier interference in orthogonal frequency-division multiplexing (OFDM) systems. Intercarrier interference will significantly increase the difficulty of OFDM channel estimation because too many channel coefficients need be estimated. In this article, a novel channel estimator is proposed to resolve the above problem. This estimator consists of two parts: the channel parameter estimation unit (CPEU), which is used to estimate the number of channel taps and the multipath time delays, and the channel coefficient estimation unit (CCEU), which is used to estimate the channel coefficients by using the estimated channel parameters provided by CPEU. In CCEU, the over-sampling basis expansion model is resorted to solve the problem that a large number of channel coefficients need to be estimated. Finally, simulation results are given to scale the performance of the proposed scheme.展开更多
Credit risk assessment is an important task of risk management for financial institutions.Machine learning-based approaches have made promising progress in credit risk assessment by treating it as imbalanced binary cl...Credit risk assessment is an important task of risk management for financial institutions.Machine learning-based approaches have made promising progress in credit risk assessment by treating it as imbalanced binary classification tasks.However,few efforts have been made to deal with the class overlap problem that accompanies imbalances simultaneously.To this end,this study proposes a Tomek link and genetic algorithm(GA)-based under-sampling framework(TEUS)to address the class imbalance and overlap issues in binary credit classification by eliminating majority class instances with considering multi-perspective factors.TEUS first determines boundary majority instances with Tomek link,then take the distance from each majority instance to its nearest boundary as the radius and assigns the density of opposite class samples within the radius as the overlap potential of that majority instance.Second,TEUS weighs each non-borderline majority instance based on its information contribution in estimating class labels.After partitioning non-borderline majority instances into subgroups according to overlap potential and information contribution,TEUS applies GA to select samples from subgroups and merge them with the minority samples into a new training set.Innovatively,the design of the fitness function in GA and the grouping of the non-borderline majority not only trade off the multi-perspective characteristics of instances but also help reduce the computational complexity of the sampling optimization search.Numerical experiments on real-world credit data sets demonstrate the effectiveness of the proposed TEUS.展开更多
With the worldwide analysis,heart disease is considered a significant threat and extensively increases the mortality rate.Thus,the investigators mitigate to predict the occurrence of heart disease in an earlier stage ...With the worldwide analysis,heart disease is considered a significant threat and extensively increases the mortality rate.Thus,the investigators mitigate to predict the occurrence of heart disease in an earlier stage using the design of a better Clinical Decision Support System(CDSS).Generally,CDSS is used to predict the individuals’heart disease and periodically update the condition of the patients.This research proposes a novel heart disease prediction system with CDSS composed of a clustering model for noise removal to predict and eliminate outliers.Here,the Synthetic Over-sampling prediction model is integrated with the cluster concept to balance the training data and the Adaboost classifier model is used to predict heart disease.Then,the optimization is achieved using the Adam Optimizer(AO)model with the publicly available dataset known as the Stalog dataset.This flowis used to construct the model,and the evaluation is done with various prevailing approaches like Decision tree,Random Forest,Logistic Regression,Naive Bayes and so on.The statistical analysis is done with theWilcoxon rank-summethod for extracting the p-value of the model.The observed results show that the proposed model outperforms the various existing approaches and attains efficient prediction accuracy.This model helps physicians make better decisions during complex conditions and diagnose the disease at an earlier stage.Thus,the earlier treatment process helps to eliminate the death rate.Here,simulation is done withMATLAB 2016b,and metrics like accuracy,precision-recall,F-measure,p-value,ROC are analyzed to show the significance of the model.展开更多
The state-of-the-art approaches for image reconstruction using under-sampled k-space data are compressed sensing based.They are iterative algorithms that optimize objective functions with spatial and/or temporal const...The state-of-the-art approaches for image reconstruction using under-sampled k-space data are compressed sensing based.They are iterative algorithms that optimize objective functions with spatial and/or temporal constraints.This paper proposes a non-iterative algorithm to estimate the un-measured data and then to reconstruct the image with the efficient filtered backprojection algorithm.The feasibility of the proposed method is demonstrated with a patient magnetic resonance imaging study.The proposed method is also compared with the state-of-the-art iterative compressed-sensing image reconstruction method using the total-variation optimization norm.展开更多
Traditional classification algorithms perform not very well on imbalanced data sets and small sample size. To deal with the problem, a novel method is proposed to change the class distribution through adding virtual s...Traditional classification algorithms perform not very well on imbalanced data sets and small sample size. To deal with the problem, a novel method is proposed to change the class distribution through adding virtual samples, which are generated by the windowed regression over-sampling (WRO) method. The proposed method WRO not only reflects the additive effects but also reflects the multiplicative effect between samples. A comparative study between the proposed method and other over-sampling methods such as synthetic minority over-sampling technique (SMOTE) and borderline over-sampling (BOS) on UCI datasets and Fourier transform infrared spectroscopy (FTIR) data set is provided. Experimental results show that the WRO method can achieve better performance than other methods.展开更多
By analyzing the theory of over-sampling and averaging, the conclusion is educed that white noise accompanies the signal and the addition of each bit of resolution can be achieved via a fourfold sampling frequency. Th...By analyzing the theory of over-sampling and averaging, the conclusion is educed that white noise accompanies the signal and the addition of each bit of resolution can be achieved via a fourfold sampling frequency. The addition of each bit will approximately increase the SNR (signal to noise ratio) to 6dB.展开更多
Background Bacterial vaginosis is a polymicrobial syndrome in which the homeostasis exerted by the Latobacillus species that protect the vaginal mucosa has been lost.This study explored the data balancing process with...Background Bacterial vaginosis is a polymicrobial syndrome in which the homeostasis exerted by the Latobacillus species that protect the vaginal mucosa has been lost.This study explored the data balancing process with the intention of improving the quality of association rules.The article aimed to balance the unbalanced multiclass dataset to improve association rule creation.Methods A dataset with 201 observations and 58 variables was analyzed.A preconstructed dataset was used.The authors collected the data between August 2016 and October 2018 in Tabasco,Mexico.The study population comprised sexually active women ages 18 to 50 who underwent gynecological inspection at the infectious and metabolic diseases research laboratory at the Universidad Juarez Autonoma de Tabasco.To determine the best κ-value,the random-forest algorithm was used and the balancing was performed with the synthetic minority over-sampling technique(SMOTE),random over-sampling examples(ROSE),and adaptive syntetic sampling approach for imbalanced learning(ADASYN)algorithms.The Apriori algorithm created the rules and to select rules with statistical significance,the is.redundant(),is.significant(),and is.maximal()functions and quality metric Fisher’s exact tes were used.The biological validation was carried out by the expert(bacteriologist).Results The ADASYN algorithm at K=9 the out of the bag(OOB)error was zero,this was the best𝐾-values.In the balancing process the ADASYN algorithm show best the performance.From the dataset balanced with ADASYN,the apriori algorithm created the association rules and the selection with the quality metric Fisher’s exact test,and the biological validation reported 13 rules.Gram-bacteria Atopobium vaginae,Gardnerella vaginalis,Megasphaera filotipo 1,Mycoplasma hominis and Ureaplasma parvum were detected by the apriori algorithm from the balanced dataset.Conclusion Balancing may improve the creation of association rules to efficiently model the bacteria that cause bacterial vaginosis.展开更多
Check dams have been widely constructed in the Chinese Loess Plateau and has played an important role in controlling soil loss during last 70 years.However,the large-scale and automatic mapping of the check dams and t...Check dams have been widely constructed in the Chinese Loess Plateau and has played an important role in controlling soil loss during last 70 years.However,the large-scale and automatic mapping of the check dams and the resulting silted fields are lacking.In this study,we present a novel methodological framework to extract silted fields and to estimate the location of the check dams at a pixel level in the Wuding River catchment by remote sensing and ensemble learning models.The random under-sampling method and 23 features were used to train and validate three ensemble learning models,namely Random Forest,Extreme Gradient Boosting and EasyEnsemble,based on a large number of samples.The established optimal model was then applied to the whole study area to map check dams and silted fields.Our results indicate that the imbalance ratio of the samples has a significant impact on the performance of the models.Validation of the results on the testing set show that the F1-score of silted fields of three models is higher than 0.75 at the pixel level.Finally,we produced a map of silted fields and check dams at 10 m-spatial resolution by the optimal model with an accuracy of ca.90%at the object level.The proposed framework can be used for the large-scale and high-precision mapping of check dams and silted fields,which is of great significance for the monitoring and management of the dynamics of check dams and the quantitative evaluation of their eco-environmental benefits.展开更多
In satellite anomaly detection,there are some problems such as unbalanced sample distribution,fewer fault samples,and unobvious anomaly characteristics.These problems cause the extisted anomaly detection methods are d...In satellite anomaly detection,there are some problems such as unbalanced sample distribution,fewer fault samples,and unobvious anomaly characteristics.These problems cause the extisted anomaly detection methods are difficult to train accurate classification model,and the accuracy of anomaly detection is hard to improve.At the same time,the monitoring data of satellite has high dimension and is difficult to extract effective features.Based on the DTW over-sampling method,this paper realizes the over-sampling of fault samples in satellite time series,and constructs a distributed and balanced time series data set.The Fast-DTW method is applied to calculate the distance between different time series,which can improve the speed of similarity calculation.KNN(K-Nearest Neighbor)method is applied for classification and the best classification result is obtained by search the optimal hyper-parameters k.The results show that the proposed method has high anomaly detection accuracy and consumes short calculation time.展开更多
Imbalanced data classification is an important research topic in real-world applications,like fault diagnosis in an aircraft manufacturing system.The over-sampling method is often used to solve this problem.It generat...Imbalanced data classification is an important research topic in real-world applications,like fault diagnosis in an aircraft manufacturing system.The over-sampling method is often used to solve this problem.It generates samples according to the distance between minority data.However,the traditional over-sampling method may change the original data distribution,which is harmful to the classification performance.In this paper,we propose a new method called Conditional SelfAttention Generative Adversarial Network with Differential Evolution(CSAGAN-DE)for imbalanced data classification.The new method aims at improving the classification performance of minority data by enhancing the quality of the generation of minority data.In CSAGAN-DE,the minority data are fed into the self-attention generative adversarial network to approximate the data distribution and create new data for the minority class.Then,the differential evolution algorithm is employed to automatically determine the number of generated minority data for achieving a satisfactory classification performance.Several experiments are conducted to evaluate the performance of the new CSAGAN-DE method.The results show that the new method can efficiently improve the classification performance compared with other related methods.展开更多
A clustering-based undersampling (CUS) and distance-based near-miss method are widely used in current imbalanced learning algorithms, but this method has certain drawbacks. In particular, the CUS does not consider the...A clustering-based undersampling (CUS) and distance-based near-miss method are widely used in current imbalanced learning algorithms, but this method has certain drawbacks. In particular, the CUS does not consider the influence of the distance factor on the majority of instances, and the near-miss method omits the inter-class(es) within the majority of samples. To overcome these drawbacks, this study proposes an undersampling method combining distance measurement and majority class clustering. Resampling methods are used to develop an ensemble-based imbalanced-learning algorithm called the clustering and distance-based imbalance learning model (CDEILM). This algorithm combines distance-based undersampling, feature selection, and ensemble learning. In addition, a cluster size-based resampling (CSBR) method is proposed for preserving the original distribution of the majority class, and a hybrid imbalanced learning framework is constructed by fusing various types of resampling methods. The combination of CDEILM and CSBR can be considered as a specific case of this hybrid framework. The experimental results show that the CDEILM and CSBR methods can achieve better performance than the benchmark methods, and that the hybrid model provides the best results under most circumstances. Therefore, the proposed model can be used as an alternative imbalanced learning method under specific circumstances, e.g., for providing a solution to credit evaluation problems in financial applications.展开更多
Since the overall prediction error of a classifier on imbalanced problems can be potentially misleading and bi- ased, alternative performance measures such as G-mean and F-measure have been widely adopted. Various tec...Since the overall prediction error of a classifier on imbalanced problems can be potentially misleading and bi- ased, alternative performance measures such as G-mean and F-measure have been widely adopted. Various techniques in- cluding sampling and cost sensitive learning are often em- ployed to improve the performance of classifiers in such sit- uations. However, the training process of classifiers is still largely driven by traditional error based objective functions. As a result, there is clearly a gap between the measure accord- ing to which the classifier is evaluated and how the classifier is trained. This paper investigates the prospect of explicitly using the appropriate measure itself to search the hypothesis space to bridge this gap. In the case studies, a standard three- layer neural network is used as the classifier, which is evolved by genetic algorithms (GAs) with G-mean as the objective function. Experimental results on eight benchmark problems show that the proposed method can achieve consistently fa- vorable outcomes in comparison with a commonly used sam- pling technique. The effectiveness of multi-objective opti- mization in handling imbalanced problems is also demon- strated.展开更多
基金supported by the National Key Research and Development Program of China(2018YFB1003700)the Scientific and Technological Support Project(Society)of Jiangsu Province(BE2016776)+2 种基金the“333” project of Jiangsu Province(BRA2017228 BRA2017401)the Talent Project in Six Fields of Jiangsu Province(2015-JNHB-012)
文摘For imbalanced datasets, the focus of classification is to identify samples of the minority class. The performance of current data mining algorithms is not good enough for processing imbalanced datasets. The synthetic minority over-sampling technique(SMOTE) is specifically designed for learning from imbalanced datasets, generating synthetic minority class examples by interpolating between minority class examples nearby. However, the SMOTE encounters the overgeneralization problem. The densitybased spatial clustering of applications with noise(DBSCAN) is not rigorous when dealing with the samples near the borderline.We optimize the DBSCAN algorithm for this problem to make clustering more reasonable. This paper integrates the optimized DBSCAN and SMOTE, and proposes a density-based synthetic minority over-sampling technique(DSMOTE). First, the optimized DBSCAN is used to divide the samples of the minority class into three groups, including core samples, borderline samples and noise samples, and then the noise samples of minority class is removed to synthesize more effective samples. In order to make full use of the information of core samples and borderline samples,different strategies are used to over-sample core samples and borderline samples. Experiments show that DSMOTE can achieve better results compared with SMOTE and Borderline-SMOTE in terms of precision, recall and F-value.
基金Project supported by National Natural Science Foundation ofChina (Grant No .60174030)
文摘A new identification method for a linear discrete-time closed-loop system is proposed based on an output over-sampling scheme. When the system outputs are over-sampled the new output sequences would contain more information about the plant structure. Using general least squares method (GLS) the plant over-sampled model should be recognized. Then the original plant model should be obtained by its relationship with the over-sampled model. Compared with conventional approaches the advantage of the new method is that even if the ordinary identifiability conditions are not satisfied, a close-loop system can be identified by using the oversampled output without utilizing any external test signal. Accuracy analysis shows the relationship between the estimation error and the over-sampling rate. Numerical simulation illnstrates its effectiveness.
文摘β-turn is one of the most important reverse turns because of its role in protein folding. Many computational methods have been studied for predicting β-turns and β-turn types. However, due to the imbalanced dataset, the performance is still inadequate. In this study, we proposed a novel over-sampling technique FOST to deal with the class-imbalance problem. Experimental results on three standard benchmark datasets showed that our method is comparable with state-of-the-art methods. In addition, we applied our algorithm to five benchmark datasets from UCI Machine Learning Repository and achieved significant improvement in G-mean and Sensitivity. It means that our method is also effective for various imbalanced data other than β-turns and β-turn types.
文摘MicroRNAs (miRNAs) are short (~22nt) non-coding RNAs that play an indispensable role in gene regulation of many biological processes. Most of current computational, comparative, and non-comparative methods commonly classify human precursor micro- RNA (pre-miRNA) hairpins from both genome pseudo hairpins and other non-coding RNAs (ncRNAs). Although there were a few approaches achieving promising results in applying class imbalance learning methods, this issue has still not solved completely and successfully yet by the existing methods because of imbalanced class distribution in the datasets. For example, SMOTE is a famous and general over-sampling method addressing this problem, however in some cases it cannot improve or sometimes reduces classification performance. Therefore, we developed a novel over-sampling method named incre-mental- SMOTE to distinguish human pre-miRNA hairpins from both genome pseudo hairpins and other ncRNAs. Experimental results on pre-miRNA datasets from Batuwita et al. showed that our method achieved better Sensitivity and G-mean than the control (no over- sampling), SMOTE, and several successsors of modified SMOTE including safe-level-SMOTE and border-line-SMOTE. In addition, we also applied the novel method to five imbalanced benchmark datasets from UCI Machine Learning Repository and achieved improvements in Sensitivity and G-mean. These results suggest that our method outperforms SMOTE and several successors of it in various biomedical classification problems including miRNA classification.
文摘Stroke is a life-threatening disease usually due to blockage of blood or insufficient blood flow to the brain.It has a tremendous impact on every aspect of life since it is the leading global factor of disability and morbidity.Strokes can range from minor to severe(extensive).Thus,early stroke assessment and treatment can enhance survival rates.Manual prediction is extremely time and resource intensive.Automated prediction methods such as Modern Information and Communication Technologies(ICTs),particularly those inMachine Learning(ML)area,are crucial for the early diagnosis and prognosis of stroke.Therefore,this research proposed an ensemble voting model based on three Machine Learning(ML)algorithms:Random Forest(RF),Extreme Gradient Boosting(XGBoost),and Light Gradient Boosting Machine(LGBM).We apply data preprocessing to manage the outliers and useless instances in the dataset.Furthermore,to address the problem of imbalanced data,we enhance the minority class’s representation using the Synthetic Minority Over-Sampling Technique(SMOTE),allowing it to engage in the learning process actively.Results reveal that the suggested model outperforms existing studies and other classifiers with 0.96%accuracy,0.97%precision,0.97%recall,and 0.96%F1-score.The experiment demonstrates that the proposed ensemble voting model outperforms state-of-the-art and other traditional approaches.
基金Acknowledgements We would like to express our gratitude to both the associate editor and the anonymous reviewers for their constructive comments that improved the quality of our manuscript to a large extent. This work was supported by the National Natural Science Foundation of China (Grant No.61501229) and the Fundamental Research Funds for the Central Universities (NS2015091, NS2014067, NJ20160013).
文摘In the class imbalanced learning scenario, traditional machine learning algorithms focusing on optimizing the overall accuracy tend to achieve poor classification performance especially for the minority class in which we are most interested. To solve this problem, many effective approaches have been proposed. Among them, the bagging ensemble methods with integration of the under-sampling techniques have demonstrated better performance than some other ones including the bagging ensemble methods integrated with the over-sampling techniques, the cost-sensitive methods, etc. Although these under-sampling techniques promote the diversity among the generated base classifiers with the help of random partition or sampling for the majority class, they do not take any measure to ensure the individual classification performance, consequently affecting the achievability of better ensemble performance. On the other hand, evolutionary under-sampling EUS as a novel under- sampling technique has been successfully applied in searching for the best majority class subset for training a good- performance nearest neighbor classifier. Inspired by EUS, in this paper, we try to introduce it into the under-sampling bagging framework and propose an EUS based bagging ensemble method EUS-Bag by designing a new fitness function considering three factors to make EUS better suited to the framework. With our fitness function, EUS-Bag could generate a set of accurate and diverse base classifiers. To verify the effectiveness of EUS-Bag, we conduct a series of comparison experiments on 22 two-class imbalanced classification problems. Experimental results measured using recall, geometric mean and AUC all demonstrate its superior performance.
文摘The rapid variation of channel can induce the intercarrier interference in orthogonal frequency-division multiplexing (OFDM) systems. Intercarrier interference will significantly increase the difficulty of OFDM channel estimation because too many channel coefficients need be estimated. In this article, a novel channel estimator is proposed to resolve the above problem. This estimator consists of two parts: the channel parameter estimation unit (CPEU), which is used to estimate the number of channel taps and the multipath time delays, and the channel coefficient estimation unit (CCEU), which is used to estimate the channel coefficients by using the estimated channel parameters provided by CPEU. In CCEU, the over-sampling basis expansion model is resorted to solve the problem that a large number of channel coefficients need to be estimated. Finally, simulation results are given to scale the performance of the proposed scheme.
基金supported by National Key R&D Programof ChinaunderGrant No.2019YFB1404600Beijing Natural Science Funds under Grant No.9162003Beijing's"High-grade,Precision and Advanced Discipline Construction(Municipal)-Business Administration"project under Grant No.19008022065.
文摘Credit risk assessment is an important task of risk management for financial institutions.Machine learning-based approaches have made promising progress in credit risk assessment by treating it as imbalanced binary classification tasks.However,few efforts have been made to deal with the class overlap problem that accompanies imbalances simultaneously.To this end,this study proposes a Tomek link and genetic algorithm(GA)-based under-sampling framework(TEUS)to address the class imbalance and overlap issues in binary credit classification by eliminating majority class instances with considering multi-perspective factors.TEUS first determines boundary majority instances with Tomek link,then take the distance from each majority instance to its nearest boundary as the radius and assigns the density of opposite class samples within the radius as the overlap potential of that majority instance.Second,TEUS weighs each non-borderline majority instance based on its information contribution in estimating class labels.After partitioning non-borderline majority instances into subgroups according to overlap potential and information contribution,TEUS applies GA to select samples from subgroups and merge them with the minority samples into a new training set.Innovatively,the design of the fitness function in GA and the grouping of the non-borderline majority not only trade off the multi-perspective characteristics of instances but also help reduce the computational complexity of the sampling optimization search.Numerical experiments on real-world credit data sets demonstrate the effectiveness of the proposed TEUS.
文摘With the worldwide analysis,heart disease is considered a significant threat and extensively increases the mortality rate.Thus,the investigators mitigate to predict the occurrence of heart disease in an earlier stage using the design of a better Clinical Decision Support System(CDSS).Generally,CDSS is used to predict the individuals’heart disease and periodically update the condition of the patients.This research proposes a novel heart disease prediction system with CDSS composed of a clustering model for noise removal to predict and eliminate outliers.Here,the Synthetic Over-sampling prediction model is integrated with the cluster concept to balance the training data and the Adaboost classifier model is used to predict heart disease.Then,the optimization is achieved using the Adam Optimizer(AO)model with the publicly available dataset known as the Stalog dataset.This flowis used to construct the model,and the evaluation is done with various prevailing approaches like Decision tree,Random Forest,Logistic Regression,Naive Bayes and so on.The statistical analysis is done with theWilcoxon rank-summethod for extracting the p-value of the model.The observed results show that the proposed model outperforms the various existing approaches and attains efficient prediction accuracy.This model helps physicians make better decisions during complex conditions and diagnose the disease at an earlier stage.Thus,the earlier treatment process helps to eliminate the death rate.Here,simulation is done withMATLAB 2016b,and metrics like accuracy,precision-recall,F-measure,p-value,ROC are analyzed to show the significance of the model.
基金supported by American Heart Association,No.18AJML34280074.
文摘The state-of-the-art approaches for image reconstruction using under-sampled k-space data are compressed sensing based.They are iterative algorithms that optimize objective functions with spatial and/or temporal constraints.This paper proposes a non-iterative algorithm to estimate the un-measured data and then to reconstruct the image with the efficient filtered backprojection algorithm.The feasibility of the proposed method is demonstrated with a patient magnetic resonance imaging study.The proposed method is also compared with the state-of-the-art iterative compressed-sensing image reconstruction method using the total-variation optimization norm.
文摘Traditional classification algorithms perform not very well on imbalanced data sets and small sample size. To deal with the problem, a novel method is proposed to change the class distribution through adding virtual samples, which are generated by the windowed regression over-sampling (WRO) method. The proposed method WRO not only reflects the additive effects but also reflects the multiplicative effect between samples. A comparative study between the proposed method and other over-sampling methods such as synthetic minority over-sampling technique (SMOTE) and borderline over-sampling (BOS) on UCI datasets and Fourier transform infrared spectroscopy (FTIR) data set is provided. Experimental results show that the WRO method can achieve better performance than other methods.
文摘By analyzing the theory of over-sampling and averaging, the conclusion is educed that white noise accompanies the signal and the addition of each bit of resolution can be achieved via a fourfold sampling frequency. The addition of each bit will approximately increase the SNR (signal to noise ratio) to 6dB.
文摘Background Bacterial vaginosis is a polymicrobial syndrome in which the homeostasis exerted by the Latobacillus species that protect the vaginal mucosa has been lost.This study explored the data balancing process with the intention of improving the quality of association rules.The article aimed to balance the unbalanced multiclass dataset to improve association rule creation.Methods A dataset with 201 observations and 58 variables was analyzed.A preconstructed dataset was used.The authors collected the data between August 2016 and October 2018 in Tabasco,Mexico.The study population comprised sexually active women ages 18 to 50 who underwent gynecological inspection at the infectious and metabolic diseases research laboratory at the Universidad Juarez Autonoma de Tabasco.To determine the best κ-value,the random-forest algorithm was used and the balancing was performed with the synthetic minority over-sampling technique(SMOTE),random over-sampling examples(ROSE),and adaptive syntetic sampling approach for imbalanced learning(ADASYN)algorithms.The Apriori algorithm created the rules and to select rules with statistical significance,the is.redundant(),is.significant(),and is.maximal()functions and quality metric Fisher’s exact tes were used.The biological validation was carried out by the expert(bacteriologist).Results The ADASYN algorithm at K=9 the out of the bag(OOB)error was zero,this was the best𝐾-values.In the balancing process the ADASYN algorithm show best the performance.From the dataset balanced with ADASYN,the apriori algorithm created the association rules and the selection with the quality metric Fisher’s exact test,and the biological validation reported 13 rules.Gram-bacteria Atopobium vaginae,Gardnerella vaginalis,Megasphaera filotipo 1,Mycoplasma hominis and Ureaplasma parvum were detected by the apriori algorithm from the balanced dataset.Conclusion Balancing may improve the creation of association rules to efficiently model the bacteria that cause bacterial vaginosis.
基金supported by the National Natural Science Foundation of China(No.41907048)The Fundamental Research Funds for the Central Universities,CHD(No.300102260206)The Shannxi Academy of Forestry(No.SXLK2023-02-15).
文摘Check dams have been widely constructed in the Chinese Loess Plateau and has played an important role in controlling soil loss during last 70 years.However,the large-scale and automatic mapping of the check dams and the resulting silted fields are lacking.In this study,we present a novel methodological framework to extract silted fields and to estimate the location of the check dams at a pixel level in the Wuding River catchment by remote sensing and ensemble learning models.The random under-sampling method and 23 features were used to train and validate three ensemble learning models,namely Random Forest,Extreme Gradient Boosting and EasyEnsemble,based on a large number of samples.The established optimal model was then applied to the whole study area to map check dams and silted fields.Our results indicate that the imbalance ratio of the samples has a significant impact on the performance of the models.Validation of the results on the testing set show that the F1-score of silted fields of three models is higher than 0.75 at the pixel level.Finally,we produced a map of silted fields and check dams at 10 m-spatial resolution by the optimal model with an accuracy of ca.90%at the object level.The proposed framework can be used for the large-scale and high-precision mapping of check dams and silted fields,which is of great significance for the monitoring and management of the dynamics of check dams and the quantitative evaluation of their eco-environmental benefits.
基金co-supported by the National Science and Technology Major Project of China(No.2019ZX04026001)Shanghai Aerospace Science and Technology Innovation Fund,China(No.SAST52016001)。
文摘In satellite anomaly detection,there are some problems such as unbalanced sample distribution,fewer fault samples,and unobvious anomaly characteristics.These problems cause the extisted anomaly detection methods are difficult to train accurate classification model,and the accuracy of anomaly detection is hard to improve.At the same time,the monitoring data of satellite has high dimension and is difficult to extract effective features.Based on the DTW over-sampling method,this paper realizes the over-sampling of fault samples in satellite time series,and constructs a distributed and balanced time series data set.The Fast-DTW method is applied to calculate the distance between different time series,which can improve the speed of similarity calculation.KNN(K-Nearest Neighbor)method is applied for classification and the best classification result is obtained by search the optimal hyper-parameters k.The results show that the proposed method has high anomaly detection accuracy and consumes short calculation time.
基金partially supported by the Aeronautical Science Foundation of China(No.201920007001)National Natural Science Foundation of China(Nos.U20B2067,61790552 and 61790554)。
文摘Imbalanced data classification is an important research topic in real-world applications,like fault diagnosis in an aircraft manufacturing system.The over-sampling method is often used to solve this problem.It generates samples according to the distance between minority data.However,the traditional over-sampling method may change the original data distribution,which is harmful to the classification performance.In this paper,we propose a new method called Conditional SelfAttention Generative Adversarial Network with Differential Evolution(CSAGAN-DE)for imbalanced data classification.The new method aims at improving the classification performance of minority data by enhancing the quality of the generation of minority data.In CSAGAN-DE,the minority data are fed into the self-attention generative adversarial network to approximate the data distribution and create new data for the minority class.Then,the differential evolution algorithm is employed to automatically determine the number of generated minority data for achieving a satisfactory classification performance.Several experiments are conducted to evaluate the performance of the new CSAGAN-DE method.The results show that the new method can efficiently improve the classification performance compared with other related methods.
文摘A clustering-based undersampling (CUS) and distance-based near-miss method are widely used in current imbalanced learning algorithms, but this method has certain drawbacks. In particular, the CUS does not consider the influence of the distance factor on the majority of instances, and the near-miss method omits the inter-class(es) within the majority of samples. To overcome these drawbacks, this study proposes an undersampling method combining distance measurement and majority class clustering. Resampling methods are used to develop an ensemble-based imbalanced-learning algorithm called the clustering and distance-based imbalance learning model (CDEILM). This algorithm combines distance-based undersampling, feature selection, and ensemble learning. In addition, a cluster size-based resampling (CSBR) method is proposed for preserving the original distribution of the majority class, and a hybrid imbalanced learning framework is constructed by fusing various types of resampling methods. The combination of CDEILM and CSBR can be considered as a specific case of this hybrid framework. The experimental results show that the CDEILM and CSBR methods can achieve better performance than the benchmark methods, and that the hybrid model provides the best results under most circumstances. Therefore, the proposed model can be used as an alternative imbalanced learning method under specific circumstances, e.g., for providing a solution to credit evaluation problems in financial applications.
文摘Since the overall prediction error of a classifier on imbalanced problems can be potentially misleading and bi- ased, alternative performance measures such as G-mean and F-measure have been widely adopted. Various techniques in- cluding sampling and cost sensitive learning are often em- ployed to improve the performance of classifiers in such sit- uations. However, the training process of classifiers is still largely driven by traditional error based objective functions. As a result, there is clearly a gap between the measure accord- ing to which the classifier is evaluated and how the classifier is trained. This paper investigates the prospect of explicitly using the appropriate measure itself to search the hypothesis space to bridge this gap. In the case studies, a standard three- layer neural network is used as the classifier, which is evolved by genetic algorithms (GAs) with G-mean as the objective function. Experimental results on eight benchmark problems show that the proposed method can achieve consistently fa- vorable outcomes in comparison with a commonly used sam- pling technique. The effectiveness of multi-objective opti- mization in handling imbalanced problems is also demon- strated.