For imbalanced datasets, the focus of classification is to identify samples of the minority class. The performance of current data mining algorithms is not good enough for processing imbalanced datasets. The synthetic...For imbalanced datasets, the focus of classification is to identify samples of the minority class. The performance of current data mining algorithms is not good enough for processing imbalanced datasets. The synthetic minority over-sampling technique(SMOTE) is specifically designed for learning from imbalanced datasets, generating synthetic minority class examples by interpolating between minority class examples nearby. However, the SMOTE encounters the overgeneralization problem. The densitybased spatial clustering of applications with noise(DBSCAN) is not rigorous when dealing with the samples near the borderline.We optimize the DBSCAN algorithm for this problem to make clustering more reasonable. This paper integrates the optimized DBSCAN and SMOTE, and proposes a density-based synthetic minority over-sampling technique(DSMOTE). First, the optimized DBSCAN is used to divide the samples of the minority class into three groups, including core samples, borderline samples and noise samples, and then the noise samples of minority class is removed to synthesize more effective samples. In order to make full use of the information of core samples and borderline samples,different strategies are used to over-sample core samples and borderline samples. Experiments show that DSMOTE can achieve better results compared with SMOTE and Borderline-SMOTE in terms of precision, recall and F-value.展开更多
A new identification method for a linear discrete-time closed-loop system is proposed based on an output over-sampling scheme. When the system outputs are over-sampled the new output sequences would contain more infor...A new identification method for a linear discrete-time closed-loop system is proposed based on an output over-sampling scheme. When the system outputs are over-sampled the new output sequences would contain more information about the plant structure. Using general least squares method (GLS) the plant over-sampled model should be recognized. Then the original plant model should be obtained by its relationship with the over-sampled model. Compared with conventional approaches the advantage of the new method is that even if the ordinary identifiability conditions are not satisfied, a close-loop system can be identified by using the oversampled output without utilizing any external test signal. Accuracy analysis shows the relationship between the estimation error and the over-sampling rate. Numerical simulation illnstrates its effectiveness.展开更多
MicroRNAs (miRNAs) are short (~22nt) non-coding RNAs that play an indispensable role in gene regulation of many biological processes. Most of current computational, comparative, and non-comparative methods commonly cl...MicroRNAs (miRNAs) are short (~22nt) non-coding RNAs that play an indispensable role in gene regulation of many biological processes. Most of current computational, comparative, and non-comparative methods commonly classify human precursor micro- RNA (pre-miRNA) hairpins from both genome pseudo hairpins and other non-coding RNAs (ncRNAs). Although there were a few approaches achieving promising results in applying class imbalance learning methods, this issue has still not solved completely and successfully yet by the existing methods because of imbalanced class distribution in the datasets. For example, SMOTE is a famous and general over-sampling method addressing this problem, however in some cases it cannot improve or sometimes reduces classification performance. Therefore, we developed a novel over-sampling method named incre-mental- SMOTE to distinguish human pre-miRNA hairpins from both genome pseudo hairpins and other ncRNAs. Experimental results on pre-miRNA datasets from Batuwita et al. showed that our method achieved better Sensitivity and G-mean than the control (no over- sampling), SMOTE, and several successsors of modified SMOTE including safe-level-SMOTE and border-line-SMOTE. In addition, we also applied the novel method to five imbalanced benchmark datasets from UCI Machine Learning Repository and achieved improvements in Sensitivity and G-mean. These results suggest that our method outperforms SMOTE and several successors of it in various biomedical classification problems including miRNA classification.展开更多
β-turn is one of the most important reverse turns because of its role in protein folding. Many computational methods have been studied for predicting β-turns and β-turn types. However, due to the imbalanced dataset...β-turn is one of the most important reverse turns because of its role in protein folding. Many computational methods have been studied for predicting β-turns and β-turn types. However, due to the imbalanced dataset, the performance is still inadequate. In this study, we proposed a novel over-sampling technique FOST to deal with the class-imbalance problem. Experimental results on three standard benchmark datasets showed that our method is comparable with state-of-the-art methods. In addition, we applied our algorithm to five benchmark datasets from UCI Machine Learning Repository and achieved significant improvement in G-mean and Sensitivity. It means that our method is also effective for various imbalanced data other than β-turns and β-turn types.展开更多
With the worldwide analysis,heart disease is considered a significant threat and extensively increases the mortality rate.Thus,the investigators mitigate to predict the occurrence of heart disease in an earlier stage ...With the worldwide analysis,heart disease is considered a significant threat and extensively increases the mortality rate.Thus,the investigators mitigate to predict the occurrence of heart disease in an earlier stage using the design of a better Clinical Decision Support System(CDSS).Generally,CDSS is used to predict the individuals’heart disease and periodically update the condition of the patients.This research proposes a novel heart disease prediction system with CDSS composed of a clustering model for noise removal to predict and eliminate outliers.Here,the Synthetic Over-sampling prediction model is integrated with the cluster concept to balance the training data and the Adaboost classifier model is used to predict heart disease.Then,the optimization is achieved using the Adam Optimizer(AO)model with the publicly available dataset known as the Stalog dataset.This flowis used to construct the model,and the evaluation is done with various prevailing approaches like Decision tree,Random Forest,Logistic Regression,Naive Bayes and so on.The statistical analysis is done with theWilcoxon rank-summethod for extracting the p-value of the model.The observed results show that the proposed model outperforms the various existing approaches and attains efficient prediction accuracy.This model helps physicians make better decisions during complex conditions and diagnose the disease at an earlier stage.Thus,the earlier treatment process helps to eliminate the death rate.Here,simulation is done withMATLAB 2016b,and metrics like accuracy,precision-recall,F-measure,p-value,ROC are analyzed to show the significance of the model.展开更多
Stroke is a life-threatening disease usually due to blockage of blood or insufficient blood flow to the brain.It has a tremendous impact on every aspect of life since it is the leading global factor of disability and ...Stroke is a life-threatening disease usually due to blockage of blood or insufficient blood flow to the brain.It has a tremendous impact on every aspect of life since it is the leading global factor of disability and morbidity.Strokes can range from minor to severe(extensive).Thus,early stroke assessment and treatment can enhance survival rates.Manual prediction is extremely time and resource intensive.Automated prediction methods such as Modern Information and Communication Technologies(ICTs),particularly those inMachine Learning(ML)area,are crucial for the early diagnosis and prognosis of stroke.Therefore,this research proposed an ensemble voting model based on three Machine Learning(ML)algorithms:Random Forest(RF),Extreme Gradient Boosting(XGBoost),and Light Gradient Boosting Machine(LGBM).We apply data preprocessing to manage the outliers and useless instances in the dataset.Furthermore,to address the problem of imbalanced data,we enhance the minority class’s representation using the Synthetic Minority Over-Sampling Technique(SMOTE),allowing it to engage in the learning process actively.Results reveal that the suggested model outperforms existing studies and other classifiers with 0.96%accuracy,0.97%precision,0.97%recall,and 0.96%F1-score.The experiment demonstrates that the proposed ensemble voting model outperforms state-of-the-art and other traditional approaches.展开更多
Traditional classification algorithms perform not very well on imbalanced data sets and small sample size. To deal with the problem, a novel method is proposed to change the class distribution through adding virtual s...Traditional classification algorithms perform not very well on imbalanced data sets and small sample size. To deal with the problem, a novel method is proposed to change the class distribution through adding virtual samples, which are generated by the windowed regression over-sampling (WRO) method. The proposed method WRO not only reflects the additive effects but also reflects the multiplicative effect between samples. A comparative study between the proposed method and other over-sampling methods such as synthetic minority over-sampling technique (SMOTE) and borderline over-sampling (BOS) on UCI datasets and Fourier transform infrared spectroscopy (FTIR) data set is provided. Experimental results show that the WRO method can achieve better performance than other methods.展开更多
By analyzing the theory of over-sampling and averaging, the conclusion is educed that white noise accompanies the signal and the addition of each bit of resolution can be achieved via a fourfold sampling frequency. Th...By analyzing the theory of over-sampling and averaging, the conclusion is educed that white noise accompanies the signal and the addition of each bit of resolution can be achieved via a fourfold sampling frequency. The addition of each bit will approximately increase the SNR (signal to noise ratio) to 6dB.展开更多
The rapid variation of channel can induce the intercarrier interference in orthogonal frequency-division multiplexing (OFDM) systems. Intercarrier interference will significantly increase the difficulty of OFDM chan...The rapid variation of channel can induce the intercarrier interference in orthogonal frequency-division multiplexing (OFDM) systems. Intercarrier interference will significantly increase the difficulty of OFDM channel estimation because too many channel coefficients need be estimated. In this article, a novel channel estimator is proposed to resolve the above problem. This estimator consists of two parts: the channel parameter estimation unit (CPEU), which is used to estimate the number of channel taps and the multipath time delays, and the channel coefficient estimation unit (CCEU), which is used to estimate the channel coefficients by using the estimated channel parameters provided by CPEU. In CCEU, the over-sampling basis expansion model is resorted to solve the problem that a large number of channel coefficients need to be estimated. Finally, simulation results are given to scale the performance of the proposed scheme.展开更多
Background Bacterial vaginosis is a polymicrobial syndrome in which the homeostasis exerted by the Latobacillus species that protect the vaginal mucosa has been lost.This study explored the data balancing process with...Background Bacterial vaginosis is a polymicrobial syndrome in which the homeostasis exerted by the Latobacillus species that protect the vaginal mucosa has been lost.This study explored the data balancing process with the intention of improving the quality of association rules.The article aimed to balance the unbalanced multiclass dataset to improve association rule creation.Methods A dataset with 201 observations and 58 variables was analyzed.A preconstructed dataset was used.The authors collected the data between August 2016 and October 2018 in Tabasco,Mexico.The study population comprised sexually active women ages 18 to 50 who underwent gynecological inspection at the infectious and metabolic diseases research laboratory at the Universidad Juarez Autonoma de Tabasco.To determine the best κ-value,the random-forest algorithm was used and the balancing was performed with the synthetic minority over-sampling technique(SMOTE),random over-sampling examples(ROSE),and adaptive syntetic sampling approach for imbalanced learning(ADASYN)algorithms.The Apriori algorithm created the rules and to select rules with statistical significance,the is.redundant(),is.significant(),and is.maximal()functions and quality metric Fisher’s exact tes were used.The biological validation was carried out by the expert(bacteriologist).Results The ADASYN algorithm at K=9 the out of the bag(OOB)error was zero,this was the best𝐾-values.In the balancing process the ADASYN algorithm show best the performance.From the dataset balanced with ADASYN,the apriori algorithm created the association rules and the selection with the quality metric Fisher’s exact test,and the biological validation reported 13 rules.Gram-bacteria Atopobium vaginae,Gardnerella vaginalis,Megasphaera filotipo 1,Mycoplasma hominis and Ureaplasma parvum were detected by the apriori algorithm from the balanced dataset.Conclusion Balancing may improve the creation of association rules to efficiently model the bacteria that cause bacterial vaginosis.展开更多
In satellite anomaly detection,there are some problems such as unbalanced sample distribution,fewer fault samples,and unobvious anomaly characteristics.These problems cause the extisted anomaly detection methods are d...In satellite anomaly detection,there are some problems such as unbalanced sample distribution,fewer fault samples,and unobvious anomaly characteristics.These problems cause the extisted anomaly detection methods are difficult to train accurate classification model,and the accuracy of anomaly detection is hard to improve.At the same time,the monitoring data of satellite has high dimension and is difficult to extract effective features.Based on the DTW over-sampling method,this paper realizes the over-sampling of fault samples in satellite time series,and constructs a distributed and balanced time series data set.The Fast-DTW method is applied to calculate the distance between different time series,which can improve the speed of similarity calculation.KNN(K-Nearest Neighbor)method is applied for classification and the best classification result is obtained by search the optimal hyper-parameters k.The results show that the proposed method has high anomaly detection accuracy and consumes short calculation time.展开更多
Since the overall prediction error of a classifier on imbalanced problems can be potentially misleading and bi- ased, alternative performance measures such as G-mean and F-measure have been widely adopted. Various tec...Since the overall prediction error of a classifier on imbalanced problems can be potentially misleading and bi- ased, alternative performance measures such as G-mean and F-measure have been widely adopted. Various techniques in- cluding sampling and cost sensitive learning are often em- ployed to improve the performance of classifiers in such sit- uations. However, the training process of classifiers is still largely driven by traditional error based objective functions. As a result, there is clearly a gap between the measure accord- ing to which the classifier is evaluated and how the classifier is trained. This paper investigates the prospect of explicitly using the appropriate measure itself to search the hypothesis space to bridge this gap. In the case studies, a standard three- layer neural network is used as the classifier, which is evolved by genetic algorithms (GAs) with G-mean as the objective function. Experimental results on eight benchmark problems show that the proposed method can achieve consistently fa- vorable outcomes in comparison with a commonly used sam- pling technique. The effectiveness of multi-objective opti- mization in handling imbalanced problems is also demon- strated.展开更多
Imbalanced data classification is an important research topic in real-world applications,like fault diagnosis in an aircraft manufacturing system.The over-sampling method is often used to solve this problem.It generat...Imbalanced data classification is an important research topic in real-world applications,like fault diagnosis in an aircraft manufacturing system.The over-sampling method is often used to solve this problem.It generates samples according to the distance between minority data.However,the traditional over-sampling method may change the original data distribution,which is harmful to the classification performance.In this paper,we propose a new method called Conditional SelfAttention Generative Adversarial Network with Differential Evolution(CSAGAN-DE)for imbalanced data classification.The new method aims at improving the classification performance of minority data by enhancing the quality of the generation of minority data.In CSAGAN-DE,the minority data are fed into the self-attention generative adversarial network to approximate the data distribution and create new data for the minority class.Then,the differential evolution algorithm is employed to automatically determine the number of generated minority data for achieving a satisfactory classification performance.Several experiments are conducted to evaluate the performance of the new CSAGAN-DE method.The results show that the new method can efficiently improve the classification performance compared with other related methods.展开更多
基金supported by the National Key Research and Development Program of China(2018YFB1003700)the Scientific and Technological Support Project(Society)of Jiangsu Province(BE2016776)+2 种基金the“333” project of Jiangsu Province(BRA2017228 BRA2017401)the Talent Project in Six Fields of Jiangsu Province(2015-JNHB-012)
文摘For imbalanced datasets, the focus of classification is to identify samples of the minority class. The performance of current data mining algorithms is not good enough for processing imbalanced datasets. The synthetic minority over-sampling technique(SMOTE) is specifically designed for learning from imbalanced datasets, generating synthetic minority class examples by interpolating between minority class examples nearby. However, the SMOTE encounters the overgeneralization problem. The densitybased spatial clustering of applications with noise(DBSCAN) is not rigorous when dealing with the samples near the borderline.We optimize the DBSCAN algorithm for this problem to make clustering more reasonable. This paper integrates the optimized DBSCAN and SMOTE, and proposes a density-based synthetic minority over-sampling technique(DSMOTE). First, the optimized DBSCAN is used to divide the samples of the minority class into three groups, including core samples, borderline samples and noise samples, and then the noise samples of minority class is removed to synthesize more effective samples. In order to make full use of the information of core samples and borderline samples,different strategies are used to over-sample core samples and borderline samples. Experiments show that DSMOTE can achieve better results compared with SMOTE and Borderline-SMOTE in terms of precision, recall and F-value.
基金Project supported by National Natural Science Foundation ofChina (Grant No .60174030)
文摘A new identification method for a linear discrete-time closed-loop system is proposed based on an output over-sampling scheme. When the system outputs are over-sampled the new output sequences would contain more information about the plant structure. Using general least squares method (GLS) the plant over-sampled model should be recognized. Then the original plant model should be obtained by its relationship with the over-sampled model. Compared with conventional approaches the advantage of the new method is that even if the ordinary identifiability conditions are not satisfied, a close-loop system can be identified by using the oversampled output without utilizing any external test signal. Accuracy analysis shows the relationship between the estimation error and the over-sampling rate. Numerical simulation illnstrates its effectiveness.
文摘MicroRNAs (miRNAs) are short (~22nt) non-coding RNAs that play an indispensable role in gene regulation of many biological processes. Most of current computational, comparative, and non-comparative methods commonly classify human precursor micro- RNA (pre-miRNA) hairpins from both genome pseudo hairpins and other non-coding RNAs (ncRNAs). Although there were a few approaches achieving promising results in applying class imbalance learning methods, this issue has still not solved completely and successfully yet by the existing methods because of imbalanced class distribution in the datasets. For example, SMOTE is a famous and general over-sampling method addressing this problem, however in some cases it cannot improve or sometimes reduces classification performance. Therefore, we developed a novel over-sampling method named incre-mental- SMOTE to distinguish human pre-miRNA hairpins from both genome pseudo hairpins and other ncRNAs. Experimental results on pre-miRNA datasets from Batuwita et al. showed that our method achieved better Sensitivity and G-mean than the control (no over- sampling), SMOTE, and several successsors of modified SMOTE including safe-level-SMOTE and border-line-SMOTE. In addition, we also applied the novel method to five imbalanced benchmark datasets from UCI Machine Learning Repository and achieved improvements in Sensitivity and G-mean. These results suggest that our method outperforms SMOTE and several successors of it in various biomedical classification problems including miRNA classification.
文摘β-turn is one of the most important reverse turns because of its role in protein folding. Many computational methods have been studied for predicting β-turns and β-turn types. However, due to the imbalanced dataset, the performance is still inadequate. In this study, we proposed a novel over-sampling technique FOST to deal with the class-imbalance problem. Experimental results on three standard benchmark datasets showed that our method is comparable with state-of-the-art methods. In addition, we applied our algorithm to five benchmark datasets from UCI Machine Learning Repository and achieved significant improvement in G-mean and Sensitivity. It means that our method is also effective for various imbalanced data other than β-turns and β-turn types.
文摘With the worldwide analysis,heart disease is considered a significant threat and extensively increases the mortality rate.Thus,the investigators mitigate to predict the occurrence of heart disease in an earlier stage using the design of a better Clinical Decision Support System(CDSS).Generally,CDSS is used to predict the individuals’heart disease and periodically update the condition of the patients.This research proposes a novel heart disease prediction system with CDSS composed of a clustering model for noise removal to predict and eliminate outliers.Here,the Synthetic Over-sampling prediction model is integrated with the cluster concept to balance the training data and the Adaboost classifier model is used to predict heart disease.Then,the optimization is achieved using the Adam Optimizer(AO)model with the publicly available dataset known as the Stalog dataset.This flowis used to construct the model,and the evaluation is done with various prevailing approaches like Decision tree,Random Forest,Logistic Regression,Naive Bayes and so on.The statistical analysis is done with theWilcoxon rank-summethod for extracting the p-value of the model.The observed results show that the proposed model outperforms the various existing approaches and attains efficient prediction accuracy.This model helps physicians make better decisions during complex conditions and diagnose the disease at an earlier stage.Thus,the earlier treatment process helps to eliminate the death rate.Here,simulation is done withMATLAB 2016b,and metrics like accuracy,precision-recall,F-measure,p-value,ROC are analyzed to show the significance of the model.
文摘Stroke is a life-threatening disease usually due to blockage of blood or insufficient blood flow to the brain.It has a tremendous impact on every aspect of life since it is the leading global factor of disability and morbidity.Strokes can range from minor to severe(extensive).Thus,early stroke assessment and treatment can enhance survival rates.Manual prediction is extremely time and resource intensive.Automated prediction methods such as Modern Information and Communication Technologies(ICTs),particularly those inMachine Learning(ML)area,are crucial for the early diagnosis and prognosis of stroke.Therefore,this research proposed an ensemble voting model based on three Machine Learning(ML)algorithms:Random Forest(RF),Extreme Gradient Boosting(XGBoost),and Light Gradient Boosting Machine(LGBM).We apply data preprocessing to manage the outliers and useless instances in the dataset.Furthermore,to address the problem of imbalanced data,we enhance the minority class’s representation using the Synthetic Minority Over-Sampling Technique(SMOTE),allowing it to engage in the learning process actively.Results reveal that the suggested model outperforms existing studies and other classifiers with 0.96%accuracy,0.97%precision,0.97%recall,and 0.96%F1-score.The experiment demonstrates that the proposed ensemble voting model outperforms state-of-the-art and other traditional approaches.
文摘Traditional classification algorithms perform not very well on imbalanced data sets and small sample size. To deal with the problem, a novel method is proposed to change the class distribution through adding virtual samples, which are generated by the windowed regression over-sampling (WRO) method. The proposed method WRO not only reflects the additive effects but also reflects the multiplicative effect between samples. A comparative study between the proposed method and other over-sampling methods such as synthetic minority over-sampling technique (SMOTE) and borderline over-sampling (BOS) on UCI datasets and Fourier transform infrared spectroscopy (FTIR) data set is provided. Experimental results show that the WRO method can achieve better performance than other methods.
文摘By analyzing the theory of over-sampling and averaging, the conclusion is educed that white noise accompanies the signal and the addition of each bit of resolution can be achieved via a fourfold sampling frequency. The addition of each bit will approximately increase the SNR (signal to noise ratio) to 6dB.
文摘The rapid variation of channel can induce the intercarrier interference in orthogonal frequency-division multiplexing (OFDM) systems. Intercarrier interference will significantly increase the difficulty of OFDM channel estimation because too many channel coefficients need be estimated. In this article, a novel channel estimator is proposed to resolve the above problem. This estimator consists of two parts: the channel parameter estimation unit (CPEU), which is used to estimate the number of channel taps and the multipath time delays, and the channel coefficient estimation unit (CCEU), which is used to estimate the channel coefficients by using the estimated channel parameters provided by CPEU. In CCEU, the over-sampling basis expansion model is resorted to solve the problem that a large number of channel coefficients need to be estimated. Finally, simulation results are given to scale the performance of the proposed scheme.
文摘Background Bacterial vaginosis is a polymicrobial syndrome in which the homeostasis exerted by the Latobacillus species that protect the vaginal mucosa has been lost.This study explored the data balancing process with the intention of improving the quality of association rules.The article aimed to balance the unbalanced multiclass dataset to improve association rule creation.Methods A dataset with 201 observations and 58 variables was analyzed.A preconstructed dataset was used.The authors collected the data between August 2016 and October 2018 in Tabasco,Mexico.The study population comprised sexually active women ages 18 to 50 who underwent gynecological inspection at the infectious and metabolic diseases research laboratory at the Universidad Juarez Autonoma de Tabasco.To determine the best κ-value,the random-forest algorithm was used and the balancing was performed with the synthetic minority over-sampling technique(SMOTE),random over-sampling examples(ROSE),and adaptive syntetic sampling approach for imbalanced learning(ADASYN)algorithms.The Apriori algorithm created the rules and to select rules with statistical significance,the is.redundant(),is.significant(),and is.maximal()functions and quality metric Fisher’s exact tes were used.The biological validation was carried out by the expert(bacteriologist).Results The ADASYN algorithm at K=9 the out of the bag(OOB)error was zero,this was the best𝐾-values.In the balancing process the ADASYN algorithm show best the performance.From the dataset balanced with ADASYN,the apriori algorithm created the association rules and the selection with the quality metric Fisher’s exact test,and the biological validation reported 13 rules.Gram-bacteria Atopobium vaginae,Gardnerella vaginalis,Megasphaera filotipo 1,Mycoplasma hominis and Ureaplasma parvum were detected by the apriori algorithm from the balanced dataset.Conclusion Balancing may improve the creation of association rules to efficiently model the bacteria that cause bacterial vaginosis.
基金co-supported by the National Science and Technology Major Project of China(No.2019ZX04026001)Shanghai Aerospace Science and Technology Innovation Fund,China(No.SAST52016001)。
文摘In satellite anomaly detection,there are some problems such as unbalanced sample distribution,fewer fault samples,and unobvious anomaly characteristics.These problems cause the extisted anomaly detection methods are difficult to train accurate classification model,and the accuracy of anomaly detection is hard to improve.At the same time,the monitoring data of satellite has high dimension and is difficult to extract effective features.Based on the DTW over-sampling method,this paper realizes the over-sampling of fault samples in satellite time series,and constructs a distributed and balanced time series data set.The Fast-DTW method is applied to calculate the distance between different time series,which can improve the speed of similarity calculation.KNN(K-Nearest Neighbor)method is applied for classification and the best classification result is obtained by search the optimal hyper-parameters k.The results show that the proposed method has high anomaly detection accuracy and consumes short calculation time.
文摘Since the overall prediction error of a classifier on imbalanced problems can be potentially misleading and bi- ased, alternative performance measures such as G-mean and F-measure have been widely adopted. Various techniques in- cluding sampling and cost sensitive learning are often em- ployed to improve the performance of classifiers in such sit- uations. However, the training process of classifiers is still largely driven by traditional error based objective functions. As a result, there is clearly a gap between the measure accord- ing to which the classifier is evaluated and how the classifier is trained. This paper investigates the prospect of explicitly using the appropriate measure itself to search the hypothesis space to bridge this gap. In the case studies, a standard three- layer neural network is used as the classifier, which is evolved by genetic algorithms (GAs) with G-mean as the objective function. Experimental results on eight benchmark problems show that the proposed method can achieve consistently fa- vorable outcomes in comparison with a commonly used sam- pling technique. The effectiveness of multi-objective opti- mization in handling imbalanced problems is also demon- strated.
基金partially supported by the Aeronautical Science Foundation of China(No.201920007001)National Natural Science Foundation of China(Nos.U20B2067,61790552 and 61790554)。
文摘Imbalanced data classification is an important research topic in real-world applications,like fault diagnosis in an aircraft manufacturing system.The over-sampling method is often used to solve this problem.It generates samples according to the distance between minority data.However,the traditional over-sampling method may change the original data distribution,which is harmful to the classification performance.In this paper,we propose a new method called Conditional SelfAttention Generative Adversarial Network with Differential Evolution(CSAGAN-DE)for imbalanced data classification.The new method aims at improving the classification performance of minority data by enhancing the quality of the generation of minority data.In CSAGAN-DE,the minority data are fed into the self-attention generative adversarial network to approximate the data distribution and create new data for the minority class.Then,the differential evolution algorithm is employed to automatically determine the number of generated minority data for achieving a satisfactory classification performance.Several experiments are conducted to evaluate the performance of the new CSAGAN-DE method.The results show that the new method can efficiently improve the classification performance compared with other related methods.