When building a classification model,the scenario where the samples of one class are significantly more than those of the other class is called data imbalance.Data imbalance causes the trained classification model to ...When building a classification model,the scenario where the samples of one class are significantly more than those of the other class is called data imbalance.Data imbalance causes the trained classification model to be in favor of the majority class(usually defined as the negative class),which may do harm to the accuracy of the minority class(usually defined as the positive class),and then lead to poor overall performance of the model.A method called MSHR-FCSSVM for solving imbalanced data classification is proposed in this article,which is based on a new hybrid resampling approach(MSHR)and a new fine cost-sensitive support vector machine(CS-SVM)classifier(FCSSVM).The MSHR measures the separability of each negative sample through its Silhouette value calculated by Mahalanobis distance between samples,based on which,the so-called pseudo-negative samples are screened out to generate new positive samples(over-sampling step)through linear interpolation and are deleted finally(under-sampling step).This approach replaces pseudo-negative samples with generated new positive samples one by one to clear up the inter-class overlap on the borderline,without changing the overall scale of the dataset.The FCSSVM is an improved version of the traditional CS-SVM.It considers influences of both the imbalance of sample number and the class distribution on classification simultaneously,and through finely tuning the class cost weights by using the efficient optimization algorithm based on the physical phenomenon of rime-ice(RIME)algorithm with cross-validation accuracy as the fitness function to accurately adjust the classification borderline.To verify the effectiveness of the proposed method,a series of experiments are carried out based on 20 imbalanced datasets including both mildly and extremely imbalanced datasets.The experimental results show that the MSHR-FCSSVM method performs better than the methods for comparison in most cases,and both the MSHR and the FCSSVM played significant roles.展开更多
Objective Clinical medical record data associated with hepatitis B-related acute-on-chronic liver failure(HBV-ACLF)generally have small sample sizes and a class imbalance.However,most machine learning models are desig...Objective Clinical medical record data associated with hepatitis B-related acute-on-chronic liver failure(HBV-ACLF)generally have small sample sizes and a class imbalance.However,most machine learning models are designed based on balanced data and lack interpretability.This study aimed to propose a traditional Chinese medicine(TCM)diagnostic model for HBV-ACLF based on the TCM syndrome differentiation and treatment theory,which is clinically interpretable and highly accurate.Methods We collected medical records from 261 patients diagnosed with HBV-ACLF,including three syndromes:Yang jaundice(214 cases),Yang-Yin jaundice(41 cases),and Yin jaundice(6 cases).To avoid overfitting of the machine learning model,we excluded the cases of Yin jaundice.After data standardization and cleaning,we obtained 255 relevant medical records of Yang jaundice and Yang-Yin jaundice.To address the class imbalance issue,we employed the oversampling method and five machine learning methods,including logistic regression(LR),support vector machine(SVM),decision tree(DT),random forest(RF),and extreme gradient boosting(XGBoost)to construct the syndrome diagnosis models.This study used precision,F1 score,the area under the receiver operating characteristic(ROC)curve(AUC),and accuracy as model evaluation metrics.The model with the best classification performance was selected to extract the diagnostic rule,and its clinical significance was thoroughly analyzed.Furthermore,we proposed a novel multiple-round stable rule extraction(MRSRE)method to obtain a stable rule set of features that can exhibit the model’s clinical interpretability.Results The precision of the five machine learning models built using oversampled balanced data exceeded 0.90.Among these models,the accuracy of RF classification of syndrome types was 0.92,and the mean F1 scores of the two categories of Yang jaundice and Yang-Yin jaundice were 0.93 and 0.94,respectively.Additionally,the AUC was 0.98.The extraction rules of the RF syndrome differentiation model based on the MRSRE method revealed that the common features of Yang jaundice and Yang-Yin jaundice were wiry pulse,yellowing of the urine,skin,and eyes,normal tongue body,healthy sublingual vessel,nausea,oil loathing,and poor appetite.The main features of Yang jaundice were a red tongue body and thickened sublingual vessels,whereas those of Yang-Yin jaundice were a dark tongue body,pale white tongue body,white tongue coating,lack of strength,slippery pulse,light red tongue body,slimy tongue coating,and abdominal distension.This is aligned with the classifications made by TCM experts based on TCM syndrome differentiation and treatment theory.Conclusion Our model can be utilized for differentiating HBV-ACLF syndromes,which has the potential to be applied to generate other clinically interpretable models with high accuracy on clinical data characterized by small sample sizes and a class imbalance.展开更多
For imbalanced datasets, the focus of classification is to identify samples of the minority class. The performance of current data mining algorithms is not good enough for processing imbalanced datasets. The synthetic...For imbalanced datasets, the focus of classification is to identify samples of the minority class. The performance of current data mining algorithms is not good enough for processing imbalanced datasets. The synthetic minority over-sampling technique(SMOTE) is specifically designed for learning from imbalanced datasets, generating synthetic minority class examples by interpolating between minority class examples nearby. However, the SMOTE encounters the overgeneralization problem. The densitybased spatial clustering of applications with noise(DBSCAN) is not rigorous when dealing with the samples near the borderline.We optimize the DBSCAN algorithm for this problem to make clustering more reasonable. This paper integrates the optimized DBSCAN and SMOTE, and proposes a density-based synthetic minority over-sampling technique(DSMOTE). First, the optimized DBSCAN is used to divide the samples of the minority class into three groups, including core samples, borderline samples and noise samples, and then the noise samples of minority class is removed to synthesize more effective samples. In order to make full use of the information of core samples and borderline samples,different strategies are used to over-sample core samples and borderline samples. Experiments show that DSMOTE can achieve better results compared with SMOTE and Borderline-SMOTE in terms of precision, recall and F-value.展开更多
Imbalanced data classification is one of the major problems in machine learning.This imbalanced dataset typically has significant differences in the number of data samples between its classes.In most cases,the perform...Imbalanced data classification is one of the major problems in machine learning.This imbalanced dataset typically has significant differences in the number of data samples between its classes.In most cases,the performance of the machine learning algorithm such as Support Vector Machine(SVM)is affected when dealing with an imbalanced dataset.The classification accuracy is mostly skewed toward the majority class and poor results are exhibited in the prediction of minority-class samples.In this paper,a hybrid approach combining data pre-processing technique andSVMalgorithm based on improved Simulated Annealing(SA)was proposed.Firstly,the data preprocessing technique which primarily aims at solving the resampling strategy of handling imbalanced datasets was proposed.In this technique,the data were first synthetically generated to equalize the number of samples between classes and followed by a reduction step to remove redundancy and duplicated data.Next is the training of a balanced dataset using SVM.Since this algorithm requires an iterative process to search for the best penalty parameter during training,an improved SA algorithm was proposed for this task.In this proposed improvement,a new acceptance criterion for the solution to be accepted in the SA algorithm was introduced to enhance the accuracy of the optimization process.Experimental works based on ten publicly available imbalanced datasets have demonstrated higher accuracy in the classification tasks using the proposed approach in comparison with the conventional implementation of SVM.Registering at an average of 89.65%of accuracy for the binary class classification has demonstrated the good performance of the proposed works.展开更多
Prediction of machine failure is challenging as the dataset is often imbalanced with a low failure rate.The common approach to han-dle classification involving imbalanced data is to balance the data using a sampling a...Prediction of machine failure is challenging as the dataset is often imbalanced with a low failure rate.The common approach to han-dle classification involving imbalanced data is to balance the data using a sampling approach such as random undersampling,random oversampling,or Synthetic Minority Oversampling Technique(SMOTE)algorithms.This paper compared the classification performance of three popular classifiers(Logistic Regression,Gaussian Naïve Bayes,and Support Vector Machine)in predicting machine failure in the Oil and Gas industry.The original machine failure dataset consists of 20,473 hourly data and is imbalanced with 19945(97%)‘non-failure’and 528(3%)‘failure data’.The three independent variables to predict machine failure were pressure indicator,flow indicator,and level indicator.The accuracy of the classifiers is very high and close to 100%,but the sensitivity of all classifiers using the original dataset was close to zero.The performance of the three classifiers was then evaluated for data with different imbalance rates(10%to 50%)generated from the original data using SMOTE,SMOTE-Support Vector Machine(SMOTE-SVM)and SMOTE-Edited Nearest Neighbour(SMOTE-ENN).The classifiers were evaluated based on improvement in sensitivity and F-measure.Results showed that the sensitivity of all classifiers increases as the imbalance rate increases.SVM with radial basis function(RBF)kernel has the highest sensitivity when data is balanced(50:50)using SMOTE(Sensitivitytest=0.5686,Ftest=0.6927)compared to Naïve Bayes(Sensitivitytest=0.4033,Ftest=0.6218)and Logistic Regression(Sensitivitytest=0.4194,Ftest=0.621).Overall,the Gaussian Naïve Bayes model consistently improves sensitivity and F-measure as the imbalance ratio increases,but the sensitivity is below 50%.The classifiers performed better when data was balanced using SMOTE-SVM compared to SMOTE and SMOTE-ENN.展开更多
The imbalance of dissolved gas analysis(DGA)data will lead to over-fitting,weak generalization and poor recognition performance for fault diagnosis models based on deep learning.To handle this problem,a novel transfor...The imbalance of dissolved gas analysis(DGA)data will lead to over-fitting,weak generalization and poor recognition performance for fault diagnosis models based on deep learning.To handle this problem,a novel transformer fault diagnosis method based on improved auxiliary classifier generative adversarial network(ACGAN)under imbalanced data is proposed in this paper,which meets both the requirements of balancing DGA data and supplying accurate diagnosis results.The generator combines one-dimensional convolutional neural networks(1D-CNN)and long short-term memories(LSTM),which can deeply extract the features from DGA samples and be greatly beneficial to ACGAN’s data balancing and fault diagnosis.The discriminator adopts multilayer perceptron networks(MLP),which prevents the discriminator from losing important features of DGA data when the network is too complex and the number of layers is too large.The experimental results suggest that the presented approach can effectively improve the adverse effects of DGA data imbalance on the deep learning models,enhance fault diagnosis performance and supply desirable diagnosis accuracy up to 99.46%.Furthermore,the comparison results indicate the fault diagnosis performance of the proposed approach is superior to that of other conventional methods.Therefore,the method presented in this study has excellent and reliable fault diagnosis performance for various unbalanced datasets.In addition,the proposed approach can also solve the problems of insufficient and imbalanced fault data in other practical application fields.展开更多
Accurate fault diagnosis of heating,ventilation,and air conditioning(HVAC)systems is of significant importance for maintaining normal operation,reducing energy consumption,and minimizing maintenance costs.However,in p...Accurate fault diagnosis of heating,ventilation,and air conditioning(HVAC)systems is of significant importance for maintaining normal operation,reducing energy consumption,and minimizing maintenance costs.However,in practical applications,it is challenging to obtain sufficient fault data for HVAC systems,leading to imbalanced data,where the number of fault samples is much smaller than that of normal samples.Moreover,most existing HVAC system fault diagnosis methods heavily rely on balanced training sets to achieve high fault diagnosis accuracy.Therefore,to address this issue,a composite neural network fault diagnosis model is proposed,which combines SMOTETomek,multi-scale one-dimensional convolutional neural networks(M1DCNN),and support vector machine(SVM).This method first utilizes SMOTETomek to augment the minority class samples in the imbalanced dataset,achieving a balanced number of faulty and normal data.Then,it employs the M1DCNN model to extract feature information from the augmented dataset.Finally,it replaces the original Softmax classifier with an SVM classifier for classification,thus enhancing the fault diagnosis accuracy.Using the SMOTETomek-M1DCNN-SVM method,we conducted fault diagnosis validation on both the ASHRAE RP-1043 dataset and experimental dataset with an imbalance ratio of 1:10.The results demonstrate the superiority of this approach,providing a novel and promising solution for intelligent building management,with accuracy and F1 scores of 98.45%and 100%for the RP-1043 dataset and experimental dataset,respectively.展开更多
Transition towards carbon-neutral power systems has necessitated optimization of power dispatch in active distribution networks(ADNs)to facilitate integration of distributed renewable generation.Due to unavailability ...Transition towards carbon-neutral power systems has necessitated optimization of power dispatch in active distribution networks(ADNs)to facilitate integration of distributed renewable generation.Due to unavailability of network topology and line impedance in many distribution networks,physical model-based methods may not be applicable to their operations.To tackle this challenge,some studies have proposed constraint learning,which replicates physical models by training a neural network to evaluate feasibility of a decision(i.e.,whether a decision satisfies all critical constraints or not).To ensure accuracy of this trained neural network,training set should contain sufficient feasible and infeasible samples.However,since ADNs are mostly operated in a normal status,only very few historical samples are infeasible.Thus,the historical dataset is highly imbalanced,which poses a significant obstacle to neural network training.To address this issue,we propose an enhanced constraint learning method.First,it leverages constraint learning to train a neural network as surrogate of ADN's model.Then,it introduces Synthetic Minority Oversampling Technique to generate infeasible samples to mitigate imbalance of historical dataset.By incorporating historical and synthetic samples into the training set,we can significantly improve accuracy of neural network.Furthermore,we establish a trust region to constrain and thereafter enhance reliability of the solution.Simulations confirm the benefits of the proposed method in achieving desirable optimality and feasibility while maintaining low computational complexity.展开更多
A generalization of supervised single-label learning based on the assumption that each sample in a dataset may belong to more than one class simultaneously is called multi-label learning.The main objective of this wor...A generalization of supervised single-label learning based on the assumption that each sample in a dataset may belong to more than one class simultaneously is called multi-label learning.The main objective of this work is to create a novel framework for learning and classifying imbalancedmulti-label data.This work proposes a framework of two phases.The imbalanced distribution of themulti-label dataset is addressed through the proposed Borderline MLSMOTE resampling method in phase 1.Later,an adaptive weighted l21 norm regularized(Elastic-net)multilabel logistic regression is used to predict unseen samples in phase 2.The proposed Borderline MLSMOTE resampling method focuses on samples with concurrent high labels in contrast to conventional MLSMOTE.The minority labels in these samples are called difficult minority labels and are more prone to penalize classification performance.The concurrentmeasure is considered borderline,and labels associated with samples are regarded as borderline labels in the decision boundary.In phase II,a novel adaptive l21 norm regularized weighted multi-label logistic regression is used to handle balanced data with different weighted synthetic samples.Experimentation on various benchmark datasets shows the outperformance of the proposed method and its powerful predictive performances over existing conventional state-of-the-art multi-label methods.展开更多
Purpose: This paper aims to improve the classification performance when the data is imbalanced by applying different sampling techniques available in Machine Learning.Design/methodology/approach: The medical appointme...Purpose: This paper aims to improve the classification performance when the data is imbalanced by applying different sampling techniques available in Machine Learning.Design/methodology/approach: The medical appointment no-show dataset is imbalanced, and when classification algorithms are applied directly to the dataset, it is biased towards the majority class, ignoring the minority class. To avoid this issue, multiple sampling techniques such as Random Over Sampling(ROS), Random Under Sampling(RUS), Synthetic Minority Oversampling TEchnique(SMOTE), ADAptive SYNthetic Sampling(ADASYN), Edited Nearest Neighbor(ENN), and Condensed Nearest Neighbor(CNN) are applied in order to make the dataset balanced. The performance is assessed by the Decision Tree classifier with the listed sampling techniques and the best performance is identified.Findings: This study focuses on the comparison of the performance metrics of various sampling methods widely used. It is revealed that, compared to other techniques, the Recall is high when ENN is applied CNN and ADASYN have performed equally well on the Imbalanced data.Research limitations: The testing was carried out with limited dataset and needs to be tested with a larger dataset.Practical implications: This framework will be useful whenever the data is imbalanced in real world scenarios, which ultimately improves the performance.Originality/value: This paper uses the rebalancing framework on medical appointment no-show dataset to predict the no-shows and removes the bias towards minority class.展开更多
Imbalanced data classification is an important research topic in real-world applications,like fault diagnosis in an aircraft manufacturing system.The over-sampling method is often used to solve this problem.It generat...Imbalanced data classification is an important research topic in real-world applications,like fault diagnosis in an aircraft manufacturing system.The over-sampling method is often used to solve this problem.It generates samples according to the distance between minority data.However,the traditional over-sampling method may change the original data distribution,which is harmful to the classification performance.In this paper,we propose a new method called Conditional SelfAttention Generative Adversarial Network with Differential Evolution(CSAGAN-DE)for imbalanced data classification.The new method aims at improving the classification performance of minority data by enhancing the quality of the generation of minority data.In CSAGAN-DE,the minority data are fed into the self-attention generative adversarial network to approximate the data distribution and create new data for the minority class.Then,the differential evolution algorithm is employed to automatically determine the number of generated minority data for achieving a satisfactory classification performance.Several experiments are conducted to evaluate the performance of the new CSAGAN-DE method.The results show that the new method can efficiently improve the classification performance compared with other related methods.展开更多
Recently,machine learning algorithms have been used in the detection and classification of network attacks.The performance of the algorithms has been evaluated by using benchmark network intrusion datasets such as DAR...Recently,machine learning algorithms have been used in the detection and classification of network attacks.The performance of the algorithms has been evaluated by using benchmark network intrusion datasets such as DARPA98,KDD’99,NSL-KDD,UNSW-NB15,and Caida DDoS.However,these datasets have two major challenges:imbalanced data and highdimensional data.Obtaining high accuracy for all attack types in the dataset allows for high accuracy in imbalanced datasets.On the other hand,having a large number of features increases the runtime load on the algorithms.A novel model is proposed in this paper to overcome these two concerns.The number of features in the model,which has been tested at CICIDS2017,is initially optimized by using genetic algorithms.This optimum feature set has been used to classify network attacks with six well-known classifiers according to high f1-score and g-mean value in minimumtime.Afterwards,amulti-layer perceptron based ensemble learning approach has been applied to improve the models’overall performance.The experimental results showthat the suggested model is acceptable for feature selection as well as classifying network attacks in an imbalanced dataset,with a high f1-score(0.91)and g-mean(0.99)value.Furthermore,it has outperformed base classifier models and voting procedures.展开更多
Data-driven methods are widely considered for fault diagnosis in complex systems.However,in practice,the between-class imbalance due to limited faulty samples may deteriorate their classification performance.To addres...Data-driven methods are widely considered for fault diagnosis in complex systems.However,in practice,the between-class imbalance due to limited faulty samples may deteriorate their classification performance.To address this issue,synthetic minority methods for enhancing data have been proved to be effective in many applications.Generative adversarial networks(GANs),capable of automatic features extraction,can also be adopted for augmenting the faulty samples.However,the monitoring data of a complex system may include not only continuous signals but also discrete/categorical signals.Since the current GAN methods still have some challenges in handling such heterogeneous monitoring data,a Mixed Dual Discriminator GAN(noted as M-D2GAN)is proposed in this work.In order to render the expanded fault samples more aligned with the real situation and improve the accuracy and robustness of the fault diagnosis model,different types of variables are generated in different ways,including floating-point,integer,categorical,and hierarchical.For effectively considering the class imbalance problem,proper modifications are made to the GAN model,where a normal class discriminator is added.A practical case study concerning the braking system of a high-speed train is carried out to verify the effectiveness of the proposed framework.Compared to the classic GAN,the proposed framework achieves better results with respect to F-measure and G-mean metrics.展开更多
Imbalanced data classification is the task of classifying datasets where there is a significant disparity in the number of samples between different classes.This task is prevalent in practical scenarios such as indust...Imbalanced data classification is the task of classifying datasets where there is a significant disparity in the number of samples between different classes.This task is prevalent in practical scenarios such as industrial fault diagnosis,network intrusion detection,cancer detection,etc.In imbalanced classification tasks,the focus is typically on achieving high recognition accuracy for the minority class.However,due to the challenges presented by imbalanced multi-class datasets,such as the scarcity of samples in minority classes and complex inter-class relationships with overlapping boundaries,existing methods often do not perform well in multi-class imbalanced data classification tasks,particularly in terms of recognizing minority classes with high accuracy.Therefore,this paper proposes a multi-class imbalanced data classification method called CSDSResNet,which is based on a cost-sensitive dualstream residual network.Firstly,to address the issue of limited samples in the minority class within imbalanced datasets,a dual-stream residual network backbone structure is designed to enhance the model’s feature extraction capability.Next,considering the complexities arising fromimbalanced inter-class sample quantities and imbalanced inter-class overlapping boundaries in multi-class imbalanced datasets,a unique cost-sensitive loss function is devised.This loss function places more emphasis on the minority class and the challenging classes with high interclass similarity,thereby improving the model’s classification ability.Finally,the effectiveness and generalization of the proposed method,CSDSResNet,are evaluated on two datasets:‘DryBeans’and‘Electric Motor Defects’.The experimental results demonstrate that CSDSResNet achieves the best performance on imbalanced datasets,with macro_F1-score values improving by 2.9%and 1.9%on the two datasets compared to current state-of-the-art classification methods,respectively.Furthermore,it achieves the highest precision in single-class recognition tasks for the minority class.展开更多
Encrypted traffic classification has become a hot issue in network security research.The class imbalance problem of traffic samples often causes the deterioration of Machine Learning based classifier performance.Altho...Encrypted traffic classification has become a hot issue in network security research.The class imbalance problem of traffic samples often causes the deterioration of Machine Learning based classifier performance.Although the Generative Adversarial Network(GAN)method can generate new samples by learning the feature distribution of the original samples,it is confronted with the problems of unstable training andmode collapse.To this end,a novel data augmenting approach called Graph CWGAN-GP is proposed in this paper.The traffic data is first converted into grayscale images as the input for the proposed model.Then,the minority class data is augmented with our proposed model,which is built by introducing conditional constraints and a new distance metric in typical GAN.Finally,the classical deep learning model is adopted as a classifier to classify datasets augmented by the Condition GAN(CGAN),Wasserstein GAN-Gradient Penalty(WGAN-GP)and Graph CWGAN-GP,respectively.Compared with the state-of-the-art GAN methods,the Graph CWGAN-GP cannot only control the modes of the data to be generated,but also overcome the problem of unstable training and generate more realistic and diverse samples.The experimental results show that the classification precision,recall and F1-Score of theminority class in the balanced dataset augmented in this paper have improved by more than 2.37%,3.39% and 4.57%,respectively.展开更多
Diabetics is one of the world’s most common diseases which are caused by continued high levels of blood sugar.The risk of diabetics can be lowered if the diabetic is found at the early stage.In recent days,several ma...Diabetics is one of the world’s most common diseases which are caused by continued high levels of blood sugar.The risk of diabetics can be lowered if the diabetic is found at the early stage.In recent days,several machine learning models were developed to predict the diabetic presence at an early stage.In this paper,we propose an embedded-based machine learning model that combines the split-vote method and instance duplication to leverage an imbalanced dataset called PIMA Indian to increase the prediction of diabetics.The proposed method uses both the concept of over-sampling and under-sampling along with model weighting to increase the performance of classification.Different measures such as Accuracy,Precision,Recall,and F1-Score are used to evaluate the model.The results we obtained using K-Nearest Neighbor(kNN),Naïve Bayes(NB),Support Vector Machines(SVM),Random Forest(RF),Logistic Regression(LR),and Decision Trees(DT)were 89.32%,91.44%,95.78%,89.3%,81.76%,and 80.38%respectively.The SVM model is more efficient than other models which are 21.38%more than exiting machine learning-based works.展开更多
A common difficulty in building prediction models with real-world environmental datasets is the skewed distribution of classes.There are significantly more samples for day-to-day classes,while rare events such as poll...A common difficulty in building prediction models with real-world environmental datasets is the skewed distribution of classes.There are significantly more samples for day-to-day classes,while rare events such as polluted classes are uncommon.Consequently,the limited availability of minority outcomes lowers the classifier’s overall reliability.This study assesses the capability of machine learning(ML)algorithms in tackling imbalanced water quality data based on the metrics of precision,recall,and F1 score.It intends to balance the misled accuracy towards the majority of data.Hence,10 ML algorithms of its performance are compared.The classifiers included are AdaBoost,SupportVector Machine,Linear Discriminant Analysis,k-Nearest Neighbors,Naive Bayes,Decision Trees,Random Forest,Extra Trees,Bagging,and the Multilayer Perceptron.This study also uses the Easy Ensemble Classifier,Balanced Bagging,andRUSBoost algorithm to evaluatemulti-class imbalanced learning methods.The comparison results revealed that a highaccuracy machine learning model is not always good in recall and sensitivity.This paper’s stacked ensemble deep learning(SE-DL)generalization model effectively classifies the water quality index(WQI)based on 23 input variables.The proposed algorithm achieved a remarkable average of 95.69%,94.96%,92.92%,and 93.88%for accuracy,precision,recall,and F1 score,respectively.In addition,the proposed model is compared against two state-of-the-art classifiers,the XGBoost(eXtreme Gradient Boosting)and Light Gradient Boosting Machine,where performance metrics of balanced accuracy and g-mean are included.The experimental setup concluded XGBoost with a higher balanced accuracy and G-mean.However,the SE-DL model has a better and more balanced performance in the F1 score.The SE-DL model aligns with the goal of this study to ensure the balance between accuracy and completeness for each water quality class.The proposed algorithm is also capable of higher efficiency at a lower computational time against using the standard SyntheticMinority Oversampling Technique(SMOTE)approach to imbalanced datasets.展开更多
Over the past 10 years,lightning disaster has caused a large number of casualties and considerable economic loss worldwide.Lightning poses a huge threat to various industries.In an attempt to reduce the risk of lightn...Over the past 10 years,lightning disaster has caused a large number of casualties and considerable economic loss worldwide.Lightning poses a huge threat to various industries.In an attempt to reduce the risk of lightning-caused disaster,many scholars have carried out in-depth research on lightning.However,these studies focus primarily on the lightning itself and other meteorological elements are ignored.In addition,the methods for assessing the risk of lightning disaster fail to give detailed attention to regional features(lightning disaster risk).This paper proposes a grid-based risk assessment method based on data from multiple sources.First,this paper considers the impact of lightning,the population density,the economy,and geographical environment data on the occurrence of lightning disasters;Second,this paper solves the problem of imbalanced lightning disaster data in geographic grid samples based on the K-means clustering algorithm;Third,the method calculates the feature of lightning disaster in each small field with the help of neural network structure,and the calculation results are then visually reflected in a zoning map by the Jenks natural breaks algorithm.The experimental results show that our method can solve the problem of imbalanced lightning disaster data,and offer 81%accuracy in lightning disaster risk assessment.展开更多
Diagnosis methods based on machine learning and deep learning are widely used in the field of motor fault diagnosis.However,due to the fact that the data imbalance caused by the high cost of obtaining fault data will ...Diagnosis methods based on machine learning and deep learning are widely used in the field of motor fault diagnosis.However,due to the fact that the data imbalance caused by the high cost of obtaining fault data will lead to insufficient generalization performance of the diagnosis method.In response to this problem,a motor fault monitoring system is proposed,which includes a fault diagnosis method(Xgb_LR)based on the optimized gradient boosting decision tree(Xgboost)and logistic regression(LR)fusion model and a data augmentation method named data simulation neighborhood interpolation(DSNI).The Xgb_LR method combines the advantages of the two models and has positive adaptability to imbalanced data.Simultaneously,the DSNI method can be used as an auxiliary method of the diagnosis method to reduce the impact of data imbalance by expanding the original data(signal).Simulation experiments verify the effectiveness of the proposed methods.展开更多
Imbalanced datasets are common in practical applications,and oversampling methods using fuzzy rules have been shown to enhance the classification performance of imbalanced data by taking into account the relationship ...Imbalanced datasets are common in practical applications,and oversampling methods using fuzzy rules have been shown to enhance the classification performance of imbalanced data by taking into account the relationship between data attributes.However,the creation of fuzzy rules typically depends on expert knowledge,which may not fully leverage the label information in training data and may be subjective.To address this issue,a novel fuzzy rule oversampling approach is developed based on the learning vector quantization(LVQ)algorithm.In this method,the label information of the training data is utilized to determine the antecedent part of If-Then fuzzy rules by dynamically dividing attribute intervals using LVQ.Subsequently,fuzzy rules are generated and adjusted to calculate rule weights.The number of new samples to be synthesized for each rule is then computed,and samples from the minority class are synthesized based on the newly generated fuzzy rules.This results in the establishment of a fuzzy rule oversampling method based on LVQ.To evaluate the effectiveness of this method,comparative experiments are conducted on 12 publicly available imbalance datasets with five other sampling techniques in combination with the support function machine.The experimental results demonstrate that the proposed method can significantly enhance the classification algorithm across seven performance indicators,including a boost of 2.15%to 12.34%in Accuracy,6.11%to 27.06%in G-mean,and 4.69%to 18.78%in AUC.These show that the proposed method is capable of more efficiently improving the classification performance of imbalanced data.展开更多
基金supported by the Yunnan Major Scientific and Technological Projects(Grant No.202302AD080001)the National Natural Science Foundation,China(No.52065033).
文摘When building a classification model,the scenario where the samples of one class are significantly more than those of the other class is called data imbalance.Data imbalance causes the trained classification model to be in favor of the majority class(usually defined as the negative class),which may do harm to the accuracy of the minority class(usually defined as the positive class),and then lead to poor overall performance of the model.A method called MSHR-FCSSVM for solving imbalanced data classification is proposed in this article,which is based on a new hybrid resampling approach(MSHR)and a new fine cost-sensitive support vector machine(CS-SVM)classifier(FCSSVM).The MSHR measures the separability of each negative sample through its Silhouette value calculated by Mahalanobis distance between samples,based on which,the so-called pseudo-negative samples are screened out to generate new positive samples(over-sampling step)through linear interpolation and are deleted finally(under-sampling step).This approach replaces pseudo-negative samples with generated new positive samples one by one to clear up the inter-class overlap on the borderline,without changing the overall scale of the dataset.The FCSSVM is an improved version of the traditional CS-SVM.It considers influences of both the imbalance of sample number and the class distribution on classification simultaneously,and through finely tuning the class cost weights by using the efficient optimization algorithm based on the physical phenomenon of rime-ice(RIME)algorithm with cross-validation accuracy as the fitness function to accurately adjust the classification borderline.To verify the effectiveness of the proposed method,a series of experiments are carried out based on 20 imbalanced datasets including both mildly and extremely imbalanced datasets.The experimental results show that the MSHR-FCSSVM method performs better than the methods for comparison in most cases,and both the MSHR and the FCSSVM played significant roles.
基金Key research project of Hunan Provincial Administration of Traditional Chinese Medicine(A2023048)Key Research Foundation of Education Bureau of Hunan Province,China(23A0273).
文摘Objective Clinical medical record data associated with hepatitis B-related acute-on-chronic liver failure(HBV-ACLF)generally have small sample sizes and a class imbalance.However,most machine learning models are designed based on balanced data and lack interpretability.This study aimed to propose a traditional Chinese medicine(TCM)diagnostic model for HBV-ACLF based on the TCM syndrome differentiation and treatment theory,which is clinically interpretable and highly accurate.Methods We collected medical records from 261 patients diagnosed with HBV-ACLF,including three syndromes:Yang jaundice(214 cases),Yang-Yin jaundice(41 cases),and Yin jaundice(6 cases).To avoid overfitting of the machine learning model,we excluded the cases of Yin jaundice.After data standardization and cleaning,we obtained 255 relevant medical records of Yang jaundice and Yang-Yin jaundice.To address the class imbalance issue,we employed the oversampling method and five machine learning methods,including logistic regression(LR),support vector machine(SVM),decision tree(DT),random forest(RF),and extreme gradient boosting(XGBoost)to construct the syndrome diagnosis models.This study used precision,F1 score,the area under the receiver operating characteristic(ROC)curve(AUC),and accuracy as model evaluation metrics.The model with the best classification performance was selected to extract the diagnostic rule,and its clinical significance was thoroughly analyzed.Furthermore,we proposed a novel multiple-round stable rule extraction(MRSRE)method to obtain a stable rule set of features that can exhibit the model’s clinical interpretability.Results The precision of the five machine learning models built using oversampled balanced data exceeded 0.90.Among these models,the accuracy of RF classification of syndrome types was 0.92,and the mean F1 scores of the two categories of Yang jaundice and Yang-Yin jaundice were 0.93 and 0.94,respectively.Additionally,the AUC was 0.98.The extraction rules of the RF syndrome differentiation model based on the MRSRE method revealed that the common features of Yang jaundice and Yang-Yin jaundice were wiry pulse,yellowing of the urine,skin,and eyes,normal tongue body,healthy sublingual vessel,nausea,oil loathing,and poor appetite.The main features of Yang jaundice were a red tongue body and thickened sublingual vessels,whereas those of Yang-Yin jaundice were a dark tongue body,pale white tongue body,white tongue coating,lack of strength,slippery pulse,light red tongue body,slimy tongue coating,and abdominal distension.This is aligned with the classifications made by TCM experts based on TCM syndrome differentiation and treatment theory.Conclusion Our model can be utilized for differentiating HBV-ACLF syndromes,which has the potential to be applied to generate other clinically interpretable models with high accuracy on clinical data characterized by small sample sizes and a class imbalance.
基金supported by the National Key Research and Development Program of China(2018YFB1003700)the Scientific and Technological Support Project(Society)of Jiangsu Province(BE2016776)+2 种基金the“333” project of Jiangsu Province(BRA2017228 BRA2017401)the Talent Project in Six Fields of Jiangsu Province(2015-JNHB-012)
文摘For imbalanced datasets, the focus of classification is to identify samples of the minority class. The performance of current data mining algorithms is not good enough for processing imbalanced datasets. The synthetic minority over-sampling technique(SMOTE) is specifically designed for learning from imbalanced datasets, generating synthetic minority class examples by interpolating between minority class examples nearby. However, the SMOTE encounters the overgeneralization problem. The densitybased spatial clustering of applications with noise(DBSCAN) is not rigorous when dealing with the samples near the borderline.We optimize the DBSCAN algorithm for this problem to make clustering more reasonable. This paper integrates the optimized DBSCAN and SMOTE, and proposes a density-based synthetic minority over-sampling technique(DSMOTE). First, the optimized DBSCAN is used to divide the samples of the minority class into three groups, including core samples, borderline samples and noise samples, and then the noise samples of minority class is removed to synthesize more effective samples. In order to make full use of the information of core samples and borderline samples,different strategies are used to over-sample core samples and borderline samples. Experiments show that DSMOTE can achieve better results compared with SMOTE and Borderline-SMOTE in terms of precision, recall and F-value.
文摘Imbalanced data classification is one of the major problems in machine learning.This imbalanced dataset typically has significant differences in the number of data samples between its classes.In most cases,the performance of the machine learning algorithm such as Support Vector Machine(SVM)is affected when dealing with an imbalanced dataset.The classification accuracy is mostly skewed toward the majority class and poor results are exhibited in the prediction of minority-class samples.In this paper,a hybrid approach combining data pre-processing technique andSVMalgorithm based on improved Simulated Annealing(SA)was proposed.Firstly,the data preprocessing technique which primarily aims at solving the resampling strategy of handling imbalanced datasets was proposed.In this technique,the data were first synthetically generated to equalize the number of samples between classes and followed by a reduction step to remove redundancy and duplicated data.Next is the training of a balanced dataset using SVM.Since this algorithm requires an iterative process to search for the best penalty parameter during training,an improved SA algorithm was proposed for this task.In this proposed improvement,a new acceptance criterion for the solution to be accepted in the SA algorithm was introduced to enhance the accuracy of the optimization process.Experimental works based on ten publicly available imbalanced datasets have demonstrated higher accuracy in the classification tasks using the proposed approach in comparison with the conventional implementation of SVM.Registering at an average of 89.65%of accuracy for the binary class classification has demonstrated the good performance of the proposed works.
基金supported under the research Grant(PO Number:920138936)from the Institute of Technology PETRONAS Sdn Bhd,32610,Bandar Seri Iskandar,Perak,Malaysia.
文摘Prediction of machine failure is challenging as the dataset is often imbalanced with a low failure rate.The common approach to han-dle classification involving imbalanced data is to balance the data using a sampling approach such as random undersampling,random oversampling,or Synthetic Minority Oversampling Technique(SMOTE)algorithms.This paper compared the classification performance of three popular classifiers(Logistic Regression,Gaussian Naïve Bayes,and Support Vector Machine)in predicting machine failure in the Oil and Gas industry.The original machine failure dataset consists of 20,473 hourly data and is imbalanced with 19945(97%)‘non-failure’and 528(3%)‘failure data’.The three independent variables to predict machine failure were pressure indicator,flow indicator,and level indicator.The accuracy of the classifiers is very high and close to 100%,but the sensitivity of all classifiers using the original dataset was close to zero.The performance of the three classifiers was then evaluated for data with different imbalance rates(10%to 50%)generated from the original data using SMOTE,SMOTE-Support Vector Machine(SMOTE-SVM)and SMOTE-Edited Nearest Neighbour(SMOTE-ENN).The classifiers were evaluated based on improvement in sensitivity and F-measure.Results showed that the sensitivity of all classifiers increases as the imbalance rate increases.SVM with radial basis function(RBF)kernel has the highest sensitivity when data is balanced(50:50)using SMOTE(Sensitivitytest=0.5686,Ftest=0.6927)compared to Naïve Bayes(Sensitivitytest=0.4033,Ftest=0.6218)and Logistic Regression(Sensitivitytest=0.4194,Ftest=0.621).Overall,the Gaussian Naïve Bayes model consistently improves sensitivity and F-measure as the imbalance ratio increases,but the sensitivity is below 50%.The classifiers performed better when data was balanced using SMOTE-SVM compared to SMOTE and SMOTE-ENN.
基金The authors gratefully acknowledge financial support of national natural science foundation of China(No.52067021)natural science foundation of Xinjiang Uygur Autonomous Region(2022D01C35)+1 种基金excellent youth scientific and technological talents plan of Xinjiang(No.2019Q012)major science&technology special project of Xinjiang Uygur Autonomous Region(2022A01002-2).
文摘The imbalance of dissolved gas analysis(DGA)data will lead to over-fitting,weak generalization and poor recognition performance for fault diagnosis models based on deep learning.To handle this problem,a novel transformer fault diagnosis method based on improved auxiliary classifier generative adversarial network(ACGAN)under imbalanced data is proposed in this paper,which meets both the requirements of balancing DGA data and supplying accurate diagnosis results.The generator combines one-dimensional convolutional neural networks(1D-CNN)and long short-term memories(LSTM),which can deeply extract the features from DGA samples and be greatly beneficial to ACGAN’s data balancing and fault diagnosis.The discriminator adopts multilayer perceptron networks(MLP),which prevents the discriminator from losing important features of DGA data when the network is too complex and the number of layers is too large.The experimental results suggest that the presented approach can effectively improve the adverse effects of DGA data imbalance on the deep learning models,enhance fault diagnosis performance and supply desirable diagnosis accuracy up to 99.46%.Furthermore,the comparison results indicate the fault diagnosis performance of the proposed approach is superior to that of other conventional methods.Therefore,the method presented in this study has excellent and reliable fault diagnosis performance for various unbalanced datasets.In addition,the proposed approach can also solve the problems of insufficient and imbalanced fault data in other practical application fields.
基金The authors of this paper acknowledge the support from the National Natural Science Foundation of China(No.51975191)the Funds for Science and Technology Creative Talents of Hubei,China(No.2023DJC048)This work was supported by the Xiangyang Hubei University of Technology Industrial Research Institute Funding Program(No.XYYJ2022B01).
文摘Accurate fault diagnosis of heating,ventilation,and air conditioning(HVAC)systems is of significant importance for maintaining normal operation,reducing energy consumption,and minimizing maintenance costs.However,in practical applications,it is challenging to obtain sufficient fault data for HVAC systems,leading to imbalanced data,where the number of fault samples is much smaller than that of normal samples.Moreover,most existing HVAC system fault diagnosis methods heavily rely on balanced training sets to achieve high fault diagnosis accuracy.Therefore,to address this issue,a composite neural network fault diagnosis model is proposed,which combines SMOTETomek,multi-scale one-dimensional convolutional neural networks(M1DCNN),and support vector machine(SVM).This method first utilizes SMOTETomek to augment the minority class samples in the imbalanced dataset,achieving a balanced number of faulty and normal data.Then,it employs the M1DCNN model to extract feature information from the augmented dataset.Finally,it replaces the original Softmax classifier with an SVM classifier for classification,thus enhancing the fault diagnosis accuracy.Using the SMOTETomek-M1DCNN-SVM method,we conducted fault diagnosis validation on both the ASHRAE RP-1043 dataset and experimental dataset with an imbalance ratio of 1:10.The results demonstrate the superiority of this approach,providing a novel and promising solution for intelligent building management,with accuracy and F1 scores of 98.45%and 100%for the RP-1043 dataset and experimental dataset,respectively.
基金supported in part by the Science and Technology Development Fund,Macao SAR,China(File no.SKL-IOTSC(UM)-2021-2023,File no.0003/2020/AKP,and File no.0011/2021/AGJ)。
文摘Transition towards carbon-neutral power systems has necessitated optimization of power dispatch in active distribution networks(ADNs)to facilitate integration of distributed renewable generation.Due to unavailability of network topology and line impedance in many distribution networks,physical model-based methods may not be applicable to their operations.To tackle this challenge,some studies have proposed constraint learning,which replicates physical models by training a neural network to evaluate feasibility of a decision(i.e.,whether a decision satisfies all critical constraints or not).To ensure accuracy of this trained neural network,training set should contain sufficient feasible and infeasible samples.However,since ADNs are mostly operated in a normal status,only very few historical samples are infeasible.Thus,the historical dataset is highly imbalanced,which poses a significant obstacle to neural network training.To address this issue,we propose an enhanced constraint learning method.First,it leverages constraint learning to train a neural network as surrogate of ADN's model.Then,it introduces Synthetic Minority Oversampling Technique to generate infeasible samples to mitigate imbalance of historical dataset.By incorporating historical and synthetic samples into the training set,we can significantly improve accuracy of neural network.Furthermore,we establish a trust region to constrain and thereafter enhance reliability of the solution.Simulations confirm the benefits of the proposed method in achieving desirable optimality and feasibility while maintaining low computational complexity.
基金partly supported by the Technology Development Program of MSS(No.S3033853)by the National Research Foundation of Korea(NRF)grant funded by the Korea government(MSIT)(No.2021R1A4A1031509).
文摘A generalization of supervised single-label learning based on the assumption that each sample in a dataset may belong to more than one class simultaneously is called multi-label learning.The main objective of this work is to create a novel framework for learning and classifying imbalancedmulti-label data.This work proposes a framework of two phases.The imbalanced distribution of themulti-label dataset is addressed through the proposed Borderline MLSMOTE resampling method in phase 1.Later,an adaptive weighted l21 norm regularized(Elastic-net)multilabel logistic regression is used to predict unseen samples in phase 2.The proposed Borderline MLSMOTE resampling method focuses on samples with concurrent high labels in contrast to conventional MLSMOTE.The minority labels in these samples are called difficult minority labels and are more prone to penalize classification performance.The concurrentmeasure is considered borderline,and labels associated with samples are regarded as borderline labels in the decision boundary.In phase II,a novel adaptive l21 norm regularized weighted multi-label logistic regression is used to handle balanced data with different weighted synthetic samples.Experimentation on various benchmark datasets shows the outperformance of the proposed method and its powerful predictive performances over existing conventional state-of-the-art multi-label methods.
文摘Purpose: This paper aims to improve the classification performance when the data is imbalanced by applying different sampling techniques available in Machine Learning.Design/methodology/approach: The medical appointment no-show dataset is imbalanced, and when classification algorithms are applied directly to the dataset, it is biased towards the majority class, ignoring the minority class. To avoid this issue, multiple sampling techniques such as Random Over Sampling(ROS), Random Under Sampling(RUS), Synthetic Minority Oversampling TEchnique(SMOTE), ADAptive SYNthetic Sampling(ADASYN), Edited Nearest Neighbor(ENN), and Condensed Nearest Neighbor(CNN) are applied in order to make the dataset balanced. The performance is assessed by the Decision Tree classifier with the listed sampling techniques and the best performance is identified.Findings: This study focuses on the comparison of the performance metrics of various sampling methods widely used. It is revealed that, compared to other techniques, the Recall is high when ENN is applied CNN and ADASYN have performed equally well on the Imbalanced data.Research limitations: The testing was carried out with limited dataset and needs to be tested with a larger dataset.Practical implications: This framework will be useful whenever the data is imbalanced in real world scenarios, which ultimately improves the performance.Originality/value: This paper uses the rebalancing framework on medical appointment no-show dataset to predict the no-shows and removes the bias towards minority class.
基金partially supported by the Aeronautical Science Foundation of China(No.201920007001)National Natural Science Foundation of China(Nos.U20B2067,61790552 and 61790554)。
文摘Imbalanced data classification is an important research topic in real-world applications,like fault diagnosis in an aircraft manufacturing system.The over-sampling method is often used to solve this problem.It generates samples according to the distance between minority data.However,the traditional over-sampling method may change the original data distribution,which is harmful to the classification performance.In this paper,we propose a new method called Conditional SelfAttention Generative Adversarial Network with Differential Evolution(CSAGAN-DE)for imbalanced data classification.The new method aims at improving the classification performance of minority data by enhancing the quality of the generation of minority data.In CSAGAN-DE,the minority data are fed into the self-attention generative adversarial network to approximate the data distribution and create new data for the minority class.Then,the differential evolution algorithm is employed to automatically determine the number of generated minority data for achieving a satisfactory classification performance.Several experiments are conducted to evaluate the performance of the new CSAGAN-DE method.The results show that the new method can efficiently improve the classification performance compared with other related methods.
文摘Recently,machine learning algorithms have been used in the detection and classification of network attacks.The performance of the algorithms has been evaluated by using benchmark network intrusion datasets such as DARPA98,KDD’99,NSL-KDD,UNSW-NB15,and Caida DDoS.However,these datasets have two major challenges:imbalanced data and highdimensional data.Obtaining high accuracy for all attack types in the dataset allows for high accuracy in imbalanced datasets.On the other hand,having a large number of features increases the runtime load on the algorithms.A novel model is proposed in this paper to overcome these two concerns.The number of features in the model,which has been tested at CICIDS2017,is initially optimized by using genetic algorithms.This optimum feature set has been used to classify network attacks with six well-known classifiers according to high f1-score and g-mean value in minimumtime.Afterwards,amulti-layer perceptron based ensemble learning approach has been applied to improve the models’overall performance.The experimental results showthat the suggested model is acceptable for feature selection as well as classifying network attacks in an imbalanced dataset,with a high f1-score(0.91)and g-mean(0.99)value.Furthermore,it has outperformed base classifier models and voting procedures.
文摘Data-driven methods are widely considered for fault diagnosis in complex systems.However,in practice,the between-class imbalance due to limited faulty samples may deteriorate their classification performance.To address this issue,synthetic minority methods for enhancing data have been proved to be effective in many applications.Generative adversarial networks(GANs),capable of automatic features extraction,can also be adopted for augmenting the faulty samples.However,the monitoring data of a complex system may include not only continuous signals but also discrete/categorical signals.Since the current GAN methods still have some challenges in handling such heterogeneous monitoring data,a Mixed Dual Discriminator GAN(noted as M-D2GAN)is proposed in this work.In order to render the expanded fault samples more aligned with the real situation and improve the accuracy and robustness of the fault diagnosis model,different types of variables are generated in different ways,including floating-point,integer,categorical,and hierarchical.For effectively considering the class imbalance problem,proper modifications are made to the GAN model,where a normal class discriminator is added.A practical case study concerning the braking system of a high-speed train is carried out to verify the effectiveness of the proposed framework.Compared to the classic GAN,the proposed framework achieves better results with respect to F-measure and G-mean metrics.
基金supported by Beijing Municipal Science and Technology Project(No.Z221100007122003)。
文摘Imbalanced data classification is the task of classifying datasets where there is a significant disparity in the number of samples between different classes.This task is prevalent in practical scenarios such as industrial fault diagnosis,network intrusion detection,cancer detection,etc.In imbalanced classification tasks,the focus is typically on achieving high recognition accuracy for the minority class.However,due to the challenges presented by imbalanced multi-class datasets,such as the scarcity of samples in minority classes and complex inter-class relationships with overlapping boundaries,existing methods often do not perform well in multi-class imbalanced data classification tasks,particularly in terms of recognizing minority classes with high accuracy.Therefore,this paper proposes a multi-class imbalanced data classification method called CSDSResNet,which is based on a cost-sensitive dualstream residual network.Firstly,to address the issue of limited samples in the minority class within imbalanced datasets,a dual-stream residual network backbone structure is designed to enhance the model’s feature extraction capability.Next,considering the complexities arising fromimbalanced inter-class sample quantities and imbalanced inter-class overlapping boundaries in multi-class imbalanced datasets,a unique cost-sensitive loss function is devised.This loss function places more emphasis on the minority class and the challenging classes with high interclass similarity,thereby improving the model’s classification ability.Finally,the effectiveness and generalization of the proposed method,CSDSResNet,are evaluated on two datasets:‘DryBeans’and‘Electric Motor Defects’.The experimental results demonstrate that CSDSResNet achieves the best performance on imbalanced datasets,with macro_F1-score values improving by 2.9%and 1.9%on the two datasets compared to current state-of-the-art classification methods,respectively.Furthermore,it achieves the highest precision in single-class recognition tasks for the minority class.
基金supported by the National Natural Science Foundation of China (Grants Nos.61931004,62072250)the Talent Launch Fund of Nanjing University of Information Science and Technology (2020r061).
文摘Encrypted traffic classification has become a hot issue in network security research.The class imbalance problem of traffic samples often causes the deterioration of Machine Learning based classifier performance.Although the Generative Adversarial Network(GAN)method can generate new samples by learning the feature distribution of the original samples,it is confronted with the problems of unstable training andmode collapse.To this end,a novel data augmenting approach called Graph CWGAN-GP is proposed in this paper.The traffic data is first converted into grayscale images as the input for the proposed model.Then,the minority class data is augmented with our proposed model,which is built by introducing conditional constraints and a new distance metric in typical GAN.Finally,the classical deep learning model is adopted as a classifier to classify datasets augmented by the Condition GAN(CGAN),Wasserstein GAN-Gradient Penalty(WGAN-GP)and Graph CWGAN-GP,respectively.Compared with the state-of-the-art GAN methods,the Graph CWGAN-GP cannot only control the modes of the data to be generated,but also overcome the problem of unstable training and generate more realistic and diverse samples.The experimental results show that the classification precision,recall and F1-Score of theminority class in the balanced dataset augmented in this paper have improved by more than 2.37%,3.39% and 4.57%,respectively.
文摘Diabetics is one of the world’s most common diseases which are caused by continued high levels of blood sugar.The risk of diabetics can be lowered if the diabetic is found at the early stage.In recent days,several machine learning models were developed to predict the diabetic presence at an early stage.In this paper,we propose an embedded-based machine learning model that combines the split-vote method and instance duplication to leverage an imbalanced dataset called PIMA Indian to increase the prediction of diabetics.The proposed method uses both the concept of over-sampling and under-sampling along with model weighting to increase the performance of classification.Different measures such as Accuracy,Precision,Recall,and F1-Score are used to evaluate the model.The results we obtained using K-Nearest Neighbor(kNN),Naïve Bayes(NB),Support Vector Machines(SVM),Random Forest(RF),Logistic Regression(LR),and Decision Trees(DT)were 89.32%,91.44%,95.78%,89.3%,81.76%,and 80.38%respectively.The SVM model is more efficient than other models which are 21.38%more than exiting machine learning-based works.
基金primarily supported by the Ministry of Higher Education through MRUN Young Researchers Grant Scheme(MY-RGS),MR001-2019,entitled“Climate Change Mitigation:Artificial Intelligence-Based Integrated Environmental System for Mangrove Forest Conservation,”received by K.H.,S.A.R.,H.F.H.,M.I.M.,and M.M.Asecondarily funded by the UM-RU Grant,ST065-2021,entitled Climate Smart Mitigation and Adaptation:Integrated Climate Resilience Strategy for Tropical Marine Ecosystem.
文摘A common difficulty in building prediction models with real-world environmental datasets is the skewed distribution of classes.There are significantly more samples for day-to-day classes,while rare events such as polluted classes are uncommon.Consequently,the limited availability of minority outcomes lowers the classifier’s overall reliability.This study assesses the capability of machine learning(ML)algorithms in tackling imbalanced water quality data based on the metrics of precision,recall,and F1 score.It intends to balance the misled accuracy towards the majority of data.Hence,10 ML algorithms of its performance are compared.The classifiers included are AdaBoost,SupportVector Machine,Linear Discriminant Analysis,k-Nearest Neighbors,Naive Bayes,Decision Trees,Random Forest,Extra Trees,Bagging,and the Multilayer Perceptron.This study also uses the Easy Ensemble Classifier,Balanced Bagging,andRUSBoost algorithm to evaluatemulti-class imbalanced learning methods.The comparison results revealed that a highaccuracy machine learning model is not always good in recall and sensitivity.This paper’s stacked ensemble deep learning(SE-DL)generalization model effectively classifies the water quality index(WQI)based on 23 input variables.The proposed algorithm achieved a remarkable average of 95.69%,94.96%,92.92%,and 93.88%for accuracy,precision,recall,and F1 score,respectively.In addition,the proposed model is compared against two state-of-the-art classifiers,the XGBoost(eXtreme Gradient Boosting)and Light Gradient Boosting Machine,where performance metrics of balanced accuracy and g-mean are included.The experimental setup concluded XGBoost with a higher balanced accuracy and G-mean.However,the SE-DL model has a better and more balanced performance in the F1 score.The SE-DL model aligns with the goal of this study to ensure the balance between accuracy and completeness for each water quality class.The proposed algorithm is also capable of higher efficiency at a lower computational time against using the standard SyntheticMinority Oversampling Technique(SMOTE)approach to imbalanced datasets.
基金the National Key R&D Program of China under grant number 2018YFB1003205by the National Natural Science Foundation of China under grant number U1836208,U1536206,U1836110,61602253 and 61672294+3 种基金by the Startup Foundation for Introducing Talent of NUIST(1441102001002)by the Jiangsu Basic Research Programs-Natural Science Foundation under grant number BK20181407by the Priority Academic Program Development of Jiangsu Higher Education Institutions(PAPD)fundby the Postgraduate Research and Innovation Plan Project in Jiangsu Province under grant number KYCX20_0934 and by the Collaborative Innovation Center of Atmospheric Environment and Equipment Technology(CICAEET)fund,China.
文摘Over the past 10 years,lightning disaster has caused a large number of casualties and considerable economic loss worldwide.Lightning poses a huge threat to various industries.In an attempt to reduce the risk of lightning-caused disaster,many scholars have carried out in-depth research on lightning.However,these studies focus primarily on the lightning itself and other meteorological elements are ignored.In addition,the methods for assessing the risk of lightning disaster fail to give detailed attention to regional features(lightning disaster risk).This paper proposes a grid-based risk assessment method based on data from multiple sources.First,this paper considers the impact of lightning,the population density,the economy,and geographical environment data on the occurrence of lightning disasters;Second,this paper solves the problem of imbalanced lightning disaster data in geographic grid samples based on the K-means clustering algorithm;Third,the method calculates the feature of lightning disaster in each small field with the help of neural network structure,and the calculation results are then visually reflected in a zoning map by the Jenks natural breaks algorithm.The experimental results show that our method can solve the problem of imbalanced lightning disaster data,and offer 81%accuracy in lightning disaster risk assessment.
基金supported by the National Natural Science Foundation of China(No.61873032)。
文摘Diagnosis methods based on machine learning and deep learning are widely used in the field of motor fault diagnosis.However,due to the fact that the data imbalance caused by the high cost of obtaining fault data will lead to insufficient generalization performance of the diagnosis method.In response to this problem,a motor fault monitoring system is proposed,which includes a fault diagnosis method(Xgb_LR)based on the optimized gradient boosting decision tree(Xgboost)and logistic regression(LR)fusion model and a data augmentation method named data simulation neighborhood interpolation(DSNI).The Xgb_LR method combines the advantages of the two models and has positive adaptability to imbalanced data.Simultaneously,the DSNI method can be used as an auxiliary method of the diagnosis method to reduce the impact of data imbalance by expanding the original data(signal).Simulation experiments verify the effectiveness of the proposed methods.
基金funded by the National Science Foundation of China(62006068)Hebei Natural Science Foundation(A2021402008),Natural Science Foundation of Scientific Research Project of Higher Education in Hebei Province(ZD2020185,QN2020188)333 Talent Supported Project of Hebei Province(C20221026).
文摘Imbalanced datasets are common in practical applications,and oversampling methods using fuzzy rules have been shown to enhance the classification performance of imbalanced data by taking into account the relationship between data attributes.However,the creation of fuzzy rules typically depends on expert knowledge,which may not fully leverage the label information in training data and may be subjective.To address this issue,a novel fuzzy rule oversampling approach is developed based on the learning vector quantization(LVQ)algorithm.In this method,the label information of the training data is utilized to determine the antecedent part of If-Then fuzzy rules by dynamically dividing attribute intervals using LVQ.Subsequently,fuzzy rules are generated and adjusted to calculate rule weights.The number of new samples to be synthesized for each rule is then computed,and samples from the minority class are synthesized based on the newly generated fuzzy rules.This results in the establishment of a fuzzy rule oversampling method based on LVQ.To evaluate the effectiveness of this method,comparative experiments are conducted on 12 publicly available imbalance datasets with five other sampling techniques in combination with the support function machine.The experimental results demonstrate that the proposed method can significantly enhance the classification algorithm across seven performance indicators,including a boost of 2.15%to 12.34%in Accuracy,6.11%to 27.06%in G-mean,and 4.69%to 18.78%in AUC.These show that the proposed method is capable of more efficiently improving the classification performance of imbalanced data.