The study of machine learning has revealed that it can unleash new applications in a variety of disciplines.Many limitations limit their expressiveness,and researchers are working to overcome them to fully exploit the...The study of machine learning has revealed that it can unleash new applications in a variety of disciplines.Many limitations limit their expressiveness,and researchers are working to overcome them to fully exploit the power of data-driven machine learning(ML)and deep learning(DL)techniques.The data imbalance presents major hurdles for classification and prediction problems in machine learning,restricting data analytics and acquiring relevant insights in practically all real-world research domains.In visual learning,network information security,failure prediction,digital marketing,healthcare,and a variety of other domains,raw data suffers from a biased data distribution of one class over the other.This article aims to present a taxonomy of the approaches for handling imbalanced data problems and their comparative study on the classification metrics and their application areas.We have explored very recent trends of techniques employed for solutions to class imbalance problems in datasets and have also discussed their limitations.This article has also identified open challenges for further research in the direction of class data imbalance.展开更多
Every application in a smart city environment like the smart grid,health monitoring, security, and surveillance generates non-stationary datastreams. Due to such nature, the statistical properties of data changes over...Every application in a smart city environment like the smart grid,health monitoring, security, and surveillance generates non-stationary datastreams. Due to such nature, the statistical properties of data changes overtime, leading to class imbalance and concept drift issues. Both these issuescause model performance degradation. Most of the current work has beenfocused on developing an ensemble strategy by training a new classifier on thelatest data to resolve the issue. These techniques suffer while training the newclassifier if the data is imbalanced. Also, the class imbalance ratio may changegreatly from one input stream to another, making the problem more complex.The existing solutions proposed for addressing the combined issue of classimbalance and concept drift are lacking in understating of correlation of oneproblem with the other. This work studies the association between conceptdrift and class imbalance ratio and then demonstrates how changes in classimbalance ratio along with concept drift affect the classifier’s performance.We analyzed the effect of both the issues on minority and majority classesindividually. To do this, we conducted experiments on benchmark datasetsusing state-of-the-art classifiers especially designed for data stream classification.Precision, recall, F1 score, and geometric mean were used to measure theperformance. Our findings show that when both class imbalance and conceptdrift problems occur together the performance can decrease up to 15%. Ourresults also show that the increase in the imbalance ratio can cause a 10% to15% decrease in the precision scores of both minority and majority classes.The study findings may help in designing intelligent and adaptive solutionsthat can cope with the challenges of non-stationary data streams like conceptdrift and class imbalance.展开更多
Pneumonia is an acute lung infection that has caused many fatalitiesglobally. Radiologists often employ chest X-rays to identify pneumoniasince they are presently the most effective imaging method for this purpose.Com...Pneumonia is an acute lung infection that has caused many fatalitiesglobally. Radiologists often employ chest X-rays to identify pneumoniasince they are presently the most effective imaging method for this purpose.Computer-aided diagnosis of pneumonia using deep learning techniques iswidely used due to its effectiveness and performance. In the proposed method,the Synthetic Minority Oversampling Technique (SMOTE) approach is usedto eliminate the class imbalance in the X-ray dataset. To compensate forthe paucity of accessible data, pre-trained transfer learning is used, and anensemble Convolutional Neural Network (CNN) model is developed. Theensemble model consists of all possible combinations of the MobileNetv2,Visual Geometry Group (VGG16), and DenseNet169 models. MobileNetV2and DenseNet169 performed well in the Single classifier model, with anaccuracy of 94%, while the ensemble model (MobileNetV2+DenseNet169)achieved an accuracy of 96.9%. Using the data synchronous parallel modelin Distributed Tensorflow, the training process accelerated performance by98.6% and outperformed other conventional approaches.展开更多
The popularity of the Internet of Things(IoT)has enabled a large number of vulnerable devices to connect to the Internet,bringing huge security risks.As a network-level security authentication method,device fingerprin...The popularity of the Internet of Things(IoT)has enabled a large number of vulnerable devices to connect to the Internet,bringing huge security risks.As a network-level security authentication method,device fingerprint based on machine learning has attracted considerable attention because it can detect vulnerable devices in complex and heterogeneous access phases.However,flexible and diversified IoT devices with limited resources increase dif-ficulty of the device fingerprint authentication method executed in IoT,because it needs to retrain the model network to deal with incremental features or types.To address this problem,a device fingerprinting mechanism based on a Broad Learning System(BLS)is proposed in this paper.The mechanism firstly characterizes IoT devices by traffic analysis based on the identifiable differences of the traffic data of IoT devices,and extracts feature parameters of the traffic packets.A hierarchical hybrid sampling method is designed at the preprocessing phase to improve the imbalanced data distribution and reconstruct the fingerprint dataset.The complexity of the dataset is reduced using Principal Component Analysis(PCA)and the device type is identified by training weights using BLS.The experimental results show that the proposed method can achieve state-of-the-art accuracy and spend less training time than other existing methods.展开更多
Datasets with the imbalanced class distribution are difficult to handle with the standard classification algorithms.In supervised learning,dealing with the problem of class imbalance is still considered to be a challe...Datasets with the imbalanced class distribution are difficult to handle with the standard classification algorithms.In supervised learning,dealing with the problem of class imbalance is still considered to be a challenging research problem.Various machine learning techniques are designed to operate on balanced datasets;therefore,the state of the art,different undersampling,over-sampling and hybrid strategies have been proposed to deal with the problem of imbalanced datasets,but highly skewed datasets still pose the problem of generalization and noise generation during resampling.To overcome these problems,this paper proposes amajority clusteringmodel for classification of imbalanced datasets known as MCBC-SMOTE(Majority Clustering for balanced Classification-SMOTE).The model provides a method to convert the problem of binary classification into a multi-class problem.In the proposed algorithm,the number of clusters for themajority class is calculated using the elbow method and the minority class is over-sampled as an average of clustered majority classes to generate a symmetrical class distribution.The proposed technique is cost-effective,reduces the problem of noise generation and successfully disables the imbalances present in between and within classes.The results of the evaluations on diverse real datasets proved to provide better classification results as compared to state of the art existing methodologies based on several performance metrics.展开更多
With the rise of internet facilities,a greater number of people have started doing online transactions at an exponential rate in recent years as the online transaction system has eliminated the need of going to the ba...With the rise of internet facilities,a greater number of people have started doing online transactions at an exponential rate in recent years as the online transaction system has eliminated the need of going to the bank physically for every transaction.However,the fraud cases have also increased causing the loss of money to the consumers.Hence,an effective fraud detection system is the need of the hour which can detect fraudulent transactions automatically in real-time.Generally,the genuine transactions are large in number than the fraudulent transactions which leads to the class imbalance problem.In this research work,an online transaction fraud detection system using deep learning has been proposed which can handle class imbalance problem by applying algorithm-level methods which modify the learning of the model to focus more on the minority class i.e.,fraud transactions.A novel loss function named Weighted Hard-Reduced Focal Loss(WH-RFL)has been proposed which has achieved maximum fraud detection rate i.e.,True PositiveRate(TPR)at the cost of misclassification of few genuine transactions as high TPR is preferred over a high True Negative Rate(TNR)in fraud detection system and same has been demonstrated using three publicly available imbalanced transactional datasets.Also,Thresholding has been applied to optimize the decision threshold using cross-validation to detect maximum number of frauds and it has been demonstrated by the experimental results that the selection of the right thresholding method with deep learning yields better results.展开更多
In recent years,the detection of fake job descriptions has become increasingly necessary because social networking has changed the way people access burgeoning information in the internet age.Identifying fraud in job ...In recent years,the detection of fake job descriptions has become increasingly necessary because social networking has changed the way people access burgeoning information in the internet age.Identifying fraud in job descriptions can help jobseekers to avoid many of the risks of job hunting.However,the problem of detecting fake job descriptions comes up against the problem of class imbalance when the number of genuine jobs exceeds the number of fake jobs.This causes a reduction in the predictability and performance of traditional machine learning models.We therefore present an efficient framework that uses an oversampling technique called FJD-OT(Fake Job Description Detection Using Oversampling Techniques)to improve the predictability of detecting fake job descriptions.In the proposed framework,we apply several techniques including the removal of stop words and the use of a tokenizer to preprocess the text data in the first module.We then use a bag of words in combination with the term frequency-inverse document frequency(TF-IDF)approach to extract the features from the text data to create the feature dataset in the second module.Next,our framework applies k-fold cross-validation,a commonly used technique to test the effectiveness of machine learning models,that splits the experimental dataset[the Employment Scam Aegean(ESA)dataset in our study]into training and test sets for evaluation.The training set is passed through the third module,an oversampling module in which the SVMSMOTE method is used to balance data before training the classifiers in the last module.The experimental results indicate that the proposed approach significantly improves the predictability of fake job description detection on the ESA dataset based on several popular performance metrics.展开更多
The emergence of digital networks and the wide adoption of information on internet platforms have given rise to threats against users’private information.Many intruders actively seek such private data either for sale...The emergence of digital networks and the wide adoption of information on internet platforms have given rise to threats against users’private information.Many intruders actively seek such private data either for sale or other inappropriate purposes.Similarly,national and international organizations have country-level and company-level private information that could be accessed by different network attacks.Therefore,the need for a Network Intruder Detection System(NIDS)becomes essential for protecting these networks and organizations.In the evolution of NIDS,Artificial Intelligence(AI)assisted tools and methods have been widely adopted to provide effective solutions.However,the development of NIDS still faces challenges at the dataset and machine learning levels,such as large deviations in numeric features,the presence of numerous irrelevant categorical features resulting in reduced cardinality,and class imbalance in multiclass-level data.To address these challenges and offer a unified solution to NIDS development,this study proposes a novel framework that preprocesses datasets and applies a box-cox transformation to linearly transform the numeric features and bring them into closer alignment.Cardinality reduction was applied to categorical features through the binning method.Subsequently,the class imbalance dataset was addressed using the adaptive synthetic sampling data generation method.Finally,the preprocessed,refined,and oversampled feature set was divided into training and test sets with an 80–20 ratio,and two experiments were conducted.In Experiment 1,the binary classification was executed using four machine learning classifiers,with the extra trees classifier achieving the highest accuracy of 97.23%and an AUC of 0.9961.In Experiment 2,multiclass classification was performed,and the extra trees classifier emerged as the most effective,achieving an accuracy of 81.27%and an AUC of 0.97.The results were evaluated based on training,testing,and total time,and a comparative analysis with state-of-the-art studies proved the robustness and significance of the applied methods in developing a timely and precision-efficient solution to NIDS.展开更多
Class imbalance is a common characteristic of industrial data that adversely affects industrial data mining because it leads to the biased training of machine learning models.To address this issue,the augmentation of ...Class imbalance is a common characteristic of industrial data that adversely affects industrial data mining because it leads to the biased training of machine learning models.To address this issue,the augmentation of samples in minority classes based on generative adversarial networks(GANs)has been demonstrated as an effective approach.This study proposes a novel GAN-based minority class augmentation approach named classifier-aided minority augmentation generative adversarial network(CMAGAN).In the CMAGAN framework,an outlier elimination strategy is first applied to each class to minimize the negative impacts of outliers.Subsequently,a newly designed boundary-strengthening learning GAN(BSLGAN)is employed to generate additional samples for minority classes.By incorporating a supplementary classifier and innovative training mechanisms,the BSLGAN focuses on learning the distribution of samples near classification boundaries.Consequently,it can fully capture the characteristics of the target class and generate highly realistic samples with clear boundaries.Finally,the new samples are filtered based on the Mahalanobis distance to ensure that they are within the desired distribution.To evaluate the effectiveness of the proposed approach,CMAGAN was used to solve the class imbalance problem in eight real-world fault-prediction applications.The performance of CMAGAN was compared with that of seven other algorithms,including state-of-the-art GAN-based methods,and the results indicated that CMAGAN could provide higher-quality augmented results.展开更多
The lithofacies classification is essential for oil and gas reservoir exploration and development.The traditional method of lithofacies classification is based on"core calibration logging"and the experience ...The lithofacies classification is essential for oil and gas reservoir exploration and development.The traditional method of lithofacies classification is based on"core calibration logging"and the experience of geologists.This approach has strong subjectivity,low efficiency,and high uncertainty.This uncertainty may be one of the key factors affecting the results of 3 D modeling of tight sandstone reservoirs.In recent years,deep learning,which is a cutting-edge artificial intelligence technology,has attracted attention from various fields.However,the study of deep-learning techniques in the field of lithofacies classification has not been sufficient.Therefore,this paper proposes a novel hybrid deep-learning model based on the efficient data feature-extraction ability of convolutional neural networks(CNN)and the excellent ability to describe time-dependent features of long short-term memory networks(LSTM)to conduct lithological facies-classification experiments.The results of a series of experiments show that the hybrid CNN-LSTM model had an average accuracy of 87.3%and the best classification effect compared to the CNN,LSTM or the three commonly used machine learning models(Support vector machine,random forest,and gradient boosting decision tree).In addition,the borderline synthetic minority oversampling technique(BSMOTE)is introduced to address the class-imbalance issue of raw data.The results show that processed data balance can significantly improve the accuracy of lithofacies classification.Beside that,based on the fine lithofacies constraints,the sequential indicator simulation method is used to establish a three-dimensional lithofacies model,which completes the fine description of the spatial distribution of tight sandstone reservoirs in the study area.According to this comprehensive analysis,the proposed CNN-LSTM model,which eliminates class imbalance,can be effectively applied to lithofacies classification,and is expected to improve the reality of the geological model for the tight sandstone reservoirs.展开更多
Machine Learning(ML) techniques have been widely applied in recent traffic classification.However, the problems of both discriminator bias and class imbalance decrease the accuracies of ML based traffic classifier. In...Machine Learning(ML) techniques have been widely applied in recent traffic classification.However, the problems of both discriminator bias and class imbalance decrease the accuracies of ML based traffic classifier. In this paper, we propose an accurate and extensible traffic classifier. Specifically, to address the discriminator bias issue, our classifier is built by making an optimal cascade of binary sub-classifiers, where each binary sub-classifier is trained independently with the discriminators used for identifying application specific traffic. Moreover, to balance a training dataset,we apply SMOTE algorithm in generating artificial training samples for minority classes.We evaluate our classifier on two datasets collected from different network border routers.Compared with the previous multi-class traffic classifiers built in one-time training process,our classifier achieves much higher F-Measure and AUC for each application.展开更多
The Internet of Medical Things(IoMT)will come to be of great importance in the mediation of medical disputes,as it is emerging as the core of intelligent medical treatment.First,IoMT can track the entire medical treat...The Internet of Medical Things(IoMT)will come to be of great importance in the mediation of medical disputes,as it is emerging as the core of intelligent medical treatment.First,IoMT can track the entire medical treatment process in order to provide detailed trace data in medical dispute resolution.Second,IoMT can infiltrate the ongoing treatment and provide timely intelligent decision support to medical staff.This information includes recommendation of similar historical cases,guidance for medical treatment,alerting of hired dispute profiteers etc.The multi-label classification of medical dispute documents(MDDs)plays an important role as a front-end process for intelligent decision support,especially in the recommendation of similar historical cases.However,MDDs usually appear as long texts containing a large amount of redundant information,and there is a serious distribution imbalance in the dataset,which directly leads to weaker classification performance.Accordingly,in this paper,a multi-label classification method based on key sentence extraction is proposed for MDDs.The method is divided into two parts.First,the attention-based hierarchical bi-directional long short-term memory(BiLSTM)model is used to extract key sentences from documents;second,random comprehensive sampling Bagging(RCS-Bagging),which is an ensemble multi-label classification model,is employed to classify MDDs based on key sentence sets.The use of this approach greatly improves the classification performance.Experiments show that the performance of the two models proposed in this paper is remarkably better than that of the baseline methods.展开更多
Near Infrared spectroscopy(NIRS)has been widely used in the discrimination(classification)of pharmaceutical drugs.In real applications,however,the class imbalance of the drug samples,i.e.,the number of one drug sample...Near Infrared spectroscopy(NIRS)has been widely used in the discrimination(classification)of pharmaceutical drugs.In real applications,however,the class imbalance of the drug samples,i.e.,the number of one drug sample may be much larger than the number of the other drugs,deceasesdrastically the discrimination performance of the classification models.To address this classimbalance problem,a new computational method--the scaled convex hull(SCH)-basedmaximum margin classifier is proposed in this paper.By a suitable selection of the reductionfactor of the SCHs generated by the two classes of drug samples,respectively,the maximalmargin classifier bet ween SCHs can be constructed which can obtain good classification per-formance.With an optimization of the parameters involved in the modeling by Cuckoo Search,a satisfied model is achieved for the classification of the drug.The experiments on spectra samplesproduced by a pharmaceutical company show that the proposed method is more effective androbust than the existing ones.展开更多
This paper presents a review of the ensemble learning models proposed for web services classification,selection,and composition.Web service is an evo-lutionary research area,and ensemble learning has become a hot spot...This paper presents a review of the ensemble learning models proposed for web services classification,selection,and composition.Web service is an evo-lutionary research area,and ensemble learning has become a hot spot to assess web services’earlier mentioned aspects.The proposed research aims to review the state of art approaches performed on the interesting web services area.The literature on the research topic is examined using the preferred reporting items for systematic reviews and meta-analyses(PRISMA)as a research method.The study reveals an increasing trend of using ensemble learning in the chosen papers within the last ten years.Naïve Bayes(NB),Support Vector Machine’(SVM),and other classifiers were identified as widely explored in selected studies.Core analysis of web services classification suggests that web services’performance aspects can be investigated in future works.This paper also identified performance measuring metrics,including accuracy,precision,recall,and f-measure,widely used in the literature.展开更多
Classification of sheep behaviour from a sequence of tri-axial accelerometer data has the potential to enhance sheep management.Sheep behaviour is inherently imbalanced(e.g.,more ruminating than walking)resulting in u...Classification of sheep behaviour from a sequence of tri-axial accelerometer data has the potential to enhance sheep management.Sheep behaviour is inherently imbalanced(e.g.,more ruminating than walking)resulting in underperforming classification for the minority activities which hold importance.Existing works have not addressed class imbalance and use traditional machine learning techniques,e.g.,Random Forest(RF).We investigated Deep Learning(DL)models,namely,Long Short Term Memory(LSTM)and Bidirectional LSTM(BLSTM),appropriate for sequential data,from imbalanced data.Two data sets were collected in normal grazing conditions using jaw-mounted and earmounted sensors.Novel to this study,alongside typical single classes,e.g.,walking,depending on the behaviours,data samples were labelled with compound classes,e.g.,walking_-grazing.The number of steps a sheep performed in the observed 10 s time window was also recorded and incorporated in the models.We designed several multi-class classification studies with imbalance being addressed using synthetic data.DL models achieved superior performance to traditional ML models,especially with augmented data(e.g.,4-Class+Steps:LSTM 88.0%,RF 82.5%).DL methods showed superior generalisability on unseen sheep(i.e.,F1-score:BLSTM 0.84,LSTM 0.83,RF 0.65).LSTM,BLSTM and RF achieved sub-millisecond average inference time,making them suitable for real-time applications.The results demonstrate the effectiveness of DL models for sheep behaviour classification in grazing conditions.The results also demonstrate the DL techniques can generalise across different sheep.The study presents a strong foundation of the development of such models for real-time animal monitoring.展开更多
Cloud Computing(CC)is the preference of all information technology(IT)organizations as it offers pay-per-use based and flexible services to its users.But the privacy and security become the main hindrances in its achi...Cloud Computing(CC)is the preference of all information technology(IT)organizations as it offers pay-per-use based and flexible services to its users.But the privacy and security become the main hindrances in its achievement due to distributed and open architecture that is prone to intruders.Intrusion Detection System(IDS)refers to one of the commonly utilized system for detecting attacks on cloud.IDS proves to be an effective and promising technique,that identifies malicious activities and known threats by observing traffic data in computers,and warnings are given when such threatswere identified.The current mainstream IDS are assisted with machine learning(ML)but have issues of low detection rates and demanded wide feature engineering.This article devises an Enhanced Coyote Optimization with Deep Learning based Intrusion Detection System for Cloud Security(ECODL-IDSCS)model.The ECODL-IDSCS model initially addresses the class imbalance data problem by the use of Adaptive Synthetic(ADASYN)technique.For detecting and classification of intrusions,long short term memory(LSTM)model is exploited.In addition,ECO algorithm is derived to optimally fine tune the hyperparameters related to the LSTM model to enhance its detection efficiency in the cloud environment.Once the presented ECODL-IDSCS model is tested on benchmark dataset,the experimental results show the promising performance of the ECODL-IDSCS model over the existing IDS models.展开更多
Oversampling is the most utilized approach to deal with class-imbalanced datasets,as seen by the plethora of oversampling methods developed in the last two decades.We argue in the following editorial the issues with o...Oversampling is the most utilized approach to deal with class-imbalanced datasets,as seen by the plethora of oversampling methods developed in the last two decades.We argue in the following editorial the issues with oversampling that stem from the possibility of overfitting and the generation of synthetic cases that might not accurately represent the minority class.These limitations should be considered when using oversampling techniques.We also propose several alternate strategies for dealing with imbalanced data,as well as a future work perspective.展开更多
Imbalance classification techniques have been frequently applied in many machine learning application domains where the number of the majority (or positive) class of a dataset is much larger than that of the minori...Imbalance classification techniques have been frequently applied in many machine learning application domains where the number of the majority (or positive) class of a dataset is much larger than that of the minority (or negative) class. Meanwhile, feature selection (FS) is one of the key techniques for the high-dimensional classification task in a manner which greatly improves the classification performance and the computational efficiency. However, most studies of feature selection and imbalance classification are restricted to off-line batch learning, which is not well adapted to some practical scenarios. In this paper, we aim to solve high-dimensional imbalanced classification problem accurately and efficiently with only a small number of active features in an online fashion, and we propose two novel online learning algorithms for this purpose. In our approach, a classifier which involves only a small and fixed number of features is constructed to classify a sequence of imbalanced data received in an online manner. We formulate the construction of such online learner into an optimization problem and use an iterative approach to solve the problem based on the passive-aggressive (PA) algorithm as well as a truncated gradient (TG) method. We evaluate the performance of the proposed algorithms based on several real-world datasets, and our experimental results have demonstrated the effectiveness of the proposed algorithms in comparison with the baselines.展开更多
Recently,Financial Technology(FinTech)has received more attention among financial sectors and researchers to derive effective solutions for any financial institution or firm.Financial crisis prediction(FCP)is an essen...Recently,Financial Technology(FinTech)has received more attention among financial sectors and researchers to derive effective solutions for any financial institution or firm.Financial crisis prediction(FCP)is an essential topic in business sector that finds it useful to identify the financial condition of a financial institution.At the same time,the development of the internet of things(IoT)has altered the mode of human interaction with the physical world.The IoT can be combined with the FCP model to examine the financial data from the users and perform decision making process.This paper presents a novel multi-objective squirrel search optimization algorithm with stacked autoencoder(MOSSA-SAE)model for FCP in IoT environment.The MOSSA-SAE model encompasses different subprocesses namely preprocessing,class imbalance handling,parameter tuning,and classification.Primarily,the MOSSA-SAE model allows the IoT devices such as smartphones,laptops,etc.,to collect the financial details of the users which are then transmitted to the cloud for further analysis.In addition,SMOTE technique is employed to handle class imbalance problems.The goal of MOSSA in SMOTE is to determine the oversampling rate and area of nearest neighbors of SMOTE.Besides,SAE model is utilized as a classification technique to determine the class label of the financial data.At the same time,the MOSSA is applied to appropriately select the‘weights’and‘bias’values of the SAE.An extensive experimental validation process is performed on the benchmark financial dataset and the results are examined under distinct aspects.The experimental values ensured the superior performance of the MOSSA-SAE model on the applied dataset.展开更多
For the classification problem in practice,one of the challenging issues is to obtain enough labeled data for training.Moreover,even if such labeled data has been sufficiently accumulated,most datasets often exhibit l...For the classification problem in practice,one of the challenging issues is to obtain enough labeled data for training.Moreover,even if such labeled data has been sufficiently accumulated,most datasets often exhibit long-tailed distribution with heavy class imbalance,which results in a biased model towards a majority class.To alleviate such class imbalance,semisupervised learning methods using additional unlabeled data have been considered.However,as a matter of course,the accuracy is much lower than that from supervised learning.In this study,under the assumption that additional unlabeled data is available,we propose the iterative semi-supervised learning algorithms,which iteratively correct the labeling of the extra unlabeled data based on softmax probabilities.The results show that the proposed algorithms provide the accuracy as high as that from the supervised learning.To validate the proposed algorithms,we tested on the two scenarios:with the balanced unlabeled dataset and with the imbalanced unlabeled dataset.Under both scenarios,our proposed semi-supervised learning algorithms provided higher accuracy than previous state-of-the-arts.Code is available at https://github.com/HeewonChung92/iterative-semi-learning.展开更多
文摘The study of machine learning has revealed that it can unleash new applications in a variety of disciplines.Many limitations limit their expressiveness,and researchers are working to overcome them to fully exploit the power of data-driven machine learning(ML)and deep learning(DL)techniques.The data imbalance presents major hurdles for classification and prediction problems in machine learning,restricting data analytics and acquiring relevant insights in practically all real-world research domains.In visual learning,network information security,failure prediction,digital marketing,healthcare,and a variety of other domains,raw data suffers from a biased data distribution of one class over the other.This article aims to present a taxonomy of the approaches for handling imbalanced data problems and their comparative study on the classification metrics and their application areas.We have explored very recent trends of techniques employed for solutions to class imbalance problems in datasets and have also discussed their limitations.This article has also identified open challenges for further research in the direction of class data imbalance.
基金The authors would like to extend their gratitude to Universiti Teknologi PETRONAS (Malaysia)for funding this research through grant number (015LA0-037).
文摘Every application in a smart city environment like the smart grid,health monitoring, security, and surveillance generates non-stationary datastreams. Due to such nature, the statistical properties of data changes overtime, leading to class imbalance and concept drift issues. Both these issuescause model performance degradation. Most of the current work has beenfocused on developing an ensemble strategy by training a new classifier on thelatest data to resolve the issue. These techniques suffer while training the newclassifier if the data is imbalanced. Also, the class imbalance ratio may changegreatly from one input stream to another, making the problem more complex.The existing solutions proposed for addressing the combined issue of classimbalance and concept drift are lacking in understating of correlation of oneproblem with the other. This work studies the association between conceptdrift and class imbalance ratio and then demonstrates how changes in classimbalance ratio along with concept drift affect the classifier’s performance.We analyzed the effect of both the issues on minority and majority classesindividually. To do this, we conducted experiments on benchmark datasetsusing state-of-the-art classifiers especially designed for data stream classification.Precision, recall, F1 score, and geometric mean were used to measure theperformance. Our findings show that when both class imbalance and conceptdrift problems occur together the performance can decrease up to 15%. Ourresults also show that the increase in the imbalance ratio can cause a 10% to15% decrease in the precision scores of both minority and majority classes.The study findings may help in designing intelligent and adaptive solutionsthat can cope with the challenges of non-stationary data streams like conceptdrift and class imbalance.
文摘Pneumonia is an acute lung infection that has caused many fatalitiesglobally. Radiologists often employ chest X-rays to identify pneumoniasince they are presently the most effective imaging method for this purpose.Computer-aided diagnosis of pneumonia using deep learning techniques iswidely used due to its effectiveness and performance. In the proposed method,the Synthetic Minority Oversampling Technique (SMOTE) approach is usedto eliminate the class imbalance in the X-ray dataset. To compensate forthe paucity of accessible data, pre-trained transfer learning is used, and anensemble Convolutional Neural Network (CNN) model is developed. Theensemble model consists of all possible combinations of the MobileNetv2,Visual Geometry Group (VGG16), and DenseNet169 models. MobileNetV2and DenseNet169 performed well in the Single classifier model, with anaccuracy of 94%, while the ensemble model (MobileNetV2+DenseNet169)achieved an accuracy of 96.9%. Using the data synchronous parallel modelin Distributed Tensorflow, the training process accelerated performance by98.6% and outperformed other conventional approaches.
基金supported by National Key R&D Program of China(2019YFB2102303)National Natural Science Foundation of China(NSFC61971014,NSFC11675199)Young Backbone Teacher Training Program of Henan Colleges and Universities(2021GGJS170).
文摘The popularity of the Internet of Things(IoT)has enabled a large number of vulnerable devices to connect to the Internet,bringing huge security risks.As a network-level security authentication method,device fingerprint based on machine learning has attracted considerable attention because it can detect vulnerable devices in complex and heterogeneous access phases.However,flexible and diversified IoT devices with limited resources increase dif-ficulty of the device fingerprint authentication method executed in IoT,because it needs to retrain the model network to deal with incremental features or types.To address this problem,a device fingerprinting mechanism based on a Broad Learning System(BLS)is proposed in this paper.The mechanism firstly characterizes IoT devices by traffic analysis based on the identifiable differences of the traffic data of IoT devices,and extracts feature parameters of the traffic packets.A hierarchical hybrid sampling method is designed at the preprocessing phase to improve the imbalanced data distribution and reconstruct the fingerprint dataset.The complexity of the dataset is reduced using Principal Component Analysis(PCA)and the device type is identified by training weights using BLS.The experimental results show that the proposed method can achieve state-of-the-art accuracy and spend less training time than other existing methods.
基金This research was supported by Taif University Researchers Supporting Project number(TURSP-2020/254),Taif University,Taif,Saudi Arabia.
文摘Datasets with the imbalanced class distribution are difficult to handle with the standard classification algorithms.In supervised learning,dealing with the problem of class imbalance is still considered to be a challenging research problem.Various machine learning techniques are designed to operate on balanced datasets;therefore,the state of the art,different undersampling,over-sampling and hybrid strategies have been proposed to deal with the problem of imbalanced datasets,but highly skewed datasets still pose the problem of generalization and noise generation during resampling.To overcome these problems,this paper proposes amajority clusteringmodel for classification of imbalanced datasets known as MCBC-SMOTE(Majority Clustering for balanced Classification-SMOTE).The model provides a method to convert the problem of binary classification into a multi-class problem.In the proposed algorithm,the number of clusters for themajority class is calculated using the elbow method and the minority class is over-sampled as an average of clustered majority classes to generate a symmetrical class distribution.The proposed technique is cost-effective,reduces the problem of noise generation and successfully disables the imbalances present in between and within classes.The results of the evaluations on diverse real datasets proved to provide better classification results as compared to state of the art existing methodologies based on several performance metrics.
基金This research was supported by Korea Institute for Advancement of Technology(KIAT)grant funded by the Korea Government(MOTIE)(P0012724,The Competency Development Program for Industry Specialist)and the Soonchunhyang University Research Fund.
文摘With the rise of internet facilities,a greater number of people have started doing online transactions at an exponential rate in recent years as the online transaction system has eliminated the need of going to the bank physically for every transaction.However,the fraud cases have also increased causing the loss of money to the consumers.Hence,an effective fraud detection system is the need of the hour which can detect fraudulent transactions automatically in real-time.Generally,the genuine transactions are large in number than the fraudulent transactions which leads to the class imbalance problem.In this research work,an online transaction fraud detection system using deep learning has been proposed which can handle class imbalance problem by applying algorithm-level methods which modify the learning of the model to focus more on the minority class i.e.,fraud transactions.A novel loss function named Weighted Hard-Reduced Focal Loss(WH-RFL)has been proposed which has achieved maximum fraud detection rate i.e.,True PositiveRate(TPR)at the cost of misclassification of few genuine transactions as high TPR is preferred over a high True Negative Rate(TNR)in fraud detection system and same has been demonstrated using three publicly available imbalanced transactional datasets.Also,Thresholding has been applied to optimize the decision threshold using cross-validation to detect maximum number of frauds and it has been demonstrated by the experimental results that the selection of the right thresholding method with deep learning yields better results.
文摘In recent years,the detection of fake job descriptions has become increasingly necessary because social networking has changed the way people access burgeoning information in the internet age.Identifying fraud in job descriptions can help jobseekers to avoid many of the risks of job hunting.However,the problem of detecting fake job descriptions comes up against the problem of class imbalance when the number of genuine jobs exceeds the number of fake jobs.This causes a reduction in the predictability and performance of traditional machine learning models.We therefore present an efficient framework that uses an oversampling technique called FJD-OT(Fake Job Description Detection Using Oversampling Techniques)to improve the predictability of detecting fake job descriptions.In the proposed framework,we apply several techniques including the removal of stop words and the use of a tokenizer to preprocess the text data in the first module.We then use a bag of words in combination with the term frequency-inverse document frequency(TF-IDF)approach to extract the features from the text data to create the feature dataset in the second module.Next,our framework applies k-fold cross-validation,a commonly used technique to test the effectiveness of machine learning models,that splits the experimental dataset[the Employment Scam Aegean(ESA)dataset in our study]into training and test sets for evaluation.The training set is passed through the third module,an oversampling module in which the SVMSMOTE method is used to balance data before training the classifiers in the last module.The experimental results indicate that the proposed approach significantly improves the predictability of fake job description detection on the ESA dataset based on several popular performance metrics.
文摘The emergence of digital networks and the wide adoption of information on internet platforms have given rise to threats against users’private information.Many intruders actively seek such private data either for sale or other inappropriate purposes.Similarly,national and international organizations have country-level and company-level private information that could be accessed by different network attacks.Therefore,the need for a Network Intruder Detection System(NIDS)becomes essential for protecting these networks and organizations.In the evolution of NIDS,Artificial Intelligence(AI)assisted tools and methods have been widely adopted to provide effective solutions.However,the development of NIDS still faces challenges at the dataset and machine learning levels,such as large deviations in numeric features,the presence of numerous irrelevant categorical features resulting in reduced cardinality,and class imbalance in multiclass-level data.To address these challenges and offer a unified solution to NIDS development,this study proposes a novel framework that preprocesses datasets and applies a box-cox transformation to linearly transform the numeric features and bring them into closer alignment.Cardinality reduction was applied to categorical features through the binning method.Subsequently,the class imbalance dataset was addressed using the adaptive synthetic sampling data generation method.Finally,the preprocessed,refined,and oversampled feature set was divided into training and test sets with an 80–20 ratio,and two experiments were conducted.In Experiment 1,the binary classification was executed using four machine learning classifiers,with the extra trees classifier achieving the highest accuracy of 97.23%and an AUC of 0.9961.In Experiment 2,multiclass classification was performed,and the extra trees classifier emerged as the most effective,achieving an accuracy of 81.27%and an AUC of 0.97.The results were evaluated based on training,testing,and total time,and a comparative analysis with state-of-the-art studies proved the robustness and significance of the applied methods in developing a timely and precision-efficient solution to NIDS.
基金supported by the National Natural Science Foundation of China(Grant No.52375256)the Natural Science Foundation of Shanghai Municipality(Grant Nos.21ZR1431500 and 23ZR1431600).
文摘Class imbalance is a common characteristic of industrial data that adversely affects industrial data mining because it leads to the biased training of machine learning models.To address this issue,the augmentation of samples in minority classes based on generative adversarial networks(GANs)has been demonstrated as an effective approach.This study proposes a novel GAN-based minority class augmentation approach named classifier-aided minority augmentation generative adversarial network(CMAGAN).In the CMAGAN framework,an outlier elimination strategy is first applied to each class to minimize the negative impacts of outliers.Subsequently,a newly designed boundary-strengthening learning GAN(BSLGAN)is employed to generate additional samples for minority classes.By incorporating a supplementary classifier and innovative training mechanisms,the BSLGAN focuses on learning the distribution of samples near classification boundaries.Consequently,it can fully capture the characteristics of the target class and generate highly realistic samples with clear boundaries.Finally,the new samples are filtered based on the Mahalanobis distance to ensure that they are within the desired distribution.To evaluate the effectiveness of the proposed approach,CMAGAN was used to solve the class imbalance problem in eight real-world fault-prediction applications.The performance of CMAGAN was compared with that of seven other algorithms,including state-of-the-art GAN-based methods,and the results indicated that CMAGAN could provide higher-quality augmented results.
基金supported by the Fundamental Research Funds for the Central Universities(Grant No.300102278402)。
文摘The lithofacies classification is essential for oil and gas reservoir exploration and development.The traditional method of lithofacies classification is based on"core calibration logging"and the experience of geologists.This approach has strong subjectivity,low efficiency,and high uncertainty.This uncertainty may be one of the key factors affecting the results of 3 D modeling of tight sandstone reservoirs.In recent years,deep learning,which is a cutting-edge artificial intelligence technology,has attracted attention from various fields.However,the study of deep-learning techniques in the field of lithofacies classification has not been sufficient.Therefore,this paper proposes a novel hybrid deep-learning model based on the efficient data feature-extraction ability of convolutional neural networks(CNN)and the excellent ability to describe time-dependent features of long short-term memory networks(LSTM)to conduct lithological facies-classification experiments.The results of a series of experiments show that the hybrid CNN-LSTM model had an average accuracy of 87.3%and the best classification effect compared to the CNN,LSTM or the three commonly used machine learning models(Support vector machine,random forest,and gradient boosting decision tree).In addition,the borderline synthetic minority oversampling technique(BSMOTE)is introduced to address the class-imbalance issue of raw data.The results show that processed data balance can significantly improve the accuracy of lithofacies classification.Beside that,based on the fine lithofacies constraints,the sequential indicator simulation method is used to establish a three-dimensional lithofacies model,which completes the fine description of the spatial distribution of tight sandstone reservoirs in the study area.According to this comprehensive analysis,the proposed CNN-LSTM model,which eliminates class imbalance,can be effectively applied to lithofacies classification,and is expected to improve the reality of the geological model for the tight sandstone reservoirs.
基金supported by the National Natural Science Foundation of China under Grant No.61402485National Natural Science Foundation of China under Grant No.61303061supported by the Open fund from HPCL No.201513-01
文摘Machine Learning(ML) techniques have been widely applied in recent traffic classification.However, the problems of both discriminator bias and class imbalance decrease the accuracies of ML based traffic classifier. In this paper, we propose an accurate and extensible traffic classifier. Specifically, to address the discriminator bias issue, our classifier is built by making an optimal cascade of binary sub-classifiers, where each binary sub-classifier is trained independently with the discriminators used for identifying application specific traffic. Moreover, to balance a training dataset,we apply SMOTE algorithm in generating artificial training samples for minority classes.We evaluate our classifier on two datasets collected from different network border routers.Compared with the previous multi-class traffic classifiers built in one-time training process,our classifier achieves much higher F-Measure and AUC for each application.
基金supported by the National Key R&D Program of China(2018YFC0830200,Zhang,B,www.most.gov.cn)the Fundamental Research Funds for the Central Universities(2242018S30021 and 2242017S30023,Zhou S,www.seu.edu.cn)the Open Research Fund from Key Laboratory of Computer Network and Information Integration In Southeast University,Ministry of Education,China(3209012001C3,Zhang B,www.seu.edu.cn).
文摘The Internet of Medical Things(IoMT)will come to be of great importance in the mediation of medical disputes,as it is emerging as the core of intelligent medical treatment.First,IoMT can track the entire medical treatment process in order to provide detailed trace data in medical dispute resolution.Second,IoMT can infiltrate the ongoing treatment and provide timely intelligent decision support to medical staff.This information includes recommendation of similar historical cases,guidance for medical treatment,alerting of hired dispute profiteers etc.The multi-label classification of medical dispute documents(MDDs)plays an important role as a front-end process for intelligent decision support,especially in the recommendation of similar historical cases.However,MDDs usually appear as long texts containing a large amount of redundant information,and there is a serious distribution imbalance in the dataset,which directly leads to weaker classification performance.Accordingly,in this paper,a multi-label classification method based on key sentence extraction is proposed for MDDs.The method is divided into two parts.First,the attention-based hierarchical bi-directional long short-term memory(BiLSTM)model is used to extract key sentences from documents;second,random comprehensive sampling Bagging(RCS-Bagging),which is an ensemble multi-label classification model,is employed to classify MDDs based on key sentence sets.The use of this approach greatly improves the classification performance.Experiments show that the performance of the two models proposed in this paper is remarkably better than that of the baseline methods.
基金funded by the National Nat ural Science Foundation of China(Grant Nos.61105004,61071136and 21365008)Natural Science Foundation of Guangxi(Grant No.2013GXNSFBA019279)Innovation Project of GUET Graduate Education(No.ZYC0725).
文摘Near Infrared spectroscopy(NIRS)has been widely used in the discrimination(classification)of pharmaceutical drugs.In real applications,however,the class imbalance of the drug samples,i.e.,the number of one drug sample may be much larger than the number of the other drugs,deceasesdrastically the discrimination performance of the classification models.To address this classimbalance problem,a new computational method--the scaled convex hull(SCH)-basedmaximum margin classifier is proposed in this paper.By a suitable selection of the reductionfactor of the SCHs generated by the two classes of drug samples,respectively,the maximalmargin classifier bet ween SCHs can be constructed which can obtain good classification per-formance.With an optimization of the parameters involved in the modeling by Cuckoo Search,a satisfied model is achieved for the classification of the drug.The experiments on spectra samplesproduced by a pharmaceutical company show that the proposed method is more effective androbust than the existing ones.
基金This research was supported by the BK21 FOUR(Fostering Outstanding Universities for Research)the Ministry of Education(MOE,Korea)and National Research Foundation of Korea(NRF).
文摘This paper presents a review of the ensemble learning models proposed for web services classification,selection,and composition.Web service is an evo-lutionary research area,and ensemble learning has become a hot spot to assess web services’earlier mentioned aspects.The proposed research aims to review the state of art approaches performed on the interesting web services area.The literature on the research topic is examined using the preferred reporting items for systematic reviews and meta-analyses(PRISMA)as a research method.The study reveals an increasing trend of using ensemble learning in the chosen papers within the last ten years.Naïve Bayes(NB),Support Vector Machine’(SVM),and other classifiers were identified as widely explored in selected studies.Core analysis of web services classification suggests that web services’performance aspects can be investigated in future works.This paper also identified performance measuring metrics,including accuracy,precision,recall,and f-measure,widely used in the literature.
文摘Classification of sheep behaviour from a sequence of tri-axial accelerometer data has the potential to enhance sheep management.Sheep behaviour is inherently imbalanced(e.g.,more ruminating than walking)resulting in underperforming classification for the minority activities which hold importance.Existing works have not addressed class imbalance and use traditional machine learning techniques,e.g.,Random Forest(RF).We investigated Deep Learning(DL)models,namely,Long Short Term Memory(LSTM)and Bidirectional LSTM(BLSTM),appropriate for sequential data,from imbalanced data.Two data sets were collected in normal grazing conditions using jaw-mounted and earmounted sensors.Novel to this study,alongside typical single classes,e.g.,walking,depending on the behaviours,data samples were labelled with compound classes,e.g.,walking_-grazing.The number of steps a sheep performed in the observed 10 s time window was also recorded and incorporated in the models.We designed several multi-class classification studies with imbalance being addressed using synthetic data.DL models achieved superior performance to traditional ML models,especially with augmented data(e.g.,4-Class+Steps:LSTM 88.0%,RF 82.5%).DL methods showed superior generalisability on unseen sheep(i.e.,F1-score:BLSTM 0.84,LSTM 0.83,RF 0.65).LSTM,BLSTM and RF achieved sub-millisecond average inference time,making them suitable for real-time applications.The results demonstrate the effectiveness of DL models for sheep behaviour classification in grazing conditions.The results also demonstrate the DL techniques can generalise across different sheep.The study presents a strong foundation of the development of such models for real-time animal monitoring.
基金The Deanship of Scientific Research(DSR)at King Abdulaziz University(KAU),Jeddah,Saudi Arabia has funded this project,under grant no.KEP-1-120-42.
文摘Cloud Computing(CC)is the preference of all information technology(IT)organizations as it offers pay-per-use based and flexible services to its users.But the privacy and security become the main hindrances in its achievement due to distributed and open architecture that is prone to intruders.Intrusion Detection System(IDS)refers to one of the commonly utilized system for detecting attacks on cloud.IDS proves to be an effective and promising technique,that identifies malicious activities and known threats by observing traffic data in computers,and warnings are given when such threatswere identified.The current mainstream IDS are assisted with machine learning(ML)but have issues of low detection rates and demanded wide feature engineering.This article devises an Enhanced Coyote Optimization with Deep Learning based Intrusion Detection System for Cloud Security(ECODL-IDSCS)model.The ECODL-IDSCS model initially addresses the class imbalance data problem by the use of Adaptive Synthetic(ADASYN)technique.For detecting and classification of intrusions,long short term memory(LSTM)model is exploited.In addition,ECO algorithm is derived to optimally fine tune the hyperparameters related to the LSTM model to enhance its detection efficiency in the cloud environment.Once the presented ECODL-IDSCS model is tested on benchmark dataset,the experimental results show the promising performance of the ECODL-IDSCS model over the existing IDS models.
文摘Oversampling is the most utilized approach to deal with class-imbalanced datasets,as seen by the plethora of oversampling methods developed in the last two decades.We argue in the following editorial the issues with oversampling that stem from the possibility of overfitting and the generation of synthetic cases that might not accurately represent the minority class.These limitations should be considered when using oversampling techniques.We also propose several alternate strategies for dealing with imbalanced data,as well as a future work perspective.
基金This research was supported by the Guangzhou Key Laboratory of Robotics and Intelligent Software under Grant No. 15180007, the Fundamental Research Funds for the Central Universities of China under Grant Nos. D215048w and 2015ZZ029, and the National Natural Science Foundation of China under Grant Nos. 61005061 and 61502177.
文摘Imbalance classification techniques have been frequently applied in many machine learning application domains where the number of the majority (or positive) class of a dataset is much larger than that of the minority (or negative) class. Meanwhile, feature selection (FS) is one of the key techniques for the high-dimensional classification task in a manner which greatly improves the classification performance and the computational efficiency. However, most studies of feature selection and imbalance classification are restricted to off-line batch learning, which is not well adapted to some practical scenarios. In this paper, we aim to solve high-dimensional imbalanced classification problem accurately and efficiently with only a small number of active features in an online fashion, and we propose two novel online learning algorithms for this purpose. In our approach, a classifier which involves only a small and fixed number of features is constructed to classify a sequence of imbalanced data received in an online manner. We formulate the construction of such online learner into an optimization problem and use an iterative approach to solve the problem based on the passive-aggressive (PA) algorithm as well as a truncated gradient (TG) method. We evaluate the performance of the proposed algorithms based on several real-world datasets, and our experimental results have demonstrated the effectiveness of the proposed algorithms in comparison with the baselines.
文摘Recently,Financial Technology(FinTech)has received more attention among financial sectors and researchers to derive effective solutions for any financial institution or firm.Financial crisis prediction(FCP)is an essential topic in business sector that finds it useful to identify the financial condition of a financial institution.At the same time,the development of the internet of things(IoT)has altered the mode of human interaction with the physical world.The IoT can be combined with the FCP model to examine the financial data from the users and perform decision making process.This paper presents a novel multi-objective squirrel search optimization algorithm with stacked autoencoder(MOSSA-SAE)model for FCP in IoT environment.The MOSSA-SAE model encompasses different subprocesses namely preprocessing,class imbalance handling,parameter tuning,and classification.Primarily,the MOSSA-SAE model allows the IoT devices such as smartphones,laptops,etc.,to collect the financial details of the users which are then transmitted to the cloud for further analysis.In addition,SMOTE technique is employed to handle class imbalance problems.The goal of MOSSA in SMOTE is to determine the oversampling rate and area of nearest neighbors of SMOTE.Besides,SAE model is utilized as a classification technique to determine the class label of the financial data.At the same time,the MOSSA is applied to appropriately select the‘weights’and‘bias’values of the SAE.An extensive experimental validation process is performed on the benchmark financial dataset and the results are examined under distinct aspects.The experimental values ensured the superior performance of the MOSSA-SAE model on the applied dataset.
基金This work was supported by the National Research Foundation of Korea(No.2020R1A2C1014829)by the Korea Medical Device Development Fund grant,which is funded by the Government of the Republic of Korea Korea government(the Ministry of Science and ICT+2 种基金the Ministry of Trade,Industry and Energythe Ministry of Health and Welfareand the Ministry of Food and Drug Safety)(grant KMDF_PR_20200901_0095).
文摘For the classification problem in practice,one of the challenging issues is to obtain enough labeled data for training.Moreover,even if such labeled data has been sufficiently accumulated,most datasets often exhibit long-tailed distribution with heavy class imbalance,which results in a biased model towards a majority class.To alleviate such class imbalance,semisupervised learning methods using additional unlabeled data have been considered.However,as a matter of course,the accuracy is much lower than that from supervised learning.In this study,under the assumption that additional unlabeled data is available,we propose the iterative semi-supervised learning algorithms,which iteratively correct the labeling of the extra unlabeled data based on softmax probabilities.The results show that the proposed algorithms provide the accuracy as high as that from the supervised learning.To validate the proposed algorithms,we tested on the two scenarios:with the balanced unlabeled dataset and with the imbalanced unlabeled dataset.Under both scenarios,our proposed semi-supervised learning algorithms provided higher accuracy than previous state-of-the-arts.Code is available at https://github.com/HeewonChung92/iterative-semi-learning.