期刊文献+
共找到1,287篇文章
< 1 2 65 >
每页显示 20 50 100
A Stacked Ensemble Deep Learning Approach for Imbalanced Multi-Class Water Quality Index Prediction
1
作者 Wen Yee Wong Khairunnisa Hasikin +4 位作者 Anis Salwa Mohd Khairuddin Sarah Abdul Razak Hanee Farzana Hizaddin Mohd Istajib Mokhtar Muhammad Mokhzaini Azizan 《Computers, Materials & Continua》 SCIE EI 2023年第8期1361-1384,共24页
A common difficulty in building prediction models with real-world environmental datasets is the skewed distribution of classes.There are significantly more samples for day-to-day classes,while rare events such as poll... A common difficulty in building prediction models with real-world environmental datasets is the skewed distribution of classes.There are significantly more samples for day-to-day classes,while rare events such as polluted classes are uncommon.Consequently,the limited availability of minority outcomes lowers the classifier’s overall reliability.This study assesses the capability of machine learning(ML)algorithms in tackling imbalanced water quality data based on the metrics of precision,recall,and F1 score.It intends to balance the misled accuracy towards the majority of data.Hence,10 ML algorithms of its performance are compared.The classifiers included are AdaBoost,SupportVector Machine,Linear Discriminant Analysis,k-Nearest Neighbors,Naive Bayes,Decision Trees,Random Forest,Extra Trees,Bagging,and the Multilayer Perceptron.This study also uses the Easy Ensemble Classifier,Balanced Bagging,andRUSBoost algorithm to evaluatemulti-class imbalanced learning methods.The comparison results revealed that a highaccuracy machine learning model is not always good in recall and sensitivity.This paper’s stacked ensemble deep learning(SE-DL)generalization model effectively classifies the water quality index(WQI)based on 23 input variables.The proposed algorithm achieved a remarkable average of 95.69%,94.96%,92.92%,and 93.88%for accuracy,precision,recall,and F1 score,respectively.In addition,the proposed model is compared against two state-of-the-art classifiers,the XGBoost(eXtreme Gradient Boosting)and Light Gradient Boosting Machine,where performance metrics of balanced accuracy and g-mean are included.The experimental setup concluded XGBoost with a higher balanced accuracy and G-mean.However,the SE-DL model has a better and more balanced performance in the F1 score.The SE-DL model aligns with the goal of this study to ensure the balance between accuracy and completeness for each water quality class.The proposed algorithm is also capable of higher efficiency at a lower computational time against using the standard SyntheticMinority Oversampling Technique(SMOTE)approach to imbalanced datasets. 展开更多
关键词 Water quality classification imbalanced data SMOTE stacked ensemble deep learning sensitivity analysis
下载PDF
An Imbalanced Dataset and Class Overlapping Classification Model for Big Data 被引量:1
2
作者 Mini Prince P.M.Joe Prathap 《Computer Systems Science & Engineering》 SCIE EI 2023年第2期1009-1024,共16页
Most modern technologies,such as social media,smart cities,and the internet of things(IoT),rely on big data.When big data is used in the real-world applications,two data challenges such as class overlap and class imba... Most modern technologies,such as social media,smart cities,and the internet of things(IoT),rely on big data.When big data is used in the real-world applications,two data challenges such as class overlap and class imbalance arises.When dealing with large datasets,most traditional classifiers are stuck in the local optimum problem.As a result,it’s necessary to look into new methods for dealing with large data collections.Several solutions have been proposed for overcoming this issue.The rapid growth of the available data threatens to limit the usefulness of many traditional methods.Methods such as oversampling and undersampling have shown great promises in addressing the issues of class imbalance.Among all of these techniques,Synthetic Minority Oversampling TechniquE(SMOTE)has produced the best results by generating synthetic samples for the minority class in creating a balanced dataset.The issue is that their practical applicability is restricted to problems involving tens of thousands or lower instances of each.In this paper,we have proposed a parallel mode method using SMOTE and MapReduce strategy,this distributes the operation of the algorithm among a group of computational nodes for addressing the aforementioned problem.Our proposed solution has been divided into three stages.Thefirst stage involves the process of splitting the data into different blocks using a mapping function,followed by a pre-processing step for each mapping block that employs a hybrid SMOTE algo-rithm for solving the class imbalanced problem.On each map block,a decision tree model would be constructed.Finally,the decision tree blocks would be com-bined for creating a classification model.We have used numerous datasets with up to 4 million instances in our experiments for testing the proposed scheme’s cap-abilities.As a result,the Hybrid SMOTE appears to have good scalability within the framework proposed,and it also cuts down the processing time. 展开更多
关键词 imbalanced dataset class overlapping SMOTE MAPREDUCE parallel programming OVERSAMPLING
下载PDF
Machine Learning and Synthetic Minority Oversampling Techniques for Imbalanced Data: Improving Machine Failure Prediction
3
作者 Yap Bee Wah Azlan Ismail +4 位作者 Nur Niswah Naslina Azid Jafreezal Jaafar Izzatdin Abdul Aziz Mohd Hilmi Hasan Jasni Mohamad Zain 《Computers, Materials & Continua》 SCIE EI 2023年第6期4821-4841,共21页
Prediction of machine failure is challenging as the dataset is often imbalanced with a low failure rate.The common approach to han-dle classification involving imbalanced data is to balance the data using a sampling a... Prediction of machine failure is challenging as the dataset is often imbalanced with a low failure rate.The common approach to han-dle classification involving imbalanced data is to balance the data using a sampling approach such as random undersampling,random oversampling,or Synthetic Minority Oversampling Technique(SMOTE)algorithms.This paper compared the classification performance of three popular classifiers(Logistic Regression,Gaussian Naïve Bayes,and Support Vector Machine)in predicting machine failure in the Oil and Gas industry.The original machine failure dataset consists of 20,473 hourly data and is imbalanced with 19945(97%)‘non-failure’and 528(3%)‘failure data’.The three independent variables to predict machine failure were pressure indicator,flow indicator,and level indicator.The accuracy of the classifiers is very high and close to 100%,but the sensitivity of all classifiers using the original dataset was close to zero.The performance of the three classifiers was then evaluated for data with different imbalance rates(10%to 50%)generated from the original data using SMOTE,SMOTE-Support Vector Machine(SMOTE-SVM)and SMOTE-Edited Nearest Neighbour(SMOTE-ENN).The classifiers were evaluated based on improvement in sensitivity and F-measure.Results showed that the sensitivity of all classifiers increases as the imbalance rate increases.SVM with radial basis function(RBF)kernel has the highest sensitivity when data is balanced(50:50)using SMOTE(Sensitivitytest=0.5686,Ftest=0.6927)compared to Naïve Bayes(Sensitivitytest=0.4033,Ftest=0.6218)and Logistic Regression(Sensitivitytest=0.4194,Ftest=0.621).Overall,the Gaussian Naïve Bayes model consistently improves sensitivity and F-measure as the imbalance ratio increases,but the sensitivity is below 50%.The classifiers performed better when data was balanced using SMOTE-SVM compared to SMOTE and SMOTE-ENN. 展开更多
关键词 Machine failure machine learning imbalanced data SMOTE classification
下载PDF
GraphCWGAN-GP:A Novel Data Augmenting Approach for Imbalanced Encrypted Traffic Classification
4
作者 Jiangtao Zhai Peng Lin +2 位作者 Yongfu Cui Lilong Xu Ming Liu 《Computer Modeling in Engineering & Sciences》 SCIE EI 2023年第8期2069-2092,共24页
Encrypted traffic classification has become a hot issue in network security research.The class imbalance problem of traffic samples often causes the deterioration of Machine Learning based classifier performance.Altho... Encrypted traffic classification has become a hot issue in network security research.The class imbalance problem of traffic samples often causes the deterioration of Machine Learning based classifier performance.Although the Generative Adversarial Network(GAN)method can generate new samples by learning the feature distribution of the original samples,it is confronted with the problems of unstable training andmode collapse.To this end,a novel data augmenting approach called Graph CWGAN-GP is proposed in this paper.The traffic data is first converted into grayscale images as the input for the proposed model.Then,the minority class data is augmented with our proposed model,which is built by introducing conditional constraints and a new distance metric in typical GAN.Finally,the classical deep learning model is adopted as a classifier to classify datasets augmented by the Condition GAN(CGAN),Wasserstein GAN-Gradient Penalty(WGAN-GP)and Graph CWGAN-GP,respectively.Compared with the state-of-the-art GAN methods,the Graph CWGAN-GP cannot only control the modes of the data to be generated,but also overcome the problem of unstable training and generate more realistic and diverse samples.The experimental results show that the classification precision,recall and F1-Score of theminority class in the balanced dataset augmented in this paper have improved by more than 2.37%,3.39% and 4.57%,respectively. 展开更多
关键词 Generative Adversarial Network imbalanced traffic data data augmenting encrypted traffic classification
下载PDF
Observation points classifier ensemble for high-dimensional imbalanced classification
5
作者 Yulin He Xu Li +3 位作者 Philippe Fournier‐Viger Joshua Zhexue Huang Mianjie Li Salman Salloum 《CAAI Transactions on Intelligence Technology》 SCIE EI 2023年第2期500-517,共18页
In this paper,an Observation Points Classifier Ensemble(OPCE)algorithm is proposed to deal with High-Dimensional Imbalanced Classification(HDIC)problems based on data processed using the Multi-Dimensional Scaling(MDS)... In this paper,an Observation Points Classifier Ensemble(OPCE)algorithm is proposed to deal with High-Dimensional Imbalanced Classification(HDIC)problems based on data processed using the Multi-Dimensional Scaling(MDS)feature extraction technique.First,dimensionality of the original imbalanced data is reduced using MDS so that distances between any two different samples are preserved as well as possible.Second,a novel OPCE algorithm is applied to classify imbalanced samples by placing optimised observation points in a low-dimensional data space.Third,optimization of the observation point mappings is carried out to obtain a reliable assessment of the unknown samples.Exhaustive experiments have been conducted to evaluate the feasibility,rationality,and effectiveness of the proposed OPCE algorithm using seven benchmark HDIC data sets.Experimental results show that(1)the OPCE algorithm can be trained faster on low-dimensional imbalanced data than on high-dimensional data;(2)the OPCE algorithm can correctly identify samples as the number of optimised observation points is increased;and(3)statistical analysis reveals that OPCE yields better HDIC performances on the selected data sets in comparison with eight other HDIC algorithms.This demonstrates that OPCE is a viable algorithm to deal with HDIC problems. 展开更多
关键词 classifier ensemble feature transformation high-dimensional data classification imbalanced learning observation point mechanism
下载PDF
Imbalanced Data Classification Using SVM Based on Improved Simulated Annealing Featuring Synthetic Data Generation and Reduction
6
作者 Hussein Ibrahim Hussein Said Amirul Anwar Muhammad Imran Ahmad 《Computers, Materials & Continua》 SCIE EI 2023年第4期547-564,共18页
Imbalanced data classification is one of the major problems in machine learning.This imbalanced dataset typically has significant differences in the number of data samples between its classes.In most cases,the perform... Imbalanced data classification is one of the major problems in machine learning.This imbalanced dataset typically has significant differences in the number of data samples between its classes.In most cases,the performance of the machine learning algorithm such as Support Vector Machine(SVM)is affected when dealing with an imbalanced dataset.The classification accuracy is mostly skewed toward the majority class and poor results are exhibited in the prediction of minority-class samples.In this paper,a hybrid approach combining data pre-processing technique andSVMalgorithm based on improved Simulated Annealing(SA)was proposed.Firstly,the data preprocessing technique which primarily aims at solving the resampling strategy of handling imbalanced datasets was proposed.In this technique,the data were first synthetically generated to equalize the number of samples between classes and followed by a reduction step to remove redundancy and duplicated data.Next is the training of a balanced dataset using SVM.Since this algorithm requires an iterative process to search for the best penalty parameter during training,an improved SA algorithm was proposed for this task.In this proposed improvement,a new acceptance criterion for the solution to be accepted in the SA algorithm was introduced to enhance the accuracy of the optimization process.Experimental works based on ten publicly available imbalanced datasets have demonstrated higher accuracy in the classification tasks using the proposed approach in comparison with the conventional implementation of SVM.Registering at an average of 89.65%of accuracy for the binary class classification has demonstrated the good performance of the proposed works. 展开更多
关键词 imbalanced data resampling technique data reduction support vector machine simulated annealing
下载PDF
Fault Diagnosis of Power Transformer Based on Improved ACGAN Under Imbalanced Data
7
作者 Tusongjiang.Kari Lin Du +3 位作者 Aisikaer.Rouzi Xiaojing Ma Zhichao Liu Bo Li 《Computers, Materials & Continua》 SCIE EI 2023年第5期4573-4592,共20页
The imbalance of dissolved gas analysis(DGA)data will lead to over-fitting,weak generalization and poor recognition performance for fault diagnosis models based on deep learning.To handle this problem,a novel transfor... The imbalance of dissolved gas analysis(DGA)data will lead to over-fitting,weak generalization and poor recognition performance for fault diagnosis models based on deep learning.To handle this problem,a novel transformer fault diagnosis method based on improved auxiliary classifier generative adversarial network(ACGAN)under imbalanced data is proposed in this paper,which meets both the requirements of balancing DGA data and supplying accurate diagnosis results.The generator combines one-dimensional convolutional neural networks(1D-CNN)and long short-term memories(LSTM),which can deeply extract the features from DGA samples and be greatly beneficial to ACGAN’s data balancing and fault diagnosis.The discriminator adopts multilayer perceptron networks(MLP),which prevents the discriminator from losing important features of DGA data when the network is too complex and the number of layers is too large.The experimental results suggest that the presented approach can effectively improve the adverse effects of DGA data imbalance on the deep learning models,enhance fault diagnosis performance and supply desirable diagnosis accuracy up to 99.46%.Furthermore,the comparison results indicate the fault diagnosis performance of the proposed approach is superior to that of other conventional methods.Therefore,the method presented in this study has excellent and reliable fault diagnosis performance for various unbalanced datasets.In addition,the proposed approach can also solve the problems of insufficient and imbalanced fault data in other practical application fields. 展开更多
关键词 Power transformer dissolved gas analysis imbalanced data auxiliary classifier generative adversarial network
下载PDF
An improved bidirectional generative adversarial network model for multivariate estimation of correlated and imbalanced tunnel construction parameters
8
作者 Yao Xiao Jia Yu +3 位作者 Guoxin Xu Dawei Tong Jiahao Yu Tuocheng Zeng 《Journal of Rock Mechanics and Geotechnical Engineering》 SCIE CSCD 2023年第7期1797-1809,共13页
Estimation of construction parameters is crucial for optimizing tunnel construction schedule.Due to the influence of routine activities and occasional risk events,these parameters are usually correlated and imbalanced... Estimation of construction parameters is crucial for optimizing tunnel construction schedule.Due to the influence of routine activities and occasional risk events,these parameters are usually correlated and imbalanced.To solve this issue,an improved bidirectional generative adversarial network(BiGAN)model with a joint discriminator structure and zero-centered gradient penalty(0-GP)is proposed.In this model,in order to improve the capability of original BiGAN in learning imbalanced parameters,the joint discriminator separately discriminates the routine activities and risk event durations to balance their influence weights.Then,the self-attention mechanism is embedded so that the discriminator can pay more attention to the imbalanced parameters.Finally,the 0-GP is adapted for the loss of the discrimi-nator to improve its convergence and stability.A case study of a tunnel in China shows that the improved BiGAN can obtain parameter estimates consistent with the classical Gauss mixture model,without the need of tedious and complex correlation analysis.The proposed joint discriminator can increase the ability of BiGAN in estimating imbalanced construction parameters,and the 0-GP can ensure the stability and convergence of the model. 展开更多
关键词 Multivariate parameters estimation Correlated and imbalanced parameters Bidirectional generative adversarial network(BiGAN) Joint discriminator Zero-centered gradient penalty(0-GP)
下载PDF
An Embedded Feature Selection Method for Imbalanced Data Classification 被引量:11
9
作者 Haoyue Liu MengChu Zhou Qing Liu 《IEEE/CAA Journal of Automatica Sinica》 EI CSCD 2019年第3期703-715,共13页
Imbalanced data is one type of datasets that are frequently found in real-world applications, e.g., fraud detection and cancer diagnosis. For this type of datasets, improving the accuracy to identify their minority cl... Imbalanced data is one type of datasets that are frequently found in real-world applications, e.g., fraud detection and cancer diagnosis. For this type of datasets, improving the accuracy to identify their minority class is a critically important issue.Feature selection is one method to address this issue. An effective feature selection method can choose a subset of features that favor in the accurate determination of the minority class. A decision tree is a classifier that can be built up by using different splitting criteria. Its advantage is the ease of detecting which feature is used as a splitting node. Thus, it is possible to use a decision tree splitting criterion as a feature selection method. In this paper, an embedded feature selection method using our proposed weighted Gini index(WGI) is proposed. Its comparison results with Chi2, F-statistic and Gini index feature selection methods show that F-statistic and Chi2 reach the best performance when only a few features are selected. As the number of selected features increases, our proposed method has the highest probability of achieving the best performance. The area under a receiver operating characteristic curve(ROC AUC) and F-measure are used as evaluation criteria. Experimental results with two datasets show that ROC AUC performance can be high, even if only a few features are selected and used, and only changes slightly as more and more features are selected. However, the performance of Fmeasure achieves excellent performance only if 20% or more of features are chosen. The results are helpful for practitioners to select a proper feature selection method when facing a practical problem. 展开更多
关键词 Classification and regression TREE FEATURE SELECTION imbalanced data WEIGHTED GINI index (WGI)
下载PDF
Over-sampling algorithm for imbalanced data classification 被引量:6
10
作者 XU Xiaolong CHEN Wen SUN Yanfei 《Journal of Systems Engineering and Electronics》 SCIE EI CSCD 2019年第6期1182-1191,共10页
For imbalanced datasets, the focus of classification is to identify samples of the minority class. The performance of current data mining algorithms is not good enough for processing imbalanced datasets. The synthetic... For imbalanced datasets, the focus of classification is to identify samples of the minority class. The performance of current data mining algorithms is not good enough for processing imbalanced datasets. The synthetic minority over-sampling technique(SMOTE) is specifically designed for learning from imbalanced datasets, generating synthetic minority class examples by interpolating between minority class examples nearby. However, the SMOTE encounters the overgeneralization problem. The densitybased spatial clustering of applications with noise(DBSCAN) is not rigorous when dealing with the samples near the borderline.We optimize the DBSCAN algorithm for this problem to make clustering more reasonable. This paper integrates the optimized DBSCAN and SMOTE, and proposes a density-based synthetic minority over-sampling technique(DSMOTE). First, the optimized DBSCAN is used to divide the samples of the minority class into three groups, including core samples, borderline samples and noise samples, and then the noise samples of minority class is removed to synthesize more effective samples. In order to make full use of the information of core samples and borderline samples,different strategies are used to over-sample core samples and borderline samples. Experiments show that DSMOTE can achieve better results compared with SMOTE and Borderline-SMOTE in terms of precision, recall and F-value. 展开更多
关键词 imbalanced data density-based spatial clustering of applications with noise(DBSCAN) synthetic minority over sampling technique(SMOTE) over-sampling.
下载PDF
Using Imbalanced Triangle Synthetic Data for Machine Learning Anomaly Detection 被引量:4
11
作者 Menghua Luo Ke Wang +3 位作者 Zhiping Cai Anfeng Liu Yangyang Li Chak Fong Cheang 《Computers, Materials & Continua》 SCIE EI 2019年第1期15-26,共12页
The extreme imbalanced data problem is the core issue in anomaly detection.The amount of abnormal data is so small that we cannot get adequate information to analyze it.The mainstream methods focus on taking fully adv... The extreme imbalanced data problem is the core issue in anomaly detection.The amount of abnormal data is so small that we cannot get adequate information to analyze it.The mainstream methods focus on taking fully advantages of the normal data,of which the discrimination method is that the data not belonging to normal data distribution is the anomaly.From the view of data science,we concentrate on the abnormal data and generate artificial abnormal samples by machine learning method.In this kind of technologies,Synthetic Minority Over-sampling Technique and its improved algorithms are representative milestones,which generate synthetic examples randomly in selected line segments.In our work,we break the limitation of line segment and propose an Imbalanced Triangle Synthetic Data method.In theory,our method covers a wider range.In experiment with real world data,our method performs better than the SMOTE and its meliorations. 展开更多
关键词 ANOMALY detection imbalanced DATA SYNTHETIC DATA machine learning
下载PDF
Imbalanced Classification in Diabetics Using Ensembled Machine Learning 被引量:1
12
作者 M.Sandeep Kumar Mohammad Zubair Khan +3 位作者 Sukumar Rajendran Ayman Noor A.Stephen Dass J.Prabhu 《Computers, Materials & Continua》 SCIE EI 2022年第9期4397-4409,共13页
Diabetics is one of the world’s most common diseases which are caused by continued high levels of blood sugar.The risk of diabetics can be lowered if the diabetic is found at the early stage.In recent days,several ma... Diabetics is one of the world’s most common diseases which are caused by continued high levels of blood sugar.The risk of diabetics can be lowered if the diabetic is found at the early stage.In recent days,several machine learning models were developed to predict the diabetic presence at an early stage.In this paper,we propose an embedded-based machine learning model that combines the split-vote method and instance duplication to leverage an imbalanced dataset called PIMA Indian to increase the prediction of diabetics.The proposed method uses both the concept of over-sampling and under-sampling along with model weighting to increase the performance of classification.Different measures such as Accuracy,Precision,Recall,and F1-Score are used to evaluate the model.The results we obtained using K-Nearest Neighbor(kNN),Naïve Bayes(NB),Support Vector Machines(SVM),Random Forest(RF),Logistic Regression(LR),and Decision Trees(DT)were 89.32%,91.44%,95.78%,89.3%,81.76%,and 80.38%respectively.The SVM model is more efficient than other models which are 21.38%more than exiting machine learning-based works. 展开更多
关键词 Diabetics classification imbalanced data split-vote instance duplication
下载PDF
A Rebalancing Framework for Classification of Imbalanced Medical Appointment No-show Data
13
作者 Ulagapriya Krishnan Pushpa Sangar 《Journal of Data and Information Science》 CSCD 2021年第1期178-192,共15页
Purpose: This paper aims to improve the classification performance when the data is imbalanced by applying different sampling techniques available in Machine Learning.Design/methodology/approach: The medical appointme... Purpose: This paper aims to improve the classification performance when the data is imbalanced by applying different sampling techniques available in Machine Learning.Design/methodology/approach: The medical appointment no-show dataset is imbalanced, and when classification algorithms are applied directly to the dataset, it is biased towards the majority class, ignoring the minority class. To avoid this issue, multiple sampling techniques such as Random Over Sampling(ROS), Random Under Sampling(RUS), Synthetic Minority Oversampling TEchnique(SMOTE), ADAptive SYNthetic Sampling(ADASYN), Edited Nearest Neighbor(ENN), and Condensed Nearest Neighbor(CNN) are applied in order to make the dataset balanced. The performance is assessed by the Decision Tree classifier with the listed sampling techniques and the best performance is identified.Findings: This study focuses on the comparison of the performance metrics of various sampling methods widely used. It is revealed that, compared to other techniques, the Recall is high when ENN is applied CNN and ADASYN have performed equally well on the Imbalanced data.Research limitations: The testing was carried out with limited dataset and needs to be tested with a larger dataset.Practical implications: This framework will be useful whenever the data is imbalanced in real world scenarios, which ultimately improves the performance.Originality/value: This paper uses the rebalancing framework on medical appointment no-show dataset to predict the no-shows and removes the bias towards minority class. 展开更多
关键词 imbalanced data Sampling methods Machine learning CLASSIFICATION
下载PDF
Dealing with Imbalanced Dataset Leveraging Boundary Samples Discovered by Support Vector Data Description
14
作者 Zhengbo Luo Hamïd Parvïn +3 位作者 Harish Garg Sultan Noman Qasem Kim-Hung Pho Zulkefli Mansor 《Computers, Materials & Continua》 SCIE EI 2021年第3期2691-2708,共18页
These days,imbalanced datasets,denoted throughout the paper by ID,(a dataset that contains some(usually two)classes where one contains considerably smaller number of samples than the other(s))emerge in many real world... These days,imbalanced datasets,denoted throughout the paper by ID,(a dataset that contains some(usually two)classes where one contains considerably smaller number of samples than the other(s))emerge in many real world problems(like health care systems or disease diagnosis systems,anomaly detection,fraud detection,stream based malware detection systems,and so on)and these datasets cause some problems(like under-training of minority class(es)and over-training of majority class(es),bias towards majority class(es),and so on)in classification process and application.Therefore,these datasets take the focus of many researchers in any science and there are several solutions for dealing with this problem.The main aim of this study for dealing with IDs is to resample the borderline samples discovered by Support Vector Data Description(SVDD).There are naturally two kinds of resampling:Under-sampling(U-S)and oversampling(O-S).The O-S may cause the occurrence of over-fitting(the occurrence of over-fitting is its main drawback).The U-S can cause the occurrence of significant information loss(the occurrence of significant information loss is its main drawback).In this study,to avoid the drawbacks of the sampling techniques,we focus on the samples that may be misclassified.The data points that can be misclassified are considered to be the borderline data points which are on border(s)between the majority class(es)and minority class(es).First by SVDD,we find the borderline examples;then,the data resampling is applied over them.At the next step,the base classifier is trained on the newly created dataset.Finally,we compare the result of our method in terms of Area Under Curve(AUC)and F-measure and G-mean with the other state-of-the-art methods.We show that our method has betterresults than the other state-of-the-art methods on our experimental study. 展开更多
关键词 imbalanced learning CLASSIFICATION borderline examples
下载PDF
A Rasterized Lightning Disaster Risk Method for Imbalanced Sets Using Neural
15
作者 Yan Zhang Jin Han +3 位作者 Chengsheng Yuan Shuo Yang Chuanlong Li Xingming Sun 《Computers, Materials & Continua》 SCIE EI 2021年第1期563-574,共12页
Over the past 10 years,lightning disaster has caused a large number of casualties and considerable economic loss worldwide.Lightning poses a huge threat to various industries.In an attempt to reduce the risk of lightn... Over the past 10 years,lightning disaster has caused a large number of casualties and considerable economic loss worldwide.Lightning poses a huge threat to various industries.In an attempt to reduce the risk of lightning-caused disaster,many scholars have carried out in-depth research on lightning.However,these studies focus primarily on the lightning itself and other meteorological elements are ignored.In addition,the methods for assessing the risk of lightning disaster fail to give detailed attention to regional features(lightning disaster risk).This paper proposes a grid-based risk assessment method based on data from multiple sources.First,this paper considers the impact of lightning,the population density,the economy,and geographical environment data on the occurrence of lightning disasters;Second,this paper solves the problem of imbalanced lightning disaster data in geographic grid samples based on the K-means clustering algorithm;Third,the method calculates the feature of lightning disaster in each small field with the help of neural network structure,and the calculation results are then visually reflected in a zoning map by the Jenks natural breaks algorithm.The experimental results show that our method can solve the problem of imbalanced lightning disaster data,and offer 81%accuracy in lightning disaster risk assessment. 展开更多
关键词 Lightning disaster neural network imbalanced data
下载PDF
An Effective Classifier Model for Imbalanced Network Attack Data
16
作者 Gürcan Ctin 《Computers, Materials & Continua》 SCIE EI 2022年第12期4519-4539,共21页
Recently,machine learning algorithms have been used in the detection and classification of network attacks.The performance of the algorithms has been evaluated by using benchmark network intrusion datasets such as DAR... Recently,machine learning algorithms have been used in the detection and classification of network attacks.The performance of the algorithms has been evaluated by using benchmark network intrusion datasets such as DARPA98,KDD’99,NSL-KDD,UNSW-NB15,and Caida DDoS.However,these datasets have two major challenges:imbalanced data and highdimensional data.Obtaining high accuracy for all attack types in the dataset allows for high accuracy in imbalanced datasets.On the other hand,having a large number of features increases the runtime load on the algorithms.A novel model is proposed in this paper to overcome these two concerns.The number of features in the model,which has been tested at CICIDS2017,is initially optimized by using genetic algorithms.This optimum feature set has been used to classify network attacks with six well-known classifiers according to high f1-score and g-mean value in minimumtime.Afterwards,amulti-layer perceptron based ensemble learning approach has been applied to improve the models’overall performance.The experimental results showthat the suggested model is acceptable for feature selection as well as classifying network attacks in an imbalanced dataset,with a high f1-score(0.91)and g-mean(0.99)value.Furthermore,it has outperformed base classifier models and voting procedures. 展开更多
关键词 Ensemble methods feature selection genetic algorithm multilayer perceptron network attacks imbalanced data
下载PDF
A Modified Generative Adversarial Network for Fault Diagnosis in High-Speed Train Components with Imbalanced and Heterogeneous Monitoring Data
17
作者 Chong Wang Jie Liu Enrico Zio 《Journal of Dynamics, Monitoring and Diagnostics》 2022年第2期84-92,共9页
Data-driven methods are widely considered for fault diagnosis in complex systems.However,in practice,the between-class imbalance due to limited faulty samples may deteriorate their classification performance.To addres... Data-driven methods are widely considered for fault diagnosis in complex systems.However,in practice,the between-class imbalance due to limited faulty samples may deteriorate their classification performance.To address this issue,synthetic minority methods for enhancing data have been proved to be effective in many applications.Generative adversarial networks(GANs),capable of automatic features extraction,can also be adopted for augmenting the faulty samples.However,the monitoring data of a complex system may include not only continuous signals but also discrete/categorical signals.Since the current GAN methods still have some challenges in handling such heterogeneous monitoring data,a Mixed Dual Discriminator GAN(noted as M-D2GAN)is proposed in this work.In order to render the expanded fault samples more aligned with the real situation and improve the accuracy and robustness of the fault diagnosis model,different types of variables are generated in different ways,including floating-point,integer,categorical,and hierarchical.For effectively considering the class imbalance problem,proper modifications are made to the GAN model,where a normal class discriminator is added.A practical case study concerning the braking system of a high-speed train is carried out to verify the effectiveness of the proposed framework.Compared to the classic GAN,the proposed framework achieves better results with respect to F-measure and G-mean metrics. 展开更多
关键词 braking system fault diagnosis generative adversarial network heterogeneous data high-speed train imbalanced data
下载PDF
基于重采样和混合集成学习的不平衡窃电检测
18
作者 游文霞 梁皓 +3 位作者 杨楠 李清清 吴永华 李文武 《电网技术》 EI CSCD 北大核心 2024年第2期730-739,共10页
针对电力用户类别不平衡导致窃电检测具有偏向性问题,该文提出一种基于重采样和混合集成学习的不平衡窃电检测模型。首先以Easy-ensemble混合集成学习框架为基础确定最佳采样子集数;然后通过重采样自适应策略,即根据用户用电数据集的不... 针对电力用户类别不平衡导致窃电检测具有偏向性问题,该文提出一种基于重采样和混合集成学习的不平衡窃电检测模型。首先以Easy-ensemble混合集成学习框架为基础确定最佳采样子集数;然后通过重采样自适应策略,即根据用户用电数据集的不平衡度以及最佳采样子集数确定检测模型的重采样方式,使用电数据达到平衡;最后按照先串行集成减小偏差、后并行集成降低方差的混合集成方式,对重采样后的均衡样本进行窃电检测。算例对比分析表明所提检测模型通过重采样和混合集成有效解决了传统集成算法在不平衡窃电检测中的偏向问题,降低了由于用电数据的不平衡性对集成结果的影响,提高了用户类别不平衡的窃电检测效果,在多种不平衡度下模型的准确率、F1值和G均值均表现优异。 展开更多
关键词 窃电检测 不平衡数据 重采样 集成学习 Easy-Ensemble集成框架
下载PDF
非均衡数据下基于注意力网络和代价敏感学习的轨面状态识别
19
作者 于惠钧 张锦圣 +3 位作者 刘建华 彭慈兵 刘丽丽 龚事引 《科学技术与工程》 北大核心 2024年第5期1972-1979,共8页
准确识别轨面状态,可为列车牵引/制动性能提升提供关键依据。重点针对传统代价敏感学习应用在非均衡轨面状态识别中存在的同类别样本重要性不同和多数类精度下降等问题,提出一种基于注意力网络和代价敏感学习的轨面状态识别方法。该法... 准确识别轨面状态,可为列车牵引/制动性能提升提供关键依据。重点针对传统代价敏感学习应用在非均衡轨面状态识别中存在的同类别样本重要性不同和多数类精度下降等问题,提出一种基于注意力网络和代价敏感学习的轨面状态识别方法。该法首先利用迁移学习思想将均衡数据集的特征迁移到非均衡轨面状态数据集,减轻少数类样本误分类影响;其次在骨干网络ResNet18中引入卷积注意力机制模块,增强网络对目标区域的特征学习能力和全局特征信息的感知性能,调整优化网络权重参数;最后构造依据轨面状态样本重要性大小的自适应加权平衡损失函数,降低决策边界对困难样本中多数类的过拟合,获得更加平滑的决策边界。非均衡数据下的实验结果表明,在3种非均衡比下,所提方法的准确率和召回率分别达到96.00%、90.67%、86.33%,与目前常用的方法Focal相比,分别提升了7.00%、2.34%、3.00%。此外,该方法在提高少数类召回率的同时可有效维持多数类的召回率,并且降低了网络训练时间成本。 展开更多
关键词 轨面状态识别 非均衡数据 代价敏感学习 注意力机制
下载PDF
基于权重距离的优势边界小类样本合成算法
20
作者 何田中 郑艺峰 胡敏杰 《闽南师范大学学报(自然科学版)》 2024年第1期54-64,共11页
提出基于权重距离的优势边界小类样本合成算法(ABWD)来克服数据类别不平衡的问题.ABWD算法具有如下特点:1)定义权重距离,并基于该距离选取样本近邻;2)根据样本近邻确定该样本是否为小类的边界样本;3)对每个小类的边界样本确定其合成位... 提出基于权重距离的优势边界小类样本合成算法(ABWD)来克服数据类别不平衡的问题.ABWD算法具有如下特点:1)定义权重距离,并基于该距离选取样本近邻;2)根据样本近邻确定该样本是否为小类的边界样本;3)对每个小类的边界样本确定其合成位置与合成数量,使该小类样本合成后近邻中小类个数不少于大类的个数,确保该小类样本具有优势边界.实验结果表明,与其他典型过抽样算法相比,算法较大提高了小类的分类性能,在G-mean、F-measure及查全率三种度量上均取得很好的实验结果. 展开更多
关键词 数据挖掘 不平衡数据 过抽样 优势边界 权重距离
下载PDF
上一页 1 2 65 下一页 到第
使用帮助 返回顶部