期刊文献+
共找到10篇文章
< 1 >
每页显示 20 50 100
An Imbalanced Dataset and Class Overlapping Classification Model for Big Data 被引量:1
1
作者 Mini Prince P.M.Joe Prathap 《Computer Systems Science & Engineering》 SCIE EI 2023年第2期1009-1024,共16页
Most modern technologies,such as social media,smart cities,and the internet of things(IoT),rely on big data.When big data is used in the real-world applications,two data challenges such as class overlap and class imba... Most modern technologies,such as social media,smart cities,and the internet of things(IoT),rely on big data.When big data is used in the real-world applications,two data challenges such as class overlap and class imbalance arises.When dealing with large datasets,most traditional classifiers are stuck in the local optimum problem.As a result,it’s necessary to look into new methods for dealing with large data collections.Several solutions have been proposed for overcoming this issue.The rapid growth of the available data threatens to limit the usefulness of many traditional methods.Methods such as oversampling and undersampling have shown great promises in addressing the issues of class imbalance.Among all of these techniques,Synthetic Minority Oversampling TechniquE(SMOTE)has produced the best results by generating synthetic samples for the minority class in creating a balanced dataset.The issue is that their practical applicability is restricted to problems involving tens of thousands or lower instances of each.In this paper,we have proposed a parallel mode method using SMOTE and MapReduce strategy,this distributes the operation of the algorithm among a group of computational nodes for addressing the aforementioned problem.Our proposed solution has been divided into three stages.Thefirst stage involves the process of splitting the data into different blocks using a mapping function,followed by a pre-processing step for each mapping block that employs a hybrid SMOTE algo-rithm for solving the class imbalanced problem.On each map block,a decision tree model would be constructed.Finally,the decision tree blocks would be com-bined for creating a classification model.We have used numerous datasets with up to 4 million instances in our experiments for testing the proposed scheme’s cap-abilities.As a result,the Hybrid SMOTE appears to have good scalability within the framework proposed,and it also cuts down the processing time. 展开更多
关键词 imbalanced dataset class overlapping SMOTE MAPREDUCE parallel programming OVERSAMPLING
下载PDF
Data-Driven Decision-Making for Bank Target Marketing Using Supervised Learning Classifiers on Imbalanced Big Data
2
作者 Fahim Nasir Abdulghani Ali Ahmed +2 位作者 Mehmet Sabir Kiraz Iryna Yevseyeva Mubarak Saif 《Computers, Materials & Continua》 SCIE EI 2024年第10期1703-1728,共26页
Integrating machine learning and data mining is crucial for processing big data and extracting valuable insights to enhance decision-making.However,imbalanced target variables within big data present technical challen... Integrating machine learning and data mining is crucial for processing big data and extracting valuable insights to enhance decision-making.However,imbalanced target variables within big data present technical challenges that hinder the performance of supervised learning classifiers on key evaluation metrics,limiting their overall effectiveness.This study presents a comprehensive review of both common and recently developed Supervised Learning Classifiers(SLCs)and evaluates their performance in data-driven decision-making.The evaluation uses various metrics,with a particular focus on the Harmonic Mean Score(F-1 score)on an imbalanced real-world bank target marketing dataset.The findings indicate that grid-search random forest and random-search random forest excel in Precision and area under the curve,while Extreme Gradient Boosting(XGBoost)outperforms other traditional classifiers in terms of F-1 score.Employing oversampling methods to address the imbalanced data shows significant performance improvement in XGBoost,delivering superior results across all metrics,particularly when using the SMOTE variant known as the BorderlineSMOTE2 technique.The study concludes several key factors for effectively addressing the challenges of supervised learning with imbalanced datasets.These factors include the importance of selecting appropriate datasets for training and testing,choosing the right classifiers,employing effective techniques for processing and handling imbalanced datasets,and identifying suitable metrics for performance evaluation.Additionally,factors also entail the utilisation of effective exploratory data analysis in conjunction with visualisation techniques to yield insights conducive to data-driven decision-making. 展开更多
关键词 Big data machine learning data mining data visualization label encoding imbalanced dataset sampling techniques
下载PDF
Fast Maximum Entropy Machine for Big Imbalanced Datasets
3
作者 Feng Yin Shuqing Lin +1 位作者 Chuxin Piao Shuguang(Robert)Cui 《Journal of Communications and Information Networks》 2018年第3期20-30,共11页
Driven by the need of a plethora of machine learning applications,several attempts have been made at improving the performance of classifiers applied to imbalanced datasets.In this paper,we present a fast maximum entr... Driven by the need of a plethora of machine learning applications,several attempts have been made at improving the performance of classifiers applied to imbalanced datasets.In this paper,we present a fast maximum entropy machine(MEM)combined with a synthetic minority over-sampling technique for handling binary classification problems with high imbalance ratios,large numbers of data samples,and medium/large numbers of features.A random Fourier feature representation of kernel functions and primal estimated sub-gradient solver for support vector machine(PEGASOS)are applied to speed up the classic MEM.Experiments have been conducted using various real datasets(including two China Mobile datasets and several other standard test datasets)with various configurations.The obtained results demonstrate that the proposed algorithm has extremely low complexity but an excellent overall classification performance(in terms of several widely used evaluation metrics)as compared to the classic MEM and some other state-of-the-art methods.The proposed algorithm is particularly valuable in big data applications owing to its significantly low computational complexity. 展开更多
关键词 binary classification imbalanced datasets maximum entropy machine PEGASOS random Fourier feature SMOTE
原文传递
AMDnet:An Academic Misconduct Detection Method for Authors’Behaviors
4
作者 Shihao Zhou Ziyuan Xu +2 位作者 JinHan Xingming Sun Yi Cao 《Computers, Materials & Continua》 SCIE EI 2022年第6期5995-6009,共15页
In recent years,academic misconduct has been frequently exposed by the media,with serious impacts on the academic community.Current research on academic misconduct focuses mainly on detecting plagiarism in article con... In recent years,academic misconduct has been frequently exposed by the media,with serious impacts on the academic community.Current research on academic misconduct focuses mainly on detecting plagiarism in article content through the application of character-based and non-text element detection techniques over the entirety of a manuscript.For the most part,these techniques can only detect cases of textual plagiarism,which means that potential culprits can easily avoid discovery through clever editing and alterations of text content.In this paper,we propose an academic misconduct detection method based on scholars’submission behaviors.The model can effectively capture the atypical behavioral approach and operation of the author.As such,it is able to detect various types of misconduct,thereby improving the accuracy of detection when combined with a text content analysis.The model learns by forming a dual network group that processes text features and user behavior features to detect potential academic misconduct.First,the effect of scholars’behavioral features on the model are considered and analyzed.Second,the Synthetic Minority Oversampling Technique(SMOTE)is applied to address the problem of imbalanced samples of positive and negative classes among contributing scholars.Finally,the text features of the papers are combined with the scholars’behavioral data to improve recognition precision.Experimental results on the imbalanced dataset demonstrate that our model has a highly satisfactory performance in terms of accuracy and recall. 展开更多
关键词 Academic misconduct neural network imbalanced dataset
下载PDF
Tackling imbalanced data in cybersecurity with transfer learning: a case with ROP payload detection
5
作者 Haizhou Wang Anoop Singhal Peng Liu 《Cybersecurity》 EI CSCD 2023年第2期29-43,共15页
In recent years,deep learning gained proliferating popularity in the cybersecurity application domain,since when being compared to traditional machine learning methods,it usually involves less human efforts,produces b... In recent years,deep learning gained proliferating popularity in the cybersecurity application domain,since when being compared to traditional machine learning methods,it usually involves less human efforts,produces better results,and provides better generalizability.However,the imbalanced data issue is very common in cybersecurity,which can substantially deteriorate the performance of the deep learning models.This paper introduces a transfer learning based method to tackle the imbalanced data issue in cybersecurity using return-oriented programming payload detection as a case study.We achieved 0.0290 average false positive rate,0.9705 average F1 score and 0.9521 average detection rate on 3 different target domain programs using 2 different source domain programs,with 0 benign training data sample in the target domain.The performance improvement compared to the baseline is a trade-off between false positive rate and detection rate.Using our approach,the total number of false positives is reduced by 23.16%,and as a trade-off,the number of detected malicious samples decreases by 0.68%. 展开更多
关键词 Domain adaptation Return-oriented programming imbalanced dataset
原文传递
Electricity Theft Detection Method Based on Ensemble Learning and Prototype Learning
6
作者 Xinwu Sun Jiaxiang Hu +4 位作者 Zhenyuan Zhang Di Cao Qi Huang Zhe Chen Weihao Hu 《Journal of Modern Power Systems and Clean Energy》 SCIE EI CSCD 2024年第1期213-224,共12页
With the development of advanced metering infrastructure(AMI),large amounts of electricity consumption data can be collected for electricity theft detection.However,the imbalance of electricity consumption data is vio... With the development of advanced metering infrastructure(AMI),large amounts of electricity consumption data can be collected for electricity theft detection.However,the imbalance of electricity consumption data is violent,which makes the training of detection model challenging.In this case,this paper proposes an electricity theft detection method based on ensemble learning and prototype learning,which has great performance on imbalanced dataset and abnormal data with different abnormal level.In this paper,convolutional neural network(CNN)and long short-term memory(LSTM)are employed to obtain abstract feature from electricity consumption data.After calculating the means of the abstract feature,the prototype per class is obtained,which is used to predict the labels of unknown samples.In the meanwhile,through training the network by different balanced subsets of training set,the prototype is representative.Compared with some mainstream methods including CNN,random forest(RF)and so on,the proposed method has been proved to effectively deal with the electricity theft detection when abnormal data only account for 2.5%and 1.25%of normal data.The results show that the proposed method outperforms other state-of-the-art methods. 展开更多
关键词 Electricity theft detection ensemble learning prototype learning imbalanced dataset deep learning abnormal level
原文传递
Enhancing Telemarketing Success Using Ensemble-Based Online Machine Learning
7
作者 Shahriar Kaisar Md Mamunur Rashid +3 位作者 Abdullahi Chowdhury Sakib Shahriar Shafin Joarder Kamruzzaman Abebe Diro 《Big Data Mining and Analytics》 EI CSCD 2024年第2期294-314,共21页
Telemarketing is a well-established marketing approach to offering products and services to prospective customers.The effectiveness of such an approach,however,is highly dependent on the selection of the appropriate c... Telemarketing is a well-established marketing approach to offering products and services to prospective customers.The effectiveness of such an approach,however,is highly dependent on the selection of the appropriate consumer base,as reaching uninterested customers will induce annoyance and consume costly enterprise resources in vain while missing interested ones.The introduction of business intelligence and machine learning models can positively influence the decision-making process by predicting the potential customer base,and the existing literature in this direction shows promising results.However,the selection of influential features and the construction of effective learning models for improved performance remain a challenge.Furthermore,from the modelling perspective,the class imbalance nature of the training data,where samples with unsuccessful outcomes highly outnumber successful ones,further compounds the problem by creating biased and inaccurate models.Additionally,customer preferences are likely to change over time due to various reasons,and/or a fresh group of customers may be targeted for a new product or service,necessitating model retraining which is not addressed at all in existing works.A major challenge in model retraining is maintaining a balance between stability(retaining older knowledge)and plasticity(being receptive to new information).To address the above issues,this paper proposes an ensemble machine learning model with feature selection and oversampling techniques to identify potential customers more accurately.A novel online learning method is proposed for model retraining when new samples are available over time.This newly introduced method equips the proposed approach to deal with dynamic data,leading to improved readiness of the proposed model for practical adoption,and is a highly useful addition to the literature.Extensive experiments with real-world data show that the proposed approach achieves excellent results in all cases(e.g.,98.6%accuracy in classifying customers)and outperforms recent competing models in the literature by a considerable margin of 3%on a widely used dataset. 展开更多
关键词 machine learning online learning OVERSAMPLING TELEMARKETING imbalanced dataset ensemble model
原文传递
Applicability of deep neural networks for lithofacies classification from conventional well logs: An integrated approach
8
作者 Saud Qadir Khan Farzain Ud Din Kirmani 《Petroleum Research》 EI 2024年第3期393-408,共16页
Parametric understanding for specifying formation characteristics can be perceived through conven-tional approaches.Significantly,attributes of reservoir lithology are practiced for hydrocarbon explora-tion.Well loggi... Parametric understanding for specifying formation characteristics can be perceived through conven-tional approaches.Significantly,attributes of reservoir lithology are practiced for hydrocarbon explora-tion.Well logging is conventional approach which is applicable to predict lithology efficiently as compared to geophysical modeling and petrophysical analysis due to cost effectiveness and suitable interpretation time.However,manual interpretation of lithology identification through well logging data requires domain expertise with an extended length of time for measurement.Therefore,in this study,Deep Neural Network(DNN)has been deployed to automate the lithology identification process from well logging data which would provide support by increasing time-effective for monitoring lithology.DNN model has been developed for predicting formation lithology leading to the optimization of the model through the thorough evaluation of the best parameters and hyperparameters including the number of neurons,number of layers,optimizer,learning rate,dropout values,and activation functions.Accuracy of the model is examined by utilizing different evaluation metrics through the division of the dataset into the subdomains of training,validation and testing.Additionally,an attempt is contributed to remove interception for formation lithology prediction while addressing the imbalanced nature of the associated dataset as well in the training process using class weight.It is assessed that accuracy is not a true and only reliable metric to evaluate the lithology classification model.The model with class weight recognizes all the classes but has low accuracy as well as a low F1-score while LSTM based model has high accuracyas well as a high F1-score. 展开更多
关键词 Lithology identification Deep learning LSTM imbalanced dataset Conventional well logs
原文传递
Measure oriented training: a targeted approach to imbalanced classification problems 被引量:1
9
作者 Bo YUAN Wenhuang LIU 《Frontiers of Computer Science》 SCIE EI CSCD 2012年第5期489-497,共9页
Since the overall prediction error of a classifier on imbalanced problems can be potentially misleading and bi- ased, alternative performance measures such as G-mean and F-measure have been widely adopted. Various tec... Since the overall prediction error of a classifier on imbalanced problems can be potentially misleading and bi- ased, alternative performance measures such as G-mean and F-measure have been widely adopted. Various techniques in- cluding sampling and cost sensitive learning are often em- ployed to improve the performance of classifiers in such sit- uations. However, the training process of classifiers is still largely driven by traditional error based objective functions. As a result, there is clearly a gap between the measure accord- ing to which the classifier is evaluated and how the classifier is trained. This paper investigates the prospect of explicitly using the appropriate measure itself to search the hypothesis space to bridge this gap. In the case studies, a standard three- layer neural network is used as the classifier, which is evolved by genetic algorithms (GAs) with G-mean as the objective function. Experimental results on eight benchmark problems show that the proposed method can achieve consistently fa- vorable outcomes in comparison with a commonly used sam- pling technique. The effectiveness of multi-objective opti- mization in handling imbalanced problems is also demon- strated. 展开更多
关键词 imbalanced datasets genetic algorithms (GAs) neural networks G-mean synthetic minority over-sampling technique (SMOTE)
原文传递
ConfDTree: A Statistical Method for Improving Decision Trees 被引量:3
10
作者 Gilad Katz Asaf Shabtai +1 位作者 Lior Rokach Nir Ofek 《Journal of Computer Science & Technology》 SCIE EI CSCD 2014年第3期392-407,共16页
Decision trees have three main disadvantages: reduced performance when the training set is small; rigid decision criteria; and the fact that a single "uncharacteristic" attribute might "derail" the classification... Decision trees have three main disadvantages: reduced performance when the training set is small; rigid decision criteria; and the fact that a single "uncharacteristic" attribute might "derail" the classification process. In this paper we present ConfDTree (Confidence-Based Decision Tree) -- a post-processing method that enables decision trees to better classify outlier instances. This method, which can be applied to any decision tree algorithm, uses easy-to-implement statistical methods (confidence intervals and two-proportion tests) in order to identify hard-to-classify instances and to propose alternative routes. The experimental study indicates that the proposed post-processing method consistently and significantly improves the predictive performance of decision trees, particularly for small, imbalanced or multi-class datasets in which an average improvement of 5%-9% in the AUC performance is reported. 展开更多
关键词 decision tree confidence interval imbalanced dataset
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部