Most modern technologies,such as social media,smart cities,and the internet of things(IoT),rely on big data.When big data is used in the real-world applications,two data challenges such as class overlap and class imba...Most modern technologies,such as social media,smart cities,and the internet of things(IoT),rely on big data.When big data is used in the real-world applications,two data challenges such as class overlap and class imbalance arises.When dealing with large datasets,most traditional classifiers are stuck in the local optimum problem.As a result,it’s necessary to look into new methods for dealing with large data collections.Several solutions have been proposed for overcoming this issue.The rapid growth of the available data threatens to limit the usefulness of many traditional methods.Methods such as oversampling and undersampling have shown great promises in addressing the issues of class imbalance.Among all of these techniques,Synthetic Minority Oversampling TechniquE(SMOTE)has produced the best results by generating synthetic samples for the minority class in creating a balanced dataset.The issue is that their practical applicability is restricted to problems involving tens of thousands or lower instances of each.In this paper,we have proposed a parallel mode method using SMOTE and MapReduce strategy,this distributes the operation of the algorithm among a group of computational nodes for addressing the aforementioned problem.Our proposed solution has been divided into three stages.Thefirst stage involves the process of splitting the data into different blocks using a mapping function,followed by a pre-processing step for each mapping block that employs a hybrid SMOTE algo-rithm for solving the class imbalanced problem.On each map block,a decision tree model would be constructed.Finally,the decision tree blocks would be com-bined for creating a classification model.We have used numerous datasets with up to 4 million instances in our experiments for testing the proposed scheme’s cap-abilities.As a result,the Hybrid SMOTE appears to have good scalability within the framework proposed,and it also cuts down the processing time.展开更多
Driven by the need of a plethora of machine learning applications,several attempts have been made at improving the performance of classifiers applied to imbalanced datasets.In this paper,we present a fast maximum entr...Driven by the need of a plethora of machine learning applications,several attempts have been made at improving the performance of classifiers applied to imbalanced datasets.In this paper,we present a fast maximum entropy machine(MEM)combined with a synthetic minority over-sampling technique for handling binary classification problems with high imbalance ratios,large numbers of data samples,and medium/large numbers of features.A random Fourier feature representation of kernel functions and primal estimated sub-gradient solver for support vector machine(PEGASOS)are applied to speed up the classic MEM.Experiments have been conducted using various real datasets(including two China Mobile datasets and several other standard test datasets)with various configurations.The obtained results demonstrate that the proposed algorithm has extremely low complexity but an excellent overall classification performance(in terms of several widely used evaluation metrics)as compared to the classic MEM and some other state-of-the-art methods.The proposed algorithm is particularly valuable in big data applications owing to its significantly low computational complexity.展开更多
In recent years,academic misconduct has been frequently exposed by the media,with serious impacts on the academic community.Current research on academic misconduct focuses mainly on detecting plagiarism in article con...In recent years,academic misconduct has been frequently exposed by the media,with serious impacts on the academic community.Current research on academic misconduct focuses mainly on detecting plagiarism in article content through the application of character-based and non-text element detection techniques over the entirety of a manuscript.For the most part,these techniques can only detect cases of textual plagiarism,which means that potential culprits can easily avoid discovery through clever editing and alterations of text content.In this paper,we propose an academic misconduct detection method based on scholars’submission behaviors.The model can effectively capture the atypical behavioral approach and operation of the author.As such,it is able to detect various types of misconduct,thereby improving the accuracy of detection when combined with a text content analysis.The model learns by forming a dual network group that processes text features and user behavior features to detect potential academic misconduct.First,the effect of scholars’behavioral features on the model are considered and analyzed.Second,the Synthetic Minority Oversampling Technique(SMOTE)is applied to address the problem of imbalanced samples of positive and negative classes among contributing scholars.Finally,the text features of the papers are combined with the scholars’behavioral data to improve recognition precision.Experimental results on the imbalanced dataset demonstrate that our model has a highly satisfactory performance in terms of accuracy and recall.展开更多
In recent years,deep learning gained proliferating popularity in the cybersecurity application domain,since when being compared to traditional machine learning methods,it usually involves less human efforts,produces b...In recent years,deep learning gained proliferating popularity in the cybersecurity application domain,since when being compared to traditional machine learning methods,it usually involves less human efforts,produces better results,and provides better generalizability.However,the imbalanced data issue is very common in cybersecurity,which can substantially deteriorate the performance of the deep learning models.This paper introduces a transfer learning based method to tackle the imbalanced data issue in cybersecurity using return-oriented programming payload detection as a case study.We achieved 0.0290 average false positive rate,0.9705 average F1 score and 0.9521 average detection rate on 3 different target domain programs using 2 different source domain programs,with 0 benign training data sample in the target domain.The performance improvement compared to the baseline is a trade-off between false positive rate and detection rate.Using our approach,the total number of false positives is reduced by 23.16%,and as a trade-off,the number of detected malicious samples decreases by 0.68%.展开更多
With the development of advanced metering infrastructure(AMI),large amounts of electricity consumption data can be collected for electricity theft detection.However,the imbalance of electricity consumption data is vio...With the development of advanced metering infrastructure(AMI),large amounts of electricity consumption data can be collected for electricity theft detection.However,the imbalance of electricity consumption data is violent,which makes the training of detection model challenging.In this case,this paper proposes an electricity theft detection method based on ensemble learning and prototype learning,which has great performance on imbalanced dataset and abnormal data with different abnormal level.In this paper,convolutional neural network(CNN)and long short-term memory(LSTM)are employed to obtain abstract feature from electricity consumption data.After calculating the means of the abstract feature,the prototype per class is obtained,which is used to predict the labels of unknown samples.In the meanwhile,through training the network by different balanced subsets of training set,the prototype is representative.Compared with some mainstream methods including CNN,random forest(RF)and so on,the proposed method has been proved to effectively deal with the electricity theft detection when abnormal data only account for 2.5%and 1.25%of normal data.The results show that the proposed method outperforms other state-of-the-art methods.展开更多
Telemarketing is a well-established marketing approach to offering products and services to prospective customers.The effectiveness of such an approach,however,is highly dependent on the selection of the appropriate c...Telemarketing is a well-established marketing approach to offering products and services to prospective customers.The effectiveness of such an approach,however,is highly dependent on the selection of the appropriate consumer base,as reaching uninterested customers will induce annoyance and consume costly enterprise resources in vain while missing interested ones.The introduction of business intelligence and machine learning models can positively influence the decision-making process by predicting the potential customer base,and the existing literature in this direction shows promising results.However,the selection of influential features and the construction of effective learning models for improved performance remain a challenge.Furthermore,from the modelling perspective,the class imbalance nature of the training data,where samples with unsuccessful outcomes highly outnumber successful ones,further compounds the problem by creating biased and inaccurate models.Additionally,customer preferences are likely to change over time due to various reasons,and/or a fresh group of customers may be targeted for a new product or service,necessitating model retraining which is not addressed at all in existing works.A major challenge in model retraining is maintaining a balance between stability(retaining older knowledge)and plasticity(being receptive to new information).To address the above issues,this paper proposes an ensemble machine learning model with feature selection and oversampling techniques to identify potential customers more accurately.A novel online learning method is proposed for model retraining when new samples are available over time.This newly introduced method equips the proposed approach to deal with dynamic data,leading to improved readiness of the proposed model for practical adoption,and is a highly useful addition to the literature.Extensive experiments with real-world data show that the proposed approach achieves excellent results in all cases(e.g.,98.6%accuracy in classifying customers)and outperforms recent competing models in the literature by a considerable margin of 3%on a widely used dataset.展开更多
Since the overall prediction error of a classifier on imbalanced problems can be potentially misleading and bi- ased, alternative performance measures such as G-mean and F-measure have been widely adopted. Various tec...Since the overall prediction error of a classifier on imbalanced problems can be potentially misleading and bi- ased, alternative performance measures such as G-mean and F-measure have been widely adopted. Various techniques in- cluding sampling and cost sensitive learning are often em- ployed to improve the performance of classifiers in such sit- uations. However, the training process of classifiers is still largely driven by traditional error based objective functions. As a result, there is clearly a gap between the measure accord- ing to which the classifier is evaluated and how the classifier is trained. This paper investigates the prospect of explicitly using the appropriate measure itself to search the hypothesis space to bridge this gap. In the case studies, a standard three- layer neural network is used as the classifier, which is evolved by genetic algorithms (GAs) with G-mean as the objective function. Experimental results on eight benchmark problems show that the proposed method can achieve consistently fa- vorable outcomes in comparison with a commonly used sam- pling technique. The effectiveness of multi-objective opti- mization in handling imbalanced problems is also demon- strated.展开更多
Decision trees have three main disadvantages: reduced performance when the training set is small; rigid decision criteria; and the fact that a single "uncharacteristic" attribute might "derail" the classification...Decision trees have three main disadvantages: reduced performance when the training set is small; rigid decision criteria; and the fact that a single "uncharacteristic" attribute might "derail" the classification process. In this paper we present ConfDTree (Confidence-Based Decision Tree) -- a post-processing method that enables decision trees to better classify outlier instances. This method, which can be applied to any decision tree algorithm, uses easy-to-implement statistical methods (confidence intervals and two-proportion tests) in order to identify hard-to-classify instances and to propose alternative routes. The experimental study indicates that the proposed post-processing method consistently and significantly improves the predictive performance of decision trees, particularly for small, imbalanced or multi-class datasets in which an average improvement of 5%-9% in the AUC performance is reported.展开更多
文摘Most modern technologies,such as social media,smart cities,and the internet of things(IoT),rely on big data.When big data is used in the real-world applications,two data challenges such as class overlap and class imbalance arises.When dealing with large datasets,most traditional classifiers are stuck in the local optimum problem.As a result,it’s necessary to look into new methods for dealing with large data collections.Several solutions have been proposed for overcoming this issue.The rapid growth of the available data threatens to limit the usefulness of many traditional methods.Methods such as oversampling and undersampling have shown great promises in addressing the issues of class imbalance.Among all of these techniques,Synthetic Minority Oversampling TechniquE(SMOTE)has produced the best results by generating synthetic samples for the minority class in creating a balanced dataset.The issue is that their practical applicability is restricted to problems involving tens of thousands or lower instances of each.In this paper,we have proposed a parallel mode method using SMOTE and MapReduce strategy,this distributes the operation of the algorithm among a group of computational nodes for addressing the aforementioned problem.Our proposed solution has been divided into three stages.Thefirst stage involves the process of splitting the data into different blocks using a mapping function,followed by a pre-processing step for each mapping block that employs a hybrid SMOTE algo-rithm for solving the class imbalanced problem.On each map block,a decision tree model would be constructed.Finally,the decision tree blocks would be com-bined for creating a classification model.We have used numerous datasets with up to 4 million instances in our experiments for testing the proposed scheme’s cap-abilities.As a result,the Hybrid SMOTE appears to have good scalability within the framework proposed,and it also cuts down the processing time.
基金The author Feng Yin was funded by the Shenzhen Science and Technology Innovation Council(No.JCYJ20170307155957688)and by National Natural Science Foundation of China Key Project(No.61731018)The authors Feng Yin and Shuguang(Robert)Cui were funded by Shenzhen Fundamental Research Funds under Grant(Key Lab)No.ZDSYS201707251409055,Grant(Peacock)No.KQTD2015033114415450,and Guangdong province“The Pearl River Talent Recruitment Program Innovative and Entrepreneurial Teams in 2017”-Data Driven Evolution of Future Intelligent Network Team.The associate editor coordinating the review of this paper and approving it for publication was X.Cheng.
文摘Driven by the need of a plethora of machine learning applications,several attempts have been made at improving the performance of classifiers applied to imbalanced datasets.In this paper,we present a fast maximum entropy machine(MEM)combined with a synthetic minority over-sampling technique for handling binary classification problems with high imbalance ratios,large numbers of data samples,and medium/large numbers of features.A random Fourier feature representation of kernel functions and primal estimated sub-gradient solver for support vector machine(PEGASOS)are applied to speed up the classic MEM.Experiments have been conducted using various real datasets(including two China Mobile datasets and several other standard test datasets)with various configurations.The obtained results demonstrate that the proposed algorithm has extremely low complexity but an excellent overall classification performance(in terms of several widely used evaluation metrics)as compared to the classic MEM and some other state-of-the-art methods.The proposed algorithm is particularly valuable in big data applications owing to its significantly low computational complexity.
基金This work is supported by the National Key R&D Program of China under grant 2018YFB1003205by the National Natural Science Foundation of China under grants U1836208 and U1836110+1 种基金by the Priority Academic Program Development of Jiangsu Higher Education Institutions(PAPD)fundand by the Collaborative Innovation Center of Atmospheric Environment and Equipment Technology(CICAEET)fund,China.
文摘In recent years,academic misconduct has been frequently exposed by the media,with serious impacts on the academic community.Current research on academic misconduct focuses mainly on detecting plagiarism in article content through the application of character-based and non-text element detection techniques over the entirety of a manuscript.For the most part,these techniques can only detect cases of textual plagiarism,which means that potential culprits can easily avoid discovery through clever editing and alterations of text content.In this paper,we propose an academic misconduct detection method based on scholars’submission behaviors.The model can effectively capture the atypical behavioral approach and operation of the author.As such,it is able to detect various types of misconduct,thereby improving the accuracy of detection when combined with a text content analysis.The model learns by forming a dual network group that processes text features and user behavior features to detect potential academic misconduct.First,the effect of scholars’behavioral features on the model are considered and analyzed.Second,the Synthetic Minority Oversampling Technique(SMOTE)is applied to address the problem of imbalanced samples of positive and negative classes among contributing scholars.Finally,the text features of the papers are combined with the scholars’behavioral data to improve recognition precision.Experimental results on the imbalanced dataset demonstrate that our model has a highly satisfactory performance in terms of accuracy and recall.
基金supported by NSF CNS-2019340,NSF ECCS-2140175,and NIST 60NANB22D144.
文摘In recent years,deep learning gained proliferating popularity in the cybersecurity application domain,since when being compared to traditional machine learning methods,it usually involves less human efforts,produces better results,and provides better generalizability.However,the imbalanced data issue is very common in cybersecurity,which can substantially deteriorate the performance of the deep learning models.This paper introduces a transfer learning based method to tackle the imbalanced data issue in cybersecurity using return-oriented programming payload detection as a case study.We achieved 0.0290 average false positive rate,0.9705 average F1 score and 0.9521 average detection rate on 3 different target domain programs using 2 different source domain programs,with 0 benign training data sample in the target domain.The performance improvement compared to the baseline is a trade-off between false positive rate and detection rate.Using our approach,the total number of false positives is reduced by 23.16%,and as a trade-off,the number of detected malicious samples decreases by 0.68%.
基金supported by National Natural Science Foundation of China(No.52277083).
文摘With the development of advanced metering infrastructure(AMI),large amounts of electricity consumption data can be collected for electricity theft detection.However,the imbalance of electricity consumption data is violent,which makes the training of detection model challenging.In this case,this paper proposes an electricity theft detection method based on ensemble learning and prototype learning,which has great performance on imbalanced dataset and abnormal data with different abnormal level.In this paper,convolutional neural network(CNN)and long short-term memory(LSTM)are employed to obtain abstract feature from electricity consumption data.After calculating the means of the abstract feature,the prototype per class is obtained,which is used to predict the labels of unknown samples.In the meanwhile,through training the network by different balanced subsets of training set,the prototype is representative.Compared with some mainstream methods including CNN,random forest(RF)and so on,the proposed method has been proved to effectively deal with the electricity theft detection when abnormal data only account for 2.5%and 1.25%of normal data.The results show that the proposed method outperforms other state-of-the-art methods.
文摘Telemarketing is a well-established marketing approach to offering products and services to prospective customers.The effectiveness of such an approach,however,is highly dependent on the selection of the appropriate consumer base,as reaching uninterested customers will induce annoyance and consume costly enterprise resources in vain while missing interested ones.The introduction of business intelligence and machine learning models can positively influence the decision-making process by predicting the potential customer base,and the existing literature in this direction shows promising results.However,the selection of influential features and the construction of effective learning models for improved performance remain a challenge.Furthermore,from the modelling perspective,the class imbalance nature of the training data,where samples with unsuccessful outcomes highly outnumber successful ones,further compounds the problem by creating biased and inaccurate models.Additionally,customer preferences are likely to change over time due to various reasons,and/or a fresh group of customers may be targeted for a new product or service,necessitating model retraining which is not addressed at all in existing works.A major challenge in model retraining is maintaining a balance between stability(retaining older knowledge)and plasticity(being receptive to new information).To address the above issues,this paper proposes an ensemble machine learning model with feature selection and oversampling techniques to identify potential customers more accurately.A novel online learning method is proposed for model retraining when new samples are available over time.This newly introduced method equips the proposed approach to deal with dynamic data,leading to improved readiness of the proposed model for practical adoption,and is a highly useful addition to the literature.Extensive experiments with real-world data show that the proposed approach achieves excellent results in all cases(e.g.,98.6%accuracy in classifying customers)and outperforms recent competing models in the literature by a considerable margin of 3%on a widely used dataset.
文摘Since the overall prediction error of a classifier on imbalanced problems can be potentially misleading and bi- ased, alternative performance measures such as G-mean and F-measure have been widely adopted. Various techniques in- cluding sampling and cost sensitive learning are often em- ployed to improve the performance of classifiers in such sit- uations. However, the training process of classifiers is still largely driven by traditional error based objective functions. As a result, there is clearly a gap between the measure accord- ing to which the classifier is evaluated and how the classifier is trained. This paper investigates the prospect of explicitly using the appropriate measure itself to search the hypothesis space to bridge this gap. In the case studies, a standard three- layer neural network is used as the classifier, which is evolved by genetic algorithms (GAs) with G-mean as the objective function. Experimental results on eight benchmark problems show that the proposed method can achieve consistently fa- vorable outcomes in comparison with a commonly used sam- pling technique. The effectiveness of multi-objective opti- mization in handling imbalanced problems is also demon- strated.
文摘Decision trees have three main disadvantages: reduced performance when the training set is small; rigid decision criteria; and the fact that a single "uncharacteristic" attribute might "derail" the classification process. In this paper we present ConfDTree (Confidence-Based Decision Tree) -- a post-processing method that enables decision trees to better classify outlier instances. This method, which can be applied to any decision tree algorithm, uses easy-to-implement statistical methods (confidence intervals and two-proportion tests) in order to identify hard-to-classify instances and to propose alternative routes. The experimental study indicates that the proposed post-processing method consistently and significantly improves the predictive performance of decision trees, particularly for small, imbalanced or multi-class datasets in which an average improvement of 5%-9% in the AUC performance is reported.