Driven by the need of a plethora of machine learning applications,several attempts have been made at improving the performance of classifiers applied to imbalanced datasets.In this paper,we present a fast maximum entr...Driven by the need of a plethora of machine learning applications,several attempts have been made at improving the performance of classifiers applied to imbalanced datasets.In this paper,we present a fast maximum entropy machine(MEM)combined with a synthetic minority over-sampling technique for handling binary classification problems with high imbalance ratios,large numbers of data samples,and medium/large numbers of features.A random Fourier feature representation of kernel functions and primal estimated sub-gradient solver for support vector machine(PEGASOS)are applied to speed up the classic MEM.Experiments have been conducted using various real datasets(including two China Mobile datasets and several other standard test datasets)with various configurations.The obtained results demonstrate that the proposed algorithm has extremely low complexity but an excellent overall classification performance(in terms of several widely used evaluation metrics)as compared to the classic MEM and some other state-of-the-art methods.The proposed algorithm is particularly valuable in big data applications owing to its significantly low computational complexity.展开更多
Most modern technologies,such as social media,smart cities,and the internet of things(IoT),rely on big data.When big data is used in the real-world applications,two data challenges such as class overlap and class imba...Most modern technologies,such as social media,smart cities,and the internet of things(IoT),rely on big data.When big data is used in the real-world applications,two data challenges such as class overlap and class imbalance arises.When dealing with large datasets,most traditional classifiers are stuck in the local optimum problem.As a result,it’s necessary to look into new methods for dealing with large data collections.Several solutions have been proposed for overcoming this issue.The rapid growth of the available data threatens to limit the usefulness of many traditional methods.Methods such as oversampling and undersampling have shown great promises in addressing the issues of class imbalance.Among all of these techniques,Synthetic Minority Oversampling TechniquE(SMOTE)has produced the best results by generating synthetic samples for the minority class in creating a balanced dataset.The issue is that their practical applicability is restricted to problems involving tens of thousands or lower instances of each.In this paper,we have proposed a parallel mode method using SMOTE and MapReduce strategy,this distributes the operation of the algorithm among a group of computational nodes for addressing the aforementioned problem.Our proposed solution has been divided into three stages.Thefirst stage involves the process of splitting the data into different blocks using a mapping function,followed by a pre-processing step for each mapping block that employs a hybrid SMOTE algo-rithm for solving the class imbalanced problem.On each map block,a decision tree model would be constructed.Finally,the decision tree blocks would be com-bined for creating a classification model.We have used numerous datasets with up to 4 million instances in our experiments for testing the proposed scheme’s cap-abilities.As a result,the Hybrid SMOTE appears to have good scalability within the framework proposed,and it also cuts down the processing time.展开更多
In recent years,academic misconduct has been frequently exposed by the media,with serious impacts on the academic community.Current research on academic misconduct focuses mainly on detecting plagiarism in article con...In recent years,academic misconduct has been frequently exposed by the media,with serious impacts on the academic community.Current research on academic misconduct focuses mainly on detecting plagiarism in article content through the application of character-based and non-text element detection techniques over the entirety of a manuscript.For the most part,these techniques can only detect cases of textual plagiarism,which means that potential culprits can easily avoid discovery through clever editing and alterations of text content.In this paper,we propose an academic misconduct detection method based on scholars’submission behaviors.The model can effectively capture the atypical behavioral approach and operation of the author.As such,it is able to detect various types of misconduct,thereby improving the accuracy of detection when combined with a text content analysis.The model learns by forming a dual network group that processes text features and user behavior features to detect potential academic misconduct.First,the effect of scholars’behavioral features on the model are considered and analyzed.Second,the Synthetic Minority Oversampling Technique(SMOTE)is applied to address the problem of imbalanced samples of positive and negative classes among contributing scholars.Finally,the text features of the papers are combined with the scholars’behavioral data to improve recognition precision.Experimental results on the imbalanced dataset demonstrate that our model has a highly satisfactory performance in terms of accuracy and recall.展开更多
In recent years,deep learning gained proliferating popularity in the cybersecurity application domain,since when being compared to traditional machine learning methods,it usually involves less human efforts,produces b...In recent years,deep learning gained proliferating popularity in the cybersecurity application domain,since when being compared to traditional machine learning methods,it usually involves less human efforts,produces better results,and provides better generalizability.However,the imbalanced data issue is very common in cybersecurity,which can substantially deteriorate the performance of the deep learning models.This paper introduces a transfer learning based method to tackle the imbalanced data issue in cybersecurity using return-oriented programming payload detection as a case study.We achieved 0.0290 average false positive rate,0.9705 average F1 score and 0.9521 average detection rate on 3 different target domain programs using 2 different source domain programs,with 0 benign training data sample in the target domain.The performance improvement compared to the baseline is a trade-off between false positive rate and detection rate.Using our approach,the total number of false positives is reduced by 23.16%,and as a trade-off,the number of detected malicious samples decreases by 0.68%.展开更多
With the development of advanced metering infrastructure(AMI),large amounts of electricity consumption data can be collected for electricity theft detection.However,the imbalance of electricity consumption data is vio...With the development of advanced metering infrastructure(AMI),large amounts of electricity consumption data can be collected for electricity theft detection.However,the imbalance of electricity consumption data is violent,which makes the training of detection model challenging.In this case,this paper proposes an electricity theft detection method based on ensemble learning and prototype learning,which has great performance on imbalanced dataset and abnormal data with different abnormal level.In this paper,convolutional neural network(CNN)and long short-term memory(LSTM)are employed to obtain abstract feature from electricity consumption data.After calculating the means of the abstract feature,the prototype per class is obtained,which is used to predict the labels of unknown samples.In the meanwhile,through training the network by different balanced subsets of training set,the prototype is representative.Compared with some mainstream methods including CNN,random forest(RF)and so on,the proposed method has been proved to effectively deal with the electricity theft detection when abnormal data only account for 2.5%and 1.25%of normal data.The results show that the proposed method outperforms other state-of-the-art methods.展开更多
基金The author Feng Yin was funded by the Shenzhen Science and Technology Innovation Council(No.JCYJ20170307155957688)and by National Natural Science Foundation of China Key Project(No.61731018)The authors Feng Yin and Shuguang(Robert)Cui were funded by Shenzhen Fundamental Research Funds under Grant(Key Lab)No.ZDSYS201707251409055,Grant(Peacock)No.KQTD2015033114415450,and Guangdong province“The Pearl River Talent Recruitment Program Innovative and Entrepreneurial Teams in 2017”-Data Driven Evolution of Future Intelligent Network Team.The associate editor coordinating the review of this paper and approving it for publication was X.Cheng.
文摘Driven by the need of a plethora of machine learning applications,several attempts have been made at improving the performance of classifiers applied to imbalanced datasets.In this paper,we present a fast maximum entropy machine(MEM)combined with a synthetic minority over-sampling technique for handling binary classification problems with high imbalance ratios,large numbers of data samples,and medium/large numbers of features.A random Fourier feature representation of kernel functions and primal estimated sub-gradient solver for support vector machine(PEGASOS)are applied to speed up the classic MEM.Experiments have been conducted using various real datasets(including two China Mobile datasets and several other standard test datasets)with various configurations.The obtained results demonstrate that the proposed algorithm has extremely low complexity but an excellent overall classification performance(in terms of several widely used evaluation metrics)as compared to the classic MEM and some other state-of-the-art methods.The proposed algorithm is particularly valuable in big data applications owing to its significantly low computational complexity.
文摘Most modern technologies,such as social media,smart cities,and the internet of things(IoT),rely on big data.When big data is used in the real-world applications,two data challenges such as class overlap and class imbalance arises.When dealing with large datasets,most traditional classifiers are stuck in the local optimum problem.As a result,it’s necessary to look into new methods for dealing with large data collections.Several solutions have been proposed for overcoming this issue.The rapid growth of the available data threatens to limit the usefulness of many traditional methods.Methods such as oversampling and undersampling have shown great promises in addressing the issues of class imbalance.Among all of these techniques,Synthetic Minority Oversampling TechniquE(SMOTE)has produced the best results by generating synthetic samples for the minority class in creating a balanced dataset.The issue is that their practical applicability is restricted to problems involving tens of thousands or lower instances of each.In this paper,we have proposed a parallel mode method using SMOTE and MapReduce strategy,this distributes the operation of the algorithm among a group of computational nodes for addressing the aforementioned problem.Our proposed solution has been divided into three stages.Thefirst stage involves the process of splitting the data into different blocks using a mapping function,followed by a pre-processing step for each mapping block that employs a hybrid SMOTE algo-rithm for solving the class imbalanced problem.On each map block,a decision tree model would be constructed.Finally,the decision tree blocks would be com-bined for creating a classification model.We have used numerous datasets with up to 4 million instances in our experiments for testing the proposed scheme’s cap-abilities.As a result,the Hybrid SMOTE appears to have good scalability within the framework proposed,and it also cuts down the processing time.
基金This work is supported by the National Key R&D Program of China under grant 2018YFB1003205by the National Natural Science Foundation of China under grants U1836208 and U1836110+1 种基金by the Priority Academic Program Development of Jiangsu Higher Education Institutions(PAPD)fundand by the Collaborative Innovation Center of Atmospheric Environment and Equipment Technology(CICAEET)fund,China.
文摘In recent years,academic misconduct has been frequently exposed by the media,with serious impacts on the academic community.Current research on academic misconduct focuses mainly on detecting plagiarism in article content through the application of character-based and non-text element detection techniques over the entirety of a manuscript.For the most part,these techniques can only detect cases of textual plagiarism,which means that potential culprits can easily avoid discovery through clever editing and alterations of text content.In this paper,we propose an academic misconduct detection method based on scholars’submission behaviors.The model can effectively capture the atypical behavioral approach and operation of the author.As such,it is able to detect various types of misconduct,thereby improving the accuracy of detection when combined with a text content analysis.The model learns by forming a dual network group that processes text features and user behavior features to detect potential academic misconduct.First,the effect of scholars’behavioral features on the model are considered and analyzed.Second,the Synthetic Minority Oversampling Technique(SMOTE)is applied to address the problem of imbalanced samples of positive and negative classes among contributing scholars.Finally,the text features of the papers are combined with the scholars’behavioral data to improve recognition precision.Experimental results on the imbalanced dataset demonstrate that our model has a highly satisfactory performance in terms of accuracy and recall.
基金supported by NSF CNS-2019340,NSF ECCS-2140175,and NIST 60NANB22D144.
文摘In recent years,deep learning gained proliferating popularity in the cybersecurity application domain,since when being compared to traditional machine learning methods,it usually involves less human efforts,produces better results,and provides better generalizability.However,the imbalanced data issue is very common in cybersecurity,which can substantially deteriorate the performance of the deep learning models.This paper introduces a transfer learning based method to tackle the imbalanced data issue in cybersecurity using return-oriented programming payload detection as a case study.We achieved 0.0290 average false positive rate,0.9705 average F1 score and 0.9521 average detection rate on 3 different target domain programs using 2 different source domain programs,with 0 benign training data sample in the target domain.The performance improvement compared to the baseline is a trade-off between false positive rate and detection rate.Using our approach,the total number of false positives is reduced by 23.16%,and as a trade-off,the number of detected malicious samples decreases by 0.68%.
基金supported by National Natural Science Foundation of China(No.52277083).
文摘With the development of advanced metering infrastructure(AMI),large amounts of electricity consumption data can be collected for electricity theft detection.However,the imbalance of electricity consumption data is violent,which makes the training of detection model challenging.In this case,this paper proposes an electricity theft detection method based on ensemble learning and prototype learning,which has great performance on imbalanced dataset and abnormal data with different abnormal level.In this paper,convolutional neural network(CNN)and long short-term memory(LSTM)are employed to obtain abstract feature from electricity consumption data.After calculating the means of the abstract feature,the prototype per class is obtained,which is used to predict the labels of unknown samples.In the meanwhile,through training the network by different balanced subsets of training set,the prototype is representative.Compared with some mainstream methods including CNN,random forest(RF)and so on,the proposed method has been proved to effectively deal with the electricity theft detection when abnormal data only account for 2.5%and 1.25%of normal data.The results show that the proposed method outperforms other state-of-the-art methods.