Machine Learning(ML) techniques have been widely applied in recent traffic classification.However, the problems of both discriminator bias and class imbalance decrease the accuracies of ML based traffic classifier. In...Machine Learning(ML) techniques have been widely applied in recent traffic classification.However, the problems of both discriminator bias and class imbalance decrease the accuracies of ML based traffic classifier. In this paper, we propose an accurate and extensible traffic classifier. Specifically, to address the discriminator bias issue, our classifier is built by making an optimal cascade of binary sub-classifiers, where each binary sub-classifier is trained independently with the discriminators used for identifying application specific traffic. Moreover, to balance a training dataset,we apply SMOTE algorithm in generating artificial training samples for minority classes.We evaluate our classifier on two datasets collected from different network border routers.Compared with the previous multi-class traffic classifiers built in one-time training process,our classifier achieves much higher F-Measure and AUC for each application.展开更多
Intrusion detection aims to detect intrusion behavior and serves as a complement to firewalls.It can detect attack types of malicious network communications and computer usage that cannot be detected by idiomatic fire...Intrusion detection aims to detect intrusion behavior and serves as a complement to firewalls.It can detect attack types of malicious network communications and computer usage that cannot be detected by idiomatic firewalls.Many intrusion detection methods are processed through machine learning.Previous literature has shown that the performance of an intrusion detection method based on hybrid learning or integration approach is superior to that of single learning technology.However,almost no studies focus on how additional representative and concise features can be extracted to process effective intrusion detection among massive and complicated data.In this paper,a new hybrid learning method is proposed on the basis of features such as density,cluster centers,and nearest neighbors(DCNN).In this algorithm,data is represented by the local density of each sample point and the sum of distances from each sample point to cluster centers and to its nearest neighbor.k-NN classifier is adopted to classify the new feature vectors.Our experiment shows that DCNN,which combines K-means,clustering-based density,and k-NN classifier,is effective in intrusion detection.展开更多
Interact traffic classification is vital to the areas of network operation and management. Traditional classification methods such as port mapping and payload analysis are becoming increasingly difficult as newly emer...Interact traffic classification is vital to the areas of network operation and management. Traditional classification methods such as port mapping and payload analysis are becoming increasingly difficult as newly emerged applications (e. g. Peer-to-Peer) using dynamic port numbers, masquerading techniques and encryption to avoid detection. This paper presents a machine learning (ML) based traffic classifica- tion scheme, which offers solutions to a variety of network activities and provides a platform of performance evaluation for the classifiers. The impact of dataset size, feature selection, number of application types and ML algorithm selection on classification performance is analyzed and demonstrated by the following experiments: (1) The genetic algorithm based feature selection can dramatically reduce the cost without diminishing classification accuracy. (2) The chosen ML algorithms can achieve high classification accuracy. Particularly, REPTree and C4.5 outperform the other ML algorithms when computational complexity and accuracy are both taken into account. (3) Larger dataset and fewer application types would result in better classification accuracy. Finally, early detection with only several initial packets is proposed for real-time network activity and it is proved to be feasible according to the preliminary results.展开更多
To address the issue of finegrained classification of Internet multimedia traffic from a Quality of Service(QoS) perspective with a suitable granularity, this paper defines a new set of QoS classes and presents a modi...To address the issue of finegrained classification of Internet multimedia traffic from a Quality of Service(QoS) perspective with a suitable granularity, this paper defines a new set of QoS classes and presents a modified K-Singular Value Decomposition(K-SVD) method for multimedia identification. After analyzing several instances of typical Internet multimedia traffic captured in a campus network, this paper defines a new set of QoS classes according to the difference in downstream/upstream rates and proposes a modified K-SVD method that can automatically search for underlying structural patterns in the QoS characteristic space. We define bagQoS-words as the set of specific QoS local patterns, which can be expressed by core QoS characteristics. After the dictionary is constructed with an excess quantity of bag-QoSwords, Locality Constrained Feature Coding(LCFC) features of QoS classes are extracted. By associating a set of characteristics with a percentage of error, an objective function is formulated. In accordance with the modified K-SVD, Internet multimedia traffic can be classified into a corresponding QoS class with a linear Support Vector Machines(SVM) clas-sifier. Our experimental results demonstrate the feasibility of the proposed classification method.展开更多
The explosive growth ofmalware variants poses a major threat to information security. Traditional anti-virus systems based on signatures fail to classify unknown malware into their corresponding families and to detect...The explosive growth ofmalware variants poses a major threat to information security. Traditional anti-virus systems based on signatures fail to classify unknown malware into their corresponding families and to detect new kinds of malware pro- grams. Therefore, we propose a machine learning based malware analysis system, which is composed of three modules: data processing, decision making, and new malware detection. The data processing module deals with gray-scale images, Opcode n-gram, and import fimctions, which are employed to extract the features of the malware. The decision-making module uses the features to classify the malware and to identify suspicious malware. Finally, the detection module uses the shared nearest neighbor (SNN) clustering algorithm to discover new malware families. Our approach is evaluated on more than 20 000 malware instances, which were collected by Kingsoft, ESET NOD32, and Anubis. The results show that our system can effectively classify the un- known malware with a best accuracy of 98.9%, and successfully detects 86.7% of the new malware.展开更多
Mobile device manufacturers are rapidly producing miscellaneous Android versions worldwide. Simultaneously, cyber criminals are executing malicious actions, such as tracking user activities, stealing personal data, an...Mobile device manufacturers are rapidly producing miscellaneous Android versions worldwide. Simultaneously, cyber criminals are executing malicious actions, such as tracking user activities, stealing personal data, and committing bank fraud. These criminals gain numerous benefits as too many people use Android for their daily routines, including important communications. With this in mind, security practitioners have conducted static and dynamic analyses to identify malware. This study used static analysis because of its overall code coverage, low resource consumption, and rapid processing. However, static analysis requires a minimum number of features to efficiently classify malware. Therefore, we used genetic search(GS), which is a search based on a genetic algorithm(GA), to select the features among 106 strings. To evaluate the best features determined by GS, we used five machine learning classifiers, namely, Na?ve Bayes(NB), functional trees(FT), J48, random forest(RF), and multilayer perceptron(MLP). Among these classifiers, FT gave the highest accuracy(95%) and true positive rate(TPR)(96.7%) with the use of only six features.展开更多
In some image classification tasks, similarities among different categories are different and the samples are usually misclassified as highly similar categories. To distinguish highly similar categories, more specific...In some image classification tasks, similarities among different categories are different and the samples are usually misclassified as highly similar categories. To distinguish highly similar categories, more specific features are required so that the classifier can improve the classification performance. In this paper, we propose a novel two-level hierarchical feature learning framework based on the deep convolutional neural network(CNN), which is simple and effective. First, the deep feature extractors of different levels are trained using the transfer learning method that fine-tunes the pre-trained deep CNN model toward the new target dataset. Second, the general feature extracted from all the categories and the specific feature extracted from highly similar categories are fused into a feature vector. Then the final feature representation is fed into a linear classifier. Finally, experiments using the Caltech-256, Oxford Flower-102, and Tasmania Coral Point Count(CPC) datasets demonstrate that the expression ability of the deep features resulting from two-level hierarchical feature learning is powerful. Our proposed method effectively increases the classification accuracy in comparison with flat multiple classification methods.展开更多
We propose a novel discriminative learning approach for Bayesian pattern classification, called 'constrained maximum margin (CMM)'. We define the margin between two classes as the difference between the minimum de...We propose a novel discriminative learning approach for Bayesian pattern classification, called 'constrained maximum margin (CMM)'. We define the margin between two classes as the difference between the minimum decision value for positive samples and the maximum decision value for negative samples. The learning problem is to maximize the margin under the con- straint that each training pattern is classified correctly. This nonlinear programming problem is solved using the sequential un- constrained minimization technique. We applied the proposed CMM approach to learn Bayesian classifiers based on Gaussian mixture models, and conducted the experiments on 10 UCI datasets. The performance of our approach was compared with those of the expectation-maximization algorithm, the support vector machine, and other state-of-the-art approaches. The experimental results demonstrated the effectiveness of our approach.展开更多
The proliferation of forums and blogs leads to challenges and opportunities for processing large amounts of information. The information shared on various topics often contains opinionated words which are qualitative ...The proliferation of forums and blogs leads to challenges and opportunities for processing large amounts of information. The information shared on various topics often contains opinionated words which are qualitative in nature. These qualitative words need statistical computations to convert them into useful quantitative data. This data should be processed properly since it expresses opinions. Each of these opinion bearing words differs based on the significant meaning it conveys. To process the linguistic meaning of words into data and to enhance opinion mining analysis, we propose a novel weighting scheme, referred to as inferred word weighting(IWW). IWW is computed based on the significance of the word in the document(SWD) and the significance of the word in the expression(SWE) to enhance their performance. The proposed weighting methods give an analytic view and provide appropriate weights to the words compared to existing methods. In addition to the new weighting methods, another type of checking is done on the performance of text classification by including stop-words. Generally, stop-words are removed in text processing. When this new concept of including stop-words is applied to the proposed and existing weighting methods, two facts are observed:(1) Classification performance is enhanced;(2) The outcome difference between inclusion and exclusion of stop-words is smaller in the proposed methods, and larger in existing methods. The inferences provided by these observations are discussed. Experimental results of the benchmark data sets show the potential enhancement in terms of classification accuracy.展开更多
Geotechnical classification is vital for site characterization and geotechnical design.Field tests such as the cone penetration test with pore water pressure measurement(CPTu)are widespread because they represent a fa...Geotechnical classification is vital for site characterization and geotechnical design.Field tests such as the cone penetration test with pore water pressure measurement(CPTu)are widespread because they represent a faster and cheaper alternative for sample recovery and testing.However,classification schemes based on CPTu measurements are fairly generic because they represent a wide variety of soil conditions and,occasionally,they may fail when used in special soil types like sensitive or quick clays.Quick and highly sensitive clay soils in Norway have unique conditions that make them difficult to be identified through general classification charts.Therefore,new approaches to address this task are required.The following study applies machine learning methods such as logistic regression,Naive Bayes,and hidden Markov models to classify quick and highly sensitive clays at two sites in Norway based on normalized CPTu measurements.Results showed a considerable increase in the classification accuracy despite limited training sets.展开更多
In dimensional affect recognition, the machine learning methods, which are used to model and predict affect, are mostly classification and regression. However, the annotation in the dimensional affect space usually ta...In dimensional affect recognition, the machine learning methods, which are used to model and predict affect, are mostly classification and regression. However, the annotation in the dimensional affect space usually takes the form of a continuous real value which has an ordinal property. The aforementioned methods do not focus on taking advantage of this important information. Therefore, we propose an affective rating ranking framework for affect recognition based on face images in the valence and arousal dimensional space. Our approach can appropriately use the ordinal information among affective ratings which are generated by discretizing continuous annotations.Specifically, we first train a series of basic cost-sensitive binary classifiers, each of which uses all samples relabeled according to the comparison results between corresponding ratings and a given rank of a binary classifier. We obtain the final affective ratings by aggregating the outputs of binary classifiers. By comparing the experimental results with the baseline and deep learning based classification and regression methods on the benchmarking database of the AVEC 2015 Challenge and the selected subset of SEMAINE database, we find that our ordinal ranking method is effective in both arousal and valence dimensions.展开更多
基金supported by the National Natural Science Foundation of China under Grant No.61402485National Natural Science Foundation of China under Grant No.61303061supported by the Open fund from HPCL No.201513-01
文摘Machine Learning(ML) techniques have been widely applied in recent traffic classification.However, the problems of both discriminator bias and class imbalance decrease the accuracies of ML based traffic classifier. In this paper, we propose an accurate and extensible traffic classifier. Specifically, to address the discriminator bias issue, our classifier is built by making an optimal cascade of binary sub-classifiers, where each binary sub-classifier is trained independently with the discriminators used for identifying application specific traffic. Moreover, to balance a training dataset,we apply SMOTE algorithm in generating artificial training samples for minority classes.We evaluate our classifier on two datasets collected from different network border routers.Compared with the previous multi-class traffic classifiers built in one-time training process,our classifier achieves much higher F-Measure and AUC for each application.
文摘Intrusion detection aims to detect intrusion behavior and serves as a complement to firewalls.It can detect attack types of malicious network communications and computer usage that cannot be detected by idiomatic firewalls.Many intrusion detection methods are processed through machine learning.Previous literature has shown that the performance of an intrusion detection method based on hybrid learning or integration approach is superior to that of single learning technology.However,almost no studies focus on how additional representative and concise features can be extracted to process effective intrusion detection among massive and complicated data.In this paper,a new hybrid learning method is proposed on the basis of features such as density,cluster centers,and nearest neighbors(DCNN).In this algorithm,data is represented by the local density of each sample point and the sum of distances from each sample point to cluster centers and to its nearest neighbor.k-NN classifier is adopted to classify the new feature vectors.Our experiment shows that DCNN,which combines K-means,clustering-based density,and k-NN classifier,is effective in intrusion detection.
基金Supported by the National High Technology Research and Development Programme of China (No. 2005AA121620, 2006AA01Z232)the Zhejiang Provincial Natural Science Foundation of China (No. Y1080935 )the Research Innovation Program for Graduate Students in Jiangsu Province (No. CX07B_ 110zF)
文摘Interact traffic classification is vital to the areas of network operation and management. Traditional classification methods such as port mapping and payload analysis are becoming increasingly difficult as newly emerged applications (e. g. Peer-to-Peer) using dynamic port numbers, masquerading techniques and encryption to avoid detection. This paper presents a machine learning (ML) based traffic classifica- tion scheme, which offers solutions to a variety of network activities and provides a platform of performance evaluation for the classifiers. The impact of dataset size, feature selection, number of application types and ML algorithm selection on classification performance is analyzed and demonstrated by the following experiments: (1) The genetic algorithm based feature selection can dramatically reduce the cost without diminishing classification accuracy. (2) The chosen ML algorithms can achieve high classification accuracy. Particularly, REPTree and C4.5 outperform the other ML algorithms when computational complexity and accuracy are both taken into account. (3) Larger dataset and fewer application types would result in better classification accuracy. Finally, early detection with only several initial packets is proposed for real-time network activity and it is proved to be feasible according to the preliminary results.
基金supported in part by the National Natural Science Foundation of China (NO. 61401004, 61271233, 60972038)Plan of introduction and cultivation of university leading talents in Anhui (No.gxfxZ D2016013)+3 种基金the Natural Science Foundation of the Higher Education Institutions of Anhui Province, China (No. KJ2010B357)Startup Project of Anhui Normal University Doctor Scientific Research (No.2016XJJ129)the US National Science Foundation under grants CNS1702957 and ACI-1642133the Wireless Engineering Research and Education Center at Auburn University
文摘To address the issue of finegrained classification of Internet multimedia traffic from a Quality of Service(QoS) perspective with a suitable granularity, this paper defines a new set of QoS classes and presents a modified K-Singular Value Decomposition(K-SVD) method for multimedia identification. After analyzing several instances of typical Internet multimedia traffic captured in a campus network, this paper defines a new set of QoS classes according to the difference in downstream/upstream rates and proposes a modified K-SVD method that can automatically search for underlying structural patterns in the QoS characteristic space. We define bagQoS-words as the set of specific QoS local patterns, which can be expressed by core QoS characteristics. After the dictionary is constructed with an excess quantity of bag-QoSwords, Locality Constrained Feature Coding(LCFC) features of QoS classes are extracted. By associating a set of characteristics with a percentage of error, an objective function is formulated. In accordance with the modified K-SVD, Internet multimedia traffic can be classified into a corresponding QoS class with a linear Support Vector Machines(SVM) clas-sifier. Our experimental results demonstrate the feasibility of the proposed classification method.
基金Project supported by the Natiooal Natural Science Foundation of China (No. 61303264) and the National Basic Research Program (973) of China (Nos. 2012CB315906 and 0800065111001)
文摘The explosive growth ofmalware variants poses a major threat to information security. Traditional anti-virus systems based on signatures fail to classify unknown malware into their corresponding families and to detect new kinds of malware pro- grams. Therefore, we propose a machine learning based malware analysis system, which is composed of three modules: data processing, decision making, and new malware detection. The data processing module deals with gray-scale images, Opcode n-gram, and import fimctions, which are employed to extract the features of the malware. The decision-making module uses the features to classify the malware and to identify suspicious malware. Finally, the detection module uses the shared nearest neighbor (SNN) clustering algorithm to discover new malware families. Our approach is evaluated on more than 20 000 malware instances, which were collected by Kingsoft, ESET NOD32, and Anubis. The results show that our system can effectively classify the un- known malware with a best accuracy of 98.9%, and successfully detects 86.7% of the new malware.
基金supported by the Ministry of Science,Technology and Innovation of Malaysia,under the Grant e Science Fund(No.01-01-03-SF0914)
文摘Mobile device manufacturers are rapidly producing miscellaneous Android versions worldwide. Simultaneously, cyber criminals are executing malicious actions, such as tracking user activities, stealing personal data, and committing bank fraud. These criminals gain numerous benefits as too many people use Android for their daily routines, including important communications. With this in mind, security practitioners have conducted static and dynamic analyses to identify malware. This study used static analysis because of its overall code coverage, low resource consumption, and rapid processing. However, static analysis requires a minimum number of features to efficiently classify malware. Therefore, we used genetic search(GS), which is a search based on a genetic algorithm(GA), to select the features among 106 strings. To evaluate the best features determined by GS, we used five machine learning classifiers, namely, Na?ve Bayes(NB), functional trees(FT), J48, random forest(RF), and multilayer perceptron(MLP). Among these classifiers, FT gave the highest accuracy(95%) and true positive rate(TPR)(96.7%) with the use of only six features.
基金Project supported by the National Natural Science Foundation of China(No.61379074)the Zhejiang Provincial Natural Science Foundation of China(Nos.LZ12F02003 and LY15F020035)
文摘In some image classification tasks, similarities among different categories are different and the samples are usually misclassified as highly similar categories. To distinguish highly similar categories, more specific features are required so that the classifier can improve the classification performance. In this paper, we propose a novel two-level hierarchical feature learning framework based on the deep convolutional neural network(CNN), which is simple and effective. First, the deep feature extractors of different levels are trained using the transfer learning method that fine-tunes the pre-trained deep CNN model toward the new target dataset. Second, the general feature extracted from all the categories and the specific feature extracted from highly similar categories are fused into a feature vector. Then the final feature representation is fed into a linear classifier. Finally, experiments using the Caltech-256, Oxford Flower-102, and Tasmania Coral Point Count(CPC) datasets demonstrate that the expression ability of the deep features resulting from two-level hierarchical feature learning is powerful. Our proposed method effectively increases the classification accuracy in comparison with flat multiple classification methods.
基金Project supported by the National Natural Science Foundation of China(Nos.60973059 and 81171407)the Program for New Century Excellent Talents in University,China(No.NCET-10-0044)
文摘We propose a novel discriminative learning approach for Bayesian pattern classification, called 'constrained maximum margin (CMM)'. We define the margin between two classes as the difference between the minimum decision value for positive samples and the maximum decision value for negative samples. The learning problem is to maximize the margin under the con- straint that each training pattern is classified correctly. This nonlinear programming problem is solved using the sequential un- constrained minimization technique. We applied the proposed CMM approach to learn Bayesian classifiers based on Gaussian mixture models, and conducted the experiments on 10 UCI datasets. The performance of our approach was compared with those of the expectation-maximization algorithm, the support vector machine, and other state-of-the-art approaches. The experimental results demonstrated the effectiveness of our approach.
文摘The proliferation of forums and blogs leads to challenges and opportunities for processing large amounts of information. The information shared on various topics often contains opinionated words which are qualitative in nature. These qualitative words need statistical computations to convert them into useful quantitative data. This data should be processed properly since it expresses opinions. Each of these opinion bearing words differs based on the significant meaning it conveys. To process the linguistic meaning of words into data and to enhance opinion mining analysis, we propose a novel weighting scheme, referred to as inferred word weighting(IWW). IWW is computed based on the significance of the word in the document(SWD) and the significance of the word in the expression(SWE) to enhance their performance. The proposed weighting methods give an analytic view and provide appropriate weights to the words compared to existing methods. In addition to the new weighting methods, another type of checking is done on the performance of text classification by including stop-words. Generally, stop-words are removed in text processing. When this new concept of including stop-words is applied to the proposed and existing weighting methods, two facts are observed:(1) Classification performance is enhanced;(2) The outcome difference between inclusion and exclusion of stop-words is smaller in the proposed methods, and larger in existing methods. The inferences provided by these observations are discussed. Experimental results of the benchmark data sets show the potential enhancement in terms of classification accuracy.
基金the CONICYT Programa Formacion de Capital Humano Avanzado/Master Becas Chile(No.2017-73180687)。
文摘Geotechnical classification is vital for site characterization and geotechnical design.Field tests such as the cone penetration test with pore water pressure measurement(CPTu)are widespread because they represent a faster and cheaper alternative for sample recovery and testing.However,classification schemes based on CPTu measurements are fairly generic because they represent a wide variety of soil conditions and,occasionally,they may fail when used in special soil types like sensitive or quick clays.Quick and highly sensitive clay soils in Norway have unique conditions that make them difficult to be identified through general classification charts.Therefore,new approaches to address this task are required.The following study applies machine learning methods such as logistic regression,Naive Bayes,and hidden Markov models to classify quick and highly sensitive clays at two sites in Norway based on normalized CPTu measurements.Results showed a considerable increase in the classification accuracy despite limited training sets.
基金supported by the National Natural Science Foundation of China(Nos.61272211 and 61672267)the Open Project Program of the National Laboratory of Pattern Recognition(No.201700022)+1 种基金the China Postdoctoral Science Foundation(No.2015M570413)and the Innovation Project of Undergraduate Students in Jiangsu University(No.16A235)
文摘In dimensional affect recognition, the machine learning methods, which are used to model and predict affect, are mostly classification and regression. However, the annotation in the dimensional affect space usually takes the form of a continuous real value which has an ordinal property. The aforementioned methods do not focus on taking advantage of this important information. Therefore, we propose an affective rating ranking framework for affect recognition based on face images in the valence and arousal dimensional space. Our approach can appropriately use the ordinal information among affective ratings which are generated by discretizing continuous annotations.Specifically, we first train a series of basic cost-sensitive binary classifiers, each of which uses all samples relabeled according to the comparison results between corresponding ratings and a given rank of a binary classifier. We obtain the final affective ratings by aggregating the outputs of binary classifiers. By comparing the experimental results with the baseline and deep learning based classification and regression methods on the benchmarking database of the AVEC 2015 Challenge and the selected subset of SEMAINE database, we find that our ordinal ranking method is effective in both arousal and valence dimensions.