摘要
考虑到网络用户数量的快速增长、日益复杂的网络环境以及网络应用程序多元化的现状,识别网络中的具体应用程序(诸如Google、Facebook、Skype、MSN等)是网络应用的重要研究方向,通过提取网络流量特征并利用机器学习方法识别网络应用程序是其主流方法,但由于网络流量特征多且复杂,经特征选择所获特征用于分类的性能往往严重依赖于所选用的分类器而不能很好反映应用程序的个性特征。为此本文提出了一种基于多样化组合特征选择的网络应用程序分类方法,通过组合特征重要性筛选和递归特征消除,获取这两种特征选择方法选择到的特征进行并操作,后用皮尔逊相关系数进一步去除冗余特征。对有87个特征的总计3577296个实例的网络数据集实验结果表明,与传统的诸如VT、RFE、L1正则化逻辑回归等特征选择方法相比,组合特征选择方法在KNN、SVM、RF、GBDT、XGBoost、LighGBM等各类分类器上均性能优异(分类准确率提升了0.5%-3.0%),且性能基本不受所采用的分类器的影响,表明所选出的特征能更客观反映网络应用程序的特性,同时所需运行时间也极大缩减(缩减了20%-90%),提升了网络应用程序实时监管的效率。
The rapid growth of network users,the increasingly complex network environment and the diversification of web applications require identifying specific web applications in the internet(such as Google,Facebook,Skype,MSN,etc.)Extracting web traffic features is the mainstream method for the identification.Different feature subsets obtained by different feature selection methods have a great impact on classification accuracy of web applications,making the selected features heavily dependent to the follow-up selected classifier.This indicates that the selected features are not the personal features of the problem.This paper proposes a diversified feature combination selection method for web application program classification.Through important feature screening and recursive feature elimination with diverse feature selection methods and diverse types of classifiers as well as tree-based classifiers which are claimed to robust to noises,the features are selected and merged,and redundant features are further removed based on Pearson correlation analysis to obtain the final solution of selected features.Experiments of the proposed method and comparison with some typical counterpart feature selection methods such as VT,RFE and L1 regularized logistic regression on a web application program identification problem show that the proposed method outperforms its counterparts when the six main stream classifiers are applied,KNN,SVM,RF,GBDT,XGBoost,and lighGBM(the classification accuracy is increased by 0.5%-3.0%).This indicates that the selected features by the proposed approach is a better representation of the web application programs,and the running time for the identification is greatly reduced,by 20%-90%,which greatly improves the possibility of real-time monitoring of network applications.
作者
蒋胜利
张文祥
张军英
JIANG Shengli;ZHANG Wenxiang;ZHANG Junying(School of E-commerce,Luoyang Normal University,Luoyang 471934,China;School of Computer Science and Technology,Xidian University,Xian 710071,China)
出处
《聊城大学学报(自然科学版)》
2021年第3期18-27,共10页
Journal of Liaocheng University:Natural Science Edition
基金
国家自然科学基金项目(11674352)资助。
关键词
特征选择
Web应用程序分类
机器学习
Feature selection
Web application program classification
Machine learning