摘要
当前基于机器学习的恶意加密流量识别主要采用有监督学习,依赖大量标注样本,但在真实环境中恶意流量不仅稀缺而且标注依赖专家经验,标注成本较高,而主动学习通过迭代训练选择困难样本(hardsample)进行训练,一定程度上减少了训练样本量,但当前基于委员会投票的hardsample选择策略粒度较粗,所选样本质量不佳。针对该问题,提出一种改进委员会投票(QBC)的恶意加密流量识别方法CBU,设计了委员会对样本不一致性的计算方法,并结合已标注与未标注样本相似性分析,有效度量样本不确定性,从而选择高质量hardsample,以减少样本标记和训练量。实验使用业界标准数据集CTU以及真实恶意数据集进行测试,结果表明,相比传统委员会投票策略,CBU样本标记量减少1倍,只采用15%数据的情况下识别准确率达到96%,有效减少样本标注和训练量,具有较强实用性。
At present,the identification of malicious encrypted traffic based on machine learning mainly uses supervised learning and relies on a large number of labeled samples.However,in the real environment,malicious traffic is not only scarce but also depends on expert experience,and the labeling cost is high.Active learning selects difficult samples through iterative for training,which reduces the amount of training samples to a certain extent,but the current hardsample selection strategy based on committee votes has a coarser granularity,and the quality of the selected samples is not good.In response to this problem,a CBU is proposed to improve the query by committee(QBC)method for identifying malicious encrypted traffic.Labeling sample similarity analysis,effectively measuring sample uncertainty,and selecting high-quality hardsamples to reduce sample labeling and training volume.The experiment uses the industry standard data set CTU and real malicious data sets for testing.The results show that compared with the traditional committee voting strategy,the amount of CBU sample labeling is doubled,and the recognition accuracy rate of only 15% of the data amount is 96%,which effectively reduces the sample labeling.And training volume,and it has strong practicability.
作者
张荣华
刘智
罗琴
Zhang Ronghua;Liu Zhi;Luo Qin(School of Computer Science,Southwest Petroleum University.Chengdu 610500,China)
出处
《电子测量技术》
北大核心
2022年第1期28-34,共7页
Electronic Measurement Technology
基金
国家自然科学基金(61902328)项目资助。
关键词
加密流量
主动学习
样本选择
恶意识别
encrypted traffic
active learning
sample selection
malicious identification