摘要
为了解决针对Web应用层的攻击,有效分类识别恶意请求,深入研究有监督的学习方法,针对请求文本内容不足、特征稀疏的缺陷,提出了一种基于非重复多N-Gram的TF-IDF分词策略和逻辑斯蒂回归方法构建的恶意请求分类模型。通过从Secrepo安全数据样本库等来源采集到的大量样本数据进行特征提取后对模型进行训练,以最大似然估计作为模型的优化目标,利用梯度下降的方法得到最优分类模型,并在测试集上验证模型的可靠性。实验结果表明,短文本、低语义的请求内容通过字母形式在多N-Gram的分词下构造的分类模型,相对于单词和单倍N-Gram分词的分类模型具有较高的分类准确率和得分,并且训练模型所耗时间相差不大。该方法训练出的最终模型在测试集上的准确率、召回率和F1值都达到了99%以上。
In order to effectively defend the attack from Web application layer and classify and recognize the malicious requests,the supervised learning methods are researched in-depth.Aiming at the defects of insufficient content and sparse features of requests text,we propose a malicious requests classifier model based on logistic regression method and TF-IDF word segmentation with non-repetition and multi-N-Gram.The model is trained after feature extraction of a large number of sample data collected from online security database such as Secrepo.Taking the maximum likelihood estimation as the optimization goal of the model,we use the gradient descent method to obtain the optimum classification model,and its reliability is validated on the test set.The experiment shows that compared with the classification model of words and single-fold N-Gram segmentation,the classification model built by request content with short text and low semantic in letters on multi-N-Gram segmentation has higher accuracy and score.Their training time is not much different.The final model trained by this way reaches more than 99% of accuracy,recall and F1-measure on test set.
作者
陈春玲
吴凡
余瀚
CHEN Chun-ling;WU Fan;YU Han(School of Computer Science&Technology,Nanjing University of Posts and Telecommunications,Nanjing 210003,China)
出处
《计算机技术与发展》
2019年第2期124-128,共5页
Computer Technology and Development
基金
国家自然科学基金(11501302)