摘要
目的:设计一种新型的钓鱼网站检测技术,以提高检测的精确率。方法:提出了一种利用BERT(Bidirectional Encoder Representations from Transformers)提取HTML字符串嵌入特征的方法,将HTML文档转化为词嵌入向量。同时提出一种结合四种分类器的Stacking集成学习模型,使用HTML字符串嵌入特征以及筛选出的URL特征进行钓鱼网站检测。结果:在10万级数据集上精确率达到98.52%,F_(1)值达到98.81%。且相较只使用URL特征,引入上述HTML字符串嵌入特征后,检测钓鱼网站的精确率提升了近两个百分点。结论:本文所提出的基于BERT提取的HTML字符串嵌入特征对于检测钓鱼网站具有显著提升。
Aims:This paper aims to design a new phishing website detection technology to improve the accuracy.Methods:This paper proposed a method to extract HTML string embedding features using BERT(Bidirectional Encoder Representations from Transformers),which converted HTML documents into word embedding vectors.At the same time,a stacking ensemble learning model combining four classifiers was proposed,which used HTML string embedding features and filtered URL features to detect phishing websites.Results:The accuracy of a 100000-level data set was 98.52%;and the F_(1)value was 98.81%.Compared with that of just using URL features,the accuracy of the introduction of HTML string embedding features improved about 2%.Conclusions:The HTML string embedding feature based on BERT has a significant increase in the detection of phishing websites.
作者
胡强
周杭霞
刘倩
HU Qiang;ZHOU Hangxia;LIU Qian(College of Information Engineering,China Jiliang University,Hangzhou 310018,China)
出处
《中国计量大学学报》
2022年第1期49-54,共6页
Journal of China University of Metrology
基金
基于大数据架构的公安信息化应用公安部重点实验室开放课题(No.2021DSJSYS004)
浙江省基础公益研究计划项目(No.LGF18F020017)。