摘要
网页内容安全监测是维护互联网安全的一种重要技术手段。针对网络中存在的大量敏感词及其复杂多样的变体难以检测的问题,论文采用一种基于BERT-BiLSTM-CRF的深度学习网络模型进行敏感词及变体的识别。首先通过BERT层对文本序列向量化,其次将向量化的数据表示输入到BiLSTM层中提取敏感词的丰富特征,最后利用CRF层对输出做进一步约束修正,该模型在标注的敏感词及变体实体识别数据集上训练后能较为准确地识别出实体。实验结果表明,该模型在精准率、召回率和F1值上均优于其他模型,识别效果较好。
Web content security monitoring is an important technical approach to maintain Internet security.Aiming at the problem that it is difficult to detect a large number of sensitive words and their complex variants emerging on Web pages in net-works,this paper proposes a deep learning network model based on BERT-BiLSTM-CRF.Firstly,text sequence is vectorized by the Bert layer.Secondly,the vectorized data representation is input into the BiLSTM layer to extract the rich features of sensitive words.Finally,the output is processed by the CRF layer After training on the labeled sensitive words and variant entity recognition data set,the model can recognize the entity more accurately.The experimental results show that the model is better than other mod-els in accuracy,recall and F1 value,and its recognition rate is fairly accepted.
作者
郑贤茹
李柏岩
冯珍妮
刘晓强
ZHENG Xianru;LI Baiyan;FENG Zhenni;LIU Xiaoqiang(College of Computer Science and Technology,Donghua University,Shanghai 201620)
出处
《计算机与数字工程》
2023年第7期1585-1589,共5页
Computer & Digital Engineering
基金
上海市青年科技英才扬帆计划项目(编号:19YF1402200)
东华大学中央高校基本科研业务费专项资金(编号:2232021D-23)资助。