摘要
针对各类网站为了避免被检测到敏感信息,网站内的文字常采用变体词对敏感词词库进行规避。为解决这一问题,文中提出一种基于BERT模型结合变体字还原算法的网站敏感信息识别的方法。该方法将针对文本中的变体词进行还原,通过采用BERT模型对文本内容进行向量化,并将其输入由Bi LSTM层和CNN层构成的模型进行训练,从而实现对网站内敏感信息及其变体词的识别。实验结果显示,变体词还原的正确率较高,通过BERT模型获取的文本向量在文本分类任务中表现出色。与其他模型相比,BERT-Bi LSTM-CNN模型在网站敏感信息识别任务中表现出更高的准确率、召回率和F1值,呈现明显的提升。文中模型为变体词还原问题和敏感信息识别领域提供了参考和支持,具有一定的实际应用价值。
In view of the rapid development of the network and the decreasing cost of website establishment,to avoid detection of sensitive information,variant words are frequently utilized within texts of various types of websites,so that the sensitive word databases can be evaded.Therefore,this study proposes a method for identifying website sensitive information based on a BERT(bidirectional encoder representation from transformers)model combined with a variant word restoration algorithm.In this method,the variant words within the texts are restored,the text content are vectorized by the BERT model and then inputted into a model composed of BiLSTM(bi⁃directional long short⁃term memory)layer and CNN(convolutional neural network)layer for training,so as to achieve the identification of sensitive information and its variant words within websites.Experimental results demonstrate a high accuracy in variant word restoration,and the text vectors obtained by the BERT model exhibit excellent performance in the tasks of text classification.In comparison with the other models,the BERT⁃BiLSTM⁃CNN model demonstrates higher accuracy rate,recall rate,and F1 score in the task of identifying sensitive information on websites,which indicates a significant improvement.The proposed model provides reference and support for variant word restoration and the field of sensitive information identification,possessing a certain practical application value.
作者
符泽凡
姚竟发
滕桂法
FU Zefan;YAO Jingfa;TENG Guifa(College of Information Science and Technology,Hebei Agricultural University,Baoding 071001,China;Software Engineering Department,Hebei Software Institute,Baoding 071000,China;Hebei College Intelligent Interconnection Equipment and Multi-modal Big Data Application Technology Research and Development Center,Baoding 071000,China;Hebei Digital Agriculture Industry Technology Research Institute,Shijiazhuang 050021,China;Hebei Key Laboratory of Agricultural Big Data,Baoding 071001,China)
出处
《现代电子技术》
北大核心
2024年第23期105-112,共8页
Modern Electronics Technique