摘要
针对传统的基于词库的敏感信息检测方法由于变体词而导致检测性能低的问题,提出一种融合变体识别和语义分析的敏感信息检测方法.首先寻找出文本中出现的变体词,再通过编辑距离和相似度计算寻找出该变体词的原词并进行替换,接着通过BERT作为词嵌入模型融合左右上下文信息实现深层双向的语言表征,并联合卷积神经网络与双向门控循环单元网络构建敏感信息分类模型,分别提取文本的局部语义信息和上下文信息并输入到分类器中进行敏感信息检测.最后将本文提出的模型与其他深度学习网络模型在真实的数据集上进行实验对比,结果表明该方法能更有效地检测敏感信息.
Aiming at the problem of sensitive words variant in text and the low performance of sensitive detection,a sensitive information detection method combining variant recognition and semantic analysis was proposed in this paper.Firstly find out the variant terms appear in the text,and then through the edit distance and similarity calculation to find out the original word and replace the variant,and secondly,take the BERT as word embedding model for deep context information to realize the two-way language representation,and joint convolution neural network with bidirectional gated recurrent unit network build sensitive information classification model,Local semantic information and context information are extracted,and input into the classifier for sensitive information detection.Finally,the proposed model is compared with other deep learning network models on real data sets,and the results show that the proposed method can detect sensitive information more effectively.
作者
路松峰
郑召作
周军龙
朱建新
LU Songfeng;ZHENG Zhaozuo;ZHOU Junlong;ZHU Jianxin(School of Cyber Science and Engineering,Huazhong University of Science and Technology,Wuhan 430074,China;Shenzhen Huazhong University of Science and Technology Research Institute,Shenzhen 518063,China;School of Computer Science&Technology,Huazhong University of Science and Technology,Wuhan 430074,China)
出处
《湖北大学学报(自然科学版)》
CAS
2023年第6期879-887,共9页
Journal of Hubei University:Natural Science
基金
国家重点研发计划(2021YFB2012202)
湖北省重点研发计划项目(2021BAA038、2021BAA171)
深圳市科技计划基础研究项目(JCYJ20210324120002006)
深圳市科技计划技术攻关项目(JSGG20210802153009028)
湖南省教育厅科研优秀青年项目(20B491)
华南大学衡阳医学院临床医学研究“4310”培养计划[衡医发(2021)1-2-7号]资助。
关键词
敏感信息检测
卷积神经网络
预训练模型
双向门控循环单元
自然语言处理
sensitive information detection
convolutional neural network
pretraining model
bidirectional gated recurrent unit,natural language processing