摘要
面对海量的企业文件,单纯地凭借人工进行密点标注,不仅费时费力,其划分标准更受到人为主观意识的影响。因此,对企业文件进行自动定密是企业保密管理工作中需要迫切解决的重要问题。为此,提出一种基于Transformer的电网企业文件密点标注系统,包括文件预处理、中文分词、词向量构建和密点标注等步骤。在国网吉林省电力有限公司内部核心商密文件和普通商密文件构建的数据集上对所提出的模型进行了训练测试,结果表明,该系统准确率为97.79%,召回率为99.08%。模型达到了较高的识别效果,且其对密点信息识别准确,只有极少数密点信息未被标注,有效防止了密点信息的泄露。
In the face of a large number of enterprise files, it is time-consuming and laborious to label the encryption points simply by manual, and its division standard is affected by human subjective consciousness. It is an important issue for the automatic classification of enterprise documents, which needs to be solved urgently in enterprise confidentiality management is proposed. Therefore, a file dense point labeling system for power grid enterprises based on transformer. It includes file preprocessing, Chinese word segmentation, word vector construction and secret information annotation. The proposed model is trained and tested on the data set constructed by the internal core commercial secret files and ordinary commercial secret files of State Grid Jilin Electric Power Corporation. The accuracy is 97.79% and the recall is 99.08%, indicating that the model has achieved high recognition effect. The recognition of secret information is accurate. There are only a few secret information that have not been marked, which prevents the leakage of secret information effectively.
作者
董添
李广
杨振宇
张博
于波
王巍
DONG Tian;LI Guang;YANG Zhenyu;ZHANG Bo;YU Bo;WANG Wei(General Committee Office,State Grid Jilin Electric Power Supply Company,Changchun 130021,China)
出处
《吉林大学学报(信息科学版)》
CAS
2021年第6期720-725,共6页
Journal of Jilin University(Information Science Edition)
基金
国网吉林公司科技基金资助项目(522342210001)。
关键词
密点标注
深度学习
中文分词
词嵌入
企业秘密
secret information annotation
deep learning
Chinese word segmentation
word embedding
enterprise secrets