摘要
为了防止敏感数据的泄露,为数据的访问控制提供依据,提出并实现了一种基于中文文本内容的敏感数据识别方法。通过对敏感数据库和已知分类文档库的学习,完成对文本中敏感数据识别的阙值的确定和未知文档是否敏感数据的判断过程。描述了预处理、文本识别、阙值确定的详细设计和实现过程。通过对搜狗语料库中教育相关部分文本的识别,验证该方法的敏感数据识别过程简单实用并且具有较高的正确率。
To prevent the leakage of sensitive data and provide the basis for data access control, a design method of identifying sensitive data based on chinese text content is presented. Through the study of sensitive text library and the text library which included the same number of sensitive text and security text, it can determine the threshold of the sensitive data and judge the unknown classification text whether sensitive data or not. The design and implementation process of pre-processing, text recog nition and determination of the threshold is described. Finally, by identifying the education-related text in Sogou corpus, experi ments prove that the method is simple and practical and has a high accuracy rate.
出处
《计算机工程与设计》
CSCD
北大核心
2013年第4期1202-1206,共5页
Computer Engineering and Design
基金
国家863高技术研究发展计划基金项目(2012AA050802)
国家电网公司科技攻关团队基金项目(SG11034)
关键词
敏感数据
文本识别
内容识别
数据防泄漏
分类算法
sensitive data
text recognition
content identifieation
data leakage prevention
classification algorithm