期刊文献+

基于半监督学习的涉及未成年人案件文书识别方法 被引量:2

Juvenile Case Documents Recognition Method Based on Semi-Supervised Learning
下载PDF
导出
摘要 案件文书作为司法信息公开的重要内容,需要在审判之后向公众公开,某些涉及未成年人的案件文书极有可能会造成未成年人的个人隐私信息泄露。为了能从大量案件文书中准确地识别出涉及未成年人信息的文书,进而有针对性地对其进行隐私保护处理。同时,为解决现实数据集因有标注样本缺乏而难以进行有效的有监督学习的问题,文中提出了基于半监督学习的涉及未成年人案件文书识别方法。首先,对案件文书语料文本进行预处理后分别使用Word2Vec和BERT-wwm-ext对文本进行特征提取,将长语料文本转换为可作为分类模型输入的数据格式;接着,采用PU学习方法训练分类模型,在正例样本极少的情况下借助大量未标注样本构建有效的分类器;然后,在分类模型预测结果的基础上,使用主动学习方法获取关键词并对模型预测结果进行筛选处理,以进一步提升预测效果。在基于现实场景比例构建的测试集上,文中提出的案件文书识别方法取得了98.67%的召回率和81.02%的准确率。 As an important content of judicial information disclosure,case documents should be disclosed to the public after the trial.Some case documents involving juvenile are likely to cause the disclosure of juvenile personal privacy information.In order to conduct targeted privacy protection processing,the first step is to accurately identify documents involving juvenile information from a large number of case documents.At the same time,in order to solve the problem of difficulty in effective supervised learning due to the lack of labeled samples in the real data set,this paper proposed a juvenile case documents recognition method based on semi-supervised learning.Firstly,the corpus text of the case document was pre-processed,and then the features of the text were extracted with Word2Vec and BERT-wwm-ext.After the above processing,the long corpus text was converted into the data format that can be used as the input for the classification model.Then the classification model was trained with the PU learning method,and an effective classifier was constructed with a large number of unlabeled samples under the condition of few positive examples.Then,based on the prediction results of the classification model,active learning method was employed to obtain keywords and screen the prediction results,so as to further improve the prediction effect.Finally,the case documents recognition method proposed in this article achieves a recall of 98.67%and a precision of 81.02%on the test set constructed based on the proportion of real scenes.
作者 杨圣豪 吴玥悦 毛佳昕 刘奕群 张敏 马少平 YANG Shenghao;WU Yueyue;MAO Jiaxin;LIU Yiqun;ZHANG Min;MA Shaoping(Department of Computer Science and Technology//Beijing National Research Center for Information Science and Technology,Tsinghua University,Beijing 100084,China)
出处 《华南理工大学学报(自然科学版)》 EI CAS CSCD 北大核心 2021年第1期29-38,46,共11页 Journal of South China University of Technology(Natural Science Edition)
基金 国家重点研发计划项目(2018YFC0831700) 国家自然科学基金资助项目(61732008,61532011)。
关键词 文本分类 文本特征提取 深度学习 半监督学习 text classification text feature extraction deep learning semi-supervised learning
  • 相关文献

参考文献1

二级参考文献8

  • 1[1]Fabrizio Sebastiani.Machine Learning in Automated Text Categorization[J].ACM Computing Surveys,2002,34 (1):1 -47.
  • 2[2]Salton G,Wong A,Yang C S.On the Specification of Term Values in Automatic Indexing[J].Journal of Documentation,1973,29(4):351-372.
  • 3[3]Lewis D D,Ringuette M.A Comparison of Two Learning Algorithms for Text Categorization[C] // Anon.Proceedings of SIAIR94,3rd Annual Symposium on Document Analysis and Information Retrieval.Las Vegas:NV,1994:81-93.
  • 4[4]Yang Yiming,Pedersen Jan O.A Comparative Study on Feature Selection in Text Categorization[C] // Anon.Proceedings of 14th International Conference on Machine Learning (ICML-97).Nashville:TN,1997:412-420.
  • 5[5]Ruiz M E,Srinivasan P.Hierarchical Neural Networks for Text Categorization[C] // Anon.Proceedings of SIGIR -99,2nd ACM International Conference on Research and Development in Information Retrieval.Berkeley:CA,1999:281-282.
  • 6[6]Lewis D D.Naive (Bayes) at Forty:The Independence Assumption in Information Retrieval[C] // Anon.The 10th European Conf on Machine Learning (ECM98).New York:Springer-Verlag,1998:4-15.
  • 7[7]Yang Yiming,Liu Xin.A Re-examination of Text Categorization Methods[C] // Anon.The 22nd Annual ACM SIGIR Conf on Research and Development in Information Retrieval.New York:ACM Press,1999:42-49.
  • 8[8]T Joachims.Text Categorization with Support Vector Machines[C]// Anon.The 10th European Conf on Machine Learning (ECML-98).Berlin:Springer,1998.137-142.

共引文献5

同被引文献13

引证文献2

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部