面向大规模裁判文书结构化的文本分类算法被引量：11

Paragraph Context-Based Text Classification Approach for Large-Scale Judgment Text Structuring

下载PDF

导出

摘要大数据和人工智能作为国家战略,使得新技术在司法领域应用的重要性凸显.同时,最高人民法院推动人工智能在司法领域的深度应用为相关研究提供了契机.最高人民法院主导的信息化建设以及司法公开等需求使得大量的裁判文书上网,裁判文书作为重要的法律文本信息资源,包含大量关键的案件审判信息,具有多元化的研究与应用价值.然而,裁判文书中存在着大量非结构化信息,妨碍了信息的准确抽取.对裁判文书进行结构化处理是基于裁判文书开展研究的重要前提.海量的裁判文书上网,人工处理将耗费大量的时间和精力,而裁判文书规范化改革为人工智能的司法应用提供基础.针对裁判文书结构化任务,已有的正则匹配方法或者基于文本分类模型的研究方法,未能利用文书上下文段落标签的结构特征,结构化效果较差.针对这一问题,提出了一种基于裁判文书段落级别的上下文语义特征信息的序列标注模型方法.通过学习完整的裁判文书中段落标签的结构信息、段落上下文之间的联系,实现良好的裁判文书结构化效果.结果表明:准确率、召回率和F1值较文本分类的基线模型有了全面提高,得到了几乎完全准确的分类效果.另外,本文采取的结构化方法核心在于利用裁判文书段落级别的上下文语义特征信息,该方法可以推广到各种类型的裁判文书的结构化任务. As a national strategy,big data and artificial intelligence(AI)are driving the application of new technologies in the judicial field.The Supreme People’s Court is also promoting the application of AI in the judicial system,which provides an opportunity for related research.The demand for information frameworks and the judicial openness by the Supreme People’s Court have brought a large number of judgments online.As an important legal text information resource,these judgments contain a large volume of key trial information with a diverse range of research and application values.However,there is also a large amount of unstructured information in the judgments that prevents the efficient and accurate extraction of information.Structural processing is an important prerequisite for any research based on these judgments.Massive numbers of judgments are uploaded to the internet,and their manual processing would consume much time and energy.A standardized reform of judgments would provide a basis for the application of AI to the judicial system.In the structuring of judgments,existing matching and research methods based on text classification models fail to take advantage of the structural features of the paragraph tags regarding the context of the document,which yield poor structuring results.To solve this problem,we propose a sequential labeling model method based on contextual semantic feature information at the paragraph level of the judgments.By studying the structural information of the paragraph labels in complete judgments and the relationship between the paragraph contexts,a good structuring of the judgments is achieved.The results show that the accuracy rate,recall rate,and F1 value are significantly improved compared to the results obtained by the baseline model of text classification,with almost completely accurate classification results obtained.In addition,as the proposed method utilizes contextual semantic information at the paragraph level of judgment text,this information can be extended to various types of judgment text structuring tasks.

作者翁洋谷松原李静王枫李俊良李鑫 Weng Yang;Gu Songyuan;Li Jing;Wang Feng;Li Junliang;Li Xin(College of Mathematics,Sichuan University,Chengdu 610064,China;Law School,Sichuan University,Chengdu 610207,China;Union Big Data Technology Co.,Ltd.,Chengdu 610041,China)

机构地区四川大学数学学院四川大学法学院数之联科技有限公司

出处《天津大学学报（自然科学与工程技术版）》 EI CSCD 北大核心 2021年第4期418-425,共8页 Journal of Tianjin University：Science and Technology

基金国家重点研发计划资助项目(2018YFC0830300) 四川大学法学院“研究阐释党的十九届四中全会精神”专项研究课题资助项目(sculaw20190302).

关键词裁判文书文本结构化预训练模型 judgment texts text structuring pre-training model

分类号 TK448.21 [动力工程及工程热物理—动力机械及工程]