摘要
针对诉讼案件违法事实要素抽取效果依赖领域专业知识的特点,提出一种基于transformer双向编码器表示(bidirectional encoder representations from transformer, BERT)的诉讼案件违法事实要素自动抽取方法。首先,通过构建领域知识并采用谷歌BERT预训练语言模型进行训练得到拟合诉讼案件领域数据的模型参数和中文预训练字嵌入向量作为模型的输入,得到具有上下文相关的语义表示,以提高词嵌入的上下文语义质量。其次,采用循环卷积神经网络对文本进行编码并获取在文本分类任务中扮演关键角色的信息,提升案件违法事实要素抽取的效果。最后,采用focal函数作为损失函数关注难以区分的样本。违法事实要素抽取的工作是通过对文本标签进行分类得到的。实验测试表明,该方法对诉讼案件要素抽取的F1值为86.41%,相比其他方法性能均有提高。对模型注入领域内知识,也可以提高模型抽取准确率。
Based on the fact that the extraction of illegal fact elements in lawsuit cases depends on special professional knowledge,an automatic illegal fact elements extraction method of lawsuit cases based on BERT was proposed.Firstly,by constructing domain knowledge and using Google BERT pre-training language model for training,model parameters fitting the domain data of lawsuit cases and embedding vector of Chinese pre-training words were obtained as the input of the model,and the contextual representation was obtained to improve the quality of the context semantic of word embedding.Then the text was encoded by the cyclic convolutional neural network and the information that plays a key role in the text classification task was obtained.Finally,focal function was adopted as the loss function to focus on the indistinguishable samples.The work of extracting elements of illegal facts was obtained by classifying text labels.Experimental tests show that the F 1 value of the method is 86.41%,which is better than other methods.The accuracy of model extraction can also be improved by injecting domain knowledge into the model.
作者
崔斌
邹蕾
徐明月
CUI Bin;ZOU Lei;XU Ming-yue(Beijing Jinghang Institute of Computing and Communication Information Engineering Division,Beijing 100074,China)
出处
《科学技术与工程》
北大核心
2021年第9期3669-3675,共7页
Science Technology and Engineering
基金
国家重点研发计划(2018YFC0830800)。