摘要
法律文书命名实体识别是智慧司法的关键任务。现有的序列标注模型仅关注字符信息,导致在法律文书命名实体识别任务中无法获得语义和词语的上下文信息,且无法对实体的边界进行限制。因此,该文提出了一个融合外部信息并对边界限制的司法命名实体识别模型(semantic and boundary enhance named entity recognition,SBENER)。该模型收集了40万条盗窃罪法律文书,首先,预训练模型,将获得的司法盗窃罪词向量作为输入模型的外部信息;其次,设计Adapter,将司法盗窃罪的信息融入字符序列以增强语义特征;最后,使用边界指针网络对实体边界进行限制,解决了序列标注模型丢失词语信息及缺少边界限制的问题。该模型在CAILIE 1.0数据集和LegalCorpus数据集上进行实验,结果表明,SBENER模型在2个数据集上的F_1值(F_1-score)分别达88.70%和87.67%,比其他基线模型取得了更好的效果。SBENER模型能够提升司法领域命名实体识别的效果。
[Objective]Named entity recognition(NER),a central task in the information extraction realm,aims to precisely identify various named entity types in textual content,including personal names,locations,and organizational names.In Chinese NER domain,deep learning techniques are crucial for character and vocabulary representations and feature extractions,yielding remarkable research achievements.Common deep learning models for NER include sequence labeling,span-based approaches,generative methods,and table-based strategies.Nevertheless,this task suffers from the scarcity of lexical information.Hence,this challenge is perceived as a primary hindrance limiting the development of high-performance Chinese NER systems.Despite developing extensive lexical dictionaries encompassing rich vocabulary boundaries and semantic insights,effective incorporation of this lexical knowledge into Chinese NER task remains a considerable challenge.Particularly,the seamless integration of semantic information from matching vocabulary and its contextual cues into Chinese character sequence remains intricate.Moreover,ensuring the accurate delimitation of named entity boundaries is still a remarkable concern.In the realm of intelligent judicial systems,the NER task within legal documents has garnered significant attention.Nonetheless,prevailing sequence labeling models predominantly rely on character information,constraining their capacity to capture semantic and lexical contextual nuances and inadequately addressing entity boundary constraints.To resolve these challenges,this paper introduces an innovative model called semantic and boundary enhanced named entity recognition(SBENER).To enhance the semantic features of legal documents within the SBENER model,external information containing vocabulary pertinent to theft crimes is smartly integrated.Initially,word vectors for theft crime terms are acquired through pretraining.Subsequently,a vocabulary dictionary tree is constructed,enabling the potential vocabulary candidate identification for each character.Further,these candidates are amalgamated into a final external information vector via a bilinear attention mechanism.Additionally,a linear gating structure is introduced to mitigate interference from external information in the original text.To overcome the limitations of sequence labeling models for managing entity boundary constraints,this study designs a boundary pointer network within the model to confine entity boundaries.This involves embedding the character sequence into hidden layer representations via bidirectional long short-term memory followed by decoding to introduce probability constraints for each entity span.Ultimately,contextual and boundary information is inputted into a conditional random field for obtaining the ultimate entity classification outcomes.This design adroitly tackles the issues of vocabulary loss and boundary constraint scarcity within sequence labeling models.Experimental results on the CAILIE 1.0 and LegalCorpus datasets corroborated the effectiveness of the proposed method,yielding F1 scores of 88.70%and 87.67%,respectively,surpassing other baseline models.Additionally,the study conducted ablation experiments to validate the effectiveness of each component.The experimental results showed that integrating external semantic information related to theft,enhancing entity boundary constraints through pointer networks,and incorporating gating mechanisms to restrict irrelevant information fusion were all effective approaches for achieving higher F1 scores for the model.Furthermore,this paper applied dimensionality reduction to external semantic word vector information and conducted experimental analysis on different fusion layers.Single-layer fusion outperformed multilayer fusion,while fusion at intermediate levels yielded better results.This underscored the marked enhancement in judicial NER facilitated by the proposed approach.The SBENER model effectively enhances the proficiency of recognizing named entities in legal documents through the fusion of external information and reinforcement of boundary constraints.This pioneering method substantially contributes to advancements within the intelligent judicial systems.
作者
张天宇
孙媛媛
杜文玉
邢铁军
林鸿飞
杨亮
ZHANG Tianyu;SUN Yuanyuan;DU Wenyu;XING Tiejun;LIN Hongfei;YANG Liang(School of Computer Science,Dalian University of Technology,Dalian 116024,China;Procuratorial Technology and Information Research Center,Supreme People's Procuratorate,Beijing 100726,China;Neusoft Corporation,Dalian 116024,China)
出处
《清华大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2024年第5期749-759,共11页
Journal of Tsinghua University(Science and Technology)
基金
国家重点研发计划项目(2022YFC3301801)
中央高校基本科研业务费资助项目(DUT22ZD205)。
关键词
法律文书
外部法律信息
实体边界
命名实体识别
legal document
external law information
entity boundary
named entity recognition