基于语义边界增强的司法命名实体识别

Judicial named entity recognition enhanced with semantic and boundary

导出

摘要法律文书命名实体识别是智慧司法的关键任务。现有的序列标注模型仅关注字符信息,导致在法律文书命名实体识别任务中无法获得语义和词语的上下文信息,且无法对实体的边界进行限制。因此,该文提出了一个融合外部信息并对边界限制的司法命名实体识别模型(semantic and boundary enhance named entity recognition,SBENER)。该模型收集了40万条盗窃罪法律文书,首先,预训练模型,将获得的司法盗窃罪词向量作为输入模型的外部信息;其次,设计Adapter,将司法盗窃罪的信息融入字符序列以增强语义特征;最后,使用边界指针网络对实体边界进行限制,解决了序列标注模型丢失词语信息及缺少边界限制的问题。该模型在CAILIE 1.0数据集和LegalCorpus数据集上进行实验,结果表明,SBENER模型在2个数据集上的F_1值(F_1-score)分别达88.70%和87.67%,比其他基线模型取得了更好的效果。SBENER模型能够提升司法领域命名实体识别的效果。 [Objective]Named entity recognition(NER),a central task in the information extraction realm,aims to precisely identify various named entity types in textual content,including personal names,locations,and organizational names.In Chinese NER domain,deep learning techniques are crucial for character and vocabulary representations and feature extractions,yielding remarkable research achievements.Common deep learning models for NER include sequence labeling,span-based approaches,generative methods,and table-based strategies.Nevertheless,this task suffers from the scarcity of lexical information.Hence,this challenge is perceived as a primary hindrance limiting the development of high-performance Chinese NER systems.Despite developing extensive lexical dictionaries encompassing rich vocabulary boundaries and semantic insights,effective incorporation of this lexical knowledge into Chinese NER task remains a considerable challenge.Particularly,the seamless integration of semantic information from matching vocabulary and its contextual cues into Chinese character sequence remains intricate.Moreover,ensuring the accurate delimitation of named entity boundaries is still a remarkable concern.In the realm of intelligent judicial systems,the NER task within legal documents has garnered significant attention.Nonetheless,prevailing sequence labeling models predominantly rely on character information,constraining their capacity to capture semantic and lexical contextual nuances and inadequately addressing entity boundary constraints.To resolve these challenges,this paper introduces an innovative model called semantic and boundary enhanced named entity recognition(SBENER).To enhance the semantic features of legal documents within the SBENER model,external information containing vocabulary pertinent to theft crimes is smartly integrated.Initially,word vectors for theft crime terms are acquired through pretraining.Subsequently,a vocabulary dictionary tree is constructed,enabling the potential vocabulary candidate identification for each character.Further,these candidates are amalgamated into a final external information vector via a bilinear attention mechanism.Additionally,a linear gating structure is introduced to mitigate interference from external information in the original text.To overcome the limitations of sequence labeling models for managing entity boundary constraints,this study designs a boundary pointer network within the model to confine entity boundaries.This involves embedding the character sequence into hidden layer representations via bidirectional long short-term memory followed by decoding to introduce probability constraints for each entity span.Ultimately,contextual and boundary information is inputted into a conditional random field for obtaining the ultimate entity classification outcomes.This design adroitly tackles the issues of vocabulary loss and boundary constraint scarcity within sequence labeling models.Experimental results on the CAILIE 1.0 and LegalCorpus datasets corroborated the effectiveness of the proposed method,yielding F1 scores of 88.70%and 87.67%,respectively,surpassing other baseline models.Additionally,the study conducted ablation experiments to validate the effectiveness of each component.The experimental results showed that integrating external semantic information related to theft,enhancing entity boundary constraints through pointer networks,and incorporating gating mechanisms to restrict irrelevant information fusion were all effective approaches for achieving higher F1 scores for the model.Furthermore,this paper applied dimensionality reduction to external semantic word vector information and conducted experimental analysis on different fusion layers.Single-layer fusion outperformed multilayer fusion,while fusion at intermediate levels yielded better results.This underscored the marked enhancement in judicial NER facilitated by the proposed approach.The SBENER model effectively enhances the proficiency of recognizing named entities in legal documents through the fusion of external information and reinforcement of boundary constraints.This pioneering method substantially contributes to advancements within the intelligent judicial systems.

作者张天宇孙媛媛杜文玉邢铁军林鸿飞杨亮 ZHANG Tianyu;SUN Yuanyuan;DU Wenyu;XING Tiejun;LIN Hongfei;YANG Liang(School of Computer Science,Dalian University of Technology,Dalian 116024,China;Procuratorial Technology and Information Research Center,Supreme People's Procuratorate,Beijing 100726,China;Neusoft Corporation,Dalian 116024,China)

机构地区大连理工大学计算机学院最高人民检察院检察技术信息研究中心东软集团股份有限公司

出处《清华大学学报（自然科学版）》 EI CAS CSCD 北大核心 2024年第5期749-759,共11页 Journal of Tsinghua University(Science and Technology)

基金国家重点研发计划项目(2022YFC3301801) 中央高校基本科研业务费资助项目(DUT22ZD205)。

关键词法律文书外部法律信息实体边界命名实体识别 legal document external law information entity boundary named entity recognition

分类号 TP393.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献2

1郭喜跃,何婷婷.信息抽取研究综述[J].计算机科学,2015,42(2):14-17. 被引量：82
2李春楠,王雷,孙媛媛,林鸿飞.基于BERT的盗窃罪法律文书命名实体识别方法[J].中文信息学报,2021,35(8):73-81. 被引量：18

二级参考文献36

1李妮,关焕梅,杨飘,董文永.基于BERT-IDCNN-CRF的中文命名实体识别方法[J].山东大学学报（理学版）,2020,55(1):102-109. 被引量：53
2张晓艳,王挺,陈火旺.命名实体识别研究[J].计算机科学,2005,32(4):44-48. 被引量：65
3俞鸿魁,张华平,刘群,吕学强,施水才.基于层叠隐马尔可夫模型的中文命名实体识别[J].通信学报,2006,27(2):87-94. 被引量：153
4Wikipedia:Message Understanding Conference[EB/OL].2013-12-27.http://en.wikipedia.org/wiki/Message_Understanding_Conference.
5Wikipedia:Named Entity Recognition[EB/OL].2013-12-28.http://en.wikipedia.org/wiki/Named_Entity_Recognition.
6Rizzo G,Troncy R.NERD:Evaluating Named Entity Recognition Toolsinthe Web of Data[J].Lecture Notesin Computer Science,2012(7295):39-55.
7Rizzo G,Troncy R.NERD:A Framework for Unifying Named Entity Recognition and Disam biguation Extraction Tools[C]∥13th Conference ofthe European Chapter of the Association for ComputationalL inguistics.2012:73-76.
8Li Chen-liang,Weng Jian-shu.TwiNER:Named Entity Recognition in Targeted Twitter Stream[C]∥SIGIR.2012:721-730.
9Liu Xiao-hua,Zhang Shao-dian,et al.Recognizing Named Entitiesin Tweets[C]∥ACL.2011:359-367.
10Finin T,Murnane W.Annotating Named Entitiesin TwitterDatawith Crowdsourcing[C]∥ACL.2010.

共引文献98

1孔静静,于琦,李敬华,于彤,张竹绿,田野,祖雅琪.实体抽取综述及其在中医药领域的应用[J].世界科学技术-中医药现代化,2022,24(8):2957-2963. 被引量：4
2陈平,匡尧,陈婧.基于BERT-wwm-ext多特征文本表示的经济事件主体抽取方法研究[J].武汉电力职业技术学院学报,2020(2):45-50. 被引量：1
3张海瑜,陈庆龙,张斯静,张子怡,杨帆,李鑫星.基于语义知识图谱的农业知识智能检索方法[J].农业机械学报,2021,52(S01):156-163. 被引量：12
4王竹,谷松原.基于裁判文书争议焦点的民事案由逻辑图谱构建研究——以产品责任领域为例[J].民商法争鸣,2022(2):13-25.
5孙红,王哲.多粒度融合的命名实体识别[J].中文信息学报,2023,37(3):123-134.
6李春楠,王雷,孙媛媛,林鸿飞.基于BERT的盗窃罪法律文书命名实体识别方法[J].中文信息学报,2021,35(8):73-81. 被引量：18
7吴天昊,古丽拉·阿东别克.基于神经元块级别注意力机制的LSTM关系抽取[J].计算机应用研究,2020,37(S02):76-79. 被引量：6
8程乔,王映华,李冉,李友建.基于互联网+舆情数据发掘支撑网络优化新思路的研究[J].广西通信技术,2020(1):1-7.
9丁若尧.面向古汉语史料的信息抽取方法综述[J].中国科技纵横,2019,0(14):50-51. 被引量：1
10郭红转.基于信息增长模式的信息研究探讨[J].安徽工程大学学报,2015,30(5):86-90.

1彭莉红,张伟盟,程莎莎,孙栋华,骆燕,陈伟.重力数据LTHG均衡边界识别方法对比分析及其应用[J].世界核地质科学,2024,41(1):185-195.
2LIU Zhiwei,HUANG Bo,XIA Chunming,XIONG Yujie,ZANG Zhensen,ZHANG Yongqiang.Few-Shot Named Entity Recognition with the Integration of Spatial Features[J].Wuhan University Journal of Natural Sciences,2024,29(2):125-133.
3付良焕,白祥,王凯豪.基于灰色关联度模型的红枣产业提质增效路径探索——以新疆生产建设兵团第二师36团为例[J].安徽农业科学,2024,52(11):214-217.
4王海鹏,杜方,宋丽娟,李婷.融合单词级段信息的中文医疗命名实体识别[J].计算机技术与发展,2024,34(6):110-117.
5张弘弛,成旋,毛伟宾.奖赏预测误差对时间顺序记忆和来源记忆的影响[J].复印报刊资料（心理学）,2023(11):31-46.
6王伯勋,胡欣.KANO模型在VR家具购物中的应用——洞察设计决策[J].家具与室内装饰,2024,31(4):74-79.
7Halil Ibrahim Okur,Kadir Tohma,Ahmet Sertbas.Relational Turkish Text Classification Using Distant Supervised Entities and Relations[J].Computers, Materials & Continua,2024,79(5):2209-2228.
8Chuyuan Wei,Jinzhe Li,Zhiyuan Wang,Shanshan Wan,Maozu Guo.Graph Convolutional Networks Embedding Textual Structure Information for Relation Extraction[J].Computers, Materials & Continua,2024,79(5):3299-3314.

清华大学学报（自然科学版）

2024年第5期

浏览历史

内容加载中请稍等...

基于语义边界增强的司法命名实体识别

参考文献2

二级参考文献36

共引文献98

相关作者

相关机构

相关主题

浏览历史