摘要
实体关系抽取是信息抽取领域的核心任务.从文本中抽取的实体关系三元组是构建大规模知识图谱的基础.传统的流水线方法将实体关系抽取分解为独立的命名实体识别和关系抽取两个子任务.首先,构建一个高效的命名实体识别器,从大规模非结构化文本语句中识别实体边界和类型.然后,将该命名实体识别器识别的实体与类型作为关系抽取任务中所用数据的标注.最后,通过关系抽取器得到两个实体之间的关系类别,进而组合成为结构化的实体关系三元组.命名实体识别任务存在的误差会影响后续的关系抽取任务的性能,这使得流水线方法具有错误累积问题.这是因为关系抽取任务中使用的标注数据来自于前面的命名实体识别任务,这会有一定的误差,进而影响关系抽取的结果质量.此外,流水线方法减弱了两个子任务之间的特征关联,这会出现冗余实体的问题.命名实体识别任务和关系抽取任务独立进行学习训练,导致这两个子任务间缺乏交互,使得文本信息没有得到充分利用,限制了流水线方法的性能瓶颈.由于非结构化文本信息没有得到充分利用,流水线方法在抽取实体间长依赖关系时具有一定局限性,很难达到联合抽取模型的性能指标.实际应用中,实体间往往存在多种关系,流水线方法无法充分使用全局文本信息,且命名实体识别会产生冗余实体,在抽取多元重叠关系时,该方法具有一定的局限性.因此,在构建高准确率实体关系抽取模型时,流水线方法具有欠缺之处.本文对实体关系联合抽取的研究发展全景进行了综述,简要阐明整数线性规划、卡片金字塔解析模型、概率图模型和结构化预测模型这四类基于特征工程的联合模型的共同缺点.本文聚焦基于深度学习的实体关系联合抽取技术,根据近年来实体关系联合抽取前沿研究成果,总结了实体关系联合抽取模型的主流构建方法.按照建模思想的特点总结为三种建模方法:多模块-多步骤、多模块-单步骤以及单模块-单步骤.多模块-多步骤建模方法主要包含实体域映射关系域、关系域映射实体域和头实体域映射关系-尾实体域这三种类别.这三类模型的共同特点都是将三元组的提取过程分为多个模块,通过共享参数的方式整合各个模块,逐步迭代得到三元组.这种方法推动联合模型性能提升,初步解决了流水线方法存在的问题.但每个步骤使用独立的解码算法,导致解码误差累积问题.且共享参数整合各个模块的冗余误差会互相影响预测性能,从而产生级联冗余问题.多模块-单步骤建模方法旨在构建一个最优化的联合解码算法,并对其求取最优解进而得到最优超参数.这种方法设计了简单精确的联合解码算法,并加强了多个子模块间的交互性,减弱了因为逐步迭代导致的解码误差和级联冗余对联合模型性能的影响.然而,模块的分离依然会产生冗余错误,具有一定局限性.单模块-单步骤建模方法可以直接从文本语句中抽取三元组,有效缓解了多模块-多步骤和多模块-单步骤建模方法的级联错误和实体冗余等问题.本文以前沿文献中具有代表性的联合模型为例,详细分析了这些模型的建模思路,剖析了各个模型的优缺点,将多个具有共同建模思路的经典模型进行归类,以阐述实体关系联合抽取模型的发展趋势.本文将单模块-单步骤建模方法的代表模型在公开基准数据集上的模型性能与多模块-多步骤和多模块-单步骤的代表模型性能进行对比分析,阐明实体关系联合抽取模型的建模思路正在从基于多模块-多步骤和多模块-单步骤的复杂建模方法,逐渐向单模块-单步骤的高效建模方法转变的客观趋势.最后,本文对三个实体关系联合抽取的研究方向进行了展望.当下主流的联合模型聚焦于限定域的实体关系抽取任务,对于开放域问题研究得不够.开放域实体关系联合抽取任务是未来的研究人员亟待解决的问题之一.在实际工业应用中,文本语料包含多元信息,如时序信息.而当前的实体关系联合抽取模型大多依据单一文本上下文信息进行特征抽取,从而忽略了时序信息.若融入像时序信息这样的多元信息或能进一步提升联合模型性能,这是未来一项具有重大意义的课题.此外,对于跨文本的实体关系联合抽取模型研究较少,这也是该领域未来的一个研究趋势.本文旨在建立一个完整的基于深度学习的实体关系联合抽取领域研究视图,以对相关领域研究者有所帮助.
Entity-relation extraction is a core task in the field of information extraction.Entity-relation triples extract-ed from text are the basis for building large-scale knowledge graphs.The traditional pipeline method decomposes entity-re-lation extraction into two subtasks:named entity recognition and relation extraction.First,an efficient named entity recog-nizer is built to identify the entity boundaries and types from large-scale unstructured text sentences.Then,the entities and types are used as labels for the data used in the relation extraction task.Finally,the relationship category between two enti-ties is obtained through the relationship extractor and then combined into a structured entity-relation triplet.However,error in the named entity recognition task will affect the performance of the subsequent relation extraction task,which makes the pipeline method problematic because of error accumulation.This is because the labeled data used in the relation extraction task come from the previous named entity recognition task,which will include certain errors,and this will affect the quality of the relation extraction results.In addition,the pipeline method weakens the feature association between the two subtasks,which will lead to redundant entities.The named entity recognition task and relationship extraction task are independently learned and trained,which leads to a lack of interaction between these two subtasks.As a result,the text information is not fully utilized,which becomes the main reason the performance of the pipeline method is limited.Because unstructured text information is not fully employed,the pipeline method has certain limitations in extracting long dependencies between enti-ties,and it is difficult to achieve high performance in the joint extraction model.In practical applications,there are often multiple relationships between entities,but the pipeline method cannot fully consider the global text information,and hence named entity recognition produces redundant entities,which has disadvantages when extracting multiple overlapping rela-tionships.Therefore,when constructing a high-accuracy entity-relation extraction model,the pipeline approach has short-comings.This paper reviews the research and development of the joint extraction of entity relationships.Furthermore,it briefly clarifies the common shortcomings of four types of joint models based on feature engineering:integer linear pro-gramming,card pyramid analysis models,probabilistic graph models,and structured prediction models.Focusing on the joint extraction techniques for entity relationships based on deep learning,the mainstream construction methods of these models are summarized according to the state-of-the-art results reported in recent years.According to the characteristics of the modeling idea,the modeling methods are categorized into three types:multi-module/multi-step,multi-module/single-step,and single-module/single-step models.Multi-module/multi-step modeling methods consist of three main types:entity domain mapping to the relationship domain,relationship domain mapping to the entity domain,and head-entity domain mapping to the relation-tail domain.The common feature of these three types of models is that they divide the extraction of triples into multiple modules,integrate each module by sharing the parameters,and gradually iterate to obtain triples.This approach improves the performance of the joint model and initially solves the problems of the pipeline method.However,because each step uses an independent decoding algorithm,it leads to the accumulation of decoding errors.Moreover,be-cause the redundant errors of each module integrated with shared parameters affect the prediction performance of the others,this results in cascading redundancies.The multi-module-single-step modeling method aims to construct an optimal joint de-coding algorithm and obtain the optimal solution to determine the optimal hyperparameters.This method designs a simple and accurate joint decoding algorithm and strengthens the interaction between multiple submodules.Therefore,the impact of decoding errors and cascading redundancies caused by gradual iterations on the performance of the joint model is weak-ened.However,the separation of the modules still produces redundancy errors,which cause certain limitations.The single-module/single-step modeling method can extract triples from text directly,which effectively alleviates the cascading error and entity redundancy problems of multi-module/multi-step and multi-module/single-step modeling methods.Taking the representative joint models in the high-impact literature as examples,this paper analyzes the modeling idea,advantages,and disadvantages of each model.It also classifies a number of classical models according to common modeling ideas to illus-trate trends in the development of entity-relationship joint extraction models.This paper compares and analyzes the perfor-mance of the representative single-module,single-step modeling method with multi-module/multi-step and multi-module/single-step models on a public benchmark data set.Moreover,it clarifies the objective trend that the modeling idea of joint extraction models is gradually changing from complex methods based on multi-module/multi-step and multi-module/single-step models to efficient single-module/single-step models Finally,this paper discusses the prospects of research directions in the joint extraction of three-entity relationships.The current mainstream joint model focuses on the entity-relationship ex-traction task of limited domains,and the open-domain entity-relationship joint extraction task is an urgent problem for fu-ture researchers to solve.In practical industrial applications,a text corpus contains multiple types of information,such as timing information.However,most current entity-relationship joint extraction models extract features based on single-text context information,thus ignoring time-series information.If multivariate information such as time-series information could be incorporated,the performance of the joint model would be further improved,and this is a topic of high importance for the future.In addition,there is little research on cross-text entity-relationship joint extraction models,which is also a future research topic in this field.This paper aims to establish a complete deep learning-based view of entity-relationship joint ex-traction research,which will be helpful to researchers in related fields.
作者
张仰森
刘帅康
刘洋
任乐
辛永辉
ZHANG Yang-sen;LIU Shuai-kang;LIU Yang;REN Le;XIN Yong-hui(Institute of Intelligent Information Processing,Beijing Information Science and Technology University,Beijing 100192,China;Computer Network Emergency Response Technical Team,Coordination Center of China,Beijing 100029,China)
出处
《电子学报》
EI
CAS
CSCD
北大核心
2023年第4期1093-1116,共24页
Acta Electronica Sinica
基金
国家自然科学基金(No.62176023)。
关键词
信息抽取
知识图谱
深度学习
实体关系联合抽取
流水线方法
information extraction
knowledge graph
deep learning
joint extraction of entities and relations
pipe-line method