基于路径标签的文档级关系抽取方法

Document-level relation extraction method based on path labels

下载PDF

导出

摘要针对文档级关系抽取中文本处理复杂性过高,难以提取高效实体关系的问题,提出了一种基于路径标签的文档级关系抽取方法,抽取选择关键的证据句子。首先,引入路径(Path)标签代替实体句子作为处理过的文本数据集进行数据预处理;同时,结合语义分割的U-Net模型,利用输入端的编码模块捕获文档实体的上下文信息,并通过图像风格的U-Net语义分割模块捕获实体三元组之间的全局依赖性;最后,引入Softmax函数减少文本抽取时的噪声。理论分析和仿真结果表明,与基于图神经网络的RoBERTa(RoBERTa-ATLOP)关系抽取算法相比,Path+U-Net在基于文档级别的实体关系抽取数据集(DocRED)上的开发和测试的F1值分别提高了1.31、0.54个百分点,在化学疾病反应(CDR)数据集上的开发和测试的F1值分别提高了1.32、1.19个百分点;并且Path+U-Net在保证实体间的相关性与原始数据集的相关性一致的同时,对数据集的抽取成本更低、对文本的抽取精度更高。实验结果表明,所提出的基于路径标签的抽取方法能够有效提高长文本抽取效率。 Due to the high complexity of text processing in document-level relation extraction,it is difficult to extract efficient entity relations.Therefore,a path label based document-level extraction method was proposed to select key evidence sentences.Firstly,the Path label was introduced to replace the entity sentence as the processed text dataset for data preprocessing.At the same time,combined with the U-Net model of semantic segmentation,the encoding module at the input end was used to capture the context information of the document entity,and the image style was used to capture the context information of the document entities,and the U-Net semantic segmentation module was used to capture the global dependencies among entity triples.Finally,a Softmax function was introduced to decrease the noise of text extraction.Theoretical analysis and simulation results show that compared with the graph neural network-based RoBERTa(Robustly optimized Bidirectional Encoder Representations from Transformers)(RoBERTa⁃ATLOP)relation extraction algorithm,Path+U-Net has the F1-score in the development and testing of Document-level Relation Extraction Dataset(DocRED)increased by 1.31 and 0.54 percentage points respectively,and the F1-score in development and testing of Chemical Disease Response(CDR)dataset improved by 1.32 and 1.19 percentage points respectively.At the same time,Path+U-Net has lower extraction cost for datasets and higher extraction accuracy of text,while the correlation between entities is consistent with the correlation in the original dataset.Experimental results show that the proposed extraction algorithm based on path labels can effectively improve the extraction efficiency of long texts.

作者袁泉徐雲鹏唐成亮 YUAN Quan;XU Yunpeng;TANG Chengliang(School of Communication and Information Engineering,Chongqing University of Posts and Telecommunications,Chongqing 400065,China;Research Center of New Communication Technology Applications,Chongqing University of Posts and Telecommunications,Chongqing 400065,China)

机构地区重庆邮电大学通信与信息工程学院重庆邮电大学通信新技术应用研究中心

出处《计算机应用》 CSCD 北大核心 2023年第4期1029-1035,共7页 journal of Computer Applications

关键词关系抽取关系分类远程监督注意力机制语义分割 relation extraction relation classification remote supervision attention mechanism semantic segmentation

分类号 TP391 [自动化与计算机技术—计算机应用技术]