摘要
知识抽取任务是从非结构化的文本数据抽取三元组关系(头实体-关系-尾实体)。现有知识抽取方法分为流水式方法和联合抽取方法。流水式方法将命名实体识别和实体知识抽取分别用各自的模块抽取,这种方式虽然有较好的灵活性,但训练速度较慢。联合抽取的学习模型是一种通过神经网络实现的端到端的模型,同时实现实体识别和知识抽取,能够很好地保留实体和关系之间的关联,将实体和关系的联合抽取转化为一个序列标注问题。基于此,本文提出了一种基于字词混合和门控制单元(Gated Recurrent Unit, GRU)的科技文本知识抽取(MBGAB)方法,结合注意力机制提取中文科技资源文本的关系;采用字词混合的向量映射方式,既在最大程度上避免边界切分出错,又有效融入语义信息;采用端到端的联合抽取模型,利用双向GRU网络,结合自注意力机制来有效捕获句子中的长距离语义信息,并且通过引入偏置权重来提高模型抽取效果。
The knowledge extraction task is to extract triple relations(head entity-relation-tail entity) from the unstructured text data.The existing knowledge extraction methods are divided into "pipeline" method and joint extraction method.The "pipeline" method extracts named entity recognition and entity knowledge extraction with their respective modules.Although this method has better flexibility, the training speed is slow.The learning model of joint extraction is an end-to-end model implemented by neural network to realize entity recognition and relationship extraction at the same time, which can well preserve the association between entities and relationships, and convert the joint extraction of entities and relationships into a sequence labeling problem.The main contributions of this paper are as follows:(1) A knowledge extraction method for scientific and technological text based on word mixing and Gated Recurrent Unit(MBGAB) is proposed, which combines attention mechanism to extract the relationship between Chinese scientific and technological resource text.(2) Vector mapping method using mixed words can not only avoid boundary segmentation errors to the greatest extent, but also effectively integrate semantic information.(3) The end-to-end joint extraction model, the bidirectional GRU network and the self-attention mechanism are used to effectively capture the long-distance semantic information in the sentence, and the bias weight is introduced to improve the effect of model extraction.
作者
欧阳苏宇
邵蓥侠
杜军平
李昂
OUYANG Suyu;SHAO Yingxia;DU Junping;LI Ang(Beijing Key Laboratory of Intelligent Telecommunication Software and Multimedia,College of Computer Science,Beijing University of Posts and Telecommunicates,Beijing,100082,China)
出处
《广西科学》
CAS
北大核心
2022年第4期634-641,共8页
Guangxi Sciences
基金
国家重点研发计划项目(2018YFB1402600)
国家自然科学基金项目(61772083,61877006,61802028,62002027)资助。