摘要
针对当前大多数命名实体识别(NER)模型只使用字符级信息编码且缺乏对文本层次信息提取的问题,提出一种融合多粒度语言知识与层级信息的中文NER(CNER)模型(CMH)。首先,使用经过多粒度语言知识预训练的模型编码文本,使模型能够同时捕获文本的细粒度和粗粒度语言信息,从而更好地表征语料;其次,使用ON-LSTM(Ordered Neurons Long Short-Term Memory network)模型提取层级信息,利用文本本身的层级结构信息增强编码间的时序关系;最后,在模型的解码端结合文本的分词信息,并将实体识别问题转化为表格填充问题,以更好地解决实体重叠问题并获得更准确的实体识别结果。同时,为解决当前模型在不同领域中的迁移能力较差的问题,提出通用实体识别的理念,通过筛选多领域的通用实体类型,构建一套提升模型在多领域中的泛化能力的通用NER数据集MDNER(Multi-Domain NER dataset)。为验证所提模型的效果,在数据集Resume、Weibo、MSRA上进行实验,与MECT(Multi-metadata Embedding based Cross-Transformer)模型相比,F1值分别提高了0.94、4.95和1.58个百分点。为了验证所提模型在多领域中的实体识别效果,在MDNER上进行实验,F1值达到了95.29%。实验结果表明,多粒度语言知识预训练、文本层级结构信息提取和高效指针解码器对模型的性能提升至关重要。
Aiming at the problem that most of the current Named Entity Recognition(NER)models only use characterlevel information encoding and lack text hierarchical information extraction,a Chinese NER(CNER)model incorporating Multi-granularity linguistic knowledge and Hierarchical information(CMH)was proposed.First,the text was encoded using a model that had been pre-trained with multi-granularity linguistic knowledge,so that the model could capture both finegrained and coarse-grained linguistic information of the text,and thus better characterize the corpus.Second,hierarchical information was extracted using the ON-LSTM(Ordered Neurons Long Short-Term Memory network)model,in order to utilize the hierarchical structural information of the text itself and enhance the temporal relationships between codes.Finally,at the decoding end of the model,incorporated with the word segmentation Information of the text,the entity recognition problem was transformed into a table filling problem in order to better solve the entity overlapping problem and obtain more accurate entity recognition results.Meanwhile,in order to solve the problem of poor migration ability of the current models in different domains,the concept of universal entity recognition was proposed,and a set of universal NER dataset MDNER(Multi-Domain NER dataset)was constructed to enhance the generalization ability of the model in multiple domains by filtering the universal entity types in multiple domains.To validate the effectiveness of the proposed model,experiments were conducted on the datasets Resume,Weibo,and MSRA,and the F1 values were improved by 0.94,4.95 and 1.58 percentage points,respectively,compared to the MECT(Multi-metadata Embedding based Cross-Transformer)model.In order to verify the proposed model’s entity recognition effect in multi-domain,experiments were conducted on MDNER,and the F1 value reached 95.29%.The experimental results show that the pre-training of multi-granularity linguistic knowledge,the extraction of hierarchical structural information of the text,and the efficient pointer decoder are crucial for the performance promotion of the model.
作者
于右任
张仰森
蒋玉茹
黄改娟
YU Youren;ZHANG Yangsen;JIANG Yuru;HUANG Gaijuan(Institute of Intelligent Information Processing,Beijing Information Science and Technology University,Beijing 100101,China)
出处
《计算机应用》
CSCD
北大核心
2024年第6期1706-1712,共7页
journal of Computer Applications
基金
国家自然科学基金资助项目(62176023)。
关键词
命名实体识别
自然语言处理
知识图谱构建
高效指针
通用实体
Named Entity Recognition(NER)
Natural Language Processing(NLP)
knowledge graph construction
efficient pointer
generic entity