摘要
为构建东盟十国知识图谱,需要对相关文本进行命名实体识别工作。设计一种基于双向GRU-CRF的神经网络模型,对中国驻东盟十国大使馆中文新闻数据进行命名实体识别。以预训练的领域词向量为输入,利用双向GRU网络从向量化的文本中提取语义特征;再通过CRF层预测并输出最优标签序列。为了进一步改善结果,在双向GRU和CRF层之间添加两层隐藏层。在数据预处理方面,提出一种数据集划分算法,对文本进行更加科学合理的划分。在东盟十国数据集上,将该模型与几种混合模型进行对比,结果显示所提模型在人名、地名、组织机构名识别任务中拥有更好的识别性能。
In order to construct the knowledge graph of the ten ASEAN member states,it is necessary to perform named entity recognition on related texts.A neural network model based on bi-directional GRU-CRF-based was designed to identify the Chinese news data of the Chinese embassy in the ten ASEAN member states.Taking the pre-trained domain word vector as input,the Bi-directional GRU network was used to extract the semantic features from the vectorized text,and then the CRF layer was used to predict and output the optimal tag sequence.To further improve the results,two layers of hidden layers were added between the Bi-directional GRU and CRF layers.In the aspect of data preprocessing,a data set partition algorithm was proposed to make the text more scientific and reasonable.Compared with several hybrid models in the ASEAN data set,the models shows that it has better recognition performance in the identification of names of person,location and organizations.
作者
郑彦斌
夏志超
郭智
黄永忠
刘文芬
ZHENG Yan-bin;XIA Zhi-chao;GUO Zhi;HUANG Yong-zhong;LIU Wen-fen(Guangxi Key Laboratory of Cryptography and Information Security, Guilin University of Electronic Technology 1, Guilin 541004, China;School of Computer Science and Network Security, Dongguan University of Technology 2 , Dongguan 523808, China)
出处
《科学技术与工程》
北大核心
2018年第35期162-168,共7页
Science Technology and Engineering
基金
国家自然科学基金(61602125
61866008
61862011
61862012)
广西自然科学基金(2016GXNSFBA380153
2017GXNSFAA198192
2018GXNSFAA138116)
广西密码学与信息安全重点实验室项目(GCIS201625
GCIS201704)
桂林电子科技大学研究生教育创新计划项目(2018YJCX51)资助