摘要
为了解决单一军事领域语料不足导致的领域嵌入空间质量欠佳,使得深度学习神经网络模型识别军事命名实体精度较低的问题,文中从字词分布式表示入手,通过领域自适应方法由额外的领域引入更多有用信息帮助学习军事领域的嵌入。首先建立领域词典,将其与CRF算法结合,对收集到的通用领域语料和军事领域语料进行领域自适应分词,作为嵌入训练语料,并将词向量作为特征与字向量拼接,以丰富嵌入信息并验证分词效果;然后对训练所得的通用领域和军事领域的异构嵌入空间进行领域自适应转换,生成领域自适应嵌入,并作为基础模型BiLSTM-CRF层的输入;最后通过CoNLL-2000进行识别评价。实验结果表明,在相同模型下,输入领域适应嵌入比输入一般分词后的语料训练所得的军事领域嵌入,其模型识别的精确率(P)、召回率(R)、综合F1值(F1)分别提高了2.17%,1.04%,1.59%。
In order to solve the poor quality problem of domain embedding space caused by inadequate military corpus which makes low accuracy of applying deep neural network model to military named entity recognition,this paper introduces a domain adaptive method to help learn the embedding of military fields from more useful information of additional fields through distributed representation of words.First,we establish the domain dictionary and combine CRF algorithm to perform domain adaptive word segment with the collected general domain and military areas corpus as training corpus for embedding,and word vectors are used as features and spliced with character vectors to enrich the embedding information and to validate the effect of word segmentation.Then the domain adaptive transformation is carried out to the heterogeneous embedded space of the general domain and the military domain,and the domain adaptive embedding is generated,as the input to BiLSTM-CRF layer of base model.At last,the recognition evaluation is carried out through CoNLL-2000.The experimental results show that,under the same model,the recognition precision rate(P),recall rate(R),and integrated F1value(F1)of the proposed method are improved by 2.17%,1.04%,and 1.59%,respectively,compared with the military field embedding trained by a corpus which is obtained from general word segmentation.
作者
刘凯
张宏军
陈飞琼
LIU Kai;ZHANG Hong-jun;CHEN Fei-qiong(School of Graduate,Army Engineering University of PLA,Nanjing 210000,China;College of Command and Control Engineering,Army Engineering University of PLA,Nanjing 210000,China)
出处
《计算机科学》
CSCD
北大核心
2022年第1期292-297,共6页
Computer Science
关键词
字向量
词向量
中文分词
领域自适应
命名实体识别
Character embedding
Word embedding
Chinese word segmentation
Domain adaptation
Named entity recognition