摘要
为解决教育领域命名实体识别任务精度较低和语料严重不足的问题,提出一种融合词信息与自注意力的命名实体识别模型WBBAC。该模型利用BERT预训练语言模型增强字向量的语义表示并为字向量引入词频信息,将字向量与词向量拼接作为双向长短期记忆网络的输入,经过自注意力层进一步寻找序列内部的联系,最后通过CRF解码获得最优序列。根据课程文本特点创建计算机组成原理数据集并进行标注,在Resume数据集和计算机组成原理数据集上进行实验,WBBAC模型的F1值分别为95.65%和73.94%。实验结果表明,与基线模型相比,WBBAC模型具有更高的F1值,有效解决了教育领域命名实体识别任务中标注数据不足的问题。
To solve the problems of low accuracy and severe lack of corpus in named entity recognition tasks in the education field,a named entity recognition model WBBAC that integrates word information and self attention is proposed.This model utilizes a BERT pre trained lan-guage model to enhance the semantic representation of word vectors and introduces word frequency information into them.The word vectors are concatenated with each other as inputs to a bidirectional long short-term memory network,which further searches for internal connections with-in the sequence through a self attention layer.Finally,the optimal sequence is obtained through CRF decoding.Create a computer composition principle dataset based on the characteristics of the course text and annotate it.Conduct experiments on the Resume dataset and the computer composition principle dataset,and the F1 values of the WBBAC model are 95.65%and 73.94%,respectively.The experimental results show that compared with the baseline model,the WBBAC model has a higher F1 value,effectively solving the problem of insufficient annotated da-ta in named entity recognition tasks in the education field.
作者
郑守民
申艳光
ZHENG Shoumin;SHEN Yanguang(School of Information and Electrical Engineering,Hebei University of Technology,Handan 056004,China)
出处
《软件导刊》
2024年第9期105-109,共5页
Software Guide
关键词
命名实体识别
词信息
自注意力机制
教育领域
BERT
named entity recognition
word information
self-attention mechanism
education domain
BERT