摘要
多标签文本分类(MLTC)是自然语言处理(NLP)领域的重要子课题之一。针对多个标签之间存在复杂关联性的问题,提出了一种融合BERT与标签语义注意力的MLTC方法TLA-BERT。首先,通过对自编码预训练模型进行微调,从而学习输入文本的上下文向量表示;然后,使用长短期记忆(LSTM)神经网络将标签进行单独编码;最后,利用注意力机制显性突出文本对每个标签的贡献,以预测多标签序列。实验结果表明,与基于序列生成模型(SGM)算法相比,所提出的方法在AAPD与RCV1-v2公开数据集上,F1值分别提高了2.8个百分点与1.5个百分点。
Multi-Label Text Classification(MLTC)is one of the important subtasks in the field of Natural Language Processing(NLP).In order to solve the problem of complex correlation between multiple labels,an MLTC method TLA-BERT was proposed by incorporating Bidirectional Encoder Representations from Transformers(BERT)and label semantic attention.Firstly,the contextual vector representation of the input text was learned by fine-tuning the self-coding pre-training model.Secondly,the labels were encoded individually by using Long Short-Term Memory(LSTM)neural network.Finally,the contribution of text to each label was explicitly highlighted with the use of an attention mechanism in order to predict the multi-label sequences.Experimental results show that compared with Sequence Generation Model(SGM)algorithm,the proposed method improves the F value by 2.8 percentage points and 1.5 percentage points on the Arxiv Academic Paper Dataset(AAPD)and Reuters Corpus Volume I(RCV1)-v2 public dataset respectively.
作者
吕学强
彭郴
张乐
董志安
游新冬
LYU Xueqiang;PENG Chen;ZHANG Le;DONG Zhi’an;YOU Xindong(Beijing Key Laboratory of Internet Culture and Digital Dissemination Research(Beijing Information Science and Technology University),Beijing 100101,China)
出处
《计算机应用》
CSCD
北大核心
2022年第1期57-63,共7页
journal of Computer Applications
基金
北京市自然科学基金资助项目(4212020)
青海省藏文信息处理与机器翻译重点实验室/藏文信息处理教育部重点实验室开放课题基金资助项目(2019Z002)。
关键词
多标签分类
BERT
标签语义信息
双向长短期记忆神经网络
注意力机制
multi-label classification
Bidirectional Encoder Representations from Transformers(BERT)
label semantic information
Bidirectional Long Short-Term Memory(BiLSTM)neural network
attention mechanism