摘要
为了减少对有标记数据的依赖,充分利用大量无标记数据,提出了一个基于数据增强和相似伪标签的半监督文本分类算法(semi-supervised text classification algorithm with data augmentation and similar pseudo-labels, STAP)。该算法利用EPiDA(easy plug-in data augmentation)框架和自训练对少量有标记数据进行扩充,采用一致性训练和相似伪标签考虑无标记数据及其增强样本之间的关系和高置信度的相似无标记数据之间的关系,在有监督交叉熵损失、无监督一致性损失和无监督配对损失的约束下,提高无标记数据的质量。在四个文本分类数据集上进行实验,与其他经典的文本分类算法相比,STAP算法有明显的改进效果。
In order to reduce the dependence on labeled data and make full use of a large number of unlabeled data,this paper proposed the STAP(semi-supervised text classification algorithm with data augmentation and similar pseudo-labels).The algorithm used EPiDA(easy plug-in data augmentation)framework and self-training to expand a small amount of labeled data.It used consistency training and similar pseudo-labels to consider the relationship between unlabeled data and its expanded samples and the relationship between similar unlabeled data with high confidence.Under the constraint of supervised cross entropy loss,unsupervised consistency loss and unsupervised pair loss,it improved the quality of unlabeled data.Experiments on four text classification datasets show that STAP algorithm has obvious improvement over other classical text classification algorithms.
作者
盛晓辉
沈海龙
Sheng Xiaohui;Shen Hailong(School of Science,Northeastern University,Shenyang 110819,China)
出处
《计算机应用研究》
CSCD
北大核心
2023年第4期1019-1023,1051,共6页
Application Research of Computers
关键词
半监督学习
文本分类
数据增强
相似伪标签
semi-supervised learning
text classification
data augmentation
similar pseudo-label