摘要
基于全监督学习的文本分类算法需要使用大量的标签数据,而文本数据的标注任务耗时耗力且标注难度较大。针对上述问题,提出了一种基于LOTClass模型的弱监督中文短文本分类算法。首先,使用少量的标签数据构建类别种子词表;其次,使用类别种子词表指导训练中文伪标签生成模型,并使用该模型生成大量伪标签数据;最后,利用优质伪标签数据训练一个中文短文本分类模型。在THUCNews新闻标题数据集和论文标题数据集上进行实验,结果表明,该算法在仅使用少量标签数据的情况下,其性能优于主流的半监督分类算法,同时不逊于一般的全监督分类算法,为无标签数据分类任务提供了一种较好的解决方案。
The text classification algorithms based on fully supervised learning need to use a large amount of label data,while the labeling task of text data is not only time-consuming and labor-intensive,but also difficult to label.To solve the above problems,this paper proposes a weakly supervised Chinese short text classification algorithm based on the LOTClass model.First,a small amount of label data is used to construct a category seed vocabulary.Then the category seed vocabulary is used to guide the training of a Chinese pseudo-label generation model,which is then used to generate a large amount of pseudo-label data.Finally,high-quality pseudo-label data is used to train a Chinese short text classification model.Experiments on the THUCNews news title data set and the paper title data set show that,in the case of using only a small amount of labeled data,the performance of the algorithm in this paper is better than that of mainstream semi-supervised classification algorithms,and it is not inferior to general fully-supervised classification algorithms.It provides a better solution for unlabeled data classification tasks.
作者
刘硕
王庚润
任玉媛
LIU Shuo;WANG Gengrun;REN Yuyuan(Information Engineering University,Zhengzhou 450001,China)
出处
《信息工程大学学报》
2021年第5期613-620,共8页
Journal of Information Engineering University
关键词
弱监督学习
中文文本
短文本分类
预训练模型
种子词
weakly-supervised learning
Chinese text
short text classification
pre-training model
seed words