期刊文献+

基于LOTClass模型的弱监督中文短文本分类算法

Weakly Supervised Chinese Short Text Classification Algorithm Based on LOTClass Model
下载PDF
导出
摘要 基于全监督学习的文本分类算法需要使用大量的标签数据,而文本数据的标注任务耗时耗力且标注难度较大。针对上述问题,提出了一种基于LOTClass模型的弱监督中文短文本分类算法。首先,使用少量的标签数据构建类别种子词表;其次,使用类别种子词表指导训练中文伪标签生成模型,并使用该模型生成大量伪标签数据;最后,利用优质伪标签数据训练一个中文短文本分类模型。在THUCNews新闻标题数据集和论文标题数据集上进行实验,结果表明,该算法在仅使用少量标签数据的情况下,其性能优于主流的半监督分类算法,同时不逊于一般的全监督分类算法,为无标签数据分类任务提供了一种较好的解决方案。 The text classification algorithms based on fully supervised learning need to use a large amount of label data,while the labeling task of text data is not only time-consuming and labor-intensive,but also difficult to label.To solve the above problems,this paper proposes a weakly supervised Chinese short text classification algorithm based on the LOTClass model.First,a small amount of label data is used to construct a category seed vocabulary.Then the category seed vocabulary is used to guide the training of a Chinese pseudo-label generation model,which is then used to generate a large amount of pseudo-label data.Finally,high-quality pseudo-label data is used to train a Chinese short text classification model.Experiments on the THUCNews news title data set and the paper title data set show that,in the case of using only a small amount of labeled data,the performance of the algorithm in this paper is better than that of mainstream semi-supervised classification algorithms,and it is not inferior to general fully-supervised classification algorithms.It provides a better solution for unlabeled data classification tasks.
作者 刘硕 王庚润 任玉媛 LIU Shuo;WANG Gengrun;REN Yuyuan(Information Engineering University,Zhengzhou 450001,China)
机构地区 信息工程大学
出处 《信息工程大学学报》 2021年第5期613-620,共8页 Journal of Information Engineering University
关键词 弱监督学习 中文文本 短文本分类 预训练模型 种子词 weakly-supervised learning Chinese text short text classification pre-training model seed words
  • 相关文献

参考文献9

二级参考文献44

  • 1周志华.Multi-Instance Learning from Supervised View[J].Journal of Computer Science & Technology,2006,21(5):800-809. 被引量:12
  • 2夏云庆,黄锦辉,张普.中文网络聊天语言的奇异性与动态性研究[J].中文信息学报,2007,21(3):83-91. 被引量:8
  • 3CNNIC. Statistical reports on the Internet development inChina[R].北京:中国互联网信息中心,2014.
  • 4Ding Yuxin, Meng Xuejun, Chai Guangren, et al. User Identification for Instant Messages [ C ]//2011 Interna- tional Conference on Neural Information Processing. 2011:11-13.
  • 5David C, Uthus,David W. Aha. Multiparticipant chat a- nalysis: A survey [ J ]. Artificial Intelligence, 2013,2 (4) :106-121.
  • 6Gabrilovich E. Feature generation for textual information re- trieval using worldknowledge [ J ]. ACM SIGIR Forum,2007, 41 (2) :123-123.
  • 7Yan X,Guo J,Lan Y, et al. A biterm topic model for short texts[ C]//Proceedings of the 22nd international confer- ence on World Wide Web, International World Wide Web Conferences Steering Committee. 2013:1445-1456.
  • 8Lu Yue, Mei Qiaozhu, Chengxiang Zhai. Investigating task performance of probabilistie topic models: an em- pirical study of PLSA and LDA [ J ]. Information Retriev- al, 2011,14(2) :178-203.
  • 9Kevin P, Murphy. Machine Learning-A Probabilistic Perspective [ M ]. England : The MIT Press,2012:2-39.
  • 10David E Rumelhart, Geoffrey E Hintont, Ronald J Wil- liams. Learning representations by backpropagating er- rors [ J ]. Nature, 1986, 323 (6088) :533-536.

共引文献285

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部