摘要
针对GloVe、BERT模型生成的字向量在小语料库中表义不足的问题,提出融合向量预训练模型,对小语料中文短文本分类的精确度进行提升。本文以今日头条新闻公开数据集为实验对象,使用GloVe、BERT模型通过领域预训练,对GloVe与BERT生成的预训练字向量进行向量融合,实现语义增强,从而提升短文本分类效果。结果表明,当语料库中的数据量为500时,融合字向量的准确度相较于BERT字向量的准确度提升了5个百分点,相较于GloVe字向量的准确度提升了3个百分点。词义选取的维度待进一步加强。本文所提方法能够对小语料库的短文本数据实现精准分类,对后续文本挖掘工作具有重要意义。
Aiming at the problem of insufficient representation of word vectors generated by GloVe and BERT models in small corpora,a fusion vector pre-training model was proposed to improve the accuracy of Chinese short text classification in small corpora.Taking today’s headline public data set as the experimental object,using GloVe and BERT models through domain pretraining,vector fusion of pre-trained word vectors generated by GloVe and BERT to achieve semantic enhancement,thereby improving the short text classification effect.When the amount of data in the corpus is 500,the accuracy of the fused word vector is improved by 5 percentage points compared to the accuracy of the BERT word vector,and the accuracy of the GloVe word vector is improved by 3 percentage points.The dimension of word meaning selection needs to be further strengthened.The proposed method can accurately classify short text data in small corpus,which is of great significance for subsequent text mining work.
作者
陈蓝
杨帆
曾桢
Chen Lan;Yang Fan;Zeng Zhen(School of Information,Guizhou University of Finance and Economics,Guiyang 550000)
出处
《现代计算机》
2022年第16期1-8,15,共9页
Modern Computer
基金
教育部产学合作协同育人项目(BZX1902-20):基于Jupyter Notebook的用户信息行为分析整合实验教学设计。