摘要
中文短文本存在字数少、歧义多以及信息不规范等特点,导致其文本特征信息难以提取与表达。目前大多数文本分类方法采用单卷积核的卷积神经网络来提取文本局部特征,这通常会由于网络参数随机初始化不一致而导致模型分类效果不佳。为此,提出了一种基于多卷积核字词特征的短文本分类模型(Multi-CNNFusionofCharactersandWords,MCFCW)。首先采用预训练ERNIE、Word2vec模型丰富文本字词嵌入表示;然后分别采用多卷积核TextCNN、DPCNN充分提取不同角度的文本语义信息,同时有效降低网络参数随机初始化的影响;最后拼接两个通道提取到的字词高层特征向量作为最终的文本分类特征。在THUCNews新闻标题数据集上进行了模型评估。结果表明,模型在精准率、召回率和F1值3种评价指标上均优于目前的主流模型,具有较好的短文本分类效果。
Short Chinese texts have the characteristics of few words,many ambiguities,and irregular information,which makes it difficult to extract and express text feature information.At present,local text features are usually extracted by using a single convolutional kernel convolutional neural network by most text classification methods,which often leads to poor model classification results due to inconsistent random initialization of network parameters.To this end,a short text classification model MCFCW(Multi-CNN Fusion of Characters and Words)based on multi-convolution kernel word features has been proposed.Firstly,pre-trained ERNIE and Word2vec models are used to enrich text word embedding representation.Then,text semantic information from different angles are fully extracted by using multi-convolution kernel TextCNN and DPCNN;meanwhile,the influences of random initialization of network parameters is effectively weakened.Finally,word high-level feature vectors extracted from two channels are spliced,which is used as the final text classification feature.The model is evaluated on the THUCNews news headline dataset.The results show that the model is superior to the current mainstream models in the three evaluation indicators of precision rate,recall rate and F1 value,and has a better short text classification effect.
作者
李攀
吴亚东
褚琦凯
张贵宇
付朝帅
LI Pan;WU Yadong;CHU Qikai;ZHANG Guiyu;FU Chaoshuai(School of Automation and Information Engineering,Sichuan University of Science and Engineering,Yibin 644000,China;School of Computer Science and Engineering,Sichuan University of Science and Engineering,Yibin 644000,China;Artificial Intelligence Key Laboratory of Sichuan Province,Yibin 644000,China;Big Data Visual Analysis Engineering Technology Laboratory of Sichuan Province,Yibin 644000,China)
出处
《四川轻化工大学学报(自然科学版)》
CAS
2023年第1期73-83,共11页
Journal of Sichuan University of Science & Engineering(Natural Science Edition)
基金
四川省科技成果转移转化示范项目(2020ZHCG0040)
四川省重大科技专项项目(2018GZDZX0045)。
关键词
中文短文本分类
ERNIE
Word2vec
多卷积核字词特征
卷积神经网络
Chinese short text classification
ENRIE
Word2vec
features of words and characters with multiple convolution kernels
convolutional neural network