摘要
针对文本分类任务中存在数据稀疏、无法捕捉段与段之间的更长距离依赖关系问题,提出一种LC-Transformer XL集成模型。通过LDA主题模型单词与主题的概率分布,对文本进行高频关键词提取,采用CNN算法提取局部特征向量,利用Transformer-XL模型的相对位置编码和循环机制得到全局语义特征,将其提取的局部与全局特征向量融合,在此基础上,通过Softmax分类器进行分类,得到文本分类的结果。实验表明,该模型在THUCNews中文文本数据集上的F1值达到0.9318,准确率达到94.15%,在处理文本分类任务中有较好的表现。
Aiming at the problem of data sparsity in text classification task and being unable to capture the longer distance dependence between segments,this paper proposes a LC-Transformer XL integration model.Through the probability distribution of words and topics in the LDA topic model,high-frequency Keywords:were extracted from the text.CNN algorithm was used to extract local feature vectors,and the relative position encoding and cycling mechanism of the Transformer-XL model were used to obtain global semantic features.The extracted local and global feature vectors were fused.On this basis,the text classification results were obtained through the Softmax classifier.Experimental results show that the F1 value of the model reaches 0.9318 and the accuracy rate reaches 94.15%on THUCNews Chinese text data set,and it has good performance in text classification task.
作者
葛夫勇
雷景生
唐小岚
Ge Fuyong;Lei Jingsheng;Tang Xiaolan(Shanghai University of Electric Power,Shanghai 201300,China)
出处
《计算机应用与软件》
北大核心
2023年第6期118-123,132,共7页
Computer Applications and Software
基金
国家自然科学基金项目(61672337)。