摘要
新闻长文本分类是自然语言处理中的一项重要任务,但传统的文本表示方法存在特征稀疏、语义不足等问题。此外,新闻长文本含有大量的冗余信息,并且可能涉及其他主题,以上问题都会导致文本特征提取不全面。为此,本文提出一种基于改进TF-IDF算法和AGLCNN的新闻长文本分类模型。该模型首先利用特征项在类间与类内分布情况及其位置信息来改进TF-IDF算法,并结合Word2Vec词向量进行文本表示;利用注意力机制突出关键字信息,输入至Bi-LSTM捕获文本上下文特征;接着利用CNN突出新闻主题的显著特征;考虑到新闻长文本中可能存在涉及其他主题信息的句子,引入门控机制对Bi-LSTM和CNN输出特征进行融合,获得最终的文本特征表示;最后,将特征向量输入Softmax层进行新闻分类。在THUCNews数据集和搜狐新闻数据集上进行对比实验,结果表明,所提模型在2个数据集上的召回率分别为0.985和0.976,优于其他分类模型。
News long text classification is an important task in natural language processing,but traditional text representation methods have problems such as sparse features and insufficient semantics.In addition,long news texts contain a large amount of redundant information and may involve other topics,all of which can lead to incomplete text feature extraction.Therefore,this ar⁃ticle proposes a news long text classification model based on improved TF-IDF algorithm and AGLCNN.This model firstly im⁃proves the TF-IDF algorithm by utilizing the distribution and position information of feature items between and within classes,and combines Word2Vec word vectors for text representation.Using attention mechanism to highlight keyword information,we in⁃put it into Bi-LSTM to capture text contextual features.Then we use CNN to highlight the prominent features of news topics.Con⁃sidering that there may be sentences involving other topic information in long news texts,a gating mechanism is introduced to fuse the output features of Bi-LSTM and CNN to obtain the final text feature representation.Finally,we input the feature vectors into the Softmax layer for news classification.Comparative experiments are conducted on the THUCNews dataset and the Sohu News dataset,and the results show that the proposed model has recall rates of 0.985 and 0.976 on both datasets,respectively,which are superior to other classification models.
作者
周宪溪
牟莉
ZHOU Xianxi;MU Li(School of Computer Science,Xi’an Polytechnic University,Xi’an 710600,China)
出处
《计算机与现代化》
2024年第8期120-126,共7页
Computer and Modernization
基金
陕西省科技计划项目(2019CGXNG-015)。
关键词
文本分类
TF-IDF
注意力机制
卷积神经网络
特征项
text classification
TF-IDF
attention mechanism
convolutional neural network
characteristic item