摘要
信息时代互联网上产生了海量的文本数据,它们蕴含着巨大的商业和科研价值,由此文本分类技术得到了广泛的关注.文本分类在信息检索等应用领域占据着重要地位,同时也是自然语言处理等研究的关键技术.本文针对新闻文本的特点以及深度学习分类方法训练时间长的问题,提出了一种池化和注意力相结合的模型,并将其应用于中文新闻文本分类.该模型首先利用最大池化和平均池化提取出文本特征,然后利用注意力机制为句子生成权重,使用两者的拼接结果进行分类.模型在NLPCC2014新闻文本分类的数据集上进行了实验,一级类别的分类正确率达到了83. 96%,接近该数据集上的最优结果,而且比标准深度学习算法的收敛时间更短.
In the information age,a large amount of text data has been generated on the Internet,which contains great commercial and scientific value. Therefore,text classification technology has been widely concerned. Text classification plays an important role in application fields such as information retrieval,and it is also a common task in scientific research such as natural language processing. Aiming at the characteristics of news text and the long training time of deep learning classification method,this paper proposes a model combining pooling and attention,and applies it to the task of Chinese news text classification. The model first extracts text features by max-pooling and average pooling,then generates weights for sentences by attention mechanism,and classifies texts using the splicing results of the two. The model is conducted on the data set of NLPCC2014 news text classification. The classification accuracy of the first-level category reaches to83. 96%,closing to the optimal result of the data set,and the convergence time of the model is much shorter than that of the standard deep learning algorithm.
作者
陶永才
杨朝阳
石磊
卫琳
TAO Yong-cai;YANG Zhao-yang;SHI Lei;WEI Lin(School of Information Engineering,Zhengzhou University,Zhengzhou 450001,China;School of Software,Zhengzhou University,Zhengzhou 450002,China)
出处
《小型微型计算机系统》
CSCD
北大核心
2019年第11期2393-2397,共5页
Journal of Chinese Computer Systems
基金
河南省高等学校重点科研项目(16A520027)资助
关键词
文本分类
注意力机制
最大池化
机器学习
text classification
attention mechanism
max pooling
machine learning