摘要
多标签文本分类现在已经成为自然语言处理中的核心任务之一,它的目的是从多个候选标签中使用最相关的标签来注释文档。该文的研究是在文本分类的基础上,以TextCNN神经网络作为基础分类框架,介绍了在自建数据集上进行基于改进的TextCNN的多标签分类任务。通过爬虫来获取全国各个地市的政策文本数据,构建了一个全新的政策类数据集,对数据进行预处理,利用改进后的TextCNN神经网络来训练模型对数据进行多标签分类,经过实验对比测试,改进后的TextCNN结合百度百科词向量在自建数据集上达到了较好的分类效果。
Multi label text classification has become one of the core tasks in natural language processing.Its purpose is to annotate documents with the most relevant tags from multiple candidate tags.In this paper,based on the text classification,the TextCNN neural network is used as the basic classification framework,and the multi label classification task based on the improved TextCNN is introduced.The policy text data of all cities in China through crawler is gotten,and a new policy dataset is constructed,then the data is preprocessed,the model is trained through the improved TextCNN neural network to classify the data with multiple tags.Through the experimental comparison test,the improved TextCNN combined with Baidu Encyclopedia word vector achieves good results in the self⁃built dataset class effect.
作者
李悦
汤鲲
LI Yue;TANG Kun(Wuhan Research Institute of Posts and Telecommunications,Wuhan 430070,China;Fiber HomeWorld Communication Technology Co.,Ltd.,Nanjing 210019,China)
出处
《电子设计工程》
2022年第12期43-47,共5页
Electronic Design Engineering