融合上下文特征和BERT词嵌入的新闻标题分类研究被引量：15

News Title Classification Based on Contextual Features and BERT Word Embedding

导出

摘要【目的/意义】随着社交媒体的发展,各类新闻数量激增,舆情监测处理越来越重要,高效精确的识别舆情新闻可以帮助有关部门及时搜集跟踪突发事件信息并处理,减小舆论对社会的影响。本文提出一种融合BERT、TEXTCNN、BILSTM的新闻标题文本分类模型,充分考虑词嵌入信息、文本特征和上下文信息,以提高新闻标题类别识别的准确率。【方法/过程】将使用BERT生成的新闻标题文本向量输入到TEXTCNN提取特征,将TEXTCNN的结果输入到BILSTM捕获新闻标题上下文信息,利用softmax判断分类结果。【结果/结论】研究表明,本文提出的融合了基于语言模型的BERT、基于词向量TEXTCNN和基于上下文机制BILSTM三种算法的分类模型在准确率、精确率、召回率和F1值均达到了0.92以上,而且具有良好的泛化能力,优于传统的文本分类模型。【创新/局限】本文使用BERT进行词嵌入,同时进行特征提取和捕获上下文语义,模型识别新闻类别表现良好,但模型参数较多向量维度较大对训练设备要求较高,同时数据类别只有10类,未对类别更多或类别更细化的数据进行实验。【Purpose/significance】With the development of social media,the number of various kinds of news has surged,and the monitoring and processing of public opinion has become more and more important.The efficient and accurate identification of public opinion news can help relevant departments timely collect and track the information of emergencies and deal with it,so as to reduce the impact of public opinion on the society.In this paper,a news title text classification model combining BERT,TEXTCNN and BILSTM is proposed,which takes full account of word embedding information,text features and context information to improve the accuracy of news title category recognition.【Method/process】The news title text vector generated by BERT is input to TEXTCNN to extract features,and the results of TEXTCNN are input to BILSTM to capture the news title context information,and Softmax is used to judge the classification results.【Result/conclusion】The research shows that the proposed classification model,which combines the three algorithms of language model based BERT,word vector based TEXTCNN and context mechanism based BILSTM,achieves more than 0.92in accuracy,precision,recall rate and F1 value,and has good generalization ability,which is superior to the traditional text classification model.【Innovation/limitation】This article uses the BERT word embedding,simultaneously feature extraction and capture context semantics,model recognition news category performance is good,but more model parameter vector dimension is larger for training equipment demand is higher,at the same time,the data category only 10 class,not more detailed data for more category or categories for experiments.

作者范昊何灏 FAN Hao;HE Hao(School of Information Management,Wuhan University,Wuhan 430072,China)

机构地区武汉大学信息管理学院

出处《情报科学》 CSSCI 北大核心 2022年第6期90-97,共8页 Information Science

基金国家自然科学基金项目“基于科学共同体知识大图的隐性合作关系发现与深度挖掘”(72074172) 图书情报国家级实验教学示范中心(武汉大学)。

关键词文本分类新闻标题 BERT词嵌入 TEXTCNN BILSTM text classification news headlines BERT word embedding TEXTCNN BILSTM

分类号 G254 [文化科学—图书馆学]