摘要
作者身份识别一直在公安行业和文检工作中起着重要的作用。现有的作者语言风格建模过程繁琐、文本特征工程没有普适性。针对此问题,在无须专家进行特征建模的情况下,提出CABLSTM中文微博作者身份识别模型,并在公开微博语料集测试该模型准确度。该模型为最大化提取短文本特征,融合attention机制于CNN中并去除池化层,通过双向LSTM以获取上下文相关信息,身份识别结果通过softmax层进行输出。实验结果表明,该模型在进行中文微博作者身份识别任务中与传统机器学习算法、Text CNN和LSTM算法相对比,在准确率、召回率、F值方面都有一定的提升。
Author identification always plays an important role in the public security and literary inspection work.Texts feature extraction is cumbersome and not universal.To solve this problem,this paper proposed the CABLSTM Chinese microblog author identification model without expert feature modeling,and tested the accuracy of the model in the open microblog corpus.This model maximized the extraction of short text features,fused the attention mechanism in the CNN and removed the pooling layer,and obtained context-related information through the bidirectional LSTM.The identity recognition result was output through the softmax layer.Experimental results show that the model has a certain improvement in accuracy,recall rate,and F-measure in comparison with traditional machine learning algorithms and TextCNN and LSTM algorithms in the identification task of Chinese microblog authors.
作者
徐晓霖
蔡满春
芦天亮
Xu Xiaolin;Cai Manchun;Lu Tianliang(School of Information Technology&Network Security,People’s Public Security University of China,Beijing 102623,China)
出处
《计算机应用研究》
CSCD
北大核心
2020年第1期16-18,25,共4页
Application Research of Computers
基金
国家重点研发计划重点专项资助项目(2017YFB0802804)
国家自然科学基金资助项目(61602489)
中国人民公安大学2018年基本科研业务费科研机构项目(2018JKF504).