摘要
互联网是广告推广的重要媒介,但是低质、诈骗、违法等违规广告也大量充斥其中,严重污染网络空间,因此,实现恶意广告的有效甄别对构建安全清朗的网络环境意义重大。针对各类违法违规中文广告内容的识别需求,利用Bert(bidirectional encoder representation from transformers)和Word2vec分别提取文本字粒度和词粒度嵌入特征,使用CNN(convolutional neural networks)网络对Bert高层特征做深层抽取,同时将词粒度特征向量输入到双向LSTM(long short-term memory)网络提取全局语义,并采用Attention机制对语义特征强化,将强化特征和Bert字粒度特征进行融合,充分利用动态词向量和静态词向量的语义表征优势,提出一种基于强化语义的中文广告识别模型CARES(Chinese advertisement text recognition based on enhanced semantic)。在真实的社交聊天文本数据集上的实验表明,与使用卷积神经网络、循环神经网络等文本分类模型相比,CARES模型分类性能最优,能更加精确识别社交聊天文本中的广告内容,模型识别的正确率达到97.73%。
The Internet is an important medium for advertising promotion.Low-quality,fraud,illegal advertisements are full of the Internet,which pollute cyberspace seriously.Therefore,the realization of effective screening of malicious advertising is of great significance to construct a safe and clean network environment.We use Bert(bidirectional encoder representation from transformers)and Word2vec to extract char and word level embedding features respectively,and use CNN(revolutionary neural networks)to extract the high-level features of Bert,input the word features vector into the long short term memory(LSTM)network to extract the global semantics,and use the attention mechanism to strengthen the semantic features,integrate the enhanced features and Bert word features,which make full use of the semantic representation advantages of dynamic and static word vectors.We propose a Chinese advertising recognition model CARES(Chinese advertisement text recognition based on enhanced semantic).Compared with other text classification models such as convolutional neural network and recurrent neural network,CARES has the best classification performance and can recognize the advertising content in social chat text more accurately,the accuracy of advertising text recognition reaches 97.73%.
作者
赵伟
邓叶勋
赵建强
李文瑞
韩冰
欧荣安
ZHAO Wei;DENG Ye-xun;ZHAO Jian-qiang;LI Wen-rui;HAN Bing;OU Rong-an(Guangzhou Institute of Criminal Science and Technology,Guangzhou 510030,China;Xiamen Meiya Pico Information Co.,Ltd.,Xiamen 361008,China;Xidian University,Xi’an 710071,China)
出处
《计算机技术与发展》
2021年第3期65-69,110,共6页
Computer Technology and Development
基金
广州科技攻关重大专项(201903007)。
关键词
广告文本分类
语义强化
特征融合
预训练
注意力机制
advertising text classification
semantic enhanced
feature fusion
pre-training
attention mechanism