摘要
暴力视频传播已经成为网络环境治理面临的隐患之一,暴力视频这类特类视频的智能识别技术对维护互联网内容安全具有重要意义.由于采集来源的多样性,暴力视频分布通常呈现较大的类内方差和较小的类间方差,常见的暴力视频识别模型难以适应复杂多变的暴力场景.同时,暴力一词本身具有高度抽象的语义,如何从有限数据中学习通用的暴力语义表示成为一大难点.针对这些问题,本文基于语义嵌入学习的思想,构建了一种新颖的多模态暴力视频识别模型,主要由三部分构成.(1)多模态特征提取.考虑到视频具有多模态属性,采用了三种不同的深度神经网络分别提取表观、运动、音频三种模态的特征表示.(2)多模态特征融合.为获得鲁棒的通用视频表示,设计了一种轻量级的多模态特征融合模块(Multimodal Efficient Fusion Module,MEFM),该模块包括共享空间映射与多模态特征交互两部分,在对多模态特征进行充分交互的同时,又能够有效抑制不同模态信息之间的干扰.(3)语义嵌入学习.为适应不同数据分布的暴力数据集,提出了一种基于语义嵌入的多任务学习方法,通过引入中心损失构建暴力语义中心,并采用余弦嵌入损失将暴力样本向中心聚合、非暴力样本进行离散,形成具有语义判别性的特征表示,从而增强了模型的泛化能力,减少了数据噪声的干扰.在VSD2015,Violent Flows和RWF-2000三个公开数据集上的实验表明,本文提出的暴力视频识别模型较已有方法分别提升了4.79%,0.81%和1.5%,取得了具有竞争力的结果.
semantics,and it becomes a major difficulty to learn a generic semantic representation of violence from limited data.In response to these problems,we present a novel multimodal violent video recognition model based on semantic embedding learning.The model mainly consists of the following three parts.(1)Multimodal feature extraction.Considering that videos have multimodal properties,we use three different deep neural networks to extract feature representations of three modalities,i.e.,appearance,motion,and audio.(2)Multimodal feature fusion.To obtain a robust universal video representation,a lightweight multimodal feature fusion module,referred to as MEFM(Multimodal Efficient Fusion Module),is designed in this paper.The module includes two parts:common space mapping and multimodal feature interaction,which can effectively suppress the interference between different modal information while fully interacting with multimodal features.(3)Semantic embedding learning.To accommodate violence datasets from different sources,we propose a multi-task learning method based on semantic embedding,which computes the semantic center of violence by introducing a center loss and uses cosine embedding loss to aggregate violent samples toward the center while discrete with non-violent samples to form a semantic discriminative feature representation,thus enhancing the generalization ability of the model and reducing the noise interference.Experiments on three publicly available datasets,VSD2015,Violent Flows,and RWF-2000,demonstrate that the violence video recognition framework proposed in this paper achieves competitive results by improving 4.79%,0.81%,and 1.5%respectively,over the state of the arts.
作者
吴晓雨
蒲禹江
王生进
刘子豪
WU Xiao-yu;PU Yu-jiang;WANG Sheng-jin;LIU Zi-hao(School of Information and Communication,Communication University of China,Beijing 100024,China;State Key Laboratory of Media Convergence and Communication,Communication University of China,Beijing 100024,China;Department of Electronic Engineering,Tsinghua University,Beijing 100084,China)
出处
《电子学报》
EI
CAS
CSCD
北大核心
2023年第11期3225-3237,共13页
Acta Electronica Sinica
基金
国家自然科学基金(No.61801441)。
关键词
暴力视频识别
多模态特征融合
语义嵌入
多任务学习
violent video recognition
multimodal feature fusion
semantic embedding
multi-task learning