期刊文献+

针对视频语义描述模型的稀疏对抗样本攻击

Sparse Adversarial Examples Attacking on Video Captioning Model
下载PDF
导出
摘要 在多模态深度学习领域,尽管有很多研究表明图像语义描述模型容易受到对抗样本的攻击,但是视频语义描述模型的鲁棒性并没有得到很多的关注。主要原因有两点:一是与图像语义描述模型相比,视频语义描述模型的输入是一个图像流,而不是单一的图像,如果对视频的每一帧进行扰动,那么整体的计算量将会很大;二是与视频识别模型相比,视频语义描述模型的输出不是一个单词,而是更复杂的语义描述。为了解决上述问题以及研究视频描述模型的鲁棒性,提出了一种针对视频语义描述模型的稀疏对抗样本攻击方法。首先,基于图像识别领域的显著性分析的原理,提出了一种评估视频中不同帧对模型输出贡献度的方法。在此基础上,选择关键帧施加扰动。其次,针对视频语义描述模型,设计了基于L2范数的优化目标函数。在数据集MSR-VTT上的实验结果表明,所提方法在定向攻击上的成功率为96.4%,相比随机选择视频帧,查询次数减少了45%以上。上述结果验证了所提方法的有效性并揭示了视频语义描述模型的脆弱性。 Despite the fact that multi-modal deep learning such as image captioning model has been proved to be vulnerable to adversarial examples,the adversarial susceptibility in video caption generation is under-examined.There are two main reasons for this.On the one hand,the video captioning model input is a stream of images rather than a single picture in contrast to image captioning systems.The calculation would be enormous if we perturb each frame of a video.On the other hand,compared with the video recognition model,the output of the model is not a single word,but a more complex semantic description.To solve the above problems and study the robustness of video captioning model,this paper proposes a sparse adversarial attack method.Firstly,a method is proposed based on the idea derived from saliency maps in image object recognition model to verify the contribution of different frames to the video captioning model output and a L2norm based optimistic objective function suited for video caption models is designed.With a high success rate of 96.4%for the targeted attack and a reduction in queries of more than 45%compared to randomly selecting video frames,the evaluation on the MSR-VTT dataset demonstrates the effectiveness of our strategy as well as reveals the vulnerability of the video caption model.
作者 邱江兴 汤学明 王天美 王成 崔永泉 骆婷 QIU Jiangxing;TANG Xueming;WANG Tianmei;WANG Chen;CUI Yongquan;LUO Ting(Hubei Key Laboratory of Distributed System Security,Hubei Engineering Research Center on Big Data Security,School of Cyber Science and Engineering,Huazhong University of Science and Technology,Wuhan 430074,China)
出处 《计算机科学》 CSCD 北大核心 2023年第12期330-336,共7页 Computer Science
关键词 多模态模型 视频语义描述模型 对抗样本攻击 图像显著性 关键帧选择 Multi-model Video caption Adversarial example Saliency map Keyframe select
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部