期刊文献+

基于深度残差双单向DLSTM的时空一致视频事件识别 被引量:13

Deep Residual Dual Unidirectional DLSTM for Video Event Recognition with Spatial-Temporal Consistency
下载PDF
导出
摘要 监控视频下的事件识别是近期计算机视觉领域的研究热点之一.然而,自然场景下监控视频往往具有背景复杂、事件区域内对象遮挡严重等特点,使得事件类内差异大、类间差异小,给识别带来了很大的困难.为解决复杂背景下事件识别问题,提出了一种基于深度残差双单向DLSTM(DRDU-DLSTM)的时空一致视频事件识别方法.该方法首先从训练好的时间CNN网络和空间CNN网络获取视频的时空深度特征,经LSTM同步解析后形成时空特征数据联接单元DLSTM,并作为残差网络的输入.双单向传递的DLSTM联接后构成DU-DLSTM层;多个DU-DLSTM层再加一个恒等映射形成残差模块;在此基础上,多层的残差模块堆叠构成了深度残差网络架构.为了进一步优化识别结果,设计了基于双中心Loss的2C-softmax目标函数,在最大化类间距离的同时最小化类内间隔距离.在监控视频数据集VIRAT 1.0和VIRAT 2.0上的实验表明,该文提出的事件识别方法有很好的性能表现和稳定性,识别准确率分别提高了5.1%和7.3%. Event recognition in surveillance video is attracting growing interest in recent years. Nevertheless, event recognition in real-world surveillance video still faces great challenges due to various facets such as cluttered background, severe occlusion in event bounding box, tremendous intra-class variations while small inter-class variations, etc. A pronounced tendency is that more researches focus on learning deep features from raw data. Two-stream CNNs (Convolutional Neural Networks) architecture becomes a very successful model in video analysis field, in which appearance features and short-term motion features are utilized. In contrast, Long Short-Term Memory (LSTM) network can learn long-term motion features from the input sequence, which is widely used to process those tasks with quintessential time series. In order to combine the advantages of the two types of networks, in this paper, we propose a deep residual dual unidirectional double LSTM (DRDU - DLSTM) for video event recognition in surveillance video with complex scenes. In the first place, deep features are extracted from the fine - tuned temporal CNN and spatial CNN. Since fully connected layers (FC) takes more semantic information than convolutional layers, which are more suitable as the inputs of LSTM network, we extract FC6 feature of spatial CNN and FC7 feature of temporal CNN respectively. Secondly, to reinforce spatial-temporal consistency, the deep features are transformed by spatial LSTM (SLSTM) and temporal LSTM (TLSTM) respectively, and conjugated as a unit called double - LSTM (DLSTM), which forms the input of the residual network. DLSTM cells increase the number of hidden nodes of LSTM cells, and expand the width of the networks. The input features of spatial CNN and temporal CNN are deeply intertwined by DLSTM cells. At the same time, the features will be transmitted and evolved simultaneously, which will increase the consistency of spatial and temporal features. Furthermore, dual unidirectional DLSTMs are concatenated as DU - DLSTM layer. Compared with the shallow bidirectional recurrent network, the deep dual unidirectional network captures the global information better. The architecture of DU - DLSTM can further ramp up the capacity of hidden nodes in networks. The wider networks augment the optional range of features and enhance the coupling capacity of the feature. One or more DU - DLSTM layers are added to an identity mapping to form a residual block, in which the identity shortcut is a good solution to the problem of deep network vanishing gradient. Stacked residual blocks construct the deep residual architecture. The LSTM network with residual structure can reach up to 10 layers, which deepens the depth of the recurrent network. What’s more, the network’s optimization ability will be greatly enhanced. At last, to further optimize the recognition results, we design 2C-softmax objective function based on two-center Loss, which computes the center of spatial feature C S and the center of temporal feature C T separately. C S and C T will be fused as one center of mass according to the set weight coefficient. 2C-softmax objective function can minimize the intra-class variations while keep the features of different classes separable. Experiments on VIRAT 1.0 Ground Dataset and VIRAT 2.0 Ground Dataset demonstrate that the proposed method has good performance and stability, which can achieve superior performance by 5.1% and 7.3% respectively compared with the state - of-the - art methods.
作者 李永刚 王朝晖 万晓依 董虎胜 龚声蓉 刘纯平 季怡 朱蓉 LI Yong-Gang;WANG Zhao-Hui;WAN Xiao-Yi;DONG Hu-Sheng;GONG Sheng-Rong;LIU Chun-Ping;JI Yi;ZHU Rong(School of Computer Science and Technology,Soochow University,Suzhou,Jiangsu 215006;College of Mathematics Physics and Information Engineering,Jiaxing University,Jiaxing,Zhejiang 314001;School of Computer Science and Engineering,Changshu Institute of Science and Technology,Changshu,Jiangsu 215500;School of Computer and Information Technology,Beijing Jiaotong University,Beijing 100044;Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education,Jilin University,Changchun 130012)
出处 《计算机学报》 EI CSCD 北大核心 2018年第12期2852-2866,共15页 Chinese Journal of Computers
基金 国家自然科学基金(61773272 61170124 61272258 61301299) 教育部科技发展中心"云数融合科教创新"基金(2017B03112) 江苏省自然科学基金(BK20151260 BK20151254) 浙江省自然科学基金(LY15F020039) 江苏省"六大人才高峰"项目(DZXX-027) 吉林大学符号计算与知识工程教育部重点实验室基金项目(93K172016K08) 江苏省研究生科研与实践创新计划项目(KYCX17_2006)资助~~
关键词 事件识别 时空一致 残差网络 LSTM 双单向 DLSTM 深度特征 监控视频 even recognition spatial-temporal consistency residual network long short-term memory dual unidirectional double long short-term memory deep feature surveillance video
  • 相关文献

参考文献2

二级参考文献29

  • 1Text Retrieval Conference (TREC) [Online], available: http://trec.nist.gov/, April 5, 2016.
  • 2National Institute of Standards and Technology (NIST) [Online], available: http://www.nist.gov/index.html, April 5, 2016.
  • 3TREC Video Retrieval Evaluation (TRECVID) [Online], available: http://www-nlpir.nist.gov/projects/trecvid/, Ap- ril 5, 2016.
  • 4Dollar P, Wojek C, Schiele B, Perona P. Pedestrian detec- tion: an evaluation of the state of the art. IEEE Transac- tions on Pattern Analysis and Machine Intelligence, 2012, 34(4): 743--761.
  • 5Benenson R, Omran M, Hosang J, Schiele B. Ten years of pedestrian detection, what have we learned? In: Proceed- ings of the 12th European Conference on Computer Vision. Zurich, Switzerland: Springer, 2014. 613-627.
  • 6Dalal N, Triggs B. Histograms of oriented gradients for hu- man detection. In: Proceedings of the 2005 IEEE Computer Society Conference on Computer Vision and Pattern Recog- nition. San Diego, USA: IEEE, 2005. 886-893.
  • 7Felzenszwalb P, McAllester D, Ramanan D. A discrimina- tively trained, multiscale, deformable part model. In: Pro- ceedings of the 2008 IEEE Conference on Computer Vision and Pattern Recognition. Anchorage, Alaska, USA: IEEE, 2008. 1-8.
  • 8Ouyang W, Wang X. Joint deep learning for pedestrian de- tection. In: Proceedings of the 2013 IEEE International Conference on Computer Vision. Sydney, Australia: IEEE, 2013. 2056-2063.
  • 9Luo P, Tian Y, Wang X, Tang X. Switchable deep network for pedestrian detection. In: Proceedings of the 2014 IEEE Conference on Computer Vision and Pattern Recognition. Columbus, Ohio, USA: IEEE, 2014. 899-906.
  • 10Hosang J, Omran M, Benenson R, Schiele B. Taking a deeper look at pedestrians. In: Proceedings of the 2015 IEEE Conference on Computer Vision and Pattern Recog- nition. Boston, USA: IEEE, 2015. 4073-4082.

共引文献26

同被引文献71

引证文献13

二级引证文献44

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部