期刊文献+

融合时空图卷积的多人交互行为识别 被引量:5

Multi-person interaction action recognition based onspatio-temporal graph convolution
原文传递
导出
摘要 目的多人交互行为的识别在现实生活中有着广泛应用。现有的关于人类活动分析的研究主要集中在对单人简单行为的视频片段进行分类,而对于理解具有多人之间关系的复杂人类活动的问题还没有得到充分的解决。方法针对多人交互动作中两人肢体行为的特点,本文提出基于骨架的时空建模方法,将时空建模特征输入到广义图卷积中进行特征学习,通过谱图卷积的高阶快速切比雪夫多项式进行逼近。同时对骨架之间的交互信息进行设计,通过捕获这种额外的交互信息增加动作识别的准确性。为增强时域信息的提取,创新性地将切片循环神经网络(recurrent neural network,RNN)应用于视频动作识别,以捕获整个动作序列依赖性信息。结果本文在UT-Interaction数据集和SBU数据集上对本文算法进行评估,在UT-Interaction数据集中,与H-LSTCM(hierarchical long short-term concurrent memory)等算法进行了比较,相较于次好算法提高了0.7%,在SBU数据集中,相较于GCNConv(semi-supervised classification with graph convolutional networks)、RotClips+MTCNN(rotating cliips+multi-task convolutional neural netowrk)、SGC(simplifying graph convolutional)等算法分别提升了5.2%、1.03%、1.2%。同时也在SBU数据集中进行了融合实验,分别验证了不同连接与切片RNN的有效性。结论本文提出的融合时空图卷积的交互识别方法,对于交互类动作的识别具有较高的准确率,普遍适用于对象之间产生互动的行为识别。 Objective The recognition of multi-person interaction behavior has wide applications in real life. At present, human activity analysis research mainly focuses on classifying video clips of behaviors of individual persons, but the problem of understanding complex human activities with relationships between multiple people has not been resolved. When performing multi-person behavior recognition, the body information is more abundant and the description of the two-person action features are more complex. The problems such as complex recognition methods and low recognition accuracy occur easily. When the recognition object changes from a single person to multiple people, we not only need to pay attention to the action information of each person but also need to notice the interaction information between different subjects. At present, the interaction information of multiple people cannot be extracted well. To solve this problem effectively, we propose a multi-person interaction behavior-recognition algorithm based on skeleton graph convolution.Method The advantage of this method is that it can fully utilize the spatial and temporal dependence information between human joints. We design the interaction information between skeletons to discover the potential relationships between different individuals and different key points. By capturing the additional interaction information, we can improve the accuracy of action recognition. Considering the characteristics of multi-person interaction behavior, this study proposes a spatio-temporal graph convolution model based on skeleton. In terms of space, we have various designs for single-person and multi-person connections. We design the single-person connection within each frame. Apart from the physical connections between the points of the body, some potential correlations are also added between joints that represent non-physical connections such as the left and right hands of a single person. We design the interaction connection between two people within each frame. We use Euclidean distance to measure the correlation between interaction nodes and determine which points between the two persons have a certain connection. Through this method, the connection of the key points between the two persons in the frame not only can add new and necessary interaction connections, which can be used as a bridge to describe the interaction information of the two persons’ actions, but can also prevent noise connections and cause the underlying graph to have a certain sparseness. In the time dimension, we segment the action sequence. Every three frames of action are used as a processing unit. We design the joints between three adjacent frames, and use more adjacent joints to expand the receptive field to help us learn the change information in the time domain. Through the modeling design in the time and space dimensions, we have obtained a complex action skeleton diagram. We use the generalized graph convolution model to extract and summarize the two people action features, and approximate high-order fast Chebyshev polynomials of spectral graph convolution to obtain high-level feature maps. At the same time, to enhance the extraction of time domain information, we propose the application of sliced recurrent neural network(RNN) to video action recognition to enhance the characterization of two people actions. By dividing the input sequence into multiple equiling subsequences and using a separate RNN network for feature extraction on each subsequence, we can calculate each subsequence at the same time, thereby overcoming the limitations of sliced RNN that cannot be parallelized. Through the information transfer between layers, the local information on the subsequence can be integrated in the high-level network, which can integrate and summarize the information from local to global, and the network can capture the entire action-sequence dependent information. For the loss of information at the slice, we have solved this problem by taking the three frame actions as a processing unit.Result This study validates the proposed algorithm on two datasets(UT-Interaction and SBU) and compares them with other advanced interaction-recognition methods. The UT-Interaction dataset contains six classes of actions and the SBU interaction dataset has eight classes of actions. We use 10-fold and 5-fold cross-validation for evaluation. In the UT-Interaction dataset, compared with H-LSTCM(Chierarchical long-short-term concurrent memory) and other methods, the performance improves by 0.7% based on the second-best algorithm. In the SBU dataset, compared with GCNConv, RotClips+MTCNN, SGCConv, and other methods, the algorithm has been improved by 5.2%, 1.03%, and 1.2% respectively. At the same time, fusion experiments are conducted in the SBU dataset to verify the effectiveness of various connections and sliced RNN. This method can effectively extract additional information on interactions, and has a good effect on the recognition of interaction actions. Conclusion In this paper, the interactive recognition method of fusion spatio-temporal graph convolution has high accuracy for the recognition of interactive actions, and it is generally applicable to the recognition of behaviors that generate interaction between objects.
作者 成科扬 吴金霞 王文杉 荣兰 詹永照 Cheng Keyang;Wu Jinxia;Wang Wenshan;Rong Lan;Zhan Yongzhao(School of Computer Science and Telecommunications Engineering,Jiangsu University,Zhenjiang 212013,China;Jiangsu Province Big Data Ubiquitous Perception and Intelligent Agricultural Application Engineering Research Center,Zhenjiang 212013,China;Cyber Space Security Academy of Jiangsu University,Zhenjiang 212013,China;National Engineering Laboratory for Public Security Risk Perception and Control by Big Data,China Acadeemy of Electronic Sciences,Beijing 100041,China)
出处 《中国图象图形学报》 CSCD 北大核心 2021年第7期1681-1691,共11页 Journal of Image and Graphics
基金 国家自然科学基金项目(61972183) 社会安全风险感知与防控大数据应用国家工程实验室主任基金项目。
关键词 动作识别 交互信息 时空建模 图卷积 切片循环神经网络(RNN) action recognition interaction information spatial-temporal modeling graph convolution sliced recurrent neural network(RNN)
  • 相关文献

参考文献1

二级参考文献9

  • 1Candamo J, Shreve M, Goldgof D B, et al. Under- standing transit scenes~ a survey on human behavior recognition algorithms[J]. IEEE Transactions on Intel- ligent Transportation Systems, 2010, 11 (1) : 206-224.
  • 2Ryoo M S, Aggarwal J K. Spatio-temporal relation- ship match: video structure comparison for recogni- tion of complex human activities[C]//lEEE 12th In- ternational Conference on Computer Vision, 2009: 1593-1600.
  • 3Park S, Aggarwal J K. A hierarchical Bayesian net- work for event recognition of human actions and in- teractions[J] ACM Journal of Multimedia Systems, Special Issue on Video Surveillance, 2004, 10(2): 164-179.
  • 4Ryoo M S, Aggarwal J K. Recognition of composite human activities through context-free grammar based representation[C]// Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, 2006: 1709-1718.
  • 5Harris C, Stephens M. A combined corner and edge detector[C] /// Proceeding of the 4th Alvey Vision Conference, 1988 : 147-151.
  • 6Laptev 1, Lindeberg T. Space-time interest points [C]//Proceedings of Ninth IEEE International Con- ference on Computer Vision,2003: 432-439.
  • 7Dolldr P, Rabaud V, Cottrell G, et al. Behavior recognition via sparse spatio-temporal features[C]// Proceedings of 2nd Joint IEEE International Work- shop on Visual Surveillance and Performance Evalu- ation of Tracking and Surveillance,2005: 65-72.
  • 8韩磊,李君峰,贾云得.基于时空单词的两人交互行为识别方法[J].计算机学报,2010,33(4):776-784. 被引量:26
  • 9吴联世,夏利民,罗大庸.人的交互行为识别与理解研究综述[J].计算机应用与软件,2011,28(11):60-63. 被引量:9

共引文献8

同被引文献42

引证文献5

二级引证文献13

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部