With more multi-modal data available for visual classification tasks,human action recognition has become an increasingly attractive topic.However,one of the main challenges is to effectively extract complementary feat...With more multi-modal data available for visual classification tasks,human action recognition has become an increasingly attractive topic.However,one of the main challenges is to effectively extract complementary features from different modalities for action recognition.In this work,a novel multimodal supervised learning framework based on convolution neural networks(Conv Nets)is proposed to facilitate extracting the compensation features from different modalities for human action recognition.Built on information aggregation mechanism and deep Conv Nets,our recognition framework represents spatial-temporal information from the base modalities by a designed frame difference aggregation spatial-temporal module(FDA-STM),that the networks bridges information from skeleton data through a multimodal supervised compensation block(SCB)to supervise the extraction of compensation features.We evaluate the proposed recognition framework on three human action datasets,including NTU RGB+D 60,NTU RGB+D 120,and PKU-MMD.The results demonstrate that our model with FDA-STM and SCB achieves the state-of-the-art recognition performance on three benchmark datasets.展开更多
基金This work was supported by the Natural Science Foundation of Guangdong Province(Grant Nos.2022A1515140119 and 2023A1515011307)the National Key Laboratory of Air-based Information Perception and Fusion and the Aeronautic Science Foundation of China(Grant No.20220001068001)+1 种基金Dongguan Science and Technology Special Commissioner Project(Grant No.20221800500362)the National Natural Science Foundation of China(Grant Nos.62376261,61972090,and U21A20487).
文摘With more multi-modal data available for visual classification tasks,human action recognition has become an increasingly attractive topic.However,one of the main challenges is to effectively extract complementary features from different modalities for action recognition.In this work,a novel multimodal supervised learning framework based on convolution neural networks(Conv Nets)is proposed to facilitate extracting the compensation features from different modalities for human action recognition.Built on information aggregation mechanism and deep Conv Nets,our recognition framework represents spatial-temporal information from the base modalities by a designed frame difference aggregation spatial-temporal module(FDA-STM),that the networks bridges information from skeleton data through a multimodal supervised compensation block(SCB)to supervise the extraction of compensation features.We evaluate the proposed recognition framework on three human action datasets,including NTU RGB+D 60,NTU RGB+D 120,and PKU-MMD.The results demonstrate that our model with FDA-STM and SCB achieves the state-of-the-art recognition performance on three benchmark datasets.