Performance of Video Question and Answer(VQA)systems relies on capturing key information of both visual images and natural language in the context to generate relevant questions’answers.However,traditional linear com...Performance of Video Question and Answer(VQA)systems relies on capturing key information of both visual images and natural language in the context to generate relevant questions’answers.However,traditional linear combinations of multimodal features focus only on shallow feature interactions,fall far short of the need of deep feature fusion.Attention mechanisms were used to perform deep fusion,but most of them can only process weight assignment of single-modal information,leading to attention imbalance for different modalities.To address above problems,we propose a novel VQA model based on Triple Multimodal feature Cyclic Fusion(TMCF)and Self-AdaptiveMultimodal Balancing Mechanism(SAMB).Our model is designed to enhance complex feature interactions among multimodal features with cross-modal information balancing.In addition,TMCF and SAMB can be used as an extensible plug-in for exploring new feature combinations in the visual image domain.Extensive experiments were conducted on MSVDQA and MSRVTT-QA datasets.The results confirm the advantages of our approach in handling multimodal tasks.Besides,we also provide analyses for ablation studies to verify the effectiveness of each proposed component.展开更多
Action recognition is important for understanding the human behaviors in the video,and the video representation is the basis for action recognition.This paper provides a new video representation based on convolution n...Action recognition is important for understanding the human behaviors in the video,and the video representation is the basis for action recognition.This paper provides a new video representation based on convolution neural networks(CNN).For capturing human motion information in one CNN,we take both the optical flow maps and gray images as input,and combine multiple convolutional features by max pooling across frames.In another CNN,we input single color frame to capture context information.Finally,we take the top full connected layer vectors as video representation and train the classifiers by linear support vector machine.The experimental results show that the representation which integrates the optical flow maps and gray images obtains more discriminative properties than those which depend on only one element.On the most challenging data sets HMDB51 and UCF101,this video representation obtains competitive performance.展开更多
基金This work was supported by the National Natural Science Foundation of China(No.61872231)the National Key Research and Development Program of China(No.2021YFC2801000)the Major Research plan of the National Social Science Foundation of China(No.20&ZD130).
文摘Performance of Video Question and Answer(VQA)systems relies on capturing key information of both visual images and natural language in the context to generate relevant questions’answers.However,traditional linear combinations of multimodal features focus only on shallow feature interactions,fall far short of the need of deep feature fusion.Attention mechanisms were used to perform deep fusion,but most of them can only process weight assignment of single-modal information,leading to attention imbalance for different modalities.To address above problems,we propose a novel VQA model based on Triple Multimodal feature Cyclic Fusion(TMCF)and Self-AdaptiveMultimodal Balancing Mechanism(SAMB).Our model is designed to enhance complex feature interactions among multimodal features with cross-modal information balancing.In addition,TMCF and SAMB can be used as an extensible plug-in for exploring new feature combinations in the visual image domain.Extensive experiments were conducted on MSVDQA and MSRVTT-QA datasets.The results confirm the advantages of our approach in handling multimodal tasks.Besides,we also provide analyses for ablation studies to verify the effectiveness of each proposed component.
基金Supported by the National High Technology Research and Development Program of China(863 Program,2015AA016306)National Nature Science Foundation of China(61231015)+2 种基金Internet of Things Development Funding Project of Ministry of Industry in 2013(25)Technology Research Program of Ministry of Public Security(2016JSYJA12)the Nature Science Foundation of Hubei Province(2014CFB712)
文摘Action recognition is important for understanding the human behaviors in the video,and the video representation is the basis for action recognition.This paper provides a new video representation based on convolution neural networks(CNN).For capturing human motion information in one CNN,we take both the optical flow maps and gray images as input,and combine multiple convolutional features by max pooling across frames.In another CNN,we input single color frame to capture context information.Finally,we take the top full connected layer vectors as video representation and train the classifiers by linear support vector machine.The experimental results show that the representation which integrates the optical flow maps and gray images obtains more discriminative properties than those which depend on only one element.On the most challenging data sets HMDB51 and UCF101,this video representation obtains competitive performance.