Performance of Video Question and Answer(VQA)systems relies on capturing key information of both visual images and natural language in the context to generate relevant questions’answers.However,traditional linear com...Performance of Video Question and Answer(VQA)systems relies on capturing key information of both visual images and natural language in the context to generate relevant questions’answers.However,traditional linear combinations of multimodal features focus only on shallow feature interactions,fall far short of the need of deep feature fusion.Attention mechanisms were used to perform deep fusion,but most of them can only process weight assignment of single-modal information,leading to attention imbalance for different modalities.To address above problems,we propose a novel VQA model based on Triple Multimodal feature Cyclic Fusion(TMCF)and Self-AdaptiveMultimodal Balancing Mechanism(SAMB).Our model is designed to enhance complex feature interactions among multimodal features with cross-modal information balancing.In addition,TMCF and SAMB can be used as an extensible plug-in for exploring new feature combinations in the visual image domain.Extensive experiments were conducted on MSVDQA and MSRVTT-QA datasets.The results confirm the advantages of our approach in handling multimodal tasks.Besides,we also provide analyses for ablation studies to verify the effectiveness of each proposed component.展开更多
Video question answering(Video QA)involves a thorough understanding of video content and question language,as well as the grounding of the textual semantic to the visual content of videos.Thus,to answer the questions ...Video question answering(Video QA)involves a thorough understanding of video content and question language,as well as the grounding of the textual semantic to the visual content of videos.Thus,to answer the questions more accurately,not only the semantic entity should be associated with certain visual instance in video frames,but also the action or event in the question should be localized to a corresponding temporal slot.It turns out to be a more challenging task that requires the ability of conducting reasoning with correlations between instances along temporal frames.In this paper,we propose an instance-sequence reasoning network for video question answering with instance grounding and temporal localization.In our model,both visual instances and textual representations are firstly embedded into graph nodes,which benefits the integration of intra-and inter-modality.Then,we propose graph causal convolution(GCC)on graph-structured sequence with a large receptive field to capture more causal connections,which is vital for visual grounding and instance-sequence reasoning.Finally,we evaluate our model on TVQA+dataset,which contains the groundtruth of instance grounding and temporal localization,three other Video QA datasets and three multimodal language processing datasets.Extensive experiments demonstrate the effectiveness and generalization of the proposed method.Specifically,our method outperforms the state-of-the-art methods on these benchmarks.展开更多
答疑是学习活动中必不可少的关键环节,但网络的介入却给这种双向的交流带来了障碍,如何疏通解惑途径,给学习者以满意的答复就成为网络教育系统中需要关注的一个重要问题。文章提出的基于Flash Media Server和Flex技术的视频答疑方式,能...答疑是学习活动中必不可少的关键环节,但网络的介入却给这种双向的交流带来了障碍,如何疏通解惑途径,给学习者以满意的答复就成为网络教育系统中需要关注的一个重要问题。文章提出的基于Flash Media Server和Flex技术的视频答疑方式,能够为学习者和教师提供一个互动交流的平台,突破了传统答疑模块中基于文字、图形和非交互的多媒体形式,满足了用户在需要传统的文字和图形外还需要语音和视频等更人性化的内容之要求,使教师能够及时地与学习者进行多方的沟通,及时地答疑解惑。展开更多
基金This work was supported by the National Natural Science Foundation of China(No.61872231)the National Key Research and Development Program of China(No.2021YFC2801000)the Major Research plan of the National Social Science Foundation of China(No.20&ZD130).
文摘Performance of Video Question and Answer(VQA)systems relies on capturing key information of both visual images and natural language in the context to generate relevant questions’answers.However,traditional linear combinations of multimodal features focus only on shallow feature interactions,fall far short of the need of deep feature fusion.Attention mechanisms were used to perform deep fusion,but most of them can only process weight assignment of single-modal information,leading to attention imbalance for different modalities.To address above problems,we propose a novel VQA model based on Triple Multimodal feature Cyclic Fusion(TMCF)and Self-AdaptiveMultimodal Balancing Mechanism(SAMB).Our model is designed to enhance complex feature interactions among multimodal features with cross-modal information balancing.In addition,TMCF and SAMB can be used as an extensible plug-in for exploring new feature combinations in the visual image domain.Extensive experiments were conducted on MSVDQA and MSRVTT-QA datasets.The results confirm the advantages of our approach in handling multimodal tasks.Besides,we also provide analyses for ablation studies to verify the effectiveness of each proposed component.
基金supported by the National Natural Science Foundation of China (Grant Nos.61876130,61932009).
文摘Video question answering(Video QA)involves a thorough understanding of video content and question language,as well as the grounding of the textual semantic to the visual content of videos.Thus,to answer the questions more accurately,not only the semantic entity should be associated with certain visual instance in video frames,but also the action or event in the question should be localized to a corresponding temporal slot.It turns out to be a more challenging task that requires the ability of conducting reasoning with correlations between instances along temporal frames.In this paper,we propose an instance-sequence reasoning network for video question answering with instance grounding and temporal localization.In our model,both visual instances and textual representations are firstly embedded into graph nodes,which benefits the integration of intra-and inter-modality.Then,we propose graph causal convolution(GCC)on graph-structured sequence with a large receptive field to capture more causal connections,which is vital for visual grounding and instance-sequence reasoning.Finally,we evaluate our model on TVQA+dataset,which contains the groundtruth of instance grounding and temporal localization,three other Video QA datasets and three multimodal language processing datasets.Extensive experiments demonstrate the effectiveness and generalization of the proposed method.Specifically,our method outperforms the state-of-the-art methods on these benchmarks.
文摘答疑是学习活动中必不可少的关键环节,但网络的介入却给这种双向的交流带来了障碍,如何疏通解惑途径,给学习者以满意的答复就成为网络教育系统中需要关注的一个重要问题。文章提出的基于Flash Media Server和Flex技术的视频答疑方式,能够为学习者和教师提供一个互动交流的平台,突破了传统答疑模块中基于文字、图形和非交互的多媒体形式,满足了用户在需要传统的文字和图形外还需要语音和视频等更人性化的内容之要求,使教师能够及时地与学习者进行多方的沟通,及时地答疑解惑。