期刊文献+
共找到2篇文章
< 1 >
每页显示 20 50 100
Instance-sequence reasoning for video question answering
1
作者 Rui LIU Yahong HAN 《Frontiers of Computer Science》 SCIE EI CSCD 2022年第6期93-101,共9页
Video question answering(Video QA)involves a thorough understanding of video content and question language,as well as the grounding of the textual semantic to the visual content of videos.Thus,to answer the questions ... Video question answering(Video QA)involves a thorough understanding of video content and question language,as well as the grounding of the textual semantic to the visual content of videos.Thus,to answer the questions more accurately,not only the semantic entity should be associated with certain visual instance in video frames,but also the action or event in the question should be localized to a corresponding temporal slot.It turns out to be a more challenging task that requires the ability of conducting reasoning with correlations between instances along temporal frames.In this paper,we propose an instance-sequence reasoning network for video question answering with instance grounding and temporal localization.In our model,both visual instances and textual representations are firstly embedded into graph nodes,which benefits the integration of intra-and inter-modality.Then,we propose graph causal convolution(GCC)on graph-structured sequence with a large receptive field to capture more causal connections,which is vital for visual grounding and instance-sequence reasoning.Finally,we evaluate our model on TVQA+dataset,which contains the groundtruth of instance grounding and temporal localization,three other Video QA datasets and three multimodal language processing datasets.Extensive experiments demonstrate the effectiveness and generalization of the proposed method.Specifically,our method outperforms the state-of-the-art methods on these benchmarks. 展开更多
关键词 video question answering instance grounding graph causal convolution
原文传递
Triple Multimodal Cyclic Fusion and Self-Adaptive Balancing for Video Q&A Systems
2
作者 Xiliang Zhang Jin Liu +2 位作者 Yue Li Zhongdai Wu Y.Ken Wang 《Computers, Materials & Continua》 SCIE EI 2022年第12期6407-6424,共18页
Performance of Video Question and Answer(VQA)systems relies on capturing key information of both visual images and natural language in the context to generate relevant questions’answers.However,traditional linear com... Performance of Video Question and Answer(VQA)systems relies on capturing key information of both visual images and natural language in the context to generate relevant questions’answers.However,traditional linear combinations of multimodal features focus only on shallow feature interactions,fall far short of the need of deep feature fusion.Attention mechanisms were used to perform deep fusion,but most of them can only process weight assignment of single-modal information,leading to attention imbalance for different modalities.To address above problems,we propose a novel VQA model based on Triple Multimodal feature Cyclic Fusion(TMCF)and Self-AdaptiveMultimodal Balancing Mechanism(SAMB).Our model is designed to enhance complex feature interactions among multimodal features with cross-modal information balancing.In addition,TMCF and SAMB can be used as an extensible plug-in for exploring new feature combinations in the visual image domain.Extensive experiments were conducted on MSVDQA and MSRVTT-QA datasets.The results confirm the advantages of our approach in handling multimodal tasks.Besides,we also provide analyses for ablation studies to verify the effectiveness of each proposed component. 展开更多
关键词 video question and answer systems feature fusion scaling matrix attention mechanism
下载PDF
上一页 1 下一页 到第
使用帮助 返回顶部