期刊文献+

基于视觉–文本关系对齐的跨模态视频片段检索 被引量:6

Cross-modal video moment retrieval based on visual-textual re-lationship alignment
原文传递
导出
摘要 近年来,视频数据资源的日益丰富催生了一系列对于视频片段精细检索的需求.在这样的背景下,对于跨模态视频片段检索的研究逐渐兴起,其旨在根据输入的查询文本,输出一段视频中符合文本描述的片段.现有的研究工作主要关注于查询文本与视频片段的全局或局部的特征表达,而忽略了查询文本与视频片段中所蕴含的语义关系在跨模态检索中的匹配.例如,给定查询文本"一个人在打篮球"时,现有检索系统将根据整个查询文本和的视频的特征,或者关注于文本与视频中所表现的实体(如"人","篮球")来计算合适的视频片段,而缺乏对于"人打篮球"这类语义关系的考虑.因此,它们将难以辨别语义关系上的不同,从而限制了检索质量的提升.为了解决这个问题,本文提出跨模态关系对齐的图卷积框架CrossGraphAlign,通过分别构建文本关系图(textural relationship graph)与视觉关系图(visual relationship graph)来建模查询文本与视频片段中的语义关系,再通过跨模态对齐的图卷积网络来评估文本关系与视觉关系的相似度,从而帮助构建更加精准的视频片段检索系统.在公开的跨模态视频片段检索数据集TACoS和ActivityNet Captions上的实验结果表明,本文提出的方法可以有效地利用语义关系来提升跨模态视频片段检索的召回率. In recent years,increasing amounts of video resources have created a series of demands for fine retrieval of video moments,such as highlight moments in sports events and the re-creation of specific video content.In this context,research on cross-modal video segment retrieval,which attempts to output a video moment that matches the input query text,is gradually emerging.Existing solutions primarily focus on global or local feature representation for query text and video moments.However,such solutions ignore matching semantic relations contained in query text and video moments.For example,given the query text“a person is playing basketball”,existing retrieval systems may incorrectly return a video moment of“a person holding a basketball”without the considering the semantic relationship of“a person playing basketball”.Therefore,this paper proposes a crossmodal relationship alignment framework,which we refer to as CrossGraphAlign,for cross-modal video moment retrieval.The proposed framework constructs a textual relationship graph and a visual relationship graph to model the query semantics in text and video segment relations,and then evaluates the similarity between text relations and visual relations through cross-modally aligned graph convolutional networks to help construct a more accurate video moment retrieval system.Experimental results on the publicly available cross-modal video retrieval datasets TACoS and ActivityNet Captions demonstrate that the proposed method can effectively utilize the semantic relationships to improve the recall rate in cross-modal video moment retrieval.
作者 陈卓 杜昊 吴雨菲 徐童 陈恩红 Joya CHEN;Hao DU;Yufei WU;Tong XU;Enhong CHEN(Anhui Province Key Laboratory of Big Data Analysis and Application,University of Science and Technology of China,Hefei 230027,China)
出处 《中国科学:信息科学》 CSCD 北大核心 2020年第6期862-876,共15页 Scientia Sinica(Informationis)
基金 国家重点研发计划(批准号:2018YFB1004300) 国家自然科学基金(批准号:61703386,U1605251)资助项目。
关键词 关系对齐 语言关系 视觉关系 图卷积网络 跨模态视频片段检索 relationship alignment textual relationship visual relationship graph convolutional network crossmodal video moment retrieval
  • 相关文献

参考文献1

二级参考文献2

共引文献16

同被引文献55

引证文献6

二级引证文献13

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部