Captioning Videos Using Large-Scale Image Corpus 被引量：1

Captioning Videos Using Large-Scale Image Corpus

导出

摘要 Video captioning is the task of assigning complex high-level semantic descriptions （e.g., sentences or paragraphs） to video data. Different from previous video analysis techniques such as video annotation, video event detection and action recognition, video captioning is much closer to human cognition with smaller semantic gap. However, the scarcity of captioned video data severely limits the development of video captioning. In this paper, we propose a novel video captioning approach to describe videos by leveraging freely-available image corpus with abundant literal knowledge. There are two key aspects of our approach： 1） effective integration strategy bridging videos and images, and 2） high efficiency in handling ever-increasing training data. To achieve these goals, we adopt sophisticated visual hashing techniques to efficiently index and search large-scale images for relevant captions, which is of high extensibility to evolving data and the corresponding semantics. Extensive experimental results on various real-world visual datasets show the effectiveness of our approach with different hashing techniques, e.g., LSH （locality-sensitive hashing）, PCA-ITQ （principle component analysis iterative quantization） and supervised discrete hashing, as compared with the state-of-the-art methods. It is worth noting that the empirical computational cost of our approach is much lower than that of an existing method, i.e., it takes 1/256 of the memory requirement and 1/64 of the time cost of the method of Devlin et al. Video captioning is the task of assigning complex high-level semantic descriptions （e.g., sentences or paragraphs） to video data. Different from previous video analysis techniques such as video annotation, video event detection and action recognition, video captioning is much closer to human cognition with smaller semantic gap. However, the scarcity of captioned video data severely limits the development of video captioning. In this paper, we propose a novel video captioning approach to describe videos by leveraging freely-available image corpus with abundant literal knowledge. There are two key aspects of our approach： 1） effective integration strategy bridging videos and images, and 2） high efficiency in handling ever-increasing training data. To achieve these goals, we adopt sophisticated visual hashing techniques to efficiently index and search large-scale images for relevant captions, which is of high extensibility to evolving data and the corresponding semantics. Extensive experimental results on various real-world visual datasets show the effectiveness of our approach with different hashing techniques, e.g., LSH （locality-sensitive hashing）, PCA-ITQ （principle component analysis iterative quantization） and supervised discrete hashing, as compared with the state-of-the-art methods. It is worth noting that the empirical computational cost of our approach is much lower than that of an existing method, i.e., it takes 1/256 of the memory requirement and 1/64 of the time cost of the method of Devlin et al.

作者 Xiao-Yu Du Yang Yang Liu Yang Fu-Min Shen Zhi-Guang Qin Jin-Hui Tang

机构地区 School of Information and Software Engineering School of Software Engineering Center for Future Media School of Computer Science and Engineering Sichuan University West China Hospital of Stomatology School of Computer Science and Engineering

出处《Journal of Computer Science & Technology》 SCIE EI CSCD 2017年第3期480-493,共14页 计算机科学技术学报（英文版）

基金 This work was partially supported by the National Basic Research 973 Program of China under Grant No. 2014CB347600, the National Natural Science Foundation of China under Grant Nos. 61522203, 61572108, 61632007, and 61502081, tile National Ten-Thousand Talents Program of China （Young Top-Notch Talent）, the National Thousand Young Talents Program of China, the Fundamental Research Funds for the Central Universities of China under Grant Nos. ZYGX2014Z007 and ZYGX2015J055, and the Natural Science Foundation of Jiangsu Province of China under Grant No. BK20140058.

关键词 video captioning HASHING image captioning video captioning, hashing, image captioning

分类号 TP [自动化与计算机技术]