期刊文献+

视听相关的多模态概念检测 被引量:1

Audio-Visual Correlated Multimodal Concept Detection
下载PDF
导出
摘要 随着在线视频应用的流行,互联网上的视频数量快速增长.面对互联网上海量的视频,人们对视频检索的要求也越来越精细化.如何按照合适的语义概念对视频进行组织和管理,从而帮助用户更高效、更准确地获取所需视频,成为亟待解决的问题.在大量的应用场景下,需要声音和视觉同时出现才能确定某个视频事件.因此,提出具有视听信息的多模态概念的检测工作.首先,以名词-动词二元组的形式定义多模态概念,其中名词表达了视觉信息,动词表达了听觉信息,且名词和动词具有语义相关性,共同表达语义概念所描述的事件.其次,利用卷积神经网络,以多模态概念的视听相关性为目标训练多模态联合网络,进行端到端的多模态概念检测.实验表明:在多模态概念检测任务上,通过视听相关的联合网络的性能超过了单独的视觉网络和听觉网络.同时,联合网络能够学习到精细化的特征表示,利用该网络提取的视觉特征,在Huawei视频数据集某些特定的类别上超过ImageNet预训练的神经网络特征;联合网络提取的音频特征,在ESC50数据集上,也超过在Youtube8m上训练的神经网络音频特征约5.7%. With the wide dissemination of online video sharing applications, massive number of videos are generated online every day. Facing such massive videos, people require more refined retrieval services. How to organize and manage such massive videos on the Internet to help users retrieve videos more efficiently and accurately has become one of the most challenging topics in video analysis. In most scenarios, it is necessary that sounds and visual information appear simultaneously to decide a video event. Therefore, this paper proposes multimodal concept detection task based on audio-visual information. Firstly, a multimodal concept is defined as a noun-verb pair, in which the noun and verb represent visual and audio information separately. The audio and visual information in a multimodal concept is correlated. Secondly, this paper performs end-to-end multimodal concept detection using convolutional neural network. Specifically, audio-visual correlation is considered to train a joint learning network. The experimental results show that performance of the joint network via audio-visual correlation exceeds that of single visual or audio network. Thirdly, the joint network learns fine-grained features. In the Huawei video concept detection task, using visual features extracted from the joint network outperforms features extracted from an ImageNet pre-trained network on some specific concepts. In the ESC 50 audio classification task, acoustic features from the joint network exceeds that from VGG pre-trained on Youtube8m about 5.7%.
作者 奠雨洁 金琴 Dian Yujie;Jin Qin(School of Information, Renmin University of China, Beijing 100872)
出处 《计算机研究与发展》 EI CSCD 北大核心 2019年第5期1071-1081,共11页 Journal of Computer Research and Development
基金 国家自然科学基金项目(61772535) 国家重点研发计划基金项目(2016YFB1001202)~~
关键词 多模态信息 语义概念 视频概念检测 视频特征 视频语义理解 multimodal information semantic concepts video concept detection video representation video semantic understanding
  • 相关文献

参考文献1

二级参考文献23

  • 1代科学,付畅俭,武德峰,李国辉.视频挖掘:概念、技术与应用[J].计算机应用研究,2006,23(1):1-4. 被引量:8
  • 2中国国家统计局2004年统计数据[OL].[2005-02-15].http://www.stats.gov.cn/tjsj/ndsj/2005/indexch.htm.
  • 3Peter Lyman, Varian Hal R. How much information [OL]. [2007-01-03]. http://www. sims. berkeley. edu/how muchinfo-2003
  • 4Liu Y N, Wu F. Video semantic concept detection using multi modality subspace correlation propagation [C]//Proc of the I3th Int Multimedia Modeling Conference. Berlin: Springer, 2007:527-534
  • 5Babaguchi N, Kawai Y, Kitahashi T. Event based indexing of broadcast sports video by intermodal collaboration [J]. IEEE Trans on Multimedia, 2002, 4(1): 68-75
  • 6Snoek C G M, Worring M. Multimedia event-based video indexing using time intervals [J]. IEEE Trans on Multimedia, 2005, 7(4): 638-647
  • 7Snoek C G M, Worring M, Smeulders A W M. Early versus late fusion in semantic video analysis [C] //Proc of the 13th Annual ACM Int Conf on Multimedia. New York: ACM, 2005 : 399-402
  • 8Hotellin H. The most predictable criterion [J]. Journal of Educational Psychology, 1935, 26:139-142
  • 9Zhang H, Zhuang Y T, Wu F. Cross modal correlation learning for clustering on image-audio dataset [C] //Proc of ACM Int Conf on Multimedia. New York: ACM, 2007: 273-276
  • 10Globerson A, Chechik G, Pereira F, et al. Euclidean embedding of co-occurrence data [J]. Journal of Machine Learning Research, 2007, 8:2265-2295

共引文献11

同被引文献4

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部