大多数现有的视觉语言预训练方法侧重于理解任务,并在训练时使用类似于BERT的损失函数(掩码语言建模和图像文本匹配).尽管它们在许多理解类型的下游任务中表现良好,例如视觉问答、图像文本检索和视觉蕴涵,但它们不具备生成信息的能力....大多数现有的视觉语言预训练方法侧重于理解任务,并在训练时使用类似于BERT的损失函数(掩码语言建模和图像文本匹配).尽管它们在许多理解类型的下游任务中表现良好,例如视觉问答、图像文本检索和视觉蕴涵,但它们不具备生成信息的能力.为了解决这个问题,提出了视觉语言理解和生成的统一多模态预训练(unified multimodal pre-training for vision-language understanding and generation,UniVL).UniVL能够处理理解任务和生成任务,并扩展了现有的预训练范式,同时使用随机掩码和因果掩码,因果掩码即掩盖未来标记的三角形掩码,这样预训练的模型可以具有自回归生成的能力.将几种视觉语言理解任务规范为文本生成任务,并使用基于模版提示的方法对不同的下游任务进行微调.实验表明,在使用同一个模型时,理解任务和生成任务之间存在权衡,而提升这两个任务的可行方法是使用更多的数据.UniVL框架在理解任务和生成任务方面的性能与最近的视觉语言预训练方法相当.此外,实验还证明了基于模版提示的生成方法更有效,甚至在少数场景中它优于判别方法.展开更多
Facing the computing demands of Internet of things(IoT)and artificial intelligence(AI),the cost induced by moving the data between the central processing unit(CPU)and memory is the key problem and a chip featured with...Facing the computing demands of Internet of things(IoT)and artificial intelligence(AI),the cost induced by moving the data between the central processing unit(CPU)and memory is the key problem and a chip featured with flexible structural unit,ultra-low power consumption,and huge parallelism will be needed.In-memory computing,a non-von Neumann architecture fusing memory units and computing units,can eliminate the data transfer time and energy consumption while performing massive parallel computations.Prototype in-memory computing schemes modified from different memory technologies have shown orders of magnitude improvement in computing efficiency,making it be regarded as the ultimate computing paradigm.Here we review the state-of-the-art memory device technologies potential for in-memory computing,summarize their versatile applications in neural network,stochastic generation,and hybrid precision digital computing,with promising solutions for unprecedented computing tasks,and also discuss the challenges of stability and integration for general in-memory computing.展开更多
片段视频语义识别旨在识别视频中短小片段的语义概念,是视频分析的一项重要任务.由于片段视频的数量巨大且缺乏可参考的网络标签,片段视频的标记十分困难,通常只能对部分片段视频进行标记.如何利用有限的语义标签提高片段视频语义识别...片段视频语义识别旨在识别视频中短小片段的语义概念,是视频分析的一项重要任务.由于片段视频的数量巨大且缺乏可参考的网络标签,片段视频的标记十分困难,通常只能对部分片段视频进行标记.如何利用有限的语义标签提高片段视频语义识别的准确率是一项关键挑战.因此本文提出了一种基于长短时预测一致性的视频语义识别算法.该算法通过引入完整视频语义与片段视频语义一致性的约束,对片段视频语义识别结果进行筛选,以此提高片段视频语义识别的准确率.本文提出的算法在大规模视频数据集YouTube-8M的片段视频语义识别任务上达到了82.62%的平均均值准确率(mean average precision, MAP)识别精度,在第三届YouTube-8M比赛中排名第二.展开更多
Associating faces appearing in Web videos with names presented in the surrounding context is an important task in many applications. However, the problem is not well investigated particularly under large-scale realist...Associating faces appearing in Web videos with names presented in the surrounding context is an important task in many applications. However, the problem is not well investigated particularly under large-scale realistic scenario,mainly due to the scarcity of dataset constructed in such circumstance. In this paper, we introduce a Web video dataset of celebrities, named WebV-Cele, for name-face association. The dataset consists of 75 073 Internet videos of over 4 000 hours,covering 2 427 celebrities and 649 001 faces. This is, to our knowledge, the most comprehensive dataset for this problem.We describe the details of dataset construction, discuss several interesting findings by analyzing this dataset like celebrity community discovery, and provide experimental results of name-face association using five existing techniques. We also outline important and challenging research problems that could be investigated in the future.展开更多
文摘大多数现有的视觉语言预训练方法侧重于理解任务,并在训练时使用类似于BERT的损失函数(掩码语言建模和图像文本匹配).尽管它们在许多理解类型的下游任务中表现良好,例如视觉问答、图像文本检索和视觉蕴涵,但它们不具备生成信息的能力.为了解决这个问题,提出了视觉语言理解和生成的统一多模态预训练(unified multimodal pre-training for vision-language understanding and generation,UniVL).UniVL能够处理理解任务和生成任务,并扩展了现有的预训练范式,同时使用随机掩码和因果掩码,因果掩码即掩盖未来标记的三角形掩码,这样预训练的模型可以具有自回归生成的能力.将几种视觉语言理解任务规范为文本生成任务,并使用基于模版提示的方法对不同的下游任务进行微调.实验表明,在使用同一个模型时,理解任务和生成任务之间存在权衡,而提升这两个任务的可行方法是使用更多的数据.UniVL框架在理解任务和生成任务方面的性能与最近的视觉语言预训练方法相当.此外,实验还证明了基于模版提示的生成方法更有效,甚至在少数场景中它优于判别方法.
基金Project supported by the National Natural Science Foundation of China(Grant Nos.61925402 and 61851402)Science and Technology Commission of Shanghai Municipality,China(Grant No.19JC1416600)+1 种基金the National Key Research and Development Program of China(Grant No.2017YFB0405600)Shanghai Education Development Foundation and Shanghai Municipal Education Commission Shuguang Program,China(Grant No.18SG01).
文摘Facing the computing demands of Internet of things(IoT)and artificial intelligence(AI),the cost induced by moving the data between the central processing unit(CPU)and memory is the key problem and a chip featured with flexible structural unit,ultra-low power consumption,and huge parallelism will be needed.In-memory computing,a non-von Neumann architecture fusing memory units and computing units,can eliminate the data transfer time and energy consumption while performing massive parallel computations.Prototype in-memory computing schemes modified from different memory technologies have shown orders of magnitude improvement in computing efficiency,making it be regarded as the ultimate computing paradigm.Here we review the state-of-the-art memory device technologies potential for in-memory computing,summarize their versatile applications in neural network,stochastic generation,and hybrid precision digital computing,with promising solutions for unprecedented computing tasks,and also discuss the challenges of stability and integration for general in-memory computing.
文摘片段视频语义识别旨在识别视频中短小片段的语义概念,是视频分析的一项重要任务.由于片段视频的数量巨大且缺乏可参考的网络标签,片段视频的标记十分困难,通常只能对部分片段视频进行标记.如何利用有限的语义标签提高片段视频语义识别的准确率是一项关键挑战.因此本文提出了一种基于长短时预测一致性的视频语义识别算法.该算法通过引入完整视频语义与片段视频语义一致性的约束,对片段视频语义识别结果进行筛选,以此提高片段视频语义识别的准确率.本文提出的算法在大规模视频数据集YouTube-8M的片段视频语义识别任务上达到了82.62%的平均均值准确率(mean average precision, MAP)识别精度,在第三届YouTube-8M比赛中排名第二.
基金supported by a research grant from City University of Hong Kong under Grant No.7008178the National Natural Science Foundation of China under Grant Nos.61228205,61303175 and 61172153
文摘Associating faces appearing in Web videos with names presented in the surrounding context is an important task in many applications. However, the problem is not well investigated particularly under large-scale realistic scenario,mainly due to the scarcity of dataset constructed in such circumstance. In this paper, we introduce a Web video dataset of celebrities, named WebV-Cele, for name-face association. The dataset consists of 75 073 Internet videos of over 4 000 hours,covering 2 427 celebrities and 649 001 faces. This is, to our knowledge, the most comprehensive dataset for this problem.We describe the details of dataset construction, discuss several interesting findings by analyzing this dataset like celebrity community discovery, and provide experimental results of name-face association using five existing techniques. We also outline important and challenging research problems that could be investigated in the future.