期刊文献+

Knowledge Enhanced Pre-Training Model for Vision-Language-Navigation Task 被引量:1

原文传递
导出
摘要 Vision-Language-Navigation(VLN) task is a cross-modality task that combines natural language processing and computer vision. This task requires the agent to automatically move to the destination according to the natural language instruction and the observed surrounding visual information. To make the best decision, in every step during the navigation, the agent should pay more attention to understanding the objects, the object attributes, and the object relationships. But most current methods process all received textual and visual information equally. Therefore, this paper integrates more detailed semantic connections between visual and textual information through three pre-training tasks(object prediction, object attributes prediction, and object relationship prediction). The model will learn better fusion representation and alignment between these two types of information to improve the success rate(SR) and generalization. The experiments show that compared with the former baseline models, the SR on the unseen validation set(Val Unseen) increased by 7%, and the SR weighted by path length(SPL) increased by 7%;the SR on the test set(Test) increased 4%, SPL increased by 3%.
出处 《Wuhan University Journal of Natural Sciences》 CAS CSCD 2021年第2期147-155,共9页 武汉大学学报(自然科学英文版)
基金 Supported by the National Natural Science Foundation of China (62006150) Songjiang District Science and Technology Research Project (19SJKJGG83) Shanghai Young Science and Technology Talents Sailing Program (19YF1418400)。
  • 相关文献

同被引文献7

引证文献1

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部