In this paper, we propose an approach for generating rich fine-grained textual descriptions of images. In particular, we use an LSTM-in-LSTM(long short-term memory) architecture, which consists of an inner LSTM and an...In this paper, we propose an approach for generating rich fine-grained textual descriptions of images. In particular, we use an LSTM-in-LSTM(long short-term memory) architecture, which consists of an inner LSTM and an outer LSTM. The inner LSTM effectively encodes the long-range implicit contextual interaction between visual cues(i.e., the spatiallyconcurrent visual objects), while the outer LSTM generally captures the explicit multi-modal relationship between sentences and images(i.e., the correspondence of sentences and images). This architecture is capable of producing a long description by predicting one word at every time step conditioned on the previously generated word, a hidden vector(via the outer LSTM),and a context vector of fine-grained visual cues(via the inner LSTM). Our model outperforms state-of-theart methods on several benchmark datasets(Flickr8k,Flickr30 k, MSCOCO) when used to generate long rich fine-grained descriptions of given images in terms of four different metrics(BLEU, CIDEr, ROUGE-L, and METEOR).展开更多
基金supported in part by the National Basic Research Program of China(No.2012CB316400)National Natural Science Foundation of China(Nos.61472353 and 61572431)+2 种基金China Knowledge Centre for Engineering Sciences and Technology,the Fundamental Research Funds for the Central Universities2015 Qianjiang Talents Program of Zhejiang Provincesupported in part by the US NSF(No.CCF1017828)
文摘In this paper, we propose an approach for generating rich fine-grained textual descriptions of images. In particular, we use an LSTM-in-LSTM(long short-term memory) architecture, which consists of an inner LSTM and an outer LSTM. The inner LSTM effectively encodes the long-range implicit contextual interaction between visual cues(i.e., the spatiallyconcurrent visual objects), while the outer LSTM generally captures the explicit multi-modal relationship between sentences and images(i.e., the correspondence of sentences and images). This architecture is capable of producing a long description by predicting one word at every time step conditioned on the previously generated word, a hidden vector(via the outer LSTM),and a context vector of fine-grained visual cues(via the inner LSTM). Our model outperforms state-of-theart methods on several benchmark datasets(Flickr8k,Flickr30 k, MSCOCO) when used to generate long rich fine-grained descriptions of given images in terms of four different metrics(BLEU, CIDEr, ROUGE-L, and METEOR).