We propose a collaborative learning method to solve the natural image captioning problem.Numerous existing methods use pretrained image classification CNNs to obtain feature representations for image caption generatio...We propose a collaborative learning method to solve the natural image captioning problem.Numerous existing methods use pretrained image classification CNNs to obtain feature representations for image caption generation,which ignores the gap in image feature representations between different computer vision tasks.To address this problem,our method aims to utilize the similarity between image caption and pix-to-pix inverting tasks to ease the feature representation gap.Specifically,our framework consists of two modules:1)The pix2pix module(P2PM),which has a share learning feature extractor to extract feature representations and a U-net architecture to encode the image to latent code and then decodes them to the original image.2)The natural language generation module(NLGM)generates descriptions from feature representations extracted by P2PM.Consequently,the feature representations and generated image captions are improved during the collaborative learning process.The experimental results on the MSCOCO 2017 dataset prove the effectiveness of our approach compared to other comparison methods.展开更多
基金supported by grant of no.61862050 from the National Nature Science Foundation of China and no.2020AAC03031 from Natural Science Foundation of Ningxia,China.
文摘We propose a collaborative learning method to solve the natural image captioning problem.Numerous existing methods use pretrained image classification CNNs to obtain feature representations for image caption generation,which ignores the gap in image feature representations between different computer vision tasks.To address this problem,our method aims to utilize the similarity between image caption and pix-to-pix inverting tasks to ease the feature representation gap.Specifically,our framework consists of two modules:1)The pix2pix module(P2PM),which has a share learning feature extractor to extract feature representations and a U-net architecture to encode the image to latent code and then decodes them to the original image.2)The natural language generation module(NLGM)generates descriptions from feature representations extracted by P2PM.Consequently,the feature representations and generated image captions are improved during the collaborative learning process.The experimental results on the MSCOCO 2017 dataset prove the effectiveness of our approach compared to other comparison methods.