Vision Enhanced Generative Pre-trained Language Model for Multimodal Sentence Summarization

导出

摘要 Multimodal sentence summarization(MMSS)is a new yet challenging task that aims to generate a concise summary of a long sentence and its corresponding image.Although existing methods have gained promising success in MMSS,they overlook the powerful generation ability of generative pre-trained language models(GPLMs),which have shown to be effective in many text generation tasks.To fill this research gap,we propose to using GPLMs to promote the performance of MMSS.Notably,adopting GPLMs to solve MMSS inevitably faces two challenges:1)What fusion strategy should we use to inject visual information into GPLMs properly?2)How to keep the GPLM′s generation ability intact to the utmost extent when the visual feature is injected into the GPLM.To address these two challenges,we propose a vision enhanced generative pre-trained language model for MMSS,dubbed as Vision-GPLM.In Vision-GPLM,we obtain features of visual and textual modalities with two separate encoders and utilize a text decoder to produce a summary.In particular,we utilize multi-head attention to fuse the features extracted from visual and textual modalities to inject the visual feature into the GPLM.Meanwhile,we train Vision-GPLM in two stages:the vision-oriented pre-training stage and fine-tuning stage.In the vision-oriented pre-training stage,we particularly train the visual encoder by the masked language model task while the other components are frozen,aiming to obtain homogeneous representations of text and image.In the fine-tuning stage,we train all the components of Vision-GPLM by the MMSS task.Extensive experiments on a public MMSS dataset verify the superiority of our model over existing baselines.

作者 Liqiang Jing Yiren Li Junhao Xu Yongcan Yu Pei Shen Xuemeng Song

机构地区 School of Science and Technology HBIS Digital Technology Co.

出处《Machine Intelligence Research》 EI CSCD 2023年第2期289-298,共10页 机器智能研究（英文版）

关键词 Multimodal sentence summarization(MMSS) generative pre-trained language model(GPLM) natural language generation deep learning artificial intelligence

分类号 TP391.41 [自动化与计算机技术—计算机应用技术] TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

1胡杨.建一项工程护一方水土——北新路桥集团四川分公司打造绿色工地[J].人民交通,2022(12):22-23.
2李小妹.激励式心理干预结合安慰性抚触护理对剖宫产患者产后心理状态及睡眠质量的影响[J].世界睡眠医学杂志,2022,9(11):2173-2175. 被引量：4
3Min Zhang,Juntao Li.A commentary of GPT-3 in MIT Technology Review 2021[J].Fundamental Research,2021,1(6):831-833. 被引量：7
4Chenglin JIANG,Chunhong ZHANG,Yang JI,Zheng HU,Zhiqiang ZHAN,Guanghua YANG.An affective chatbot with controlled specific emotion expression[J].Science China(Information Sciences),2022,65(10):118-135.
5杨红梅,冯傲雪,李素萍.规范化培训护士工作满意度与护士交接班评估质量的相关性研究[J].中文科技期刊数据库（全文版）医药卫生,2021(10):308-309.
6华帅,钟世立,李鑫鑫,陈彩凤.一种基于词嵌入模型和卷积神经网络的简化文本分类方法[J].东莞理工学院学报,2022,29(5):69-78. 被引量：2
7乔勇鹏,于亚新,刘树越,王子腾,夏子芳,乔佳琪.图卷积增强多路解码的实体关系联合抽取模型[J].计算机研究与发展,2023,60(1):153-166. 被引量：3
8李公全,李智国,李卫星,高栋.自然语言生成技术及其在军事领域应用[J].中国电子科学研究院学报,2022,17(10):935-942. 被引量：1
9Tristin Zhang.HEMEI JAPANESE CUISINE Galloping-Good Eats[J].城市漫步（GBA版）,2018(5):60-60.
10Ryan Gandolfo.XIJINGJING BEIJING DUCK RESTAURANT Se-duck-tively Delicious[J].城市漫步（GBA版）,2018(12):55-55.

Machine Intelligence Research

2023年第2期

浏览历史

内容加载中请稍等...

Vision Enhanced Generative Pre-trained Language Model for Multimodal Sentence Summarization

相关作者

相关机构

相关主题

浏览历史