期刊文献+
共找到1篇文章
< 1 >
每页显示 20 50 100
Masked Vision-language Transformer in Fashion 被引量:1
1
作者 Ge-Peng Ji Mingchen Zhuge +3 位作者 dehong gao Deng-Ping Fan Christos Sakaridis Luc Van Gool 《Machine Intelligence Research》 EI CSCD 2023年第3期421-434,共14页
We present a masked vision-language transformer(MVLT)for fashion-specific multi-modal representation.Technically,we simply utilize the vision transformer architecture for replacing the bidirectional encoder representa... We present a masked vision-language transformer(MVLT)for fashion-specific multi-modal representation.Technically,we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers(BERT)in the pre-training model,making MVLT the first end-to-end framework for the fashion domain.Besides,we designed masked image reconstruction(MIR)for a fine-grained understanding of fashion.MVLT is an extensible and convenient architecture that admits raw multimodal inputs without extra pre-processing models(e.g.,ResNet),implicitly modeling the vision-language alignments.More importantly,MVLT can easily generalize to various matching and generative tasks.Experimental results show obvious improvements in retrieval(rank@5:17%)and recognition(accuracy:3%)tasks over the Fashion-Gen 2018 winner,Kaleido-BERT.The code is available at https://github.com/GewelsJI/MVLT. 展开更多
关键词 Vision-language masked image reconstruction TRANSFORMER FASHION e-commercial
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部