期刊文献+
共找到2篇文章
< 1 >
每页显示 20 50 100
Rethinking Global Context in Crowd Counting
1
作者 guolei sun Yun Liu +3 位作者 Thomas Probst Danda Pani Paudel Nikola Popovic Luc Van Gool 《Machine Intelligence Research》 EI CSCD 2024年第4期640-651,共12页
This paper investigates the role of global context for crowd counting.Specifically,a pure transformer is used to extract features with global information from overlapping image patches.Inspired by classification,we ad... This paper investigates the role of global context for crowd counting.Specifically,a pure transformer is used to extract features with global information from overlapping image patches.Inspired by classification,we add a context token to the input sequence,to facilitate information exchange with tokens corresponding to image patches throughout transformer layers.Due to the fact that transformers do not explicitly model the tried-and-true channel-wise interactions,we propose a token-attention module(TAM)to recalibrate encoded features through channel-wise attention informed by the context token.Beyond that,it is adopted to predict the total person count of the image through regression-token module(RTM).Extensive experiments on various datasets,including ShanghaiTech,UCFQNRF,JHU-CROWD++and NWPU,demonstrate that the proposed context extraction techniques can significantly improve the performanceover the baselines. 展开更多
关键词 Crowd counting vision transformer global context ATTENTION density map.
原文传递
Vision Transformers with Hierarchical Attention
2
作者 Yun Liu Yu-Huan Wu +3 位作者 guolei sun Le Zhang Ajad Chhatkuli Luc Van Gool 《Machine Intelligence Research》 EI CSCD 2024年第4期670-683,共14页
This paper tackles the high computational/space complexity associated with multi-head self-attention(MHSA)in vanilla vision transformers.To this end,we propose hierarchical MHSA(H-MHSA),a novel approach that computes ... This paper tackles the high computational/space complexity associated with multi-head self-attention(MHSA)in vanilla vision transformers.To this end,we propose hierarchical MHSA(H-MHSA),a novel approach that computes self-attention in a hierarchical fashion.Specifically,we first divide the input image into patches as commonly done,and each patch is viewed as a token.Then,the proposed H-MHSA learns token relationships within local patches,serving as local relationship modeling.Then,the small patches are merged into larger ones,and H-MHSA models the global dependencies for the small number of the merged tokens.At last,the local and global attentive features are aggregated to obtain features with powerful representation capacity.Since we only calculate attention for a limited number of tokens at each step,the computational load is reduced dramatically.Hence,H-MHSA can efficiently model global relationships among tokens without sacrificing fine-grained information.With the H-MHSA module incorporated,we build a family of hierarchical-attention-based transformer networks,namely HAT-Net.To demonstrate the superiority of HAT-Net in scene understanding,we conduct extensive experiments on fundamental vision tasks,including image classification,semantic segmentation,object detection and instance segmentation.Therefore,HAT-Net provides a new perspective for vision transformers.Code and pretrained models are available at https://github.com/yun-liu/HAT-Net. 展开更多
关键词 Vision transformer hierarchical attention global attention local attention scene understanding.
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部