期刊文献+
共找到18篇文章
< 1 >
每页显示 20 50 100
On‐device audio‐visual multi‐person wake word spotting
1
作者 Yidi Li Guoquan Wang +2 位作者 Zhan Chen Hao Tang Hong Liu 《CAAI Transactions on Intelligence Technology》 SCIE EI 2023年第4期1578-1589,共12页
Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection performance.However,most audio‐vi... Audio‐visual wake word spotting is a challenging multi‐modal task that exploits visual information of lip motion patterns to supplement acoustic speech to improve overall detection performance.However,most audio‐visual wake word spotting models are only suitable for simple single‐speaker scenarios and require high computational complexity.Further development is hindered by complex multi‐person scenarios and computational limitations in mobile environments.In this paper,a novel audio‐visual model is proposed for on‐device multi‐person wake word spotting.Firstly,an attention‐based audio‐visual voice activity detection module is presented,which generates an attention score matrix of audio and visual representations to derive active speaker representation.Secondly,the knowledge distillation method is introduced to transfer knowledge from the large model to the on‐device model to control the size of our model.Moreover,a new audio‐visual dataset,PKU‐KWS,is collected for sentence‐level multi‐person wake word spotting.Experimental results on the PKU‐KWS dataset show that this approach outperforms the previous state‐of‐the‐art methods. 展开更多
关键词 audio‐visual fusion human‐computer interfacing speech processing
下载PDF
A New Method to Extract Text from Natural Scenes
2
作者 郝峻晟 戚飞虎 +1 位作者 朱凯华 蒋人杰 《Journal of Donghua University(English Edition)》 EI CAS 2005年第4期52-57,共6页
This paper presents a new method for text detection, location and binarization from natural scenes. Several morphological steps are used to detect the general position of the text, including English, Chinese and Japan... This paper presents a new method for text detection, location and binarization from natural scenes. Several morphological steps are used to detect the general position of the text, including English, Chinese and Japanese characters. Next bounding boxes are processed by a new “Expand, Break and Merge” (EBM) method to get the precise text areas. Finally, text is binarized by a hybrid method based on Otsu and Niblack. This new approach can extract different kinds of text from complicated natural scenes. It is insensitive to noise, distortedness, and text orientation. It also has good performance on extracting texts in various sizes. 展开更多
关键词 机器翻译 文本提取 数学形态学 边界框 二值化
下载PDF
Masked Vision-language Transformer in Fashion
3
作者 Ge-Peng Ji Mingchen Zhuge +3 位作者 Dehong Gao Deng-Ping Fan Christos Sakaridis Luc Van Gool 《Machine Intelligence Research》 EI CSCD 2023年第3期421-434,共14页
We present a masked vision-language transformer(MVLT)for fashion-specific multi-modal representation.Technically,we simply utilize the vision transformer architecture for replacing the bidirectional encoder representa... We present a masked vision-language transformer(MVLT)for fashion-specific multi-modal representation.Technically,we simply utilize the vision transformer architecture for replacing the bidirectional encoder representations from Transformers(BERT)in the pre-training model,making MVLT the first end-to-end framework for the fashion domain.Besides,we designed masked image reconstruction(MIR)for a fine-grained understanding of fashion.MVLT is an extensible and convenient architecture that admits raw multimodal inputs without extra pre-processing models(e.g.,ResNet),implicitly modeling the vision-language alignments.More importantly,MVLT can easily generalize to various matching and generative tasks.Experimental results show obvious improvements in retrieval(rank@5:17%)and recognition(accuracy:3%)tasks over the Fashion-Gen 2018 winner,Kaleido-BERT.The code is available at https://github.com/GewelsJI/MVLT. 展开更多
关键词 Vision-language masked image reconstruction TRANSFORMER FASHION e-commercial
原文传递
Power-line Modem over the Low Voltage Grid Using Direct Sequence Spread Spectrum Techniques
4
作者 A. Mahgoub H. Abdeel-Baky E. El-Badawy 《Journal of Energy and Power Engineering》 2011年第6期562-568,共7页
关键词 电力线调制解调器 直接序列扩频技术 低压电网 电力线网络 通信网络 发展中国家 经设计 网络传输
下载PDF
Acquiring Weak Annotations for Tumor Localization in Temporal and Volumetric Data
5
作者 Yu-Cheng Chou Bowen Li +2 位作者 Deng-Ping Fan Alan Yuille Zongwei Zhou 《Machine Intelligence Research》 EI CSCD 2024年第2期318-330,共13页
Creating large-scale and well-annotated datasets to train AI algorithms is crucial for automated tumor detection and localization.However,with limited resources,it is challenging to determine the best type of annotati... Creating large-scale and well-annotated datasets to train AI algorithms is crucial for automated tumor detection and localization.However,with limited resources,it is challenging to determine the best type of annotations when annotating massive amounts of unlabeled data.To address this issue,we focus on polyps in colonoscopy videos and pancreatic tumors in abdominal CT scans;Both applications require significant effort and time for pixel-wise annotation due to the high dimensional nature of the data,involving either temporary or spatial dimensions.In this paper,we develop a new annotation strategy,termed Drag&Drop,which simplifies the annotation process to drag and drop.This annotation strategy is more efficient,particularly for temporal and volumetric imaging,than other types of weak annotations,such as per-pixel,bounding boxes,scribbles,ellipses and points.Furthermore,to exploit our Drag&Drop annotations,we develop a novel weakly supervised learning method based on the watershed algorithm.Experimental results show that our method achieves better detection and localization performance than alternative weak annotations and,more importantly,achieves similar performance to that trained on detailed per-pixel annotations.Interestingly,we find that,with limited resources,allocating weak annotations from a diverse patient population can foster models more robust to unseen images than allocating per-pixel annotations for a small set of images.In summary,this research proposes an efficient annotation strategy for tumor detection and localization that is less accurate than per-pixel annotations but useful for creating large-scale datasets for screening tumors in various medical modalities. 展开更多
关键词 Weak annotation detection localization segmentation COLONOSCOPY ABDOMEN
原文传递
PVT v2:Improved baselines with Pyramid Vision Transformer 被引量:49
6
作者 Wenhai Wang Enze Xie +6 位作者 Xiang Li Deng-Ping Fan Kaitao Song Ding Liang Tong Lu Ping Luo Ling Shao 《Computational Visual Media》 SCIE EI CSCD 2022年第3期415-424,共10页
Transformers have recently lead to encouraging progress in computer vision.In this work,we present new baselines by improving the original Pyramid Vision Transformer(PVT v1)by adding three designs:(i)a linear complexi... Transformers have recently lead to encouraging progress in computer vision.In this work,we present new baselines by improving the original Pyramid Vision Transformer(PVT v1)by adding three designs:(i)a linear complexity attention layer,(ii)an overlapping patch embedding,and(iii)a convolutional feed-forward network.With these modifications,PVT v2 reduces the computational complexity of PVT v1 to linearity and provides significant improvements on fundamental vision tasks such as classification,detection,and segmentation.In particular,PVT v2 achieves comparable or better performance than recent work such as the Swin transformer.We hope this work will facilitate state-ofthe-art transformer research in computer vision.Code is available at https://github.com/whai362/PVT. 展开更多
关键词 TRANSFORMERS dense prediction image classification object detection semantic segmentation
原文传递
Practical Blind Image Denoising via Swin-Conv-UNet and Data Synthesis 被引量:1
7
作者 Kai Zhang Yawei Li +6 位作者 Jingyun Liang Jiezhang Cao Yulun Zhang Hao Tang Deng-Ping Fan Radu Timofte Luc Van Gool 《Machine Intelligence Research》 EI CSCD 2023年第6期822-836,共15页
While recent years have witnessed a dramatic upsurge of exploiting deep neural networks toward solving image denoising,existing methods mostly rely on simple noise assumptions,such as additive white Gaussian noise(AWG... While recent years have witnessed a dramatic upsurge of exploiting deep neural networks toward solving image denoising,existing methods mostly rely on simple noise assumptions,such as additive white Gaussian noise(AWGN),JPEG compression noise and camera sensor noise,and a general-purpose blind denoising method for real images remains unsolved.In this paper,we attempt to solve this problem from the perspective of network architecture design and training data synthesis.Specifically,for the network architecture design,we propose a swin-conv block to incorporate the local modeling ability of residual convolutional layer and non-local modeling ability of swin transformer block,and then plug it as the main building block into the widely-used image-to-image translation UNet architecture.For the training data synthesis,we design a practical noise degradation model which takes into consideration different kinds of noise(including Gaussian,Poisson,speckle,JPEG compression,and processed camera sensor noises)and resizing,and also involves a random shuffle strategy and a double degradation strategy.Extensive experiments on AGWN removal and real image denoising demonstrate that the new network architecture design achieves state-of-the-art performance and the new degradation model can help to significantly improve the practicability.We believe our work can provide useful insights into current denoising research.The source code is available at https://github.com/cszn/SCUNet. 展开更多
关键词 Blind image denoising real image denosing data synthesis Transformer image signal processing(ISP)pipeline
原文传递
Sequential interactive image segmentation 被引量:1
8
作者 Zheng Lin Zhao Zhang +2 位作者 Zi-Yue Zhu Deng-Ping Fan Xia-Lei Liu 《Computational Visual Media》 SCIE EI CSCD 2023年第4期753-765,共13页
Interactive image segmentation(IIS)is an important technique for obtaining pixel-level annotations.In many cases,target objects share similar semantics.However,IIS methods neglect this connection and in particular the... Interactive image segmentation(IIS)is an important technique for obtaining pixel-level annotations.In many cases,target objects share similar semantics.However,IIS methods neglect this connection and in particular the cues provided by representations of previously segmented objects,previous user interaction,and previous prediction masks,which can all provide suitable priors for the current annotation.In this paper,we formulate a sequential interactive image segmentation(SIIS)task for minimizing user interaction when segmenting sequences of related images,and we provide a practical approach to this task using two pertinent designs.The first is a novel interaction mode.When annotating a new sample,our method can automatically propose an initial click proposal based on previous annotation.This dramatically helps to reduce the interaction burden on the user.The second is an online optimization strategy,with the goal of providing semantic information when annotating specific targets,optimizing the model with dense supervision from previously labeled samples.Experiments demonstrate the effectiveness of regarding SIIS as a particular task,and our methods for addressing it. 展开更多
关键词 interactive segmentation user interaction object segmentation
原文传递
How Good is Google Bard's Visual Understanding? An Empirical Study on Open Challenges
9
作者 Haotong Qin Ge-Peng Ji +3 位作者 Salman Khan Deng-Ping Fan Fahad Shahbaz Khan Luc Van Gool 《Machine Intelligence Research》 EI CSCD 2023年第5期605-613,共9页
Google's Bard has emerged as a formidable competitor to OpenAI's ChatGPT in the field of conversational AI.Notably,Bard has recently been updated to handle visual inputs alongside text prompts during conversat... Google's Bard has emerged as a formidable competitor to OpenAI's ChatGPT in the field of conversational AI.Notably,Bard has recently been updated to handle visual inputs alongside text prompts during conversations.Given Bard's impressive track record in handling textual inputs,we explore its capabilities in understanding and interpreting visual data(images)conditioned by text questions.This exploration holds the potential to unveil new insights and challenges for Bard and other forthcoming multi-modal Generative models,especially in addressing complex computer vision problems that demand accurate visual and language understanding.Specifically,in this study,we focus on 15 diverse task scenarios encompassing regular,camouflaged,medical,under-water and remote sensing data to comprehensively evaluate Bard's performance.Our primary finding indicates that Bard still struggles in these vision scenarios,highlighting the significant gap in vision-based understanding that needs to be bridged in future developments.We expect that this empirical study will prove valuable in advancing future models,leading to enhanced capabilities in comprehending and interpreting finegrained visual data.Our project is released on https://github.com/htqin/GoogleBard-VisUnderstand. 展开更多
关键词 Google Bard multi-modal understanding visual comprehension large language models conversational AI chatbot.
原文传递
Specificity-preserving RGB-D saliency detection
10
作者 Tao Zhou Deng-Ping Fan +2 位作者 Geng Chen Yi Zhou Huazhu Fu 《Computational Visual Media》 SCIE EI CSCD 2023年第2期297-317,共21页
Salient object detection(SOD)in RGB and depth images has attracted increasing research interest.Existing RGB-D SOD models usually adopt fusion strategies to learn a shared representation from RGB and depth modalities,... Salient object detection(SOD)in RGB and depth images has attracted increasing research interest.Existing RGB-D SOD models usually adopt fusion strategies to learn a shared representation from RGB and depth modalities,while few methods explicitly consider how to preserve modality-specific characteristics.In this study,we propose a novel framework,the specificity-preserving network(SPNet),which improves SOD performance by exploring both the shared information and modality-specific properties.Specifically,we use two modality-specific networks and a shared learning network to generate individual and shared saliency prediction maps.To effectively fuse cross-modal features in the shared learning network,we propose a cross-enhanced integration module(CIM)and propagate the fused feature to the next layer to integrate cross-level information.Moreover,to capture rich complementary multi-modal information to boost SOD performance,we use a multi-modal feature aggregation(MFA)module to integrate the modalityspecific features from each individual decoder into the shared decoder.By using skip connections between encoder and decoder layers,hierarchical features can be fully combined.Extensive experiments demonstrate that our SPNet outperforms cutting-edge approaches on six popular RGB-D SOD and three camouflaged object detection benchmarks.The project is publicly available at https://github.com/taozh2017/SPNet. 展开更多
关键词 salient object detection(SOD) RGB-D cross-enhanced integration module(CIM) multi-modal feature aggregation(MFA)
原文传递
Full-duplex strategy for video object segmentation
11
作者 Ge-Peng Ji Deng-Ping Fan +3 位作者 Keren Fu Zhe Wu Jianbing Shen Ling Shao 《Computational Visual Media》 SCIE EI CSCD 2023年第1期155-175,共21页
Previous video object segmentation approachesmainly focus on simplex solutions linking appearance and motion,limiting effective feature collaboration between these two cues.In this work,we study a novel and efficient ... Previous video object segmentation approachesmainly focus on simplex solutions linking appearance and motion,limiting effective feature collaboration between these two cues.In this work,we study a novel and efficient full-duplex strategy network(FSNet)to address this issue,by considering a better mutual restraint scheme linking motion and appearance allowing exploitation of cross-modal features from the fusion and decoding stage.Specifically,we introduce a relational cross-attention module(RCAM)to achieve bidirectional message propagation across embedding sub-spaces.To improve the model’s robustness and update inconsistent features from the spatiotemporal embeddings,we adopt a bidirectional purification module after the RCAM.Extensive experiments on five popular benchmarks show that our FSNet is robust to various challenging scenarios(e.g.,motion blur and occlusion),and compares well to leading methods both for video object segmentation and video salient object detection.The project is publicly available at https://github.com/GewelsJI/FSNet. 展开更多
关键词 video object segmentation(VOS) video salient object detection(V-SOD) visual attention
原文传递
Vision Transformers with Hierarchical Attention
12
作者 Yun Liu Yu-Huan Wu +3 位作者 Guolei Sun Le Zhang Ajad Chhatkuli Luc Van Gool 《Machine Intelligence Research》 EI 2024年第4期670-683,共14页
This paper tackles the high computational/space complexity associated with multi-head self-attention(MHSA)in vanilla vision transformers.To this end,we propose hierarchical MHSA(H-MHSA),a novel approach that computes ... This paper tackles the high computational/space complexity associated with multi-head self-attention(MHSA)in vanilla vision transformers.To this end,we propose hierarchical MHSA(H-MHSA),a novel approach that computes self-attention in a hierarchical fashion.Specifically,we first divide the input image into patches as commonly done,and each patch is viewed as a token.Then,the proposed H-MHSA learns token relationships within local patches,serving as local relationship modeling.Then,the small patches are merged into larger ones,and H-MHSA models the global dependencies for the small number of the merged tokens.At last,the local and global attentive features are aggregated to obtain features with powerful representation capacity.Since we only calculate attention for a limited number of tokens at each step,the computational load is reduced dramatically.Hence,H-MHSA can efficiently model global relationships among tokens without sacrificing fine-grained information.With the H-MHSA module incorporated,we build a family of hierarchical-attention-based transformer networks,namely HAT-Net.To demonstrate the superiority of HAT-Net in scene understanding,we conduct extensive experiments on fundamental vision tasks,including image classification,semantic segmentation,object detection and instance segmentation.Therefore,HAT-Net provides a new perspective for vision transformers.Code and pretrained models are available at https://github.com/yun-liu/HAT-Net. 展开更多
关键词 Vision transformer hierarchical attention global attention local attention scene understanding.
原文传递
Polyp-PVT:Polyp Segmentation with Pyramid Vision Transformers
13
作者 Bo Dong Wenhai Wang +3 位作者 Deng-Ping Fan Jinpeng Li Huazhu Fu Ling Shao 《CAAI Artificial Intelligence Research》 2023年第1期1-15,共15页
Most polyp segmentation methods use convolutional neural networks(CNNs)as their backbone,leading to two key issues when exchanging information between the encoder and decoder:(1)taking into account the differences in ... Most polyp segmentation methods use convolutional neural networks(CNNs)as their backbone,leading to two key issues when exchanging information between the encoder and decoder:(1)taking into account the differences in contribution between different-level features,and(2)designing an effective mechanism for fusing these features.Unlike existing CNN-based methods,we adopt a transformer encoder,which learns more powerful and robust representations.In addition,considering the image acquisition influence and elusive properties of polyps,we introduce three standard modules,including a cascaded fusion module(CFM),a camouflage identification module(CIM),and a similarity aggregation module(SAM).Among these,the CFM is used to collect the semantic and location information of polyps from high-level features;the CIM is applied to capture polyp information disguised in low-level features,and the SAM extends the pixel features of the polyp area with high-level semantic position information to the entire polyp area,thereby effectively fusing cross-level features.The proposed model,named Polyp-PVT,effectively suppresses noises in the features and significantly improves their expressive capabilities.Extensive experiments on five widely adopted datasets show that the proposed model is more robust to various challenging situations(e.g.,appearance changes,small objects,and rotation)than existing representative methods.The proposed model is available at https://github.com/DengPingFan/Polyp-PVT. 展开更多
关键词 polyp segmentation pyramid vision transformer colonoscopy computer vision
原文传递
Human Gait Recognition Based on Kernel PCA Using Projections 被引量:4
14
作者 Murat Ekinci Murat Aykut 《Journal of Computer Science & Technology》 SCIE EI CSCD 2007年第6期867-876,共10页
This paper presents a novel approach for human identification at a distance using gait recognition. Recognition of a person from their gait is a biometric of increasing interest. The proposed work introduces a nonline... This paper presents a novel approach for human identification at a distance using gait recognition. Recognition of a person from their gait is a biometric of increasing interest. The proposed work introduces a nonlinear machine learning method, kernel Principal Component Analysis (PCA), to extract gait features from silhouettes for individual recognition. Binarized silhouette of a motion object is first represented by four 1-D signals which are the basic image features called the distance vectors. Fourier transform is performed to achieve translation invariant for the gait patterns accumulated from silhouette sequences which are extracted from different circumstances. Kernel PCA is then used to extract higher order relations among the gait patterns for future recognition. A fusion strategy is finally executed to produce a final decision. The experiments are carried out on the CMU and the USF gait databases and presented based on the different training gait cycles. 展开更多
关键词 BIOMETRICS gait recognition gait representation kernel PCA pattern recognition
原文传递
Light field salient object detection:A review and benchmark 被引量:1
15
作者 Keren Fu Yao Jiang +3 位作者 Ge-Peng Ji Tao Zhou Qijun Zhao Deng-Ping Fan 《Computational Visual Media》 SCIE EI CSCD 2022年第4期509-534,共26页
Salient object detection(SOD)is a long-standing research topic in computer vision with increasing interest in the past decade.Since light fields record comprehensive information of natural scenes that benefit SOD in a... Salient object detection(SOD)is a long-standing research topic in computer vision with increasing interest in the past decade.Since light fields record comprehensive information of natural scenes that benefit SOD in a number of ways,using light field inputs to improve saliency detection over conventional RGB inputs is an emerging trend.This paper provides the first comprehensive review and a benchmark for light field SOD,which has long been lacking in the saliency community.Firstly,we introduce light fields,including theory and data forms,and then review existing studies on light field SOD,covering ten traditional models,seven deep learning-based models,a comparative study,and a brief review.Existing datasets for light field SOD are also summarized.Secondly,we benchmark nine representative light field SOD models together with several cutting-edge RGB-D SOD models on four widely used light field datasets,providing insightful discussions and analyses,including a comparison between light field SOD and RGB-D SOD models.Due to the inconsistency of current datasets,we further generate complete data and supplement focal stacks,depth maps,and multi-view images for them,making them consistent and uniform.Our supplemental data make a universal benchmark possible.Lastly,light field SOD is a specialised problem,because of its diverse data representations and high dependency on acquisition hardware,so it differs greatly from other saliency detection tasks.We provide nine observations on challenges and future directions,and outline several open issues.All the materials including models,datasets,benchmarking results,and supplemented light field datasets are publicly available at https://github.com/kerenfu/LFSOD-Survey. 展开更多
关键词 light field salient object detection(SOD) deep learning BENCHMARKING
原文传递
Palmprint Recognition by Applying Wavelet-Based Kernel PCA
16
作者 Murat Ekinci Murat Aykut 《Journal of Computer Science & Technology》 SCIE EI CSCD 2008年第5期851-861,共11页
This paper presents a wavelet-based kernel Principal Component Analysis (PCA) method by integrating the Daubechies wavelet representation of palm images and the kernel PCA method for palmprint recognition. Kernel PC... This paper presents a wavelet-based kernel Principal Component Analysis (PCA) method by integrating the Daubechies wavelet representation of palm images and the kernel PCA method for palmprint recognition. Kernel PCA is a technique for nonlinear dimension reduction of data with an underlying nonlinear spatial structure. The intensity values of the palmprint image are first normalized by using mean and standard deviation. The palmprint is then transformed into the wavelet domain to decompose palm images and the lowest resolution subband coefficients are chosen for palm representation. The kernel PCA method is then applied to extract non-linear features from the subband coefficients. Finally, similarity measurement is accomplished by using weighted Euclidean linear distance-based nearest neighbor classifier. Experimental results on PolyU Palmprint Databases demonstrate that the proposed approach achieves highly competitive performance with respect to the published palmprint recognition approaches. 展开更多
关键词 palmprint recognition kernel PCA wavelet transform BIOMETRICS pattern recognition
原文传递
A volumetric change detection framework using UAV oblique photogrammetry–a case study of ultra-high-resolution monitoring of progressive building collapse
17
作者 Ningli Xu Debao Huang +5 位作者 Shuang Song Xiao Ling Chris Strasbaugh Alper Yilmaz Halil Sezen Rongjun Qin 《International Journal of Digital Earth》 SCIE 2021年第11期1705-1720,共16页
In this paper,we present a case study that performs an unmanned aerial vehicle(UAV)based fine-scale 3D change detection and monitoring of progressive collapse performance of a building during a demolition event.Multi-... In this paper,we present a case study that performs an unmanned aerial vehicle(UAV)based fine-scale 3D change detection and monitoring of progressive collapse performance of a building during a demolition event.Multi-temporal oblique photogrammetry images are collected with 3D point clouds generated at different stages of the demolition.The geometric accuracy of the generated point clouds has been evaluated against both airborne and terrestrial LiDAR point clouds,achieving an average distance of 12 cm and 16 cm for roof and façade respectively.We propose a hierarchical volumetric change detection framework that unifies multi-temporal UAV images for pose estimation(free of ground control points),reconstruction,and a coarse-to-fine 3D density change analysis.This work has provided a solution capable of addressing change detection on full 3D time-series datasets where dramatic scene content changes are presented progressively.Our change detection results on the building demolition event have been evaluated against the manually marked ground-truth changes and have achieved an F-1 score varying from 0.78 to 0.92,with consistently high precision(0.92–0.99).Volumetric changes through the demolition progress are derived from change detection and have been shown to favorably reflect the qualitative and quantitative building demolition progression. 展开更多
关键词 3D change detection multitemporal data registration oblique photogrammetry
原文传递
Rethinking Global Context in Crowd Counting
18
作者 Guolei Sun Yun Liu +3 位作者 Thomas Probst Danda Pani Paudel Nikola Popovic Luc Van Gool 《Machine Intelligence Research》 EI 2024年第4期640-651,共12页
This paper investigates the role of global context for crowd counting.Specifically,a pure transformer is used to extract features with global information from overlapping image patches.Inspired by classification,we ad... This paper investigates the role of global context for crowd counting.Specifically,a pure transformer is used to extract features with global information from overlapping image patches.Inspired by classification,we add a context token to the input sequence,to facilitate information exchange with tokens corresponding to image patches throughout transformer layers.Due to the fact that transformers do not explicitly model the tried-and-true channel-wise interactions,we propose a token-attention module(TAM)to recalibrate encoded features through channel-wise attention informed by the context token.Beyond that,it is adopted to predict the total person count of the image through regression-token module(RTM).Extensive experiments on various datasets,including ShanghaiTech,UCFQNRF,JHU-CROWD++and NWPU,demonstrate that the proposed context extraction techniques can significantly improve the performanceover the baselines. 展开更多
关键词 Crowd counting vision transformer global context attention density map.
原文传递
上一页 1 下一页 到第
使用帮助 返回顶部