In this paper,we introduce TianXing,a transformer-based data-driven model designed with physical augmentation for skillful and efficient global weather forecasting.Previous data-driven transformer models such as Pangu...In this paper,we introduce TianXing,a transformer-based data-driven model designed with physical augmentation for skillful and efficient global weather forecasting.Previous data-driven transformer models such as Pangu-Weather,FengWu,and FuXi have emerged as promising alternatives for numerical weather prediction in weather forecasting.However,these models have been characterized by their substantial computational resource consumption during training and limited incorporation of explicit physical guidance in their modeling frameworks.In contrast,TianXing applies a linear complexity mechanism that ensures proportional scalability with input data size while significantly diminishing GPU resource demands,with only a marginal compromise in accuracy.Furthermore,TianXing proposes an explicit attention decay mechanism in the linear attention derived from physical insights to enhance its forecasting skill.The mechanism can reweight attention based on Earth's spherical distances and learned sparse multivariate coupling relationships,promptingTianXing to prioritize dynamically relevant neighboring features.Finally,to enhance its performance in mediumrange forecasting,TianXing employs a stacked autoregressive forecast algorithm.Validation of the model's architecture is conducted using ERA5 reanalysis data at a 5.625°latitude-longitude resolution,while a high-resolution dataset at 0.25°is utilized for training the actual forecasting model.Notably,the TianXing exhibits excellent performance,particularly in the Z500(geopotential height)and T850(temperature)fields,surpassing previous data-driven models and operational fullresolution models such as NCEP GFS and ECMWF IFS,as evidenced by latitude-weighted RMSE and ACC metrics.Moreover,the TianXing has demonstrated remarkable capabilities in predicting extreme weather events,such as typhoons.展开更多
The cross-view matching of local image features is a fundamental task in visual localization and 3D reconstruction.This study proposes FilterGNN,a transformer-based graph neural network(GNN),aiming to improve the matc...The cross-view matching of local image features is a fundamental task in visual localization and 3D reconstruction.This study proposes FilterGNN,a transformer-based graph neural network(GNN),aiming to improve the matching efficiency and accuracy of visual descriptors.Based on high matching sparseness and coarse-to-fine covisible area detection,FilterGNN utilizes cascaded optimal graph-matching filter modules to dynamically reject outlier matches.Moreover,we successfully adapted linear attention in FilterGNN with post-instance normalization support,which significantly reduces the complexity of complete graph learning from O(N2)to O(N).Experiments show that FilterGNN requires only 6%of the time cost and 33.3%of the memory cost compared with SuperGlue under a large-scale input size and achieves a competitive performance in various tasks,such as pose estimation,visual localization,and sparse 3D reconstruction.展开更多
As image manipulation technology advances rapidly,the malicious use of image tampering has alarmingly escalated,posing a significant threat to social stability.In the realm of image tampering localization,accurately l...As image manipulation technology advances rapidly,the malicious use of image tampering has alarmingly escalated,posing a significant threat to social stability.In the realm of image tampering localization,accurately localizing limited samples,multiple types,and various sizes of regions remains a multitude of challenges.These issues impede the model’s universality and generalization capability and detrimentally affect its performance.To tackle these issues,we propose FL-MobileViT-an improved MobileViT model devised for image tampering localization.Our proposed model utilizes a dual-stream architecture that independently processes the RGB and noise domain,and captures richer traces of tampering through dual-stream integration.Meanwhile,the model incorporating the Focused Linear Attention mechanism within the lightweight network(MobileViT).This substitution significantly diminishes computational complexity and resolves homogeneity problems associated with traditional Transformer attention mechanisms,enhancing feature extraction diversity and improving the model’s localization performance.To comprehensively fuse the generated results from both feature extractors,we introduce the ASPP architecture for multi-scale feature fusion.This facilitates a more precise localization of tampered regions of various sizes.Furthermore,to bolster the model’s generalization ability,we adopt a contrastive learning method and devise a joint optimization training strategy that leverages fused features and captures the disparities in feature distribution in tampered images.This strategy enables the learning of contrastive loss at various stages of the feature extractor and employs it as an additional constraint condition in conjunction with cross-entropy loss.As a result,overfitting issues are effectively alleviated,and the differentiation between tampered and untampered regions is enhanced.Experimental evaluations on five benchmark datasets(IMD-20,CASIA,NIST-16,Columbia and Coverage)validate the effectiveness of our proposed model.The meticulously calibrated FL-MobileViT model consistently outperforms numerous existing general models regarding localization accuracy across diverse datasets,demonstrating superior adaptability.展开更多
在流式识别方法中,分块识别破坏并行性且消耗资源较大,而限制自注意力机制的上下文识别很难获得所有信息.由此,文中提出轻量化端到端声学架构(CFLASH-Transducer).为了获取细腻的局部特征,采用轻量化的FLASH(Fast Linear Attention with...在流式识别方法中,分块识别破坏并行性且消耗资源较大,而限制自注意力机制的上下文识别很难获得所有信息.由此,文中提出轻量化端到端声学架构(CFLASH-Transducer).为了获取细腻的局部特征,采用轻量化的FLASH(Fast Linear Attention with a Single Head)与卷积神经网络块结合.卷积块中采用Inception V2网络,提取语音信号多尺度的局部特征.再通过Coordinate Attention机制捕获特征的位置信息和多通道之间的相互关联.此外,采用深度可分离卷积,用于特征增强和层间平滑过渡.为了使其可流式化处理音频,采用RNN-T(Recurrent Neural Network Transducer)架构进行训练与解码.将当前块已经计算的全局注意力作为隐变量,传入后续块中,串联各块信息,保留训练的并行性和相关性,并且不会随着序列的增长而消耗计算资源.在开源数据集THCHS30上进行训练与测试,CFLASH-Transducer取得较高的识别率.并且相比离线识别,流式识别精度损失不超过1%.展开更多
Hyperspectral Image(HSI)classification based on deep learning has been an attractive area in recent years.However,as a kind of data-driven algorithm,the deep learning method usually requires numerous computational res...Hyperspectral Image(HSI)classification based on deep learning has been an attractive area in recent years.However,as a kind of data-driven algorithm,the deep learning method usually requires numerous computational resources and high-quality labelled datasets,while the expenditures of high-performance computing and data annotation are expensive.In this paper,to reduce the dependence on massive calculation and labelled samples,we propose a deep Double-Channel dense network(DDCD)for Hyperspectral Image Classification.Specifically,we design a 3D Double-Channel dense layer to capture the local and global features of the input.And we propose a Linear Attention Mechanism that is approximate to dot-product attention with much less memory and computational costs.The number of parameters and the consumptions of calculation are observably less than contrapositive deep learning methods,which means DDCD owns simpler architecture and higher efficiency.A series of quantitative experiences on 6 widely used hyperspectral datasets show that the proposed DDCD obtains state-of-the-art performance,even though when the absence of labelled samples is severe.展开更多
基金supported in part by the Meteorological Joint Funds of the National Natural Science Foundation of China under Grant U2142211in part by the National Natural Science Foundation of China under Grant 42075141,42341202+2 种基金in part by the National Key Research and Development Program of China under Grant 2020YFA0608000in part by the Shanghai Municipal Science and Technology Major Project(2021SHZDZX0100)the Fundamental Research Funds for the Central Universities。
文摘In this paper,we introduce TianXing,a transformer-based data-driven model designed with physical augmentation for skillful and efficient global weather forecasting.Previous data-driven transformer models such as Pangu-Weather,FengWu,and FuXi have emerged as promising alternatives for numerical weather prediction in weather forecasting.However,these models have been characterized by their substantial computational resource consumption during training and limited incorporation of explicit physical guidance in their modeling frameworks.In contrast,TianXing applies a linear complexity mechanism that ensures proportional scalability with input data size while significantly diminishing GPU resource demands,with only a marginal compromise in accuracy.Furthermore,TianXing proposes an explicit attention decay mechanism in the linear attention derived from physical insights to enhance its forecasting skill.The mechanism can reweight attention based on Earth's spherical distances and learned sparse multivariate coupling relationships,promptingTianXing to prioritize dynamically relevant neighboring features.Finally,to enhance its performance in mediumrange forecasting,TianXing employs a stacked autoregressive forecast algorithm.Validation of the model's architecture is conducted using ERA5 reanalysis data at a 5.625°latitude-longitude resolution,while a high-resolution dataset at 0.25°is utilized for training the actual forecasting model.Notably,the TianXing exhibits excellent performance,particularly in the Z500(geopotential height)and T850(temperature)fields,surpassing previous data-driven models and operational fullresolution models such as NCEP GFS and ECMWF IFS,as evidenced by latitude-weighted RMSE and ACC metrics.Moreover,the TianXing has demonstrated remarkable capabilities in predicting extreme weather events,such as typhoons.
基金supported by the National Natural Science Foundation of China(Grant No.62220106003)Tsinghua-Tencent Joint Laboratory for Internet Innovation Technology.
文摘The cross-view matching of local image features is a fundamental task in visual localization and 3D reconstruction.This study proposes FilterGNN,a transformer-based graph neural network(GNN),aiming to improve the matching efficiency and accuracy of visual descriptors.Based on high matching sparseness and coarse-to-fine covisible area detection,FilterGNN utilizes cascaded optimal graph-matching filter modules to dynamically reject outlier matches.Moreover,we successfully adapted linear attention in FilterGNN with post-instance normalization support,which significantly reduces the complexity of complete graph learning from O(N2)to O(N).Experiments show that FilterGNN requires only 6%of the time cost and 33.3%of the memory cost compared with SuperGlue under a large-scale input size and achieves a competitive performance in various tasks,such as pose estimation,visual localization,and sparse 3D reconstruction.
基金This study was funded by the Science and Technology Project in Xi’an(No.22GXFW0123)this work was supported by the Special Fund Construction Project of Key Disciplines in Ordinary Colleges and Universities in Shaanxi Province,the authors would like to thank the anonymous reviewers for their helpful comments and suggestions.
文摘As image manipulation technology advances rapidly,the malicious use of image tampering has alarmingly escalated,posing a significant threat to social stability.In the realm of image tampering localization,accurately localizing limited samples,multiple types,and various sizes of regions remains a multitude of challenges.These issues impede the model’s universality and generalization capability and detrimentally affect its performance.To tackle these issues,we propose FL-MobileViT-an improved MobileViT model devised for image tampering localization.Our proposed model utilizes a dual-stream architecture that independently processes the RGB and noise domain,and captures richer traces of tampering through dual-stream integration.Meanwhile,the model incorporating the Focused Linear Attention mechanism within the lightweight network(MobileViT).This substitution significantly diminishes computational complexity and resolves homogeneity problems associated with traditional Transformer attention mechanisms,enhancing feature extraction diversity and improving the model’s localization performance.To comprehensively fuse the generated results from both feature extractors,we introduce the ASPP architecture for multi-scale feature fusion.This facilitates a more precise localization of tampered regions of various sizes.Furthermore,to bolster the model’s generalization ability,we adopt a contrastive learning method and devise a joint optimization training strategy that leverages fused features and captures the disparities in feature distribution in tampered images.This strategy enables the learning of contrastive loss at various stages of the feature extractor and employs it as an additional constraint condition in conjunction with cross-entropy loss.As a result,overfitting issues are effectively alleviated,and the differentiation between tampered and untampered regions is enhanced.Experimental evaluations on five benchmark datasets(IMD-20,CASIA,NIST-16,Columbia and Coverage)validate the effectiveness of our proposed model.The meticulously calibrated FL-MobileViT model consistently outperforms numerous existing general models regarding localization accuracy across diverse datasets,demonstrating superior adaptability.
文摘在流式识别方法中,分块识别破坏并行性且消耗资源较大,而限制自注意力机制的上下文识别很难获得所有信息.由此,文中提出轻量化端到端声学架构(CFLASH-Transducer).为了获取细腻的局部特征,采用轻量化的FLASH(Fast Linear Attention with a Single Head)与卷积神经网络块结合.卷积块中采用Inception V2网络,提取语音信号多尺度的局部特征.再通过Coordinate Attention机制捕获特征的位置信息和多通道之间的相互关联.此外,采用深度可分离卷积,用于特征增强和层间平滑过渡.为了使其可流式化处理音频,采用RNN-T(Recurrent Neural Network Transducer)架构进行训练与解码.将当前块已经计算的全局注意力作为隐变量,传入后续块中,串联各块信息,保留训练的并行性和相关性,并且不会随着序列的增长而消耗计算资源.在开源数据集THCHS30上进行训练与测试,CFLASH-Transducer取得较高的识别率.并且相比离线识别,流式识别精度损失不超过1%.
基金National Natural Science Foundations of China(41671452)China Postdoctoral Science Foundation Funded Project(2017M612510)。
文摘Hyperspectral Image(HSI)classification based on deep learning has been an attractive area in recent years.However,as a kind of data-driven algorithm,the deep learning method usually requires numerous computational resources and high-quality labelled datasets,while the expenditures of high-performance computing and data annotation are expensive.In this paper,to reduce the dependence on massive calculation and labelled samples,we propose a deep Double-Channel dense network(DDCD)for Hyperspectral Image Classification.Specifically,we design a 3D Double-Channel dense layer to capture the local and global features of the input.And we propose a Linear Attention Mechanism that is approximate to dot-product attention with much less memory and computational costs.The number of parameters and the consumptions of calculation are observably less than contrapositive deep learning methods,which means DDCD owns simpler architecture and higher efficiency.A series of quantitative experiences on 6 widely used hyperspectral datasets show that the proposed DDCD obtains state-of-the-art performance,even though when the absence of labelled samples is severe.