Recent studies have shown that deep learning(DL)models can skillfully forecast El Niño–Southern Oscillation(ENSO)events more than 1.5 years in advance.However,concerns regarding the reliability of predictions ma...Recent studies have shown that deep learning(DL)models can skillfully forecast El Niño–Southern Oscillation(ENSO)events more than 1.5 years in advance.However,concerns regarding the reliability of predictions made by DL methods persist,including potential overfitting issues and lack of interpretability.Here,we propose ResoNet,a DL model that combines CNN(convolutional neural network)and transformer architectures.This hybrid architecture enables our model to adequately capture local sea surface temperature anomalies as well as long-range inter-basin interactions across oceans.We show that ResoNet can robustly predict ENSO at lead times of 19 months,thus outperforming existing approaches in terms of the forecast horizon.According to an explainability method applied to ResoNet predictions of El Niño and La Niña from 1-to 18-month leads,we find that it predicts the Niño-3.4 index based on multiple physically reasonable mechanisms,such as the recharge oscillator concept,seasonal footprint mechanism,and Indian Ocean capacitor effect.Moreover,we demonstrate for the first time that the asymmetry between El Niño and La Niña development can be captured by ResoNet.Our results could help to alleviate skepticism about applying DL models for ENSO prediction and encourage more attempts to discover and predict climate phenomena using AI methods.展开更多
Convolutional neural network(CNN)has excellent ability to model locally contextual information.However,CNNs face challenges for descripting long-range semantic features,which will lead to relatively low classification...Convolutional neural network(CNN)has excellent ability to model locally contextual information.However,CNNs face challenges for descripting long-range semantic features,which will lead to relatively low classification accuracy of hyperspectral images.To address this problem,this article proposes an algorithm based on multiscale fusion and transformer network for hyperspectral image classification.Firstly,the low-level spatial-spectral features are extracted by multi-scale residual structure.Secondly,an attention module is introduced to focus on the more important spatialspectral information.Finally,high-level semantic features are represented and learned by a token learner and an improved transformer encoder.The proposed algorithm is compared with six classical hyperspectral classification algorithms on real hyperspectral images.The experimental results show that the proposed algorithm effectively improves the land cover classification accuracy of hyperspectral images.展开更多
Sign language fills the communication gap for people with hearing and speaking ailments.It includes both visual modalities,manual gestures consisting of movements of hands,and non-manual gestures incorporating body mo...Sign language fills the communication gap for people with hearing and speaking ailments.It includes both visual modalities,manual gestures consisting of movements of hands,and non-manual gestures incorporating body movements including head,facial expressions,eyes,shoulder shrugging,etc.Previously both gestures have been detected;identifying separately may have better accuracy,butmuch communicational information is lost.Aproper sign language mechanism is needed to detect manual and non-manual gestures to convey the appropriate detailed message to others.Our novel proposed system contributes as Sign LanguageAction Transformer Network(SLATN),localizing hand,body,and facial gestures in video sequences.Here we are expending a Transformer-style structural design as a“base network”to extract features from a spatiotemporal domain.Themodel impulsively learns to track individual persons and their action context inmultiple frames.Furthermore,a“head network”emphasizes hand movement and facial expression simultaneously,which is often crucial to understanding sign language,using its attention mechanism for creating tight bounding boxes around classified gestures.The model’s work is later compared with the traditional identification methods of activity recognition.It not only works faster but achieves better accuracy as well.Themodel achieves overall 82.66%testing accuracy with a very considerable performance of computation with 94.13 Giga-Floating Point Operations per Second(G-FLOPS).Another contribution is a newly created dataset of Pakistan Sign Language forManual and Non-Manual(PkSLMNM)gestures.展开更多
The semantic segmentation methods based on CNN have made great progress,but there are still some shortcomings in the application of remote sensing images segmentation,such as the small receptive field can not effectiv...The semantic segmentation methods based on CNN have made great progress,but there are still some shortcomings in the application of remote sensing images segmentation,such as the small receptive field can not effectively capture global context.In order to solve this problem,this paper proposes a hybrid model based on ResNet50 and swin transformer to directly capture long-range dependence,which fuses features through Cross Feature Modulation Module(CFMM).Experimental results on two publicly available datasets,Vaihingen and Potsdam,are mIoU of 70.27%and 76.63%,respectively.Thus,CFM-UNet can maintain a high segmentation performance compared with other competitive networks.展开更多
Convolutional neural networks(CNN)based on U-shaped structures and skip connections play a pivotal role in various image segmentation tasks.Recently,Transformer starts to lead new trends in the image segmentation task...Convolutional neural networks(CNN)based on U-shaped structures and skip connections play a pivotal role in various image segmentation tasks.Recently,Transformer starts to lead new trends in the image segmentation task.Transformer layer can construct the relationship between all pixels,and the two parties can complement each other well.On the basis of these characteristics,we try to combine Transformer pipeline and convolutional neural network pipeline to gain the advantages of both.The image is put into the U-shaped encoder-decoder architecture based on empirical combination of self-attention and convolution,in which skip connections are utilized for localglobal semantic feature learning.At the same time,the image is also put into the convolutional neural network architecture.The final segmentation result will be formed by Mix block which combines both.The mixture model of the convolutional neural network and the Transformer network for road segmentation(MCTNet)can achieve effective segmentation results on KITTI dataset and Unstructured Road Scene(URS)dataset built by ourselves.Codes,self-built datasets and trainable models will be available on https://github.com/xflxfl1992/MCTNet.展开更多
In the past,convolutional neural network(CNN)has become one of the most popular deep learning frameworks,and has been widely used in Hyperspectral image classification tasks.Convolution(Conv)in CNN uses filter weights...In the past,convolutional neural network(CNN)has become one of the most popular deep learning frameworks,and has been widely used in Hyperspectral image classification tasks.Convolution(Conv)in CNN uses filter weights to extract features in local receiving domain,and the weight parameters are shared globally,which more focus on the highfrequency information of the image.Different from Conv,Transformer can obtain the long‐term dependence between long‐distance features through modelling,and adaptively focus on different regions.In addition,Transformer is considered as a low‐pass filter,which more focuses on the low‐frequency information of the image.Considering the complementary characteristics of Conv and Transformer,the two modes can be integrated for full feature extraction.In addition,the most important image features correspond to the discrimination region,while the secondary image features represent important but easily ignored regions,which are also conducive to the classification of HSIs.In this study,a complementary integrated Transformer network(CITNet)for hyperspectral image classification is proposed.Firstly,three‐dimensional convolution(Conv3D)and two‐dimensional convolution(Conv2D)are utilised to extract the shallow semantic information of the image.In order to enhance the secondary features,a channel Gaussian modulation attention module is proposed,which is embedded between Conv3D and Conv2D.This module can not only enhance secondary features,but suppress the most important and least important features.Then,considering the different and complementary characteristics of Conv and Transformer,a complementary integrated Transformer module is designed.Finally,through a large number of experiments,this study evaluates the classification performance of CITNet and several state‐of‐the‐art networks on five common datasets.The experimental results show that compared with these classification networks,CITNet can provide better classification performance.展开更多
An increase in car ownership brings convenience to people’s life.However,it also leads to frequent traffic accidents.Precisely forecasting surrounding agents’future trajectories could effectively decrease vehicle-ve...An increase in car ownership brings convenience to people’s life.However,it also leads to frequent traffic accidents.Precisely forecasting surrounding agents’future trajectories could effectively decrease vehicle-vehicle and vehicle-pedestrian collisions.Long-short-term memory(LSTM)network is often used for vehicle trajectory prediction,but it has some shortages such as gradient explosion and low efficiency.A trajectory prediction method based on an improved Transformer network is proposed to forecast agents’future trajectories in a complex traffic environment.It realizes the transformation from sequential step processing of LSTM to parallel processing of Transformer based on attentionmechanism.To performtrajectory predictionmore efficiently,a probabilistic sparse self-attention mechanism is introduced to reduce attention complexity by reducing the number of queried values in the attention mechanism.Activate or not(ACON)activation function is adopted to select whether to activate or not,hence improving model flexibility.The proposed method is evaluated on the publicly available benchmarks nextgeneration simulation(NGSIM)and ETH/UCY.The experimental results indicate that the proposed method can accurately and efficiently predict agents’trajectories.展开更多
基金supported by the Shanghai Artificial Intelligence Laboratory and National Natural Science Foundation of China(Grant No.42088101 and 42030605).
文摘Recent studies have shown that deep learning(DL)models can skillfully forecast El Niño–Southern Oscillation(ENSO)events more than 1.5 years in advance.However,concerns regarding the reliability of predictions made by DL methods persist,including potential overfitting issues and lack of interpretability.Here,we propose ResoNet,a DL model that combines CNN(convolutional neural network)and transformer architectures.This hybrid architecture enables our model to adequately capture local sea surface temperature anomalies as well as long-range inter-basin interactions across oceans.We show that ResoNet can robustly predict ENSO at lead times of 19 months,thus outperforming existing approaches in terms of the forecast horizon.According to an explainability method applied to ResoNet predictions of El Niño and La Niña from 1-to 18-month leads,we find that it predicts the Niño-3.4 index based on multiple physically reasonable mechanisms,such as the recharge oscillator concept,seasonal footprint mechanism,and Indian Ocean capacitor effect.Moreover,we demonstrate for the first time that the asymmetry between El Niño and La Niña development can be captured by ResoNet.Our results could help to alleviate skepticism about applying DL models for ENSO prediction and encourage more attempts to discover and predict climate phenomena using AI methods.
基金National Natural Science Foundation of China(No.62201457)Natural Science Foundation of Shaanxi Province(Nos.2022JQ-668,2022JQ-588)。
文摘Convolutional neural network(CNN)has excellent ability to model locally contextual information.However,CNNs face challenges for descripting long-range semantic features,which will lead to relatively low classification accuracy of hyperspectral images.To address this problem,this article proposes an algorithm based on multiscale fusion and transformer network for hyperspectral image classification.Firstly,the low-level spatial-spectral features are extracted by multi-scale residual structure.Secondly,an attention module is introduced to focus on the more important spatialspectral information.Finally,high-level semantic features are represented and learned by a token learner and an improved transformer encoder.The proposed algorithm is compared with six classical hyperspectral classification algorithms on real hyperspectral images.The experimental results show that the proposed algorithm effectively improves the land cover classification accuracy of hyperspectral images.
文摘Sign language fills the communication gap for people with hearing and speaking ailments.It includes both visual modalities,manual gestures consisting of movements of hands,and non-manual gestures incorporating body movements including head,facial expressions,eyes,shoulder shrugging,etc.Previously both gestures have been detected;identifying separately may have better accuracy,butmuch communicational information is lost.Aproper sign language mechanism is needed to detect manual and non-manual gestures to convey the appropriate detailed message to others.Our novel proposed system contributes as Sign LanguageAction Transformer Network(SLATN),localizing hand,body,and facial gestures in video sequences.Here we are expending a Transformer-style structural design as a“base network”to extract features from a spatiotemporal domain.Themodel impulsively learns to track individual persons and their action context inmultiple frames.Furthermore,a“head network”emphasizes hand movement and facial expression simultaneously,which is often crucial to understanding sign language,using its attention mechanism for creating tight bounding boxes around classified gestures.The model’s work is later compared with the traditional identification methods of activity recognition.It not only works faster but achieves better accuracy as well.Themodel achieves overall 82.66%testing accuracy with a very considerable performance of computation with 94.13 Giga-Floating Point Operations per Second(G-FLOPS).Another contribution is a newly created dataset of Pakistan Sign Language forManual and Non-Manual(PkSLMNM)gestures.
基金Young Innovative Talents Project of Guangdong Ordinary Universities(No.2022KQNCX225)School-level Teaching and Research Project of Guangzhou City Polytechnic(No.2022xky046)。
文摘The semantic segmentation methods based on CNN have made great progress,but there are still some shortcomings in the application of remote sensing images segmentation,such as the small receptive field can not effectively capture global context.In order to solve this problem,this paper proposes a hybrid model based on ResNet50 and swin transformer to directly capture long-range dependence,which fuses features through Cross Feature Modulation Module(CFMM).Experimental results on two publicly available datasets,Vaihingen and Potsdam,are mIoU of 70.27%and 76.63%,respectively.Thus,CFM-UNet can maintain a high segmentation performance compared with other competitive networks.
基金supported by the Postgraduate Research&Practice Innovation Program of Jiangsu Province (SJCX21_1427)General Program of Natural Science Research in Jiangsu Universities (21KJB520019).
文摘Convolutional neural networks(CNN)based on U-shaped structures and skip connections play a pivotal role in various image segmentation tasks.Recently,Transformer starts to lead new trends in the image segmentation task.Transformer layer can construct the relationship between all pixels,and the two parties can complement each other well.On the basis of these characteristics,we try to combine Transformer pipeline and convolutional neural network pipeline to gain the advantages of both.The image is put into the U-shaped encoder-decoder architecture based on empirical combination of self-attention and convolution,in which skip connections are utilized for localglobal semantic feature learning.At the same time,the image is also put into the convolutional neural network architecture.The final segmentation result will be formed by Mix block which combines both.The mixture model of the convolutional neural network and the Transformer network for road segmentation(MCTNet)can achieve effective segmentation results on KITTI dataset and Unstructured Road Scene(URS)dataset built by ourselves.Codes,self-built datasets and trainable models will be available on https://github.com/xflxfl1992/MCTNet.
基金funded in part by the National Natural Science Foundation of China(42271409,62071084)in part by the Heilongjiang Science Foundation Project of China under Grant LH2021D022in part by the Leading Talents Project of the State Ethnic Affairs Commission,and in part by the Fundamental Research Funds in Heilongjiang Provincial Universities of China under Grant 145209149.
文摘In the past,convolutional neural network(CNN)has become one of the most popular deep learning frameworks,and has been widely used in Hyperspectral image classification tasks.Convolution(Conv)in CNN uses filter weights to extract features in local receiving domain,and the weight parameters are shared globally,which more focus on the highfrequency information of the image.Different from Conv,Transformer can obtain the long‐term dependence between long‐distance features through modelling,and adaptively focus on different regions.In addition,Transformer is considered as a low‐pass filter,which more focuses on the low‐frequency information of the image.Considering the complementary characteristics of Conv and Transformer,the two modes can be integrated for full feature extraction.In addition,the most important image features correspond to the discrimination region,while the secondary image features represent important but easily ignored regions,which are also conducive to the classification of HSIs.In this study,a complementary integrated Transformer network(CITNet)for hyperspectral image classification is proposed.Firstly,three‐dimensional convolution(Conv3D)and two‐dimensional convolution(Conv2D)are utilised to extract the shallow semantic information of the image.In order to enhance the secondary features,a channel Gaussian modulation attention module is proposed,which is embedded between Conv3D and Conv2D.This module can not only enhance secondary features,but suppress the most important and least important features.Then,considering the different and complementary characteristics of Conv and Transformer,a complementary integrated Transformer module is designed.Finally,through a large number of experiments,this study evaluates the classification performance of CITNet and several state‐of‐the‐art networks on five common datasets.The experimental results show that compared with these classification networks,CITNet can provide better classification performance.
基金the SuzhouKey industrial technology innovation project SYG202031the Future Network Scientific Research Fund Project,FNSRFP-2021-YB-29.
文摘An increase in car ownership brings convenience to people’s life.However,it also leads to frequent traffic accidents.Precisely forecasting surrounding agents’future trajectories could effectively decrease vehicle-vehicle and vehicle-pedestrian collisions.Long-short-term memory(LSTM)network is often used for vehicle trajectory prediction,but it has some shortages such as gradient explosion and low efficiency.A trajectory prediction method based on an improved Transformer network is proposed to forecast agents’future trajectories in a complex traffic environment.It realizes the transformation from sequential step processing of LSTM to parallel processing of Transformer based on attentionmechanism.To performtrajectory predictionmore efficiently,a probabilistic sparse self-attention mechanism is introduced to reduce attention complexity by reducing the number of queried values in the attention mechanism.Activate or not(ACON)activation function is adopted to select whether to activate or not,hence improving model flexibility.The proposed method is evaluated on the publicly available benchmarks nextgeneration simulation(NGSIM)and ETH/UCY.The experimental results indicate that the proposed method can accurately and efficiently predict agents’trajectories.