The fusion technique is the key to the multimodal emotion recognition task.Recently,cross-modal attention-based fusion methods have demonstrated high performance and strong robustness.However,cross-modal attention suf...The fusion technique is the key to the multimodal emotion recognition task.Recently,cross-modal attention-based fusion methods have demonstrated high performance and strong robustness.However,cross-modal attention suffers from redundant features and does not capture complementary features well.We find that it is not necessary to use the entire information of one modality to reinforce the other during cross-modal interaction,and the features that can reinforce a modality may contain only a part of it.To this end,we design an innovative Transformer-based Adaptive Cross-modal Fusion Network(TACFN).Specifically,for the redundant features,we make one modality perform intra-modal feature selection through a self-attention mechanism,so that the selected features can adaptively and efficiently interact with another modality.To better capture the complementary information between the modalities,we obtain the fused weight vector by splicing and use the weight vector to achieve feature reinforcement of the modalities.We apply TCAFN to the RAVDESS and IEMOCAP datasets.For fair comparison,we use the same unimodal representations to validate the effectiveness of the proposed fusion method.The experimental results show that TACFN brings a significant performance improvement compared to other methods and reaches the state-of-the-art performance.All code and models could be accessed from https://github.com/shuzihuaiyu/TACFN.展开更多
In this paper, a sampling adaptive for block compressed sensing with smooth projected Landweber based on edge detection (SA-BCS-SPL-ED) image reconstruction algorithm is presented. This algorithm takes full advantag...In this paper, a sampling adaptive for block compressed sensing with smooth projected Landweber based on edge detection (SA-BCS-SPL-ED) image reconstruction algorithm is presented. This algorithm takes full advantage of the characteristics of the block compressed sensing, which assigns a sampling rate depending on its texture complexity of each block. The block complexity is measured by the variance of its texture gradient, big variance with high sampling rates and small variance with low sampling rates. Meanwhile, in order to avoid over-sampling and sub-sampling, we set up the maximum sampling rate and the minimum sampling rate for each block. Through iterative algorithm, the actual sampling rate of the whole image approximately equals to the set up value. In aspects of the directional transforms, discrete cosine transform (DCT), dual-tree discrete wavelet transform (DDWT), discrete wavelet transform (DWT) and Contourlet (CT) are used in experiments. Experimental results show that compared to block compressed sensing with smooth projected Landweber (BCS-SPL), the proposed algorithm is much better with simple texture images and even complicated texture images at the same sampling rate. Besides, SA-BCS-SPL-ED-DDWT is quite good for the most of images while the SA-BCS-SPL-ED-CT is likely better only for more-complicated texture images.展开更多
基金supported by Beijing Key Laboratory of Behavior and Mental Health,Peking University。
文摘The fusion technique is the key to the multimodal emotion recognition task.Recently,cross-modal attention-based fusion methods have demonstrated high performance and strong robustness.However,cross-modal attention suffers from redundant features and does not capture complementary features well.We find that it is not necessary to use the entire information of one modality to reinforce the other during cross-modal interaction,and the features that can reinforce a modality may contain only a part of it.To this end,we design an innovative Transformer-based Adaptive Cross-modal Fusion Network(TACFN).Specifically,for the redundant features,we make one modality perform intra-modal feature selection through a self-attention mechanism,so that the selected features can adaptively and efficiently interact with another modality.To better capture the complementary information between the modalities,we obtain the fused weight vector by splicing and use the weight vector to achieve feature reinforcement of the modalities.We apply TCAFN to the RAVDESS and IEMOCAP datasets.For fair comparison,we use the same unimodal representations to validate the effectiveness of the proposed fusion method.The experimental results show that TACFN brings a significant performance improvement compared to other methods and reaches the state-of-the-art performance.All code and models could be accessed from https://github.com/shuzihuaiyu/TACFN.
基金supported by the National Natural Science Foundation of China (61071091, 61071166)the Project Funded by the Priority Academic Program Development of Jiangsu Higher Education Institution-Information and Communication Engineering
文摘In this paper, a sampling adaptive for block compressed sensing with smooth projected Landweber based on edge detection (SA-BCS-SPL-ED) image reconstruction algorithm is presented. This algorithm takes full advantage of the characteristics of the block compressed sensing, which assigns a sampling rate depending on its texture complexity of each block. The block complexity is measured by the variance of its texture gradient, big variance with high sampling rates and small variance with low sampling rates. Meanwhile, in order to avoid over-sampling and sub-sampling, we set up the maximum sampling rate and the minimum sampling rate for each block. Through iterative algorithm, the actual sampling rate of the whole image approximately equals to the set up value. In aspects of the directional transforms, discrete cosine transform (DCT), dual-tree discrete wavelet transform (DDWT), discrete wavelet transform (DWT) and Contourlet (CT) are used in experiments. Experimental results show that compared to block compressed sensing with smooth projected Landweber (BCS-SPL), the proposed algorithm is much better with simple texture images and even complicated texture images at the same sampling rate. Besides, SA-BCS-SPL-ED-DDWT is quite good for the most of images while the SA-BCS-SPL-ED-CT is likely better only for more-complicated texture images.
基金国家杰出青年基金(the National Science Fund of China for Distinguished Young Scholar under Grant No.60325310)深圳市科技计划项目(No.szkj0502)+1 种基金广东省自然科学团队研究项目(No.04205783)科技部重大基础前期研究专项(No.2005CCA04100)。