In the era of big data rich inWe Media,the single mode retrieval system has been unable to meet people’s demand for information retrieval.This paper proposes a new solution to the problem of feature extraction and un...In the era of big data rich inWe Media,the single mode retrieval system has been unable to meet people’s demand for information retrieval.This paper proposes a new solution to the problem of feature extraction and unified mapping of different modes:A Cross-Modal Hashing retrieval algorithm based on Deep Residual Network(CMHR-DRN).The model construction is divided into two stages:The first stage is the feature extraction of different modal data,including the use of Deep Residual Network(DRN)to extract the image features,using the method of combining TF-IDF with the full connection network to extract the text features,and the obtained image and text features used as the input of the second stage.In the second stage,the image and text features are mapped into Hash functions by supervised learning,and the image and text features are mapped to the common binary Hamming space.In the process of mapping,the distance measurement of the original distance measurement and the common feature space are kept unchanged as far as possible to improve the accuracy of Cross-Modal Retrieval.In training the model,adaptive moment estimation(Adam)is used to calculate the adaptive learning rate of each parameter,and the stochastic gradient descent(SGD)is calculated to obtain the minimum loss function.The whole training process is completed on Caffe deep learning framework.Experiments show that the proposed algorithm CMHR-DRN based on Deep Residual Network has better retrieval performance and stronger advantages than other Cross-Modal algorithms CMFH,CMDN and CMSSH.展开更多
In recent years,the development of deep learning has further improved hash retrieval technology.Most of the existing hashing methods currently use Convolutional Neural Networks(CNNs)and Recurrent Neural Networks(RNNs)...In recent years,the development of deep learning has further improved hash retrieval technology.Most of the existing hashing methods currently use Convolutional Neural Networks(CNNs)and Recurrent Neural Networks(RNNs)to process image and text information,respectively.This makes images or texts subject to local constraints,and inherent label matching cannot capture finegrained information,often leading to suboptimal results.Driven by the development of the transformer model,we propose a framework called ViT2CMH mainly based on the Vision Transformer to handle deep Cross-modal Hashing tasks rather than CNNs or RNNs.Specifically,we use a BERT network to extract text features and use the vision transformer as the image network of the model.Finally,the features are transformed into hash codes for efficient and fast retrieval.We conduct extensive experiments on Microsoft COCO(MS-COCO)and Flickr30K,comparing with baselines of some hashing methods and image-text matching methods,showing that our method has better performance.展开更多
In recent years,cross-modal hash retrieval has become a popular research field because of its advantages of high efficiency and low storage.Cross-modal retrieval technology can be applied to search engines,crossmodalm...In recent years,cross-modal hash retrieval has become a popular research field because of its advantages of high efficiency and low storage.Cross-modal retrieval technology can be applied to search engines,crossmodalmedical processing,etc.The existing main method is to use amulti-label matching paradigm to finish the retrieval tasks.However,such methods do not use fine-grained information in the multi-modal data,which may lead to suboptimal results.To avoid cross-modal matching turning into label matching,this paper proposes an end-to-end fine-grained cross-modal hash retrieval method,which can focus more on the fine-grained semantic information of multi-modal data.First,the method refines the image features and no longer uses multiple labels to represent text features but uses BERT for processing.Second,this method uses the inference capabilities of the transformer encoder to generate global fine-grained features.Finally,in order to better judge the effect of the fine-grained model,this paper uses the datasets in the image text matching field instead of the traditional label-matching datasets.This article experiment on Microsoft COCO(MS-COCO)and Flickr30K datasets and compare it with the previous classicalmethods.The experimental results show that this method can obtain more advanced results in the cross-modal hash retrieval field.展开更多
Multimodal sentiment analysis aims to understand people’s emotions and opinions from diverse data.Concate-nating or multiplying various modalities is a traditional multi-modal sentiment analysis fusion method.This fu...Multimodal sentiment analysis aims to understand people’s emotions and opinions from diverse data.Concate-nating or multiplying various modalities is a traditional multi-modal sentiment analysis fusion method.This fusion method does not utilize the correlation information between modalities.To solve this problem,this paper proposes amodel based on amulti-head attention mechanism.First,after preprocessing the original data.Then,the feature representation is converted into a sequence of word vectors and positional encoding is introduced to better understand the semantic and sequential information in the input sequence.Next,the input coding sequence is fed into the transformer model for further processing and learning.At the transformer layer,a cross-modal attention consisting of a pair of multi-head attention modules is employed to reflect the correlation between modalities.Finally,the processed results are input into the feedforward neural network to obtain the emotional output through the classification layer.Through the above processing flow,the model can capture semantic information and contextual relationships and achieve good results in various natural language processing tasks.Our model was tested on the CMU Multimodal Opinion Sentiment and Emotion Intensity(CMU-MOSEI)and Multimodal EmotionLines Dataset(MELD),achieving an accuracy of 82.04% and F1 parameters reached 80.59% on the former dataset.展开更多
With the explosive growth of false information on social media platforms, the automatic detection of multimodalfalse information has received increasing attention. Recent research has significantly contributed to mult...With the explosive growth of false information on social media platforms, the automatic detection of multimodalfalse information has received increasing attention. Recent research has significantly contributed to multimodalinformation exchange and fusion, with many methods attempting to integrate unimodal features to generatemultimodal news representations. However, they still need to fully explore the hierarchical and complex semanticcorrelations between different modal contents, severely limiting their performance detecting multimodal falseinformation. This work proposes a two-stage detection framework for multimodal false information detection,called ASMFD, which is based on image aesthetic similarity to segment and explores the consistency andinconsistency features of images and texts. Specifically, we first use the Contrastive Language-Image Pre-training(CLIP) model to learn the relationship between text and images through label awareness and train an imageaesthetic attribute scorer using an aesthetic attribute dataset. Then, we calculate the aesthetic similarity betweenthe image and related images and use this similarity as a threshold to divide the multimodal correlation matrixinto consistency and inconsistencymatrices. Finally, the fusionmodule is designed to identify essential features fordetectingmultimodal false information. In extensive experiments on four datasets, the performance of the ASMFDis superior to state-of-the-art baseline methods.展开更多
Steganography is a technique for hiding secret messages while sending and receiving communications through a cover item.From ancient times to the present,the security of secret or vital information has always been a s...Steganography is a technique for hiding secret messages while sending and receiving communications through a cover item.From ancient times to the present,the security of secret or vital information has always been a significant problem.The development of secure communication methods that keep recipient-only data transmissions secret has always been an area of interest.Therefore,several approaches,including steganography,have been developed by researchers over time to enable safe data transit.In this review,we have discussed image steganography based on Discrete Cosine Transform(DCT)algorithm,etc.We have also discussed image steganography based on multiple hashing algorithms like the Rivest–Shamir–Adleman(RSA)method,the Blowfish technique,and the hash-least significant bit(LSB)approach.In this review,a novel method of hiding information in images has been developed with minimal variance in image bits,making our method secure and effective.A cryptography mechanism was also used in this strategy.Before encoding the data and embedding it into a carry image,this review verifies that it has been encrypted.Usually,embedded text in photos conveys crucial signals about the content.This review employs hash table encryption on the message before hiding it within the picture to provide a more secure method of data transport.If the message is ever intercepted by a third party,there are several ways to stop this operation.A second level of security process implementation involves encrypting and decrypting steganography images using different hashing algorithms.展开更多
The easy generation, storage, transmission and reproduction of digital images have caused serious abuse and security problems. Assurance of the rightful ownership, integrity, and authenticity is a major concern to the...The easy generation, storage, transmission and reproduction of digital images have caused serious abuse and security problems. Assurance of the rightful ownership, integrity, and authenticity is a major concern to the academia as well as the industry. On the other hand, efficient search of the huge amount of images has become a great challenge. Image hashing is a technique suitable for use in image authentication and content based image retrieval (CBIR). In this article, we review some representative image hashing techniques proposed in the recent years, with emphases on how to meet the conflicting requirements of perceptual robustness and security. Following a brief introduction to some earlier methods, we focus on a typical two-stage structure and some geometric-distortion resilient techniques. We then introduce two image hashing approaches developed in our own research, and reveal security problems in some existing methods due to the absence of secret keys in certain stage of the image feature extraction, or availability of a large quantity of images, keys, or the hash function to the adversary. More research efforts are needed in developing truly robust and secure image hashing techniques.展开更多
There is a steep increase in data encoded as symmetric positive definite(SPD)matrix in the past decade.The set of SPD matrices forms a Riemannian manifold that constitutes a half convex cone in the vector space of mat...There is a steep increase in data encoded as symmetric positive definite(SPD)matrix in the past decade.The set of SPD matrices forms a Riemannian manifold that constitutes a half convex cone in the vector space of matrices,which we sometimes call SPD manifold.One of the fundamental problems in the application of SPD manifold is to find the nearest neighbor of a queried SPD matrix.Hashing is a popular method that can be used for the nearest neighbor search.However,hashing cannot be directly applied to SPD manifold due to its non-Euclidean intrinsic geometry.Inspired by the idea of kernel trick,a new hashing scheme for SPD manifold by random projection and quantization in expanded data space is proposed in this paper.Experimental results in large scale nearduplicate image detection show the effectiveness and efficiency of the proposed method.展开更多
Image hashing is a useful multimedia technology for many applications,such as image authentication,image retrieval,image copy detection and image forensics.In this paper,we propose a robust image hashing based on rand...Image hashing is a useful multimedia technology for many applications,such as image authentication,image retrieval,image copy detection and image forensics.In this paper,we propose a robust image hashing based on random Gabor filtering and discrete wavelet transform(DWT).Specifically,robust and secure image features are first extracted from the normalized image by Gabor filtering and a chaotic map called Skew tent map,and then are compressed via a single-level 2-D DWT.Image hash is finally obtained by concatenating DWT coefficients in the LL sub-band.Many experiments with open image datasets are carried out and the results illustrate that our hashing is robust,discriminative and secure.Receiver operating characteristic(ROC)curve comparisons show that our hashing is better than some popular image hashing algorithms in classification performance between robustness and discrimination.展开更多
Studies on the integration of cross-modal information with taste perception has been mostly limited to uni-modal level.The cross-modal sensory interaction and the neural network of information processing and its contr...Studies on the integration of cross-modal information with taste perception has been mostly limited to uni-modal level.The cross-modal sensory interaction and the neural network of information processing and its control were not fully explored and the mechanisms remain poorly understood.This mini review investigated the impact of uni-modal and multi-modal information on the taste perception,from the perspective of cognitive status,such as emotion,expectation and attention,and discussed the hypothesis that the cognitive status is the key step for visual sense to exert influence on taste.This work may help researchers better understand the mechanism of cross-modal information processing and further develop neutrally-based artificial intelligent(AI)system.展开更多
Speech emotion recognition,as an important component of humancomputer interaction technology,has received increasing attention.Recent studies have treated emotion recognition of speech signals as a multimodal task,due...Speech emotion recognition,as an important component of humancomputer interaction technology,has received increasing attention.Recent studies have treated emotion recognition of speech signals as a multimodal task,due to its inclusion of the semantic features of two different modalities,i.e.,audio and text.However,existing methods often fail in effectively represent features and capture correlations.This paper presents a multi-level circulant cross-modal Transformer(MLCCT)formultimodal speech emotion recognition.The proposed model can be divided into three steps,feature extraction,interaction and fusion.Self-supervised embedding models are introduced for feature extraction,which give a more powerful representation of the original data than those using spectrograms or audio features such as Mel-frequency cepstral coefficients(MFCCs)and low-level descriptors(LLDs).In particular,MLCCT contains two types of feature interaction processes,where a bidirectional Long Short-term Memory(Bi-LSTM)with circulant interaction mechanism is proposed for low-level features,while a two-stream residual cross-modal Transformer block is appliedwhen high-level features are involved.Finally,we choose self-attention blocks for fusion and a fully connected layer to make predictions.To evaluate the performance of our proposed model,comprehensive experiments are conducted on three widely used benchmark datasets including IEMOCAP,MELD and CMU-MOSEI.The competitive results verify the effectiveness of our approach.展开更多
Hashing technology has the advantages of reducing data storage and improving the efficiency of the learning system,making it more and more widely used in image retrieval.Multi-view data describes image information mor...Hashing technology has the advantages of reducing data storage and improving the efficiency of the learning system,making it more and more widely used in image retrieval.Multi-view data describes image information more comprehensively than traditional methods using a single-view.How to use hashing to combine multi-view data for image retrieval is still a challenge.In this paper,a multi-view fusion hashing method based on RKCCA(Random Kernel Canonical Correlation Analysis)is proposed.In order to describe image content more accurately,we use deep learning dense convolutional network feature DenseNet to construct multi-view by combining GIST feature or BoW_SIFT(Bag-of-Words model+SIFT feature)feature.This algorithm uses RKCCA method to fuse multi-view features to construct association features and apply them to image retrieval.The algorithm generates binary hash code with minimal distortion error by designing quantization regularization terms.A large number of experiments on benchmark datasets show that this method is superior to other multi-view hashing methods.展开更多
Blindness provides an unparalleled opportunity to study plasticity of the nervous system in humans.Seminal work in this area examined the often dramatic modifications to the visual cortex that result when visual input...Blindness provides an unparalleled opportunity to study plasticity of the nervous system in humans.Seminal work in this area examined the often dramatic modifications to the visual cortex that result when visual input is completely absent from birth or very early in life(Kupers and Ptito,2014).More recent studies explored what happens to the visual pathways in the context of acquired blindness.This is particularly relevant as the majority of diseases that cause vision loss occur in the elderly.展开更多
文摘In the era of big data rich inWe Media,the single mode retrieval system has been unable to meet people’s demand for information retrieval.This paper proposes a new solution to the problem of feature extraction and unified mapping of different modes:A Cross-Modal Hashing retrieval algorithm based on Deep Residual Network(CMHR-DRN).The model construction is divided into two stages:The first stage is the feature extraction of different modal data,including the use of Deep Residual Network(DRN)to extract the image features,using the method of combining TF-IDF with the full connection network to extract the text features,and the obtained image and text features used as the input of the second stage.In the second stage,the image and text features are mapped into Hash functions by supervised learning,and the image and text features are mapped to the common binary Hamming space.In the process of mapping,the distance measurement of the original distance measurement and the common feature space are kept unchanged as far as possible to improve the accuracy of Cross-Modal Retrieval.In training the model,adaptive moment estimation(Adam)is used to calculate the adaptive learning rate of each parameter,and the stochastic gradient descent(SGD)is calculated to obtain the minimum loss function.The whole training process is completed on Caffe deep learning framework.Experiments show that the proposed algorithm CMHR-DRN based on Deep Residual Network has better retrieval performance and stronger advantages than other Cross-Modal algorithms CMFH,CMDN and CMSSH.
基金This work was partially supported by Science and Technology Project of Chongqing Education Commission of China(KJZD-K202200513)National Natural Science Foundation of China(61370205)+1 种基金Chongqing Normal University Fund(22XLB003)Chongqing Education Science Planning Project(2021-GX-320).
文摘In recent years,the development of deep learning has further improved hash retrieval technology.Most of the existing hashing methods currently use Convolutional Neural Networks(CNNs)and Recurrent Neural Networks(RNNs)to process image and text information,respectively.This makes images or texts subject to local constraints,and inherent label matching cannot capture finegrained information,often leading to suboptimal results.Driven by the development of the transformer model,we propose a framework called ViT2CMH mainly based on the Vision Transformer to handle deep Cross-modal Hashing tasks rather than CNNs or RNNs.Specifically,we use a BERT network to extract text features and use the vision transformer as the image network of the model.Finally,the features are transformed into hash codes for efficient and fast retrieval.We conduct extensive experiments on Microsoft COCO(MS-COCO)and Flickr30K,comparing with baselines of some hashing methods and image-text matching methods,showing that our method has better performance.
基金This work was partially supported by Chongqing Natural Science Foundation of China(Grant No.CSTB2022NSCQ-MSX1417)the Science and Technology Research Program of Chongqing Municipal Education Commission(Grant No.KJZD-K202200513)+2 种基金Chongqing Normal University Fund(Grant No.22XLB003)Chongqing Education Science Planning Project(Grant No.2021-GX-320)Humanities and Social Sciences Project of Chongqing Education Commission of China(Grant No.22SKGH100).
文摘In recent years,cross-modal hash retrieval has become a popular research field because of its advantages of high efficiency and low storage.Cross-modal retrieval technology can be applied to search engines,crossmodalmedical processing,etc.The existing main method is to use amulti-label matching paradigm to finish the retrieval tasks.However,such methods do not use fine-grained information in the multi-modal data,which may lead to suboptimal results.To avoid cross-modal matching turning into label matching,this paper proposes an end-to-end fine-grained cross-modal hash retrieval method,which can focus more on the fine-grained semantic information of multi-modal data.First,the method refines the image features and no longer uses multiple labels to represent text features but uses BERT for processing.Second,this method uses the inference capabilities of the transformer encoder to generate global fine-grained features.Finally,in order to better judge the effect of the fine-grained model,this paper uses the datasets in the image text matching field instead of the traditional label-matching datasets.This article experiment on Microsoft COCO(MS-COCO)and Flickr30K datasets and compare it with the previous classicalmethods.The experimental results show that this method can obtain more advanced results in the cross-modal hash retrieval field.
基金supported by the National Natural Science Foundation of China under Grant 61702462the Henan Provincial Science and Technology Research Project under Grants 222102210010 and 222102210064+2 种基金the Research and Practice Project of Higher Education Teaching Reform in Henan Province under Grants 2019SJGLX320 and 2019SJGLX020the Undergraduate Universities Smart Teaching Special Research Project of Henan Province under Grant JiaoGao[2021]No.489-29the Academic Degrees&Graduate Education Reform Project of Henan Province under Grant 2021SJGLX115Y.
文摘Multimodal sentiment analysis aims to understand people’s emotions and opinions from diverse data.Concate-nating or multiplying various modalities is a traditional multi-modal sentiment analysis fusion method.This fusion method does not utilize the correlation information between modalities.To solve this problem,this paper proposes amodel based on amulti-head attention mechanism.First,after preprocessing the original data.Then,the feature representation is converted into a sequence of word vectors and positional encoding is introduced to better understand the semantic and sequential information in the input sequence.Next,the input coding sequence is fed into the transformer model for further processing and learning.At the transformer layer,a cross-modal attention consisting of a pair of multi-head attention modules is employed to reflect the correlation between modalities.Finally,the processed results are input into the feedforward neural network to obtain the emotional output through the classification layer.Through the above processing flow,the model can capture semantic information and contextual relationships and achieve good results in various natural language processing tasks.Our model was tested on the CMU Multimodal Opinion Sentiment and Emotion Intensity(CMU-MOSEI)and Multimodal EmotionLines Dataset(MELD),achieving an accuracy of 82.04% and F1 parameters reached 80.59% on the former dataset.
文摘With the explosive growth of false information on social media platforms, the automatic detection of multimodalfalse information has received increasing attention. Recent research has significantly contributed to multimodalinformation exchange and fusion, with many methods attempting to integrate unimodal features to generatemultimodal news representations. However, they still need to fully explore the hierarchical and complex semanticcorrelations between different modal contents, severely limiting their performance detecting multimodal falseinformation. This work proposes a two-stage detection framework for multimodal false information detection,called ASMFD, which is based on image aesthetic similarity to segment and explores the consistency andinconsistency features of images and texts. Specifically, we first use the Contrastive Language-Image Pre-training(CLIP) model to learn the relationship between text and images through label awareness and train an imageaesthetic attribute scorer using an aesthetic attribute dataset. Then, we calculate the aesthetic similarity betweenthe image and related images and use this similarity as a threshold to divide the multimodal correlation matrixinto consistency and inconsistencymatrices. Finally, the fusionmodule is designed to identify essential features fordetectingmultimodal false information. In extensive experiments on four datasets, the performance of the ASMFDis superior to state-of-the-art baseline methods.
文摘Steganography is a technique for hiding secret messages while sending and receiving communications through a cover item.From ancient times to the present,the security of secret or vital information has always been a significant problem.The development of secure communication methods that keep recipient-only data transmissions secret has always been an area of interest.Therefore,several approaches,including steganography,have been developed by researchers over time to enable safe data transit.In this review,we have discussed image steganography based on Discrete Cosine Transform(DCT)algorithm,etc.We have also discussed image steganography based on multiple hashing algorithms like the Rivest–Shamir–Adleman(RSA)method,the Blowfish technique,and the hash-least significant bit(LSB)approach.In this review,a novel method of hiding information in images has been developed with minimal variance in image bits,making our method secure and effective.A cryptography mechanism was also used in this strategy.Before encoding the data and embedding it into a carry image,this review verifies that it has been encrypted.Usually,embedded text in photos conveys crucial signals about the content.This review employs hash table encryption on the message before hiding it within the picture to provide a more secure method of data transport.If the message is ever intercepted by a third party,there are several ways to stop this operation.A second level of security process implementation involves encrypting and decrypting steganography images using different hashing algorithms.
基金supported by the National Natural Science Foundation of China(Grant No.60502039),the Shanghai Rising-Star Program(Grant No.06QA14022),and the Key project of Shanghai Municipality for Basic Research (Grant No.04JC14037)
文摘The easy generation, storage, transmission and reproduction of digital images have caused serious abuse and security problems. Assurance of the rightful ownership, integrity, and authenticity is a major concern to the academia as well as the industry. On the other hand, efficient search of the huge amount of images has become a great challenge. Image hashing is a technique suitable for use in image authentication and content based image retrieval (CBIR). In this article, we review some representative image hashing techniques proposed in the recent years, with emphases on how to meet the conflicting requirements of perceptual robustness and security. Following a brief introduction to some earlier methods, we focus on a typical two-stage structure and some geometric-distortion resilient techniques. We then introduce two image hashing approaches developed in our own research, and reveal security problems in some existing methods due to the absence of secret keys in certain stage of the image feature extraction, or availability of a large quantity of images, keys, or the hash function to the adversary. More research efforts are needed in developing truly robust and secure image hashing techniques.
文摘There is a steep increase in data encoded as symmetric positive definite(SPD)matrix in the past decade.The set of SPD matrices forms a Riemannian manifold that constitutes a half convex cone in the vector space of matrices,which we sometimes call SPD manifold.One of the fundamental problems in the application of SPD manifold is to find the nearest neighbor of a queried SPD matrix.Hashing is a popular method that can be used for the nearest neighbor search.However,hashing cannot be directly applied to SPD manifold due to its non-Euclidean intrinsic geometry.Inspired by the idea of kernel trick,a new hashing scheme for SPD manifold by random projection and quantization in expanded data space is proposed in this paper.Experimental results in large scale nearduplicate image detection show the effectiveness and efficiency of the proposed method.
基金This work is partially supported by the National Natural Science Foundation of China(Nos.61562007,61762017,61702332)National Key R&D Plan of China(2018YFB1003701)+3 种基金Guangxi“Bagui Scholar”Teams for Innovation and Research,the Guangxi Natural Science Foundation(Nos.2017GXNSFAA198222,2015GXNSFDA139040)the Project of Guangxi Science and Technology(Nos.GuiKeAD17195062)the Project of the Guangxi Key Lab of Multi-source Information Mining&Security(Nos.16-A-02-02,15-A-02-02)the Guangxi Collaborative Innovation Center of Multi-source Information Integration and Intelligent Processing,and the Innovation Project of Guangxi Graduate Education(No.XYCSZ 2018076).
文摘Image hashing is a useful multimedia technology for many applications,such as image authentication,image retrieval,image copy detection and image forensics.In this paper,we propose a robust image hashing based on random Gabor filtering and discrete wavelet transform(DWT).Specifically,robust and secure image features are first extracted from the normalized image by Gabor filtering and a chaotic map called Skew tent map,and then are compressed via a single-level 2-D DWT.Image hash is finally obtained by concatenating DWT coefficients in the LL sub-band.Many experiments with open image datasets are carried out and the results illustrate that our hashing is robust,discriminative and secure.Receiver operating characteristic(ROC)curve comparisons show that our hashing is better than some popular image hashing algorithms in classification performance between robustness and discrimination.
基金This study was supported by the National Natural Science Foundation of China(Nos.61703058,81873701).
文摘Studies on the integration of cross-modal information with taste perception has been mostly limited to uni-modal level.The cross-modal sensory interaction and the neural network of information processing and its control were not fully explored and the mechanisms remain poorly understood.This mini review investigated the impact of uni-modal and multi-modal information on the taste perception,from the perspective of cognitive status,such as emotion,expectation and attention,and discussed the hypothesis that the cognitive status is the key step for visual sense to exert influence on taste.This work may help researchers better understand the mechanism of cross-modal information processing and further develop neutrally-based artificial intelligent(AI)system.
基金the National Natural Science Foundation of China(No.61872231)the National Key Research and Development Program of China(No.2021YFC2801000)the Major Research plan of the National Social Science Foundation of China(No.2000&ZD130).
文摘Speech emotion recognition,as an important component of humancomputer interaction technology,has received increasing attention.Recent studies have treated emotion recognition of speech signals as a multimodal task,due to its inclusion of the semantic features of two different modalities,i.e.,audio and text.However,existing methods often fail in effectively represent features and capture correlations.This paper presents a multi-level circulant cross-modal Transformer(MLCCT)formultimodal speech emotion recognition.The proposed model can be divided into three steps,feature extraction,interaction and fusion.Self-supervised embedding models are introduced for feature extraction,which give a more powerful representation of the original data than those using spectrograms or audio features such as Mel-frequency cepstral coefficients(MFCCs)and low-level descriptors(LLDs).In particular,MLCCT contains two types of feature interaction processes,where a bidirectional Long Short-term Memory(Bi-LSTM)with circulant interaction mechanism is proposed for low-level features,while a two-stream residual cross-modal Transformer block is appliedwhen high-level features are involved.Finally,we choose self-attention blocks for fusion and a fully connected layer to make predictions.To evaluate the performance of our proposed model,comprehensive experiments are conducted on three widely used benchmark datasets including IEMOCAP,MELD and CMU-MOSEI.The competitive results verify the effectiveness of our approach.
基金This work is supported by the National Natural Science Foundation of China(No.61772561)the Key Research&Development Plan of Hunan Province(No.2018NK2012)+1 种基金the Science Research Projects of Hunan Provincial Education Department(Nos.18A174,18C0262)the Science&Technology Innovation Platform and Talent Plan of Hunan Province(2017TP1022).
文摘Hashing technology has the advantages of reducing data storage and improving the efficiency of the learning system,making it more and more widely used in image retrieval.Multi-view data describes image information more comprehensively than traditional methods using a single-view.How to use hashing to combine multi-view data for image retrieval is still a challenge.In this paper,a multi-view fusion hashing method based on RKCCA(Random Kernel Canonical Correlation Analysis)is proposed.In order to describe image content more accurately,we use deep learning dense convolutional network feature DenseNet to construct multi-view by combining GIST feature or BoW_SIFT(Bag-of-Words model+SIFT feature)feature.This algorithm uses RKCCA method to fuse multi-view features to construct association features and apply them to image retrieval.The algorithm generates binary hash code with minimal distortion error by designing quantization regularization terms.A large number of experiments on benchmark datasets show that this method is superior to other multi-view hashing methods.
基金supported by National Institutes of Health Contracts P30-EY008098 and T32-EY017271-06(BethesdaMD)+14 种基金United States Department of Defense DM090217(ArlingtonVA)Alcon Research Institute Young Investigator Grant(Fort WorthTX)Eye and Ear Foundation(PittsburghPA)Research to Prevent Blindness(New YorkNY)Aging Institute Pilot Seed GrantUniversity of Pittsburgh(PittsburghPA)Postdoctoral Fellowship Program in Ocular Tissue Engineering and Regenerative OphthalmologyLouis J.Fox Center for Vision RestorationUniversity of Pittsburgh and UPMC(PittsburghPA)
文摘Blindness provides an unparalleled opportunity to study plasticity of the nervous system in humans.Seminal work in this area examined the often dramatic modifications to the visual cortex that result when visual input is completely absent from birth or very early in life(Kupers and Ptito,2014).More recent studies explored what happens to the visual pathways in the context of acquired blindness.This is particularly relevant as the majority of diseases that cause vision loss occur in the elderly.