Visual question answering(VQA)requires a deep understanding of images and their corresponding textual questions to answer questions about images more accurately.However,existing models tend to ignore the implicit know...Visual question answering(VQA)requires a deep understanding of images and their corresponding textual questions to answer questions about images more accurately.However,existing models tend to ignore the implicit knowledge in the images and focus only on the visual information in the images,which limits the understanding depth of the image content.The images contain more than just visual objects,some images contain textual information about the scene,and slightly more complex images contain relationships between individual visual objects.Firstly,this paper proposes a model using image description for feature enhancement.This model encodes images and their descriptions separately based on the question-guided coattention mechanism.This mechanism increases the feature representation of the model,enhancing the model’s ability for reasoning.In addition,this paper improves the bottom-up attention model by obtaining two image region features.After obtaining the two visual features and the spatial position information corresponding to each feature,concatenating the two features as the final image feature can better represent an image.Finally,the obtained spatial position information is processed to enable the model to perceive the size and relative position of each object in the image.Our best single model delivers a 74.16%overall accuracy on the VQA 2.0 dataset,our model even outperforms some multi-modal pre-training models with fewer images and a shorter time.展开更多
Based on the actual needs of speech application research such as speech recognition and voiceprint recognition,the acoustic characteristics and recognition of Hotan dialect were studied for the first time.Firstly,the ...Based on the actual needs of speech application research such as speech recognition and voiceprint recognition,the acoustic characteristics and recognition of Hotan dialect were studied for the first time.Firstly,the Hetian dialect voice was selected for artificial multi-level annotation,and the formant,duration and intensity of the vowel were analyzed to describe statistically the main pattern of Hetian dialect and the pronunciation characteristics of male and female.Then using the analysis of variance and nonparametric analysis to test the formant samples of the three dialects of Uygur language,the results show that there are significant differences in the formant distribution patterns of male vowels,female vowels and whole vowels in the three dialects.Finally,the GUM-UBM(Gaussian Mixture Model-Universal Background Model),DNN-UBM(Deep Neural Networks-Universal Background Model)and LSTM-UBM(Long Short Term Memory Network-Universal Background Model)Uyghur dialect recognition models are constructed respectively.Based on the Mel-frequency cepstrum coefficient and its combination with the formant frequency for the input feature extraction,the contrastive experiment of dialect i-vector distinctiveness is carried out.The experimental results show that the combined features of the formant coefficients can increase the recognition of the dialect,and the LSTM-UBM model can extract more discriminative dialects than the GMM-UBM and DNN-UBM.展开更多
基金supported in part by the National Natural Science Foundation of China under Grant U1911401.
文摘Visual question answering(VQA)requires a deep understanding of images and their corresponding textual questions to answer questions about images more accurately.However,existing models tend to ignore the implicit knowledge in the images and focus only on the visual information in the images,which limits the understanding depth of the image content.The images contain more than just visual objects,some images contain textual information about the scene,and slightly more complex images contain relationships between individual visual objects.Firstly,this paper proposes a model using image description for feature enhancement.This model encodes images and their descriptions separately based on the question-guided coattention mechanism.This mechanism increases the feature representation of the model,enhancing the model’s ability for reasoning.In addition,this paper improves the bottom-up attention model by obtaining two image region features.After obtaining the two visual features and the spatial position information corresponding to each feature,concatenating the two features as the final image feature can better represent an image.Finally,the obtained spatial position information is processed to enable the model to perceive the size and relative position of each object in the image.Our best single model delivers a 74.16%overall accuracy on the VQA 2.0 dataset,our model even outperforms some multi-modal pre-training models with fewer images and a shorter time.
基金supported by the the Xinjiang Uygur Autonomous Region Key Laboratory Project(2015KL013)the National Key Basic Research and Development Program(973 Program)Sub-topics(2014CB340506,213-61590)the National Natural Science Foundation of China(61433012,U1435215,U1603262)。
文摘Based on the actual needs of speech application research such as speech recognition and voiceprint recognition,the acoustic characteristics and recognition of Hotan dialect were studied for the first time.Firstly,the Hetian dialect voice was selected for artificial multi-level annotation,and the formant,duration and intensity of the vowel were analyzed to describe statistically the main pattern of Hetian dialect and the pronunciation characteristics of male and female.Then using the analysis of variance and nonparametric analysis to test the formant samples of the three dialects of Uygur language,the results show that there are significant differences in the formant distribution patterns of male vowels,female vowels and whole vowels in the three dialects.Finally,the GUM-UBM(Gaussian Mixture Model-Universal Background Model),DNN-UBM(Deep Neural Networks-Universal Background Model)and LSTM-UBM(Long Short Term Memory Network-Universal Background Model)Uyghur dialect recognition models are constructed respectively.Based on the Mel-frequency cepstrum coefficient and its combination with the formant frequency for the input feature extraction,the contrastive experiment of dialect i-vector distinctiveness is carried out.The experimental results show that the combined features of the formant coefficients can increase the recognition of the dialect,and the LSTM-UBM model can extract more discriminative dialects than the GMM-UBM and DNN-UBM.