Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on...Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on unimodal pre-trained models for feature extraction from each modality often overlook the intrinsic connections of semantic information between modalities.This limitation is attributed to their training on unimodal data,and necessitates the use of complex fusion mechanisms for sentiment analysis.In this study,we present a novel approach that combines a vision-language pre-trained model with a proposed multimodal contrastive learning method.Our approach harnesses the power of transfer learning by utilizing a vision-language pre-trained model to extract both visual and textual representations in a unified framework.We employ a Transformer architecture to integrate these representations,thereby enabling the capture of rich semantic infor-mation in image-text pairs.To further enhance the representation learning of these pairs,we introduce our proposed multimodal contrastive learning method,which leads to improved performance in sentiment analysis tasks.Our approach is evaluated through extensive experiments on two publicly accessible datasets,where we demonstrate its effectiveness.We achieve a significant improvement in sentiment analysis accuracy,indicating the supe-riority of our approach over existing techniques.These results highlight the potential of multimodal sentiment analysis and underscore the importance of considering the intrinsic semantic connections between modalities for accurate sentiment assessment.展开更多
The exponential growth of astronomical datasets provides an unprecedented opportunity for humans to gain insight into the Universe.However,effectively analyzing this vast amount of data poses a significant challenge.I...The exponential growth of astronomical datasets provides an unprecedented opportunity for humans to gain insight into the Universe.However,effectively analyzing this vast amount of data poses a significant challenge.In response,astronomers are turning to deep learning techniques,but these methods are limited by their specific training sets,leading to considerable duplicate workloads.To overcome this issue,we built a framework for the general analysis of galaxy images based on a large vision model(LVM)plus downstream tasks(DST),including galaxy morphological classification,image restoration object detection,parameter extraction,and more.Considering the low signal-to-noise ratios of galaxy images and the imbalanced distribution of galaxy categories,we designed our LVM to incorporate a Human-in-the-loop(HITL)module,which leverages human knowledge to enhance the reliability and interpretability of processing galaxy images interactively.The proposed framework exhibits notable fewshot learning capabilities and versatile adaptability for all the abovementioned tasks on galaxy images in the DESI Legacy Imaging Surveys.In particular,for the object detection task,which was trained using 1000 data points,our DST in the LVM achieved an accuracy of 96.7%,while ResNet50 plus Mask R-CNN reached an accuracy of 93.1%.For morphological classification,to obtain an area under the curve(AUC)of~0.9,LVM plus DST and HITL only requested 1/50 of the training sets that ResNet18 requested.In addition,multimodal data can be integrated,which creates possibilities for conducting joint analyses with datasets spanning diverse domains in the era of multi-messenger astronomy.展开更多
In order to improve the low positioning accuracy and execution efficiency of the robot binocular vision,a binocular vision positioning method based on coarse-fine stereo matching is proposed to achieve object position...In order to improve the low positioning accuracy and execution efficiency of the robot binocular vision,a binocular vision positioning method based on coarse-fine stereo matching is proposed to achieve object positioning.The random fern is used in the coarse matching to identify objects in the left and right images,and the pixel coordinates of the object center points in the two images are calculated to complete the center matching.In the fine matching,the right center point is viewed as an estimated value to set the search range of the right image,in which the region matching is implemented to find the best matched point of the left center point.Then,the similar triangle principle of the binocular vision model is used to calculate the 3D coordinates of the center point,achieving fast and accurate object positioning.Finally,the proposed method is applied to the object scene images and the robotic arm grasping platform.The experimental results show that the average absolute positioning error and average relative positioning error of the proposed method are 8.22 mm and 1.96%respectively when the object's depth distance is within 600 mm,the time consumption is less than 1.029s.The method can meet the needs of the robot grasping system,and has better accuracy and robustness.展开更多
Visual object recognition in humans and nonhuman primates is achieved by the ventral visual pathway(ventral occipital-temporal cortex,VOTC),which shows a well-documented object domain structure.An on-going question is...Visual object recognition in humans and nonhuman primates is achieved by the ventral visual pathway(ventral occipital-temporal cortex,VOTC),which shows a well-documented object domain structure.An on-going question is what type of information is processed in the higher-order VOTC that underlies such observations,with recent evidence suggesting effects of certain visual features.Combining computational vision models,fMRI experiment using a parametric-modulation approach,and natural image statistics of common objects,we depicted the neural distribution of a comprehensive set of visual features in the VOTC,identifying voxel sensitivities with specific feature sets across geometry/shape,Fourier power,and color.The visual feature combination pattern in the VOTC is significantly explained by their relationships to different types of response-action computation(fight-or-flight,navigation,and manipulation),as derived from behavioral ratings and natural image statistics.These results offer a comprehensive visual feature map in the VOTC and a plausible theoretical explanation as a mapping onto different types of downstream response-action systems.展开更多
基金supported by Science and Technology Research Project of Jiangxi Education Department.Project Grant No.GJJ2203306.
文摘Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on unimodal pre-trained models for feature extraction from each modality often overlook the intrinsic connections of semantic information between modalities.This limitation is attributed to their training on unimodal data,and necessitates the use of complex fusion mechanisms for sentiment analysis.In this study,we present a novel approach that combines a vision-language pre-trained model with a proposed multimodal contrastive learning method.Our approach harnesses the power of transfer learning by utilizing a vision-language pre-trained model to extract both visual and textual representations in a unified framework.We employ a Transformer architecture to integrate these representations,thereby enabling the capture of rich semantic infor-mation in image-text pairs.To further enhance the representation learning of these pairs,we introduce our proposed multimodal contrastive learning method,which leads to improved performance in sentiment analysis tasks.Our approach is evaluated through extensive experiments on two publicly accessible datasets,where we demonstrate its effectiveness.We achieve a significant improvement in sentiment analysis accuracy,indicating the supe-riority of our approach over existing techniques.These results highlight the potential of multimodal sentiment analysis and underscore the importance of considering the intrinsic semantic connections between modalities for accurate sentiment assessment.
基金the support from the National Natural Science Foundation of China(Grant Nos.12173027,12303105,12173062)the National Key R&D Program of China(Grant Nos.2023YFF0725300,2022YFF0503402)+5 种基金the Science Research Grants from the Square Kilometre Array(SKA)(2020SKA0110100)the Science Research Grants from the China Manned Space Project(Grant Nos.CMS-CSST-2021-A01,CMS-CSST-2021-A07,CMS-CSST-2021-B05)the CAS Project for Young Scientists in Basic ResearchChina(Grant No.YSBR-062)supported by the Young Data Scientist Project of the National Astronomical Data Centerthe Program of Science and Education Integration at the School of Astronomy and Space Science,University of Chinese Academy of Sciences,China。
文摘The exponential growth of astronomical datasets provides an unprecedented opportunity for humans to gain insight into the Universe.However,effectively analyzing this vast amount of data poses a significant challenge.In response,astronomers are turning to deep learning techniques,but these methods are limited by their specific training sets,leading to considerable duplicate workloads.To overcome this issue,we built a framework for the general analysis of galaxy images based on a large vision model(LVM)plus downstream tasks(DST),including galaxy morphological classification,image restoration object detection,parameter extraction,and more.Considering the low signal-to-noise ratios of galaxy images and the imbalanced distribution of galaxy categories,we designed our LVM to incorporate a Human-in-the-loop(HITL)module,which leverages human knowledge to enhance the reliability and interpretability of processing galaxy images interactively.The proposed framework exhibits notable fewshot learning capabilities and versatile adaptability for all the abovementioned tasks on galaxy images in the DESI Legacy Imaging Surveys.In particular,for the object detection task,which was trained using 1000 data points,our DST in the LVM achieved an accuracy of 96.7%,while ResNet50 plus Mask R-CNN reached an accuracy of 93.1%.For morphological classification,to obtain an area under the curve(AUC)of~0.9,LVM plus DST and HITL only requested 1/50 of the training sets that ResNet18 requested.In addition,multimodal data can be integrated,which creates possibilities for conducting joint analyses with datasets spanning diverse domains in the era of multi-messenger astronomy.
基金supported by National Natural Science Foundation of China(No.61125101)。
文摘In order to improve the low positioning accuracy and execution efficiency of the robot binocular vision,a binocular vision positioning method based on coarse-fine stereo matching is proposed to achieve object positioning.The random fern is used in the coarse matching to identify objects in the left and right images,and the pixel coordinates of the object center points in the two images are calculated to complete the center matching.In the fine matching,the right center point is viewed as an estimated value to set the search range of the right image,in which the region matching is implemented to find the best matched point of the left center point.Then,the similar triangle principle of the binocular vision model is used to calculate the 3D coordinates of the center point,achieving fast and accurate object positioning.Finally,the proposed method is applied to the object scene images and the robotic arm grasping platform.The experimental results show that the average absolute positioning error and average relative positioning error of the proposed method are 8.22 mm and 1.96%respectively when the object's depth distance is within 600 mm,the time consumption is less than 1.029s.The method can meet the needs of the robot grasping system,and has better accuracy and robustness.
基金by the National Natural Science Foundation of China(31671128,31925020,31700999,31700943,and 31500882)the Changjiang Scholar Professorship Award(T2016031)Fundamental Research Funds for the Central Universities(2017EYT35).
文摘Visual object recognition in humans and nonhuman primates is achieved by the ventral visual pathway(ventral occipital-temporal cortex,VOTC),which shows a well-documented object domain structure.An on-going question is what type of information is processed in the higher-order VOTC that underlies such observations,with recent evidence suggesting effects of certain visual features.Combining computational vision models,fMRI experiment using a parametric-modulation approach,and natural image statistics of common objects,we depicted the neural distribution of a comprehensive set of visual features in the VOTC,identifying voxel sensitivities with specific feature sets across geometry/shape,Fourier power,and color.The visual feature combination pattern in the VOTC is significantly explained by their relationships to different types of response-action computation(fight-or-flight,navigation,and manipulation),as derived from behavioral ratings and natural image statistics.These results offer a comprehensive visual feature map in the VOTC and a plausible theoretical explanation as a mapping onto different types of downstream response-action systems.