The exponential growth of astronomical datasets provides an unprecedented opportunity for humans to gain insight into the Universe.However,effectively analyzing this vast amount of data poses a significant challenge.I...The exponential growth of astronomical datasets provides an unprecedented opportunity for humans to gain insight into the Universe.However,effectively analyzing this vast amount of data poses a significant challenge.In response,astronomers are turning to deep learning techniques,but these methods are limited by their specific training sets,leading to considerable duplicate workloads.To overcome this issue,we built a framework for the general analysis of galaxy images based on a large vision model(LVM)plus downstream tasks(DST),including galaxy morphological classification,image restoration object detection,parameter extraction,and more.Considering the low signal-to-noise ratios of galaxy images and the imbalanced distribution of galaxy categories,we designed our LVM to incorporate a Human-in-the-loop(HITL)module,which leverages human knowledge to enhance the reliability and interpretability of processing galaxy images interactively.The proposed framework exhibits notable fewshot learning capabilities and versatile adaptability for all the abovementioned tasks on galaxy images in the DESI Legacy Imaging Surveys.In particular,for the object detection task,which was trained using 1000 data points,our DST in the LVM achieved an accuracy of 96.7%,while ResNet50 plus Mask R-CNN reached an accuracy of 93.1%.For morphological classification,to obtain an area under the curve(AUC)of~0.9,LVM plus DST and HITL only requested 1/50 of the training sets that ResNet18 requested.In addition,multimodal data can be integrated,which creates possibilities for conducting joint analyses with datasets spanning diverse domains in the era of multi-messenger astronomy.展开更多
Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on...Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on unimodal pre-trained models for feature extraction from each modality often overlook the intrinsic connections of semantic information between modalities.This limitation is attributed to their training on unimodal data,and necessitates the use of complex fusion mechanisms for sentiment analysis.In this study,we present a novel approach that combines a vision-language pre-trained model with a proposed multimodal contrastive learning method.Our approach harnesses the power of transfer learning by utilizing a vision-language pre-trained model to extract both visual and textual representations in a unified framework.We employ a Transformer architecture to integrate these representations,thereby enabling the capture of rich semantic infor-mation in image-text pairs.To further enhance the representation learning of these pairs,we introduce our proposed multimodal contrastive learning method,which leads to improved performance in sentiment analysis tasks.Our approach is evaluated through extensive experiments on two publicly accessible datasets,where we demonstrate its effectiveness.We achieve a significant improvement in sentiment analysis accuracy,indicating the supe-riority of our approach over existing techniques.These results highlight the potential of multimodal sentiment analysis and underscore the importance of considering the intrinsic semantic connections between modalities for accurate sentiment assessment.展开更多
This paper theoretically analyzes and researches the coordinate frames of a 3D vision scanning system, establishes the mathematic model of a system scanning process, derives the relationship between the general non-or...This paper theoretically analyzes and researches the coordinate frames of a 3D vision scanning system, establishes the mathematic model of a system scanning process, derives the relationship between the general non-orthonormal sensor coordinate system and the machine coordinate system and the coordinate transformation matrix of the extrinsic calibration for the system.展开更多
This paper presents a method for structured scene modeling using micro stereo vision system with large field of view. The proposed algorithm includes edge detection with Canny detector, line fitting with principle axi...This paper presents a method for structured scene modeling using micro stereo vision system with large field of view. The proposed algorithm includes edge detection with Canny detector, line fitting with principle axis based approach, finding corresponding lines using feature based matching method, and 3D line depth computation.展开更多
As the prevalence of diabetic retinopathy continues to be on the rise, the Chronic Care Model (CCM) offers a transformative, patient-focused approach for efficient diabetic retinopathy care, emphasizing the need for u...As the prevalence of diabetic retinopathy continues to be on the rise, the Chronic Care Model (CCM) offers a transformative, patient-focused approach for efficient diabetic retinopathy care, emphasizing the need for urgent and innovative strategies in the United States. The model integrates community resources, healthcare organizations, self-management support, delivery system design, decision support, and clinical information systems. Addressing challenges and solutions, the model emphasizes proactive and preventive measures, collaborative multidisciplinary care, technological integration, and overcoming resistance to change. This paper proposes the utilization of the Chronic Care Model (CCM) as a possible public health framework for comprehensive management of diabetic retinopathy in the United States. Implementing the CCM offers a comprehensive approach to diabetic retinopathy care, addressing both individual and systemic factors, essential for improving public health outcomes.展开更多
A study about the action control of a dexterous mechanical gripper based on stereo-vision system was proposed. The vision-based system was used to replace the data-glove for gesture measurement. The stereo vision theo...A study about the action control of a dexterous mechanical gripper based on stereo-vision system was proposed. The vision-based system was used to replace the data-glove for gesture measurement. The stereo vision theory was applied to calculate the 3D information of the hand gesture. The information was used to generate the grasping action parameters of a 3-finger dexterous mechanical gripper. Combined with a force feedback device, a closed control loop could be constructed. The test for the precision of the algorithms and action control simulation result were shown in the paper.展开更多
This paper proposes a novel model fusion approach to enhance predictive capabilities of vision and language models by strategically integrating object detection and large language models. We have named this multimodal...This paper proposes a novel model fusion approach to enhance predictive capabilities of vision and language models by strategically integrating object detection and large language models. We have named this multimodal integration approach as VOLTRON (Vision Object Linguistic Translation for Responsive Observation and Narration). VOLTRON is aimed at improving responses for self-driving vehicles in detecting small objects crossing roads and identifying merged or narrower lanes. The models are fused using a single layer to provide LLaMA2 (Large Language Model Meta AI) with object detection probabilities from YoloV8-n (You Only Look Once) translated into sentences. Experiments using specialized datasets showed accuracy improvements up to 88.16%. We provide a comprehensive exploration of the theoretical aspects that inform our model fusion approach, detailing the fundamental principles upon which it is built. Moreover, we elucidate the intricacies of the methodologies employed for merging these two disparate models, shedding light on the techniques and strategies used.展开更多
基金the support from the National Natural Science Foundation of China(Grant Nos.12173027,12303105,12173062)the National Key R&D Program of China(Grant Nos.2023YFF0725300,2022YFF0503402)+5 种基金the Science Research Grants from the Square Kilometre Array(SKA)(2020SKA0110100)the Science Research Grants from the China Manned Space Project(Grant Nos.CMS-CSST-2021-A01,CMS-CSST-2021-A07,CMS-CSST-2021-B05)the CAS Project for Young Scientists in Basic ResearchChina(Grant No.YSBR-062)supported by the Young Data Scientist Project of the National Astronomical Data Centerthe Program of Science and Education Integration at the School of Astronomy and Space Science,University of Chinese Academy of Sciences,China。
文摘The exponential growth of astronomical datasets provides an unprecedented opportunity for humans to gain insight into the Universe.However,effectively analyzing this vast amount of data poses a significant challenge.In response,astronomers are turning to deep learning techniques,but these methods are limited by their specific training sets,leading to considerable duplicate workloads.To overcome this issue,we built a framework for the general analysis of galaxy images based on a large vision model(LVM)plus downstream tasks(DST),including galaxy morphological classification,image restoration object detection,parameter extraction,and more.Considering the low signal-to-noise ratios of galaxy images and the imbalanced distribution of galaxy categories,we designed our LVM to incorporate a Human-in-the-loop(HITL)module,which leverages human knowledge to enhance the reliability and interpretability of processing galaxy images interactively.The proposed framework exhibits notable fewshot learning capabilities and versatile adaptability for all the abovementioned tasks on galaxy images in the DESI Legacy Imaging Surveys.In particular,for the object detection task,which was trained using 1000 data points,our DST in the LVM achieved an accuracy of 96.7%,while ResNet50 plus Mask R-CNN reached an accuracy of 93.1%.For morphological classification,to obtain an area under the curve(AUC)of~0.9,LVM plus DST and HITL only requested 1/50 of the training sets that ResNet18 requested.In addition,multimodal data can be integrated,which creates possibilities for conducting joint analyses with datasets spanning diverse domains in the era of multi-messenger astronomy.
基金supported by Science and Technology Research Project of Jiangxi Education Department.Project Grant No.GJJ2203306.
文摘Multimodal sentiment analysis is an essential area of research in artificial intelligence that combines multiple modes,such as text and image,to accurately assess sentiment.However,conventional approaches that rely on unimodal pre-trained models for feature extraction from each modality often overlook the intrinsic connections of semantic information between modalities.This limitation is attributed to their training on unimodal data,and necessitates the use of complex fusion mechanisms for sentiment analysis.In this study,we present a novel approach that combines a vision-language pre-trained model with a proposed multimodal contrastive learning method.Our approach harnesses the power of transfer learning by utilizing a vision-language pre-trained model to extract both visual and textual representations in a unified framework.We employ a Transformer architecture to integrate these representations,thereby enabling the capture of rich semantic infor-mation in image-text pairs.To further enhance the representation learning of these pairs,we introduce our proposed multimodal contrastive learning method,which leads to improved performance in sentiment analysis tasks.Our approach is evaluated through extensive experiments on two publicly accessible datasets,where we demonstrate its effectiveness.We achieve a significant improvement in sentiment analysis accuracy,indicating the supe-riority of our approach over existing techniques.These results highlight the potential of multimodal sentiment analysis and underscore the importance of considering the intrinsic semantic connections between modalities for accurate sentiment assessment.
文摘This paper theoretically analyzes and researches the coordinate frames of a 3D vision scanning system, establishes the mathematic model of a system scanning process, derives the relationship between the general non-orthonormal sensor coordinate system and the machine coordinate system and the coordinate transformation matrix of the extrinsic calibration for the system.
文摘This paper presents a method for structured scene modeling using micro stereo vision system with large field of view. The proposed algorithm includes edge detection with Canny detector, line fitting with principle axis based approach, finding corresponding lines using feature based matching method, and 3D line depth computation.
文摘As the prevalence of diabetic retinopathy continues to be on the rise, the Chronic Care Model (CCM) offers a transformative, patient-focused approach for efficient diabetic retinopathy care, emphasizing the need for urgent and innovative strategies in the United States. The model integrates community resources, healthcare organizations, self-management support, delivery system design, decision support, and clinical information systems. Addressing challenges and solutions, the model emphasizes proactive and preventive measures, collaborative multidisciplinary care, technological integration, and overcoming resistance to change. This paper proposes the utilization of the Chronic Care Model (CCM) as a possible public health framework for comprehensive management of diabetic retinopathy in the United States. Implementing the CCM offers a comprehensive approach to diabetic retinopathy care, addressing both individual and systemic factors, essential for improving public health outcomes.
文摘A study about the action control of a dexterous mechanical gripper based on stereo-vision system was proposed. The vision-based system was used to replace the data-glove for gesture measurement. The stereo vision theory was applied to calculate the 3D information of the hand gesture. The information was used to generate the grasping action parameters of a 3-finger dexterous mechanical gripper. Combined with a force feedback device, a closed control loop could be constructed. The test for the precision of the algorithms and action control simulation result were shown in the paper.
文摘This paper proposes a novel model fusion approach to enhance predictive capabilities of vision and language models by strategically integrating object detection and large language models. We have named this multimodal integration approach as VOLTRON (Vision Object Linguistic Translation for Responsive Observation and Narration). VOLTRON is aimed at improving responses for self-driving vehicles in detecting small objects crossing roads and identifying merged or narrower lanes. The models are fused using a single layer to provide LLaMA2 (Large Language Model Meta AI) with object detection probabilities from YoloV8-n (You Only Look Once) translated into sentences. Experiments using specialized datasets showed accuracy improvements up to 88.16%. We provide a comprehensive exploration of the theoretical aspects that inform our model fusion approach, detailing the fundamental principles upon which it is built. Moreover, we elucidate the intricacies of the methodologies employed for merging these two disparate models, shedding light on the techniques and strategies used.