The contribution of this work is twofold: (1) a multimodality prediction method of chaotic time series with the Gaussian process mixture (GPM) model is proposed, which employs a divide and conquer strategy. It au...The contribution of this work is twofold: (1) a multimodality prediction method of chaotic time series with the Gaussian process mixture (GPM) model is proposed, which employs a divide and conquer strategy. It automatically divides the chaotic time series into multiple modalities with different extrinsic patterns and intrinsic characteristics, and thus can more precisely fit the chaotic time series. (2) An effective sparse hard-cut expec- tation maximization (SHC-EM) learning algorithm for the GPM model is proposed to improve the prediction performance. SHO-EM replaces a large learning sample set with fewer pseudo inputs, accelerating model learning based on these pseudo inputs. Experiments on Lorenz and Chua time series demonstrate that the proposed method yields not only accurate multimodality prediction, but also the prediction confidence interval SHC-EM outperforms the traditional variational 1earning in terms of both prediction accuracy and speed. In addition, SHC-EM is more robust and insusceptible to noise than variational learning.展开更多
The potential for reducing greenhouse gas(GHG)emissions and energy consumption in wastewater treatment can be realized through intelligent control,with machine learning(ML)and multimodality emerging as a promising sol...The potential for reducing greenhouse gas(GHG)emissions and energy consumption in wastewater treatment can be realized through intelligent control,with machine learning(ML)and multimodality emerging as a promising solution.Here,we introduce an ML technique based on multimodal strategies,focusing specifically on intelligent aeration control in wastewater treatment plants(WWTPs).The generalization of the multimodal strategy is demonstrated on eight ML models.The results demonstrate that this multimodal strategy significantly enhances model indicators for ML in environmental science and the efficiency of aeration control,exhibiting exceptional performance and interpretability.Integrating random forest with visual models achieves the highest accuracy in forecasting aeration quantity in multimodal models,with a mean absolute percentage error of 4.4%and a coefficient of determination of 0.948.Practical testing in a full-scale plant reveals that the multimodal model can reduce operation costs by 19.8%compared to traditional fuzzy control methods.The potential application of these strategies in critical water science domains is discussed.To foster accessibility and promote widespread adoption,the multimodal ML models are freely available on GitHub,thereby eliminating technical barriers and encouraging the application of artificial intelligence in urban wastewater treatment.展开更多
This paper presents an end-to-end deep learning method to solve geometry problems via feature learning and contrastive learning of multimodal data.A key challenge in solving geometry problems using deep learning is to...This paper presents an end-to-end deep learning method to solve geometry problems via feature learning and contrastive learning of multimodal data.A key challenge in solving geometry problems using deep learning is to automatically adapt to the task of understanding single-modal and multimodal problems.Existing methods either focus on single-modal ormultimodal problems,and they cannot fit each other.A general geometry problem solver shouldobviouslybe able toprocess variousmodalproblems at the same time.Inthispaper,a shared feature-learning model of multimodal data is adopted to learn the unified feature representation of text and image,which can solve the heterogeneity issue between multimodal geometry problems.A contrastive learning model of multimodal data enhances the semantic relevance betweenmultimodal features and maps them into a unified semantic space,which can effectively adapt to both single-modal and multimodal downstream tasks.Based on the feature extraction and fusion of multimodal data,a proposed geometry problem solver uses relation extraction,theorem reasoning,and problem solving to present solutions in a readable way.Experimental results show the effectiveness of the method.展开更多
This mixed-method empirical study investigated the role of learning strategies and motivation in predicting L2 Chinese learning outcomes in an online multimodal learning environment.Both quantitative and qualitative a...This mixed-method empirical study investigated the role of learning strategies and motivation in predicting L2 Chinese learning outcomes in an online multimodal learning environment.Both quantitative and qualitative approaches also examined the learners'perspectives on online multimodal Chinese learning.The participants in this study were fifteen pre-intermediate adult Chinese learners aged 18-26.They were originally from different countries(Spain,Italy,Argentina,Colombia,and Mexico)and lived in Barcelona.They were multilingual,speaking more than two European languages,without exposure to any other Asian languages apart from Chinese.The study's investigation was composed of Strategy Inventory for Language Learning(SILL),motivation questionnaire,learner perception questionnaire,and focus group interview.The whole trial period lasted three months;after the experiment,the statistics were analyzed via the Spearman correlation coefficient.The statistical analysis results showed that strategy use was highly correlated with online multimodal Chinese learning outcomes;this indicated that strategy use played a vital role in online multimodal Chinese learning.Motivation was also found to have a significant effect.The perception questionnaire uncovered that the students were overall satisfied and favoring the online multimodal learning experience design.The detailed insights from the participants were exhibited in the transcripted analysis of focus group interviews.展开更多
Humans can perceive our complex world through multi-sensory fusion.Under limited visual conditions,people can sense a variety of tactile signals to identify objects accurately and rapidly.However,replicating this uniq...Humans can perceive our complex world through multi-sensory fusion.Under limited visual conditions,people can sense a variety of tactile signals to identify objects accurately and rapidly.However,replicating this unique capability in robots remains a significant challenge.Here,we present a new form of ultralight multifunctional tactile nano-layered carbon aerogel sensor that provides pressure,temperature,material recognition and 3D location capabilities,which is combined with multimodal supervised learning algorithms for object recognition.The sensor exhibits human-like pressure(0.04–100 kPa)and temperature(21.5–66.2℃)detection,millisecond response times(11 ms),a pressure sensitivity of 92.22 kPa^(−1)and triboelectric durability of over 6000 cycles.The devised algorithm has universality and can accommodate a range of application scenarios.The tactile system can identify common foods in a kitchen scene with 94.63%accuracy and explore the topographic and geomorphic features of a Mars scene with 100%accuracy.This sensing approach empowers robots with versatile tactile perception to advance future society toward heightened sensing,recognition and intelligence.展开更多
Cross-lingual image description,the task of generating image captions in a target language from images and descriptions in a source language,is addressed in this study through a novel approach that combines neural net...Cross-lingual image description,the task of generating image captions in a target language from images and descriptions in a source language,is addressed in this study through a novel approach that combines neural network models and semantic matching techniques.Experiments conducted on the Flickr8k and AraImg2k benchmark datasets,featuring images and descriptions in English and Arabic,showcase remarkable performance improvements over state-of-the-art methods.Our model,equipped with the Image&Cross-Language Semantic Matching module and the Target Language Domain Evaluation module,significantly enhances the semantic relevance of generated image descriptions.For English-to-Arabic and Arabic-to-English cross-language image descriptions,our approach achieves a CIDEr score for English and Arabic of 87.9%and 81.7%,respectively,emphasizing the substantial contributions of our methodology.Comparative analyses with previous works further affirm the superior performance of our approach,and visual results underscore that our model generates image captions that are both semantically accurate and stylistically consistent with the target language.In summary,this study advances the field of cross-lingual image description,offering an effective solution for generating image captions across languages,with the potential to impact multilingual communication and accessibility.Future research directions include expanding to more languages and incorporating diverse visual and textual data sources.展开更多
Objective:To develop a multimodal deep-learning model for classifying Chinese medicine constitution,i.e.,the balanced and unbalanced constitutions,based on inspection of tongue and face images,pulse waves from palpati...Objective:To develop a multimodal deep-learning model for classifying Chinese medicine constitution,i.e.,the balanced and unbalanced constitutions,based on inspection of tongue and face images,pulse waves from palpation,and health information from a total of 540 subjects.Methods:This study data consisted of tongue and face images,pulse waves obtained by palpation,and health information,including personal information,life habits,medical history,and current symptoms,from 540 subjects(202 males and 338 females).Convolutional neural networks,recurrent neural networks,and fully connected neural networks were used to extract deep features from the data.Feature fusion and decision fusion models were constructed for the multimodal data.Results:The optimal models for tongue and face images,pulse waves and health information were ResNet18,Gate Recurrent Unit,and entity embedding,respectively.Feature fusion was superior to decision fusion.The multimodal analysis revealed that multimodal data compensated for the loss of information from a single mode,resulting in improved classification performance.Conclusions:Multimodal data fusion can supplement single model information and improve classification performance.Our research underscores the effectiveness of multimodal deep learning technology to identify body constitution for modernizing and improving the intelligent application of Chinese medicine.展开更多
The prediction of drug-drug interactions(DDIs)is a crucial task for drug safety research,and identifying potential DDIs helps us to explore the mechanism behind combinatorial therapy.Traditional wet chemical experimen...The prediction of drug-drug interactions(DDIs)is a crucial task for drug safety research,and identifying potential DDIs helps us to explore the mechanism behind combinatorial therapy.Traditional wet chemical experiments for DDI are cumbersome and time-consuming,and are too small in scale,limiting the efficiency of DDI predictions.Therefore,it is particularly crucial to develop improved computational methods for detecting drug interactions.With the development of deep learning,several computational models based on deep learning have been proposed for DDI prediction.In this review,we summarized the high-quality DDI prediction methods based on deep learning in recent years,and divided them into four categories:neural network-based methods,graph neural network-based methods,knowledge graph-based methods,and multimodal-based methods.Furthermore,we discuss the challenges of existing methods and future potential perspectives.This review reveals that deep learning can significantly improve DDI prediction performance compared to traditional machine learning.Deep learning models can scale to large-scale datasets and accept multiple data types as input,thus making DDI predictions more efficient and accurate.展开更多
FIB-SEM tomography is a powerful technique that integrates a focused ion beam(FIB)and a scanning electron microscope(SEM)to capture high-resolution imaging data of nanostructures.This approach involves collecting in-p...FIB-SEM tomography is a powerful technique that integrates a focused ion beam(FIB)and a scanning electron microscope(SEM)to capture high-resolution imaging data of nanostructures.This approach involves collecting in-plane SEM imagesand using FIB to remove material layers for imaging subsequent planes,thereby producing image stacks.However,theseimage stacks in FIB-SEM tomography are subject to the shine-through effect,which makes structures visible from theposterior regions of the current plane.This artifact introduces an ambiguity between image intensity and structures in thecurrent plane,making conventional segmentation methods such as thresholding or the k-means algorithm insufficient.Inthis study,we propose a multimodal machine learning approach that combines intensity information obtained at differentelectron beam accelerating voltages to improve the three-dimensional(3D)reconstruction of nanostructures.By treatingthe increased shine-through effect at higher accelerating voltages as a form of additional information,the proposed methodsignificantly improves segmentation accuracy and leads to more precise 3D reconstructions for real FIB tomography data.展开更多
Automated waste sorting can dramatically increase waste sorting efficiency and reduce its regulation cost. Most of the current methods only use a single modality such as image data or acoustic data for waste classific...Automated waste sorting can dramatically increase waste sorting efficiency and reduce its regulation cost. Most of the current methods only use a single modality such as image data or acoustic data for waste classification, which makes it difficult to classify mixed and confusable wastes. In these complex situations, using multiple modalities becomes necessary to achieve a high classification accuracy. Traditionally, the fusion of multiple modalities has been limited by fixed handcrafted features. In this study, the deep-learning approach was applied to the multimodal fusion at the feature level for municipal solid-waste sorting.More specifically, the pre-trained VGG16 and one-dimensional convolutional neural networks(1 D CNNs) were utilized to extract features from visual data and acoustic data, respectively. These deeply learned features were then fused in the fully connected layers for classification. The results of comparative experiments proved that the proposed method was superior to the single-modality methods. Additionally, the feature-based fusion strategy performed better than the decision-based strategy with deeply learned features.展开更多
Multimodal machine learning(MML)aims to understand the world from multiple related modalities.It has attracted much attention as multimodal data has become increasingly available in real-world application.It is shown ...Multimodal machine learning(MML)aims to understand the world from multiple related modalities.It has attracted much attention as multimodal data has become increasingly available in real-world application.It is shown that MML can perform better than single-modal machine learning,since multi-modalities containing more information which could complement each other.However,it is a key challenge to fuse the multi-modalities in MML.Different from previous work,we further consider the side-information,which reflects the situation and influences the fusion of multi-modalities.We recover multimodal label distribution(MLD)by leveraging the side-information,representing the degree to which each modality contributes to describing the instance.Accordingly,a novel framework named multimodal label distribution learning(MLDL)is proposed to recover the MLD,and fuse the multimodalities with its guidance to learn an in-depth understanding of the jointly feature representation.Moreover,two versions of MLDL are proposed to deal with the sequential data.Experiments on multimodal sentiment analysis and disease prediction show that the proposed approaches perform favorably against state-of-the-art methods.展开更多
Modern computational models have leveraged biological advances in human brain research. This study addresses the problem of multimodal learning with the help of brain-inspired models. Specifically, a unified multimoda...Modern computational models have leveraged biological advances in human brain research. This study addresses the problem of multimodal learning with the help of brain-inspired models. Specifically, a unified multimodal learning architecture is proposed based on deep neural networks, which are inspired by the biology of the visual cortex of the human brain. This unified framework is validated by two practical multimodal learning tasks: image captioning, involving visual and natural language signals, and visual-haptic fusion, involving haptic and visual signals. Extensive experiments are conducted under the framework, and competitive results are achieved.展开更多
With the growing awareness of data privacy,federated learning(FL)has gained increasing attention in recent years as a major paradigm for training models with privacy protection in mind,which allows building models in ...With the growing awareness of data privacy,federated learning(FL)has gained increasing attention in recent years as a major paradigm for training models with privacy protection in mind,which allows building models in a collaborative but private way without exchanging data.However,most FL clients are currently unimodal.With the rise of edge computing,various types of sensors and wearable devices generate a large amount of data from different modalities,which has inspired research efforts in multimodal federated learning(MMFL).In this survey,we explore the area of MMFL to address the fundamental challenges of FL on multimodal data.First,we analyse the key motivations for MMFL.Second,the currently proposed MMFL methods are technically classified according to the modality distributions and modality annotations in MMFL.Then,we discuss the datasets and application scenarios of MMFL.Finally,we highlight the limitations and challenges of MMFL and provide insights and methods for future research.展开更多
In the past few years,the emergence of pre-training models has brought uni-modal fields such as computer vision(CV)and natural language processing(NLP)to a new era.Substantial works have shown that they are beneficial...In the past few years,the emergence of pre-training models has brought uni-modal fields such as computer vision(CV)and natural language processing(NLP)to a new era.Substantial works have shown that they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch.So can such pre-trained models be applied to multi-modal tasks?Researchers have ex-plored this problem and made significant progress.This paper surveys recent advances and new frontiers in vision-language pre-training(VLP),including image-text and video-text pre-training.To give readers a better overall grasp of VLP,we first review its recent ad-vances in five aspects:feature extraction,model architecture,pre-training objectives,pre-training datasets,and downstream tasks.Then,we summarize the specific VLP models in detail.Finally,we discuss the new frontiers in VLP.To the best of our knowledge,this is the first survey focused on VLP.We hope that this survey can shed light on future research in the VLP field.展开更多
基金Supported by the National Natural Science Foundation of China under Grant No 60972106the China Postdoctoral Science Foundation under Grant No 2014M561053+1 种基金the Humanity and Social Science Foundation of Ministry of Education of China under Grant No 15YJA630108the Hebei Province Natural Science Foundation under Grant No E2016202341
文摘The contribution of this work is twofold: (1) a multimodality prediction method of chaotic time series with the Gaussian process mixture (GPM) model is proposed, which employs a divide and conquer strategy. It automatically divides the chaotic time series into multiple modalities with different extrinsic patterns and intrinsic characteristics, and thus can more precisely fit the chaotic time series. (2) An effective sparse hard-cut expec- tation maximization (SHC-EM) learning algorithm for the GPM model is proposed to improve the prediction performance. SHO-EM replaces a large learning sample set with fewer pseudo inputs, accelerating model learning based on these pseudo inputs. Experiments on Lorenz and Chua time series demonstrate that the proposed method yields not only accurate multimodality prediction, but also the prediction confidence interval SHC-EM outperforms the traditional variational 1earning in terms of both prediction accuracy and speed. In addition, SHC-EM is more robust and insusceptible to noise than variational learning.
基金the financial support by the National Natural Science Foundation of China(52230004 and 52293445)the Key Research and Development Project of Shandong Province(2020CXGC011202-005)the Shenzhen Science and Technology Program(KCXFZ20211020163404007 and KQTD20190929172630447).
文摘The potential for reducing greenhouse gas(GHG)emissions and energy consumption in wastewater treatment can be realized through intelligent control,with machine learning(ML)and multimodality emerging as a promising solution.Here,we introduce an ML technique based on multimodal strategies,focusing specifically on intelligent aeration control in wastewater treatment plants(WWTPs).The generalization of the multimodal strategy is demonstrated on eight ML models.The results demonstrate that this multimodal strategy significantly enhances model indicators for ML in environmental science and the efficiency of aeration control,exhibiting exceptional performance and interpretability.Integrating random forest with visual models achieves the highest accuracy in forecasting aeration quantity in multimodal models,with a mean absolute percentage error of 4.4%and a coefficient of determination of 0.948.Practical testing in a full-scale plant reveals that the multimodal model can reduce operation costs by 19.8%compared to traditional fuzzy control methods.The potential application of these strategies in critical water science domains is discussed.To foster accessibility and promote widespread adoption,the multimodal ML models are freely available on GitHub,thereby eliminating technical barriers and encouraging the application of artificial intelligence in urban wastewater treatment.
基金supported by the NationalNatural Science Foundation of China (No.62107014,Jian P.,62177025,He B.)the Key R&D and Promotion Projects of Henan Province (No.212102210147,Jian P.)Innovative Education Program for Graduate Students at North China University of Water Resources and Electric Power,China (No.YK-2021-99,Guo F.).
文摘This paper presents an end-to-end deep learning method to solve geometry problems via feature learning and contrastive learning of multimodal data.A key challenge in solving geometry problems using deep learning is to automatically adapt to the task of understanding single-modal and multimodal problems.Existing methods either focus on single-modal ormultimodal problems,and they cannot fit each other.A general geometry problem solver shouldobviouslybe able toprocess variousmodalproblems at the same time.Inthispaper,a shared feature-learning model of multimodal data is adopted to learn the unified feature representation of text and image,which can solve the heterogeneity issue between multimodal geometry problems.A contrastive learning model of multimodal data enhances the semantic relevance betweenmultimodal features and maps them into a unified semantic space,which can effectively adapt to both single-modal and multimodal downstream tasks.Based on the feature extraction and fusion of multimodal data,a proposed geometry problem solver uses relation extraction,theorem reasoning,and problem solving to present solutions in a readable way.Experimental results show the effectiveness of the method.
文摘This mixed-method empirical study investigated the role of learning strategies and motivation in predicting L2 Chinese learning outcomes in an online multimodal learning environment.Both quantitative and qualitative approaches also examined the learners'perspectives on online multimodal Chinese learning.The participants in this study were fifteen pre-intermediate adult Chinese learners aged 18-26.They were originally from different countries(Spain,Italy,Argentina,Colombia,and Mexico)and lived in Barcelona.They were multilingual,speaking more than two European languages,without exposure to any other Asian languages apart from Chinese.The study's investigation was composed of Strategy Inventory for Language Learning(SILL),motivation questionnaire,learner perception questionnaire,and focus group interview.The whole trial period lasted three months;after the experiment,the statistics were analyzed via the Spearman correlation coefficient.The statistical analysis results showed that strategy use was highly correlated with online multimodal Chinese learning outcomes;this indicated that strategy use played a vital role in online multimodal Chinese learning.Motivation was also found to have a significant effect.The perception questionnaire uncovered that the students were overall satisfied and favoring the online multimodal learning experience design.The detailed insights from the participants were exhibited in the transcripted analysis of focus group interviews.
基金the National Natural Science Foundation of China(Grant No.52072041)the Beijing Natural Science Foundation(Grant No.JQ21007)+2 种基金the University of Chinese Academy of Sciences(Grant No.Y8540XX2D2)the Robotics Rhino-Bird Focused Research Project(No.2020-01-002)the Tencent Robotics X Laboratory.
文摘Humans can perceive our complex world through multi-sensory fusion.Under limited visual conditions,people can sense a variety of tactile signals to identify objects accurately and rapidly.However,replicating this unique capability in robots remains a significant challenge.Here,we present a new form of ultralight multifunctional tactile nano-layered carbon aerogel sensor that provides pressure,temperature,material recognition and 3D location capabilities,which is combined with multimodal supervised learning algorithms for object recognition.The sensor exhibits human-like pressure(0.04–100 kPa)and temperature(21.5–66.2℃)detection,millisecond response times(11 ms),a pressure sensitivity of 92.22 kPa^(−1)and triboelectric durability of over 6000 cycles.The devised algorithm has universality and can accommodate a range of application scenarios.The tactile system can identify common foods in a kitchen scene with 94.63%accuracy and explore the topographic and geomorphic features of a Mars scene with 100%accuracy.This sensing approach empowers robots with versatile tactile perception to advance future society toward heightened sensing,recognition and intelligence.
文摘Cross-lingual image description,the task of generating image captions in a target language from images and descriptions in a source language,is addressed in this study through a novel approach that combines neural network models and semantic matching techniques.Experiments conducted on the Flickr8k and AraImg2k benchmark datasets,featuring images and descriptions in English and Arabic,showcase remarkable performance improvements over state-of-the-art methods.Our model,equipped with the Image&Cross-Language Semantic Matching module and the Target Language Domain Evaluation module,significantly enhances the semantic relevance of generated image descriptions.For English-to-Arabic and Arabic-to-English cross-language image descriptions,our approach achieves a CIDEr score for English and Arabic of 87.9%and 81.7%,respectively,emphasizing the substantial contributions of our methodology.Comparative analyses with previous works further affirm the superior performance of our approach,and visual results underscore that our model generates image captions that are both semantically accurate and stylistically consistent with the target language.In summary,this study advances the field of cross-lingual image description,offering an effective solution for generating image captions across languages,with the potential to impact multilingual communication and accessibility.Future research directions include expanding to more languages and incorporating diverse visual and textual data sources.
基金Supported by the National Key Research and Development Program of China Under Grant(No.2018YFC1707704)。
文摘Objective:To develop a multimodal deep-learning model for classifying Chinese medicine constitution,i.e.,the balanced and unbalanced constitutions,based on inspection of tongue and face images,pulse waves from palpation,and health information from a total of 540 subjects.Methods:This study data consisted of tongue and face images,pulse waves obtained by palpation,and health information,including personal information,life habits,medical history,and current symptoms,from 540 subjects(202 males and 338 females).Convolutional neural networks,recurrent neural networks,and fully connected neural networks were used to extract deep features from the data.Feature fusion and decision fusion models were constructed for the multimodal data.Results:The optimal models for tongue and face images,pulse waves and health information were ResNet18,Gate Recurrent Unit,and entity embedding,respectively.Feature fusion was superior to decision fusion.The multimodal analysis revealed that multimodal data compensated for the loss of information from a single mode,resulting in improved classification performance.Conclusions:Multimodal data fusion can supplement single model information and improve classification performance.Our research underscores the effectiveness of multimodal deep learning technology to identify body constitution for modernizing and improving the intelligent application of Chinese medicine.
基金National Natural Science Foundationof China,Grant/Award Number:62102158Fundamental Research Fundsforthe Central Universities,Grant/Award Number:2662022JC004+1 种基金2021 Foshan Support Project for Promoting the Development of University Scientific and Technological Achievements ServiceIndustry,Grant/Award Number:2021DZXX05Huazhong Agricultural University Scientific Technological Selfinnovation Foundation。
文摘The prediction of drug-drug interactions(DDIs)is a crucial task for drug safety research,and identifying potential DDIs helps us to explore the mechanism behind combinatorial therapy.Traditional wet chemical experiments for DDI are cumbersome and time-consuming,and are too small in scale,limiting the efficiency of DDI predictions.Therefore,it is particularly crucial to develop improved computational methods for detecting drug interactions.With the development of deep learning,several computational models based on deep learning have been proposed for DDI prediction.In this review,we summarized the high-quality DDI prediction methods based on deep learning in recent years,and divided them into four categories:neural network-based methods,graph neural network-based methods,knowledge graph-based methods,and multimodal-based methods.Furthermore,we discuss the challenges of existing methods and future potential perspectives.This review reveals that deep learning can significantly improve DDI prediction performance compared to traditional machine learning.Deep learning models can scale to large-scale datasets and accept multiple data types as input,thus making DDI predictions more efficient and accurate.
基金funded by the Deutsche Forschungsgemein-schaft(DFG,German Research Foundation)-SFB 986-Project number 192346071.
文摘FIB-SEM tomography is a powerful technique that integrates a focused ion beam(FIB)and a scanning electron microscope(SEM)to capture high-resolution imaging data of nanostructures.This approach involves collecting in-plane SEM imagesand using FIB to remove material layers for imaging subsequent planes,thereby producing image stacks.However,theseimage stacks in FIB-SEM tomography are subject to the shine-through effect,which makes structures visible from theposterior regions of the current plane.This artifact introduces an ambiguity between image intensity and structures in thecurrent plane,making conventional segmentation methods such as thresholding or the k-means algorithm insufficient.Inthis study,we propose a multimodal machine learning approach that combines intensity information obtained at differentelectron beam accelerating voltages to improve the three-dimensional(3D)reconstruction of nanostructures.By treatingthe increased shine-through effect at higher accelerating voltages as a form of additional information,the proposed methodsignificantly improves segmentation accuracy and leads to more precise 3D reconstructions for real FIB tomography data.
基金supported by the National Natural Science Foundation of China(Grant Nos.51875507,52005439)the Key Research and Development Program of Zhejiang Province(Grant No.2021C01018)。
文摘Automated waste sorting can dramatically increase waste sorting efficiency and reduce its regulation cost. Most of the current methods only use a single modality such as image data or acoustic data for waste classification, which makes it difficult to classify mixed and confusable wastes. In these complex situations, using multiple modalities becomes necessary to achieve a high classification accuracy. Traditionally, the fusion of multiple modalities has been limited by fixed handcrafted features. In this study, the deep-learning approach was applied to the multimodal fusion at the feature level for municipal solid-waste sorting.More specifically, the pre-trained VGG16 and one-dimensional convolutional neural networks(1 D CNNs) were utilized to extract features from visual data and acoustic data, respectively. These deeply learned features were then fused in the fully connected layers for classification. The results of comparative experiments proved that the proposed method was superior to the single-modality methods. Additionally, the feature-based fusion strategy performed better than the decision-based strategy with deeply learned features.
基金This research was supported by the National Key Research and Development Plan of China(2018AAA0100104)the National Natural Science Foundation of China(Grant No.62076063)the Fundamental Research Funds for the Central Universities(2242021k30056).
文摘Multimodal machine learning(MML)aims to understand the world from multiple related modalities.It has attracted much attention as multimodal data has become increasingly available in real-world application.It is shown that MML can perform better than single-modal machine learning,since multi-modalities containing more information which could complement each other.However,it is a key challenge to fuse the multi-modalities in MML.Different from previous work,we further consider the side-information,which reflects the situation and influences the fusion of multi-modalities.We recover multimodal label distribution(MLD)by leveraging the side-information,representing the degree to which each modality contributes to describing the instance.Accordingly,a novel framework named multimodal label distribution learning(MLDL)is proposed to recover the MLD,and fuse the multimodalities with its guidance to learn an in-depth understanding of the jointly feature representation.Moreover,two versions of MLDL are proposed to deal with the sequential data.Experiments on multimodal sentiment analysis and disease prediction show that the proposed approaches perform favorably against state-of-the-art methods.
基金supported by National Natural Science Foundation of China(Grant Nos.61621136008,61327809,61210013,91420302,and 91520201)
文摘Modern computational models have leveraged biological advances in human brain research. This study addresses the problem of multimodal learning with the help of brain-inspired models. Specifically, a unified multimodal learning architecture is proposed based on deep neural networks, which are inspired by the biology of the visual cortex of the human brain. This unified framework is validated by two practical multimodal learning tasks: image captioning, involving visual and natural language signals, and visual-haptic fusion, involving haptic and visual signals. Extensive experiments are conducted under the framework, and competitive results are achieved.
基金supported by the National Natural Science Foundation of China(No.62036006)the Fundamental Research Funds for the Central Universities,Chinathe Innovation Fund of Xidian University,China.
文摘With the growing awareness of data privacy,federated learning(FL)has gained increasing attention in recent years as a major paradigm for training models with privacy protection in mind,which allows building models in a collaborative but private way without exchanging data.However,most FL clients are currently unimodal.With the rise of edge computing,various types of sensors and wearable devices generate a large amount of data from different modalities,which has inspired research efforts in multimodal federated learning(MMFL).In this survey,we explore the area of MMFL to address the fundamental challenges of FL on multimodal data.First,we analyse the key motivations for MMFL.Second,the currently proposed MMFL methods are technically classified according to the modality distributions and modality annotations in MMFL.Then,we discuss the datasets and application scenarios of MMFL.Finally,we highlight the limitations and challenges of MMFL and provide insights and methods for future research.
基金supported by the Key Research Program of the Chinese Academy of Sciences(No.ZDBSSSW-JSC006)the Strategic Priority Research Program of the Chinese Academy of Sciences(No.XDA 27030300).
文摘In the past few years,the emergence of pre-training models has brought uni-modal fields such as computer vision(CV)and natural language processing(NLP)to a new era.Substantial works have shown that they are beneficial for downstream uni-modal tasks and avoid training a new model from scratch.So can such pre-trained models be applied to multi-modal tasks?Researchers have ex-plored this problem and made significant progress.This paper surveys recent advances and new frontiers in vision-language pre-training(VLP),including image-text and video-text pre-training.To give readers a better overall grasp of VLP,we first review its recent ad-vances in five aspects:feature extraction,model architecture,pre-training objectives,pre-training datasets,and downstream tasks.Then,we summarize the specific VLP models in detail.Finally,we discuss the new frontiers in VLP.To the best of our knowledge,this is the first survey focused on VLP.We hope that this survey can shed light on future research in the VLP field.