Attention mechanism has been a successful method for multimodal affective analysis in recent years. Despite the advances, several significant challenges remain in fusing language and its nonverbal context information....Attention mechanism has been a successful method for multimodal affective analysis in recent years. Despite the advances, several significant challenges remain in fusing language and its nonverbal context information. One is to generate sparse attention coefficients associated with acoustic and visual modalities, which helps locate critical emotional se-mantics. The other is fusing complementary cross‐modal representation to construct optimal salient feature combinations of multiple modalities. A Conditional Transformer Fusion Network is proposed to handle these problems. Firstly, the authors equip the transformer module with CNN layers to enhance the detection of subtle signal patterns in nonverbal sequences. Secondly, sentiment words are utilised as context conditions to guide the computation of cross‐modal attention. As a result, the located nonverbal fea-tures are not only salient but also complementary to sentiment words directly. Experi-mental results show that the authors’ method achieves state‐of‐the‐art performance on several multimodal affective analysis datasets.展开更多
As one of the most effective methods to improve the accuracy and robustness of speech tasks,the audio-visual fusion approach has recently been introduced into the field of Keyword Spotting(KWS).However,existing audio-...As one of the most effective methods to improve the accuracy and robustness of speech tasks,the audio-visual fusion approach has recently been introduced into the field of Keyword Spotting(KWS).However,existing audio-visual keyword spotting models are limited to detecting isolated words,while keyword spotting for unconstrained speech is still a challenging problem.To this end,an Audio-Visual Keyword Transformer(AVKT)network is proposed to spot keywords in unconstrained video clips.The authors present a transformer classifier with learnable CLS tokens to extract distinctive keyword features from the variable-length audio and visual inputs.The outputs of audio and visual branches are combined in a decision fusion module.As humans can easily notice whether a keyword appears in a sentence or not,our AVKT network can detect whether a video clip with a spoken sentence contains a pre-specified keyword.Moreover,the position of the keyword is localised in the attention map without additional position labels.Exper-imental results on the LRS2-KWS dataset and our newly collected PKU-KWS dataset show that the accuracy of AVKT exceeded 99%in clean scenes and 85%in extremely noisy conditions.The code is available at https://github.com/jialeren/AVKT.展开更多
Aging is associated with a progressive decline in physiological capacities and an increased risk of aging-associated disorders.An increasing body of experimental evidence shows that aging is a complex biological proce...Aging is associated with a progressive decline in physiological capacities and an increased risk of aging-associated disorders.An increasing body of experimental evidence shows that aging is a complex biological process coordinately regulated by multiple factors at diferent molecular layers.Thus,it is difcult to delineate the overall systematic aging changes based on single-layer data.Instead,multimodal omics approaches,in which data are acquired and analyzed using complementary omics technologies,such as genomics,transcriptomics,and epigenomics,are needed for gaining insights into the precise molecular regulatory mechanisms that trigger aging.In recent years,multimodal omics sequencing technologies that can reveal complex regulatory networks and specifc phenotypic changes have been developed and widely applied to decode aging and age-related diseases.This review summarizes the classifcation and progress of multimodal omics approaches,as well as the rapidly growing number of articles reporting on their application in the feld of aging research,and outlines new developments in the clinical treatment of age-related diseases based on omics technologies.展开更多
In the era of big data,where vast amounts of information are being generated and collected at an unprecedented rate,there is a pressing demand for innovative data-driven multi-modal fusion methods.These methods aim to...In the era of big data,where vast amounts of information are being generated and collected at an unprecedented rate,there is a pressing demand for innovative data-driven multi-modal fusion methods.These methods aim to integrate diverse neuroimaging per-spectives to extract meaningful insights and attain a more comprehensive understanding of complex psychiatric disorders.However,analyzing each modality separately may only reveal partial insights or miss out on important correlations between different types of data.This is where data-driven multi-modal fusion techniques come into play.By combining information from multiple modalities in a synergistic manner,these methods enable us to uncover hidden patterns and relationships that would otherwise remain unnoticed.In this paper,we present an extensive overview of data-driven multimodal fusion approaches with or without prior information,with specific emphasis on canonical correlation analysis and independent component analysis.The applications of such fusion methods are wide-ranging and allow us to incorporate multiple factors such as genetics,environment,cognition,and treatment outcomes across various brain disorders.After summarizing the diverse neuropsychiatric magnetic resonance imaging fusion applications,we further discuss the emerging neuroimaging analyzing trends in big data,such as N-way multimodal fusion,deep learning approaches,and clinical translation.Overall,multimodal fusion emerges as an imperative approach providing valuable insights into the under-lying neural basis of mental disorders,which can uncover subtle abnormalities or potential biomarkers that may benefit targeted treatments and personalized medical interventions.展开更多
基金National Key Research and Development Plan of China, Grant/Award Number: 2021YFB3600503National Natural Science Foundation of China, Grant/Award Numbers: 62276065, U21A20472。
文摘Attention mechanism has been a successful method for multimodal affective analysis in recent years. Despite the advances, several significant challenges remain in fusing language and its nonverbal context information. One is to generate sparse attention coefficients associated with acoustic and visual modalities, which helps locate critical emotional se-mantics. The other is fusing complementary cross‐modal representation to construct optimal salient feature combinations of multiple modalities. A Conditional Transformer Fusion Network is proposed to handle these problems. Firstly, the authors equip the transformer module with CNN layers to enhance the detection of subtle signal patterns in nonverbal sequences. Secondly, sentiment words are utilised as context conditions to guide the computation of cross‐modal attention. As a result, the located nonverbal fea-tures are not only salient but also complementary to sentiment words directly. Experi-mental results show that the authors’ method achieves state‐of‐the‐art performance on several multimodal affective analysis datasets.
基金Science and Technology Plan of Shenzhen,Grant/Award Number:JCYJ20200109140410340National Natural Science Foundation of China,Grant/Award Number:62073004。
文摘As one of the most effective methods to improve the accuracy and robustness of speech tasks,the audio-visual fusion approach has recently been introduced into the field of Keyword Spotting(KWS).However,existing audio-visual keyword spotting models are limited to detecting isolated words,while keyword spotting for unconstrained speech is still a challenging problem.To this end,an Audio-Visual Keyword Transformer(AVKT)network is proposed to spot keywords in unconstrained video clips.The authors present a transformer classifier with learnable CLS tokens to extract distinctive keyword features from the variable-length audio and visual inputs.The outputs of audio and visual branches are combined in a decision fusion module.As humans can easily notice whether a keyword appears in a sentence or not,our AVKT network can detect whether a video clip with a spoken sentence contains a pre-specified keyword.Moreover,the position of the keyword is localised in the attention map without additional position labels.Exper-imental results on the LRS2-KWS dataset and our newly collected PKU-KWS dataset show that the accuracy of AVKT exceeded 99%in clean scenes and 85%in extremely noisy conditions.The code is available at https://github.com/jialeren/AVKT.
基金the National Key Research and Development Program of China(2020YFA0804000,2022YFA1103700,2020YFA0112200,2021YFF1201000)the National Natural Science Foundation of China(81921006,82125011,92149301,92168201,91949209,92049304,92049116,32121001,82192863)+5 种基金the Strategic Priority Research Program of the Chinese Academy of Sciences(XDA16010000)CAS Project for Young Scientists in Basic Research(YSBR-076,YSBR-012)the Program of the Beijing Natural Science Foundation(Z190019)Youth Innovation Promotion Association of CAS(E1CAZW0401)the Informatization Plan of Chinese Academy of Sciences(CASWX2021SF-0301,CAS-WX2022SDC-XK14,CAS-WX2021SF-0101)New Cornerstone Science Foundation through the XPLORER PRIZE(2021-1045).
文摘Aging is associated with a progressive decline in physiological capacities and an increased risk of aging-associated disorders.An increasing body of experimental evidence shows that aging is a complex biological process coordinately regulated by multiple factors at diferent molecular layers.Thus,it is difcult to delineate the overall systematic aging changes based on single-layer data.Instead,multimodal omics approaches,in which data are acquired and analyzed using complementary omics technologies,such as genomics,transcriptomics,and epigenomics,are needed for gaining insights into the precise molecular regulatory mechanisms that trigger aging.In recent years,multimodal omics sequencing technologies that can reveal complex regulatory networks and specifc phenotypic changes have been developed and widely applied to decode aging and age-related diseases.This review summarizes the classifcation and progress of multimodal omics approaches,as well as the rapidly growing number of articles reporting on their application in the feld of aging research,and outlines new developments in the clinical treatment of age-related diseases based on omics technologies.
基金supported by the Natural Science Foundation of China (62373062,82022035)the China Postdoctoral Science Foundation (2022M710434)+1 种基金the National Institute of Health grants (R01EB005846,R01MH117107,and R01MH118695)the National Science Foundation (2112455).
文摘In the era of big data,where vast amounts of information are being generated and collected at an unprecedented rate,there is a pressing demand for innovative data-driven multi-modal fusion methods.These methods aim to integrate diverse neuroimaging per-spectives to extract meaningful insights and attain a more comprehensive understanding of complex psychiatric disorders.However,analyzing each modality separately may only reveal partial insights or miss out on important correlations between different types of data.This is where data-driven multi-modal fusion techniques come into play.By combining information from multiple modalities in a synergistic manner,these methods enable us to uncover hidden patterns and relationships that would otherwise remain unnoticed.In this paper,we present an extensive overview of data-driven multimodal fusion approaches with or without prior information,with specific emphasis on canonical correlation analysis and independent component analysis.The applications of such fusion methods are wide-ranging and allow us to incorporate multiple factors such as genetics,environment,cognition,and treatment outcomes across various brain disorders.After summarizing the diverse neuropsychiatric magnetic resonance imaging fusion applications,we further discuss the emerging neuroimaging analyzing trends in big data,such as N-way multimodal fusion,deep learning approaches,and clinical translation.Overall,multimodal fusion emerges as an imperative approach providing valuable insights into the under-lying neural basis of mental disorders,which can uncover subtle abnormalities or potential biomarkers that may benefit targeted treatments and personalized medical interventions.