Marine environmental design parameter extrapolation has important applications in marine engineering and coastal disaster prevention.The distribution models used for environmental design parameter usually pass the hyp...Marine environmental design parameter extrapolation has important applications in marine engineering and coastal disaster prevention.The distribution models used for environmental design parameter usually pass the hypothesis tests in statistical analysis,but the calculation results of different distribution models often vary largely.In this paper,based on the information entropy,the overall uncertainty test criteria were studied for commonly used distributions including Gumbel,Weibull,and Pearson-III distribution.An improved method for parameter estimation of the maximum entropy distribution model is proposed on the basis of moment estimation.The study in this paper shows that the number of sample data and the degree of dispersion are proportional to the information entropy,and the overall uncertainty of the maximum entropy distribution model is minimal compared with other models.展开更多
The accurate calculation of marine environmental design parameters depends on the probability distribution model,and the calculation results of different distribution models are often different.It is very important to...The accurate calculation of marine environmental design parameters depends on the probability distribution model,and the calculation results of different distribution models are often different.It is very important to determine which distribution model is more stable and reasonable when extrapolating the recurrence level of the studied sea area.In this paper,we constructed an evaluation method of the overall uncertainty of the calculation results and a measurement of the uncertainty of the design parameters derivation model,by incorporating the influence of sample information on the model information entropy,such as sample size,degree of dispersion,and sampling error.Results show that the sample data size and the degree of dispersion are directly proportional to the information entropy.Within the same group of data,the maximum entropy distribution model has the lowest overall uncertainty,while the Gumbel distribution model has the largest overall uncertainty.In other words,the maximum entropy distribution model has good applicability in the accurate calculation of marine environmental design parameters.展开更多
After a review of recent developments in precision medicine, population health sciences and innovative clinical trial designs, and in health economics and policy, we show how innovations in health analytics can capita...After a review of recent developments in precision medicine, population health sciences and innovative clinical trial designs, and in health economics and policy, we show how innovations in health analytics can capitalize on the advances in biomedicine and health economics towards developing a data-driven and cost-effective 21<sup>st</sup> century health care system. In particular, we propose a mutually beneficial public-private partnership that combines individual responsibility with community solidarity in building this health care system.展开更多
The South China Sea suffers strongly from the typhoon storm surge disasters in China,and its northern coastal areas are facing severe risks.Therefore,it is necessary and urgent to establish an assessment system for ra...The South China Sea suffers strongly from the typhoon storm surge disasters in China,and its northern coastal areas are facing severe risks.Therefore,it is necessary and urgent to establish an assessment system for rating typhoon storm surge disaster.We constructed an effective and reliable rating assessment system for typhoon storm surge disaster based on the theories of over-threshold,distribution function family,and composite extreme value.The over-threshold sample was used as the basis of data analysis,the composite extreme value expansion model was used to derive the design water increment,and then the disaster level was delineated based on the return period level.The results of the extreme value model comparison show that the Weibull-Pareto distribution is more suitable than the classical extreme value distribution for fitting the over-threshold samples.The results of the return period projection are relatively stable based on different analysis samples.Taking the 10 typhoon storm surges as examples,they caused landfall in the Guangdong area in the past 10 years.The results of the assessment ranking indicate that the risk levels based on the return period levels obtained from different distributions are generally consistent.When classifying low-risk areas,the classification criteria of the State Oceanic Administration,China(SOA,2012)are more conservative.In the high-risk areas,the results of the assessment ranking based on return period are more consistent with those of the SOA.展开更多
The calculation results of marine environmental design parameters obtained from different data sampling methods,model distributions,and parameter estimation methods often vary greatly.To better analyze the uncertainti...The calculation results of marine environmental design parameters obtained from different data sampling methods,model distributions,and parameter estimation methods often vary greatly.To better analyze the uncertainties in the calculation of marine environmental design parameters,a general model uncertainty assessment method is necessary.We proposed a new multivariate model uncertainty assessment method for the calculation of marine environmental design parameters.The method divides the overall model uncertainty into two categories:aleatory uncertainty and epistemic uncertainty.The aleatory uncertainty of the model is obtained by analyzing the influence of the number and the dispersion degree of samples on the information entropy of the model.The epistemic uncertainty of the model is calculated using the information entropy of the model itself and the prediction error.The advantages of this method are that it does not require many-year-observation data for the marine environmental elements,and the method can be used to analyze any specific factors that cause model uncertainty.Results show that by applying the method to the South China Sea,the aleatory uncertainty of the model increases with the number of samples and then stabilizes.A positive correlation was revealed between the dispersion of the samples and the aleatory uncertainty of the model.Both the distribution of the model and the parameter estimation results of the model have significant effects on the epistemic uncertainty of the model.When the goodness-of-fit of the model is relatively close,the best model can be selected according to the criterion of the lowest overall uncertainty of the models,which can both ensure a better model fit and avoid too much uncertainty in the model calculation results.The presented multivariate model uncertainty assessment method provides a criterion to measure the advantages and disadvantages of the marine environmental design parameter calculation model from the aspect of uncertainty,which is of great significance to analyze the uncertainties in the calculation of marine environmental design parameters and improve the accuracy of the calculation results.展开更多
Let X1,X2,...be a sequence of independent random variables(r.v.s) belonging to the domain of attraction of a normal or stable law.In this paper,we study moderate deviations for the self-normalized sum ∑ni=1 Xi/Vn,p,w...Let X1,X2,...be a sequence of independent random variables(r.v.s) belonging to the domain of attraction of a normal or stable law.In this paper,we study moderate deviations for the self-normalized sum ∑ni=1 Xi/Vn,p,where Vn,p =(∑ni=1 |Xi|p)1/p(p>1).Applications to the self-normalized law of the iterated logarithm,Studentized increments of partial sums,t-statistic,and weighted sum of independent and identically distributed(i.i.d.) r.v.s are considered.展开更多
The survival analysis literature has always lagged behind the categorical data literature in developing methods to analyze clustered or multivariate data. While estimators based on
In this paper, we investigate the two sample U-statistics by jackknife empirical likelihood(JEL),a versatile nonparametric approach. More precisely, we propose the method of balanced augmented jackknife empirical like...In this paper, we investigate the two sample U-statistics by jackknife empirical likelihood(JEL),a versatile nonparametric approach. More precisely, we propose the method of balanced augmented jackknife empirical likelihood(BAJEL) by adding two artificial points to the original pseudo-value dataset, and we prove that the log likelihood ratio based on the expanded dataset tends to the χ~2 distribution.展开更多
Although deep learning methods have recently attracted considerable attention in the medical field,analyzing large-scale electronic health record data is still a difficult task.In particular,the accurate recognition o...Although deep learning methods have recently attracted considerable attention in the medical field,analyzing large-scale electronic health record data is still a difficult task.In particular,the accurate recognition of heart failure is a key technology for doctors to make reasonable treatment decisions.This study uses data from the Medical Information Mart for Intensive Care database.Compared with structured data,unstructured data contain abundant patient information.However,this type of data has unsatisfactory characteristics,e.g.,many colloquial vocabularies and sparse content.To solve these problems,we propose the KTI-RNN model for unstructured data recognition.The proposed model overcomes sparse content and obtains good classification results.The term frequency-inverse word frequency(TF-IWF)model is used to extract the keyword set.The latent dirichlet allocation(LDA)model is adopted to extract the topic word set.These models enable the expansion of the medical record text content.Finally,we embed the global attention mechanism and gating mechanism between the bidirectional recurrent neural network(BiRNN)model and the output layer.We call it gated-attention-BiRNN(GA-BiRNN)and use it to identify heart failure from extensive medical texts.Results show that the F 1 score of the proposed KTI-RNN model is 85.57%,and the accuracy rate of the proposed KTI-RNN model is 85.59%.展开更多
For several decades,much attention has been paid to the two-sample Behrens-Fisher(BF) problem which tests the equality of the means or mean vectors of two normal populations with unequal variance/covariance structures...For several decades,much attention has been paid to the two-sample Behrens-Fisher(BF) problem which tests the equality of the means or mean vectors of two normal populations with unequal variance/covariance structures.Little work,however,has been done for the k-sample BF problem for high dimensional data which tests the equality of the mean vectors of several high-dimensional normal populations with unequal covariance structures.In this paper we study this challenging problem via extending the famous Scheffe's transformation method,which reduces the k-sample BF problem to a one-sample problem.The induced one-sample problem can be easily tested by the classical Hotelling's T 2 test when the size of the resulting sample is very large relative to its dimensionality.For high dimensional data,however,the dimensionality of the resulting sample is often very large,and even much larger than its sample size,which makes the classical Hotelling's T 2 test not powerful or not even well defined.To overcome this diffculty,we propose and study an L2-norm based test.The asymp-totic powers of the proposed L2-norm based test and Hotelling's T 2 test are derived and theoretically compared.Methods for implementing the L2-norm based test are described.Simulation studies are conducted to compare the L2-norm based test and Hotelling's T 2 test when the latter can be well defined,and to compare the proposed implementation methods for the L2-norm based test otherwise.The methodologies are motivated and illustrated by a real data example.展开更多
We consider the estimation of three-dimensional ROC surfaces for continuous tests given covariates.Three way ROC analysis is important in our motivating example where patients with Alzheimer's disease are usually ...We consider the estimation of three-dimensional ROC surfaces for continuous tests given covariates.Three way ROC analysis is important in our motivating example where patients with Alzheimer's disease are usually classified into three categories and should receive different category-specific medical treatment.There has been no discussion on how covariates affect the three way ROC analysis.We propose a regression framework induced from the relationship between test results and covariates.We consider several practical cases and the corresponding inference procedures.Simulations are conducted to validate our methodology.The application on the motivating example illustrates clearly the age and sex effects on the accuracy for Mini-Mental State Examination of Alzheimer's disease.展开更多
We provide a detailed review for the statistical analysis of diagnostic accuracy in a multi-category classification task.For qualitative response variables with more than two categories,many traditional accuracy measu...We provide a detailed review for the statistical analysis of diagnostic accuracy in a multi-category classification task.For qualitative response variables with more than two categories,many traditional accuracy measures such as sensitivity,specificity and area under the ROC curve are no longer applicable.In recent literature,new diagnostic accuracy measures are introduced in medical research studies.In this paper,important statistical concepts for multi-category classification accuracy are reviewed and their utilities are demonstrated with real medical examples.We offer problem-based R code to illustrate how to perform these statistical computations step by step.We expect such analysis tools will become more familiar to practitioners and receive broader applications in biostatistics.Our program can be adapted to many classifiers among which logistic regression may be the most popular approach.We thus base our discussion and illustration completely on the logistic regression in this paper.展开更多
New statistics are proposed to estimate and test the structural change when the data dimension is comparable to or larger than the sample size. Consistency of the new statistic in estimating the change point position ...New statistics are proposed to estimate and test the structural change when the data dimension is comparable to or larger than the sample size. Consistency of the new statistic in estimating the change point position is established under the alternative hypothesis. The asymptotic distribution of the new statistic in testing the existence of a change point is obtained under the null hypothesis. Some simulation results are presented which show that the numerical performance of our method is satisfactory. The method is illustrated via the analysis of the house price index of US.展开更多
This paper establishes a new framework for assessing multimodal statistical causality between cryptocurrency market(cryptomarket)sentiment and cryptocurrency price processes.In order to achieve this,we present an effi...This paper establishes a new framework for assessing multimodal statistical causality between cryptocurrency market(cryptomarket)sentiment and cryptocurrency price processes.In order to achieve this,we present an efficient algorithm for multimodal statistical causality analysis based on Multiple-Output Gaussian Processes.Signals from different information sources(modalities)are jointly modelled as a Multiple-Output Gaussian Process,and then using a novel approach to statistical causality based on Gaussian Processes(GPs),we study linear and non-linear causal effects between the different modalities.We demonstrate the effectiveness of our approach in a machine learning application by studying the relationship between cryptocurrency spot price dynamics and sentiment time-series data specific to the crypto sector,which we conjecture influences retail investor behaviour.The investor sentiment is extracted from cryptomarket news data via methods developed in the area of statistical machine learning known as Natural Language Processing(NLP).To capture sentiment,we present a novel framework for text to time-series embedding,which we then use to construct a sentiment index from publicly available news articles.We conduct a statistical analysis of our sentiment statistical index model and compare it to alternative state-of-the-art sentiment models popular in the NLP literature.In regard to the multimodal causality,the investor sentiment is our primary modality of exploration,in addition to price and a blockchain technologyrelated indicator(hash rate).Analysis shows that our approach is effective in modelling causal structures of variable degree of complexity between heterogeneous data sources and illustrates the impact that certain modelling choices for the different modalities can have on detecting causality.A solid understanding of these factors is necessary to gauge cryptocurrency adoption by retail investors and provide sentiment-and technologybased insights about the cryptocurrency market dynamics.展开更多
基金This research was financially supported by the National Natural Science Foundation of China(Grant Nos.52071306 and 51379195)the Natural Science Foundation of Shandong Province(Grant No.ZR2019MEE050).
文摘Marine environmental design parameter extrapolation has important applications in marine engineering and coastal disaster prevention.The distribution models used for environmental design parameter usually pass the hypothesis tests in statistical analysis,but the calculation results of different distribution models often vary largely.In this paper,based on the information entropy,the overall uncertainty test criteria were studied for commonly used distributions including Gumbel,Weibull,and Pearson-III distribution.An improved method for parameter estimation of the maximum entropy distribution model is proposed on the basis of moment estimation.The study in this paper shows that the number of sample data and the degree of dispersion are proportional to the information entropy,and the overall uncertainty of the maximum entropy distribution model is minimal compared with other models.
基金Supported by the National Natural Science Foundation of China(Nos.52071306,51379195)the Natural Science Foundation of Shandong Province(No.ZR2019MEE050)the Graduate Education Foundation(No.HDYA19006)。
文摘The accurate calculation of marine environmental design parameters depends on the probability distribution model,and the calculation results of different distribution models are often different.It is very important to determine which distribution model is more stable and reasonable when extrapolating the recurrence level of the studied sea area.In this paper,we constructed an evaluation method of the overall uncertainty of the calculation results and a measurement of the uncertainty of the design parameters derivation model,by incorporating the influence of sample information on the model information entropy,such as sample size,degree of dispersion,and sampling error.Results show that the sample data size and the degree of dispersion are directly proportional to the information entropy.Within the same group of data,the maximum entropy distribution model has the lowest overall uncertainty,while the Gumbel distribution model has the largest overall uncertainty.In other words,the maximum entropy distribution model has good applicability in the accurate calculation of marine environmental design parameters.
文摘After a review of recent developments in precision medicine, population health sciences and innovative clinical trial designs, and in health economics and policy, we show how innovations in health analytics can capitalize on the advances in biomedicine and health economics towards developing a data-driven and cost-effective 21<sup>st</sup> century health care system. In particular, we propose a mutually beneficial public-private partnership that combines individual responsibility with community solidarity in building this health care system.
基金Supported by the National Natural Science Foundation of China(Nos.52071306,52101360)the Natural Science Foundation of Shandong Province(No.ZR2019MEE050)the State Key Laboratory of Coastal and Offshore Engineering(No.LP2104)。
文摘The South China Sea suffers strongly from the typhoon storm surge disasters in China,and its northern coastal areas are facing severe risks.Therefore,it is necessary and urgent to establish an assessment system for rating typhoon storm surge disaster.We constructed an effective and reliable rating assessment system for typhoon storm surge disaster based on the theories of over-threshold,distribution function family,and composite extreme value.The over-threshold sample was used as the basis of data analysis,the composite extreme value expansion model was used to derive the design water increment,and then the disaster level was delineated based on the return period level.The results of the extreme value model comparison show that the Weibull-Pareto distribution is more suitable than the classical extreme value distribution for fitting the over-threshold samples.The results of the return period projection are relatively stable based on different analysis samples.Taking the 10 typhoon storm surges as examples,they caused landfall in the Guangdong area in the past 10 years.The results of the assessment ranking indicate that the risk levels based on the return period levels obtained from different distributions are generally consistent.When classifying low-risk areas,the classification criteria of the State Oceanic Administration,China(SOA,2012)are more conservative.In the high-risk areas,the results of the assessment ranking based on return period are more consistent with those of the SOA.
基金Supported by the National Natural Science Foundation of China(No.52071306)the Natural Science Foundation of Shandong Province(No.ZR2019MEE050)。
文摘The calculation results of marine environmental design parameters obtained from different data sampling methods,model distributions,and parameter estimation methods often vary greatly.To better analyze the uncertainties in the calculation of marine environmental design parameters,a general model uncertainty assessment method is necessary.We proposed a new multivariate model uncertainty assessment method for the calculation of marine environmental design parameters.The method divides the overall model uncertainty into two categories:aleatory uncertainty and epistemic uncertainty.The aleatory uncertainty of the model is obtained by analyzing the influence of the number and the dispersion degree of samples on the information entropy of the model.The epistemic uncertainty of the model is calculated using the information entropy of the model itself and the prediction error.The advantages of this method are that it does not require many-year-observation data for the marine environmental elements,and the method can be used to analyze any specific factors that cause model uncertainty.Results show that by applying the method to the South China Sea,the aleatory uncertainty of the model increases with the number of samples and then stabilizes.A positive correlation was revealed between the dispersion of the samples and the aleatory uncertainty of the model.Both the distribution of the model and the parameter estimation results of the model have significant effects on the epistemic uncertainty of the model.When the goodness-of-fit of the model is relatively close,the best model can be selected according to the criterion of the lowest overall uncertainty of the models,which can both ensure a better model fit and avoid too much uncertainty in the model calculation results.The presented multivariate model uncertainty assessment method provides a criterion to measure the advantages and disadvantages of the marine environmental design parameter calculation model from the aspect of uncertainty,which is of great significance to analyze the uncertainties in the calculation of marine environmental design parameters and improve the accuracy of the calculation results.
基金supported by Hong Kong Research Grant Committee (Grant Nos.HKUST6019/10P and HKUST6019/12P)National Natural Science Foundation of China (Grant Nos. 10871146 and 11271286)the National University of Singapore (Grant No. R-155-000-106-112)
文摘Let X1,X2,...be a sequence of independent random variables(r.v.s) belonging to the domain of attraction of a normal or stable law.In this paper,we study moderate deviations for the self-normalized sum ∑ni=1 Xi/Vn,p,where Vn,p =(∑ni=1 |Xi|p)1/p(p>1).Applications to the self-normalized law of the iterated logarithm,Studentized increments of partial sums,t-statistic,and weighted sum of independent and identically distributed(i.i.d.) r.v.s are considered.
文摘The survival analysis literature has always lagged behind the categorical data literature in developing methods to analyze clustered or multivariate data. While estimators based on
基金supported by the Natural Science Foundation of Guangdong Province(Grant No.2016A030307019)the Higher Education Colleges and Universities Innovation Strong School Project of Guangdong Province(Grant No.2016KTSCX153)+2 种基金Science and Technology Development Fund of Macao(Grant No.127/2016/A3)National Natural Science Foundation of China(Grant No.11401607)a grant at the National University of Singapore(Grant No.R-155-000-181-114)
文摘In this paper, we investigate the two sample U-statistics by jackknife empirical likelihood(JEL),a versatile nonparametric approach. More precisely, we propose the method of balanced augmented jackknife empirical likelihood(BAJEL) by adding two artificial points to the original pseudo-value dataset, and we prove that the log likelihood ratio based on the expanded dataset tends to the χ~2 distribution.
基金supported by the National Major Scientific Research Instrument Development Project (No.62027819):High-Speed Real-Time Analyzer for Laser Chip’s Optical Catastrophic Damage Processthe General Object of the National Natural Science Foundation (No.62076177):Study on the Risk Assessment Model of Heart Failure by Integrating Multi-Modal Big DataShanxi Province Key Technology and Generic Technology R&D Project (No.2020XXX007):Energy Internet Integrated Intelligent Data Management and Decision Support Platform.
文摘Although deep learning methods have recently attracted considerable attention in the medical field,analyzing large-scale electronic health record data is still a difficult task.In particular,the accurate recognition of heart failure is a key technology for doctors to make reasonable treatment decisions.This study uses data from the Medical Information Mart for Intensive Care database.Compared with structured data,unstructured data contain abundant patient information.However,this type of data has unsatisfactory characteristics,e.g.,many colloquial vocabularies and sparse content.To solve these problems,we propose the KTI-RNN model for unstructured data recognition.The proposed model overcomes sparse content and obtains good classification results.The term frequency-inverse word frequency(TF-IWF)model is used to extract the keyword set.The latent dirichlet allocation(LDA)model is adopted to extract the topic word set.These models enable the expansion of the medical record text content.Finally,we embed the global attention mechanism and gating mechanism between the bidirectional recurrent neural network(BiRNN)model and the output layer.We call it gated-attention-BiRNN(GA-BiRNN)and use it to identify heart failure from extensive medical texts.Results show that the F 1 score of the proposed KTI-RNN model is 85.57%,and the accuracy rate of the proposed KTI-RNN model is 85.59%.
基金supported by the National University of Singapore Academic Research Grant (Grant No. R-155-000-085-112)
文摘For several decades,much attention has been paid to the two-sample Behrens-Fisher(BF) problem which tests the equality of the means or mean vectors of two normal populations with unequal variance/covariance structures.Little work,however,has been done for the k-sample BF problem for high dimensional data which tests the equality of the mean vectors of several high-dimensional normal populations with unequal covariance structures.In this paper we study this challenging problem via extending the famous Scheffe's transformation method,which reduces the k-sample BF problem to a one-sample problem.The induced one-sample problem can be easily tested by the classical Hotelling's T 2 test when the size of the resulting sample is very large relative to its dimensionality.For high dimensional data,however,the dimensionality of the resulting sample is often very large,and even much larger than its sample size,which makes the classical Hotelling's T 2 test not powerful or not even well defined.To overcome this diffculty,we propose and study an L2-norm based test.The asymp-totic powers of the proposed L2-norm based test and Hotelling's T 2 test are derived and theoretically compared.Methods for implementing the L2-norm based test are described.Simulation studies are conducted to compare the L2-norm based test and Hotelling's T 2 test when the latter can be well defined,and to compare the proposed implementation methods for the L2-norm based test otherwise.The methodologies are motivated and illustrated by a real data example.
基金support provided by the National Alzheimer's Coordinating Center(NACC)supported by National University of Singapore Academic Research Funding(Grant No.R-155-000-109-112)+2 种基金a CBRG grant from the National Medical Research Council in Singapore,NACC(Grant No.U01AG16976)the National Institute of Health(Grant No.R01EB005829)National Natural Science Foundation of China(Grant No.30728019)
文摘We consider the estimation of three-dimensional ROC surfaces for continuous tests given covariates.Three way ROC analysis is important in our motivating example where patients with Alzheimer's disease are usually classified into three categories and should receive different category-specific medical treatment.There has been no discussion on how covariates affect the three way ROC analysis.We propose a regression framework induced from the relationship between test results and covariates.We consider several practical cases and the corresponding inference procedures.Simulations are conducted to validate our methodology.The application on the motivating example illustrates clearly the age and sex effects on the accuracy for Mini-Mental State Examination of Alzheimer's disease.
基金Li’s work was partially supported by National Medical Research Council in Singapore and AcRF R-155-000-174-114.NNSF[grant number 11371142].
文摘We provide a detailed review for the statistical analysis of diagnostic accuracy in a multi-category classification task.For qualitative response variables with more than two categories,many traditional accuracy measures such as sensitivity,specificity and area under the ROC curve are no longer applicable.In recent literature,new diagnostic accuracy measures are introduced in medical research studies.In this paper,important statistical concepts for multi-category classification accuracy are reviewed and their utilities are demonstrated with real medical examples.We offer problem-based R code to illustrate how to perform these statistical computations step by step.We expect such analysis tools will become more familiar to practitioners and receive broader applications in biostatistics.Our program can be adapted to many classifiers among which logistic regression may be the most popular approach.We thus base our discussion and illustration completely on the logistic regression in this paper.
基金supported by National Natural Science Foundation of China (Grant No. 11571337)the Ministry of Education of Singapore (Grant No. # ARC 14/11)the National University of Singapore (Grant No. R-155-151-112)
文摘New statistics are proposed to estimate and test the structural change when the data dimension is comparable to or larger than the sample size. Consistency of the new statistic in estimating the change point position is established under the alternative hypothesis. The asymptotic distribution of the new statistic in testing the existence of a change point is obtained under the null hypothesis. Some simulation results are presented which show that the numerical performance of our method is satisfactory. The method is illustrated via the analysis of the house price index of US.
基金Ioannis Chalkiadakis acknowledges the support of Heriot-Watt University through a James-Watt scholarship while undertaking this work.
文摘This paper establishes a new framework for assessing multimodal statistical causality between cryptocurrency market(cryptomarket)sentiment and cryptocurrency price processes.In order to achieve this,we present an efficient algorithm for multimodal statistical causality analysis based on Multiple-Output Gaussian Processes.Signals from different information sources(modalities)are jointly modelled as a Multiple-Output Gaussian Process,and then using a novel approach to statistical causality based on Gaussian Processes(GPs),we study linear and non-linear causal effects between the different modalities.We demonstrate the effectiveness of our approach in a machine learning application by studying the relationship between cryptocurrency spot price dynamics and sentiment time-series data specific to the crypto sector,which we conjecture influences retail investor behaviour.The investor sentiment is extracted from cryptomarket news data via methods developed in the area of statistical machine learning known as Natural Language Processing(NLP).To capture sentiment,we present a novel framework for text to time-series embedding,which we then use to construct a sentiment index from publicly available news articles.We conduct a statistical analysis of our sentiment statistical index model and compare it to alternative state-of-the-art sentiment models popular in the NLP literature.In regard to the multimodal causality,the investor sentiment is our primary modality of exploration,in addition to price and a blockchain technologyrelated indicator(hash rate).Analysis shows that our approach is effective in modelling causal structures of variable degree of complexity between heterogeneous data sources and illustrates the impact that certain modelling choices for the different modalities can have on detecting causality.A solid understanding of these factors is necessary to gauge cryptocurrency adoption by retail investors and provide sentiment-and technologybased insights about the cryptocurrency market dynamics.