Cyber losses in terms of number of records breached under cyber incidents commonly feature a significant portion of zeros, specific characteristics of mid-range losses and large losses, which make it hard to model the...Cyber losses in terms of number of records breached under cyber incidents commonly feature a significant portion of zeros, specific characteristics of mid-range losses and large losses, which make it hard to model the whole range of the losses using a standard loss distribution. We tackle this modeling problem by proposing a three-component spliced regression model that can simultaneously model zeros, moderate and large losses and consider heterogeneous effects in mixture components. To apply our proposed model to Privacy Right Clearinghouse (PRC) data breach chronology, we segment geographical groups using unsupervised cluster analysis, and utilize a covariate-dependent probability to model zero losses, finite mixture distributions for moderate body and an extreme value distribution for large losses capturing the heavy-tailed nature of the loss data. Parameters and coefficients are estimated using the Expectation-Maximization (EM) algorithm. Combining with our frequency model (generalized linear mixed model) for data breaches, aggregate loss distributions are investigated and applications on cyber insurance pricing and risk management are discussed.展开更多
The diameter distribution function(DDF)is a crucial tool for accurately predicting stand carbon storage(CS).The current key issue,however,is how to construct a high-precision DDF based on stand factors,site quality,an...The diameter distribution function(DDF)is a crucial tool for accurately predicting stand carbon storage(CS).The current key issue,however,is how to construct a high-precision DDF based on stand factors,site quality,and aridity index to predict stand CS in multi-species mixed forests with complex structures.This study used data from70 survey plots for mixed broadleaf Populus davidiana and Betula platyphylla forests in the Mulan Rangeland State Forest,Hebei Province,China,to construct the DDF based on maximum likelihood estimation and finite mixture model(FMM).Ordinary least squares(OLS),linear seemingly unrelated regression(LSUR),and back propagation neural network(BPNN)were used to investigate the influences of stand factors,site quality,and aridity index on the shape and scale parameters of DDF and predicted stand CS of mixed broadleaf forests.The results showed that FMM accurately described the stand-level diameter distribution of the mixed P.davidiana and B.platyphylla forests;whereas the Weibull function constructed by MLE was more accurate in describing species-level diameter distribution.The combined variable of quadratic mean diameter(Dq),stand basal area(BA),and site quality improved the accuracy of the shape parameter models of FMM;the combined variable of Dq,BA,and De Martonne aridity index improved the accuracy of the scale parameter models.Compared to OLS and LSUR,the BPNN had higher accuracy in the re-parameterization process of FMM.OLS,LSUR,and BPNN overestimated the CS of P.davidiana but underestimated the CS of B.platyphylla in the large diameter classes(DBH≥18 cm).BPNN accurately estimated stand-and species-level CS,but it was more suitable for estimating stand-level CS compared to species-level CS,thereby providing a scientific basis for the optimization of stand structure and assessment of carbon sequestration capacity in mixed broadleaf forests.展开更多
The currently prevalent machine performance degradation assessment techniques involve estimating a machine's current condition based upon the recognition of indications of failure features,which entail complete data ...The currently prevalent machine performance degradation assessment techniques involve estimating a machine's current condition based upon the recognition of indications of failure features,which entail complete data collected in different conditions.However,failure data are always hard to acquire,thus making those techniques hard to be applied.In this paper,a novel method which does not need failure history data is introduced.Wavelet packet decomposition(WPD) is used to extract features from raw signals,principal component analysis(PCA) is utilized to reduce feature dimensions,and Gaussian mixture model(GMM) is then applied to approximate the feature space distributions.Single-channel confidence value(SCV) is calculated by the overlap between GMM of the monitoring condition and that of the normal condition,which can indicate the performance of single-channel.Furthermore,multi-channel confidence value(MCV),which can be deemed as the overall performance index of multi-channel,is calculated via logistic regression(LR) and that the task of decision-level sensor fusion is also completed.Both SCV and MCV can serve as the basis on which proactive maintenance measures can be taken,thus preventing machine breakdown.The method has been adopted to assess the performance of the turbine of a centrifugal compressor in a factory of Petro-China,and the result shows that it can effectively complete this task.The proposed method has engineering significance for machine performance degradation assessment.展开更多
Mixture of Experts(MoE)regression models are widely studied in statistics and machine learning for modeling heterogeneity in data for regression,clustering and classification.Laplace distribution is one of the most im...Mixture of Experts(MoE)regression models are widely studied in statistics and machine learning for modeling heterogeneity in data for regression,clustering and classification.Laplace distribution is one of the most important statistical tools to analyze thick and tail data.Laplace Mixture of Linear Experts(LMoLE)regression models are based on the Laplace distribution which is more robust.Similar to modelling variance parameter in a homogeneous population,we propose and study a new novel class of models:heteroscedastic Laplace mixture of experts regression models to analyze the heteroscedastic data coming from a heterogeneous population in this paper.The issues of maximum likelihood estimation are addressed.In particular,Minorization-Maximization(MM)algorithm for estimating the regression parameters is developed.Properties of the estimators of the regression coefficients are evaluated through Monte Carlo simulations.Results from the analysis of two real data sets are presented.展开更多
In this paper, we research the regression problem of time series data from heterogeneous populations on the basis of the finite mixture regression model. We propose two finite mixed time-varying regression models to s...In this paper, we research the regression problem of time series data from heterogeneous populations on the basis of the finite mixture regression model. We propose two finite mixed time-varying regression models to solve this. A regularization method for variable selection of the models is proposed, which is a mixture of the appropriate penalty functions and l2 penalty. A Block-wise minimization maximization (MM) algorithm is used for maximum penalized log quasi-likelihood estimation of these models. The procedure is illustrated by analyzing simulations and with an application to analyze the behavior of urban vehicular traffic of the city of São Paulo in the period from 14 to 18 December 2009, which shows that the proposed models outperform the FMR models.展开更多
Mixture regression is a regression problem with mixed data. Specifically, in the observations, some data are from one model, while others from other models. Only after assuming the quantity of the model is given, EM o...Mixture regression is a regression problem with mixed data. Specifically, in the observations, some data are from one model, while others from other models. Only after assuming the quantity of the model is given, EM or other algorithms can be used to solve this problem. We propose an information criterion for mixture regression model in this paper. Compared to ordinary information citizen by data simulations, results show our citizen has better performance on choosing the correct quantity of models.展开更多
In this paper, we propose a robust mixture regression model based on the skew scale mixtures of normal distributions (RMR-SSMN) which can accommodate asymmetric, heavy-tailed and contaminated data better. For the vari...In this paper, we propose a robust mixture regression model based on the skew scale mixtures of normal distributions (RMR-SSMN) which can accommodate asymmetric, heavy-tailed and contaminated data better. For the variable selection problem, the penalized likelihood approach with a new combined penalty function which balances the SCAD and l<sub>2</sub> penalty is proposed. The adjusted EM algorithm is presented to get parameter estimates of RMR-SSMN models at a faster convergence rate. As simulations show, our mixture models are more robust than general FMR models and the new combined penalty function outperforms SCAD for variable selection. Finally, the proposed methodology and algorithm are applied to a real data set and achieve reasonable results.展开更多
The dynamic soft sensor based on a single Gaussian process regression(GPR) model has been developed in fermentation processes.However,limitations of single regression models,for multiphase/multimode fermentation proce...The dynamic soft sensor based on a single Gaussian process regression(GPR) model has been developed in fermentation processes.However,limitations of single regression models,for multiphase/multimode fermentation processes,may result in large prediction errors and complexity of the soft sensor.Therefore,a dynamic soft sensor based on Gaussian mixture regression(GMR) was proposed to overcome the problems.Two structure parameters,the number of Gaussian components and the order of the model,are crucial to the soft sensor model.To achieve a simple and effective soft sensor,an iterative strategy was proposed to optimize the two structure parameters synchronously.For the aim of comparisons,the proposed dynamic GMR soft sensor and the existing dynamic GPR soft sensor were both investigated to estimate biomass concentration in a Penicillin simulation process and an industrial Erythromycin fermentation process.Results show that the proposed dynamic GMR soft sensor has higher prediction accuracy and is more suitable for dynamic multiphase/multimode fermentation processes.展开更多
In this paper, we propose a Fast Iteration Method for solving mixture regression problem, which can be treated as a model-based clustering. Compared to the EM algorithm, the proposed method is faster, more flexible an...In this paper, we propose a Fast Iteration Method for solving mixture regression problem, which can be treated as a model-based clustering. Compared to the EM algorithm, the proposed method is faster, more flexible and can solve mixture regression problem with different error distributions (i.e. Laplace and t distribution). Extensive numeric experiments show that our proposed method has better performance on randomly simulations and real data.展开更多
Normal mixture regression models are one of the most important statistical data analysis tools in a heterogeneous population. When the data set under consideration involves asymmetric outcomes, in the last two decades...Normal mixture regression models are one of the most important statistical data analysis tools in a heterogeneous population. When the data set under consideration involves asymmetric outcomes, in the last two decades, the skew normal distribution has been shown beneficial in dealing with asymmetric data in various theoretic and applied problems. In this paper, we propose and study a novel class of models: a skew-normal mixture of joint location, scale and skewness models to analyze the heteroscedastic skew-normal data coming from a heterogeneous population. The issues of maximum likelihood estimation are addressed. In particular, an Expectation-Maximization (EM) algorithm for estimating the model parameters is developed. Properties of the estimators of the regression coefficients are evaluated through Monte Carlo experiments. Results from the analysis of a real data set from the Body Mass Index (BMI) data are presented.展开更多
随着“双碳”目标的推进,清洁能源所占比重大幅度增加,分布式光伏发电在我国农村地区快速发展,但其随机性、间歇性的特点给新能源消纳和电网稳定带来很大的挑战。光伏发电预测可以在一定程度上改善新能源消纳问题,减少光伏发电的不稳定...随着“双碳”目标的推进,清洁能源所占比重大幅度增加,分布式光伏发电在我国农村地区快速发展,但其随机性、间歇性的特点给新能源消纳和电网稳定带来很大的挑战。光伏发电预测可以在一定程度上改善新能源消纳问题,减少光伏发电的不稳定性对电网的冲击。因此,为提高光伏发电功率预测精度,提出一种基于改进向量加权平均算法优化CNN-QRGRU网络的光伏发电概率预测方法。首先采用ReliefF算法对特征变量进行选择,在此基础上利用高斯混合模型(Gaussian mixture model,GMM)聚类方法将天气分为晴天、晴转多云和阴雨天3种类型,将处理好的数据输入到CNN-GRU模型中,并利用向量加权平均(weighted mean of vectors algorithm,INFO)优化算法对模型超参数进行调参,将分位数回归模型(quantile regression,QR)与INFO-CNN-GRU模型相结合得到光伏功率条件分布,结合核密度估计法从条件分布中获得概率密度函数,完成概率预测。以实际光伏电站数据作为基础,将提出的INFO优化算法与其他几种传统的优化算法进行对比,结果表明INFO的优化效果更好,在此基础上进行概率预测,得到的概率预测结果相较于点预测能提供更多有效信息,更具有应用价值。展开更多
The outbreak of COVID-19 on the Diamond Princess cruise ship has attracted much attention.Motivated by the PCR testing data on the Diamond Princess,we propose a novel cure mixture nonparametric model to investigate th...The outbreak of COVID-19 on the Diamond Princess cruise ship has attracted much attention.Motivated by the PCR testing data on the Diamond Princess,we propose a novel cure mixture nonparametric model to investigate the detection pattern.It combines a logistic regression for the probability of susceptible subjects with a nonparametric distribution for the detection of infected individuals.Maximum likelihood estimators are proposed.The resulting estimators are shown to be consistent and asymptotically normal.Simulation studies demonstrate that the proposed approach is appropriate for practical use.Finally,we apply the proposed method to PCR testing data on the Diamond Princess to show its practical utility.展开更多
有限混合回归(Finite Mixture of Regression,FMR)模型的变量选择常常在统计建模中使用。目前关于FMR模型的研究主要集中在回归误差服从正态分布的情形,而这种假设不适用于研究非对称的数据。对于偏斜数据,众数的代表性优于均值。本文...有限混合回归(Finite Mixture of Regression,FMR)模型的变量选择常常在统计建模中使用。目前关于FMR模型的研究主要集中在回归误差服从正态分布的情形,而这种假设不适用于研究非对称的数据。对于偏斜数据,众数的代表性优于均值。本文基于混合偏正态数据介绍了众数回归模型的变量选择方法,并证明了变量选择方法的相合性和参数估计的Oracle性质。为了估计模型的参数,提出了一种改进的EM(Expectation-Maximum)算法,通过模拟研究和实例分析进一步说明了所提出模型和变量选择方法的有效性。展开更多
文摘Cyber losses in terms of number of records breached under cyber incidents commonly feature a significant portion of zeros, specific characteristics of mid-range losses and large losses, which make it hard to model the whole range of the losses using a standard loss distribution. We tackle this modeling problem by proposing a three-component spliced regression model that can simultaneously model zeros, moderate and large losses and consider heterogeneous effects in mixture components. To apply our proposed model to Privacy Right Clearinghouse (PRC) data breach chronology, we segment geographical groups using unsupervised cluster analysis, and utilize a covariate-dependent probability to model zero losses, finite mixture distributions for moderate body and an extreme value distribution for large losses capturing the heavy-tailed nature of the loss data. Parameters and coefficients are estimated using the Expectation-Maximization (EM) algorithm. Combining with our frequency model (generalized linear mixed model) for data breaches, aggregate loss distributions are investigated and applications on cyber insurance pricing and risk management are discussed.
基金funded by the National Key Research and Development Program of China(No.2022YFD2200503-02)。
文摘The diameter distribution function(DDF)is a crucial tool for accurately predicting stand carbon storage(CS).The current key issue,however,is how to construct a high-precision DDF based on stand factors,site quality,and aridity index to predict stand CS in multi-species mixed forests with complex structures.This study used data from70 survey plots for mixed broadleaf Populus davidiana and Betula platyphylla forests in the Mulan Rangeland State Forest,Hebei Province,China,to construct the DDF based on maximum likelihood estimation and finite mixture model(FMM).Ordinary least squares(OLS),linear seemingly unrelated regression(LSUR),and back propagation neural network(BPNN)were used to investigate the influences of stand factors,site quality,and aridity index on the shape and scale parameters of DDF and predicted stand CS of mixed broadleaf forests.The results showed that FMM accurately described the stand-level diameter distribution of the mixed P.davidiana and B.platyphylla forests;whereas the Weibull function constructed by MLE was more accurate in describing species-level diameter distribution.The combined variable of quadratic mean diameter(Dq),stand basal area(BA),and site quality improved the accuracy of the shape parameter models of FMM;the combined variable of Dq,BA,and De Martonne aridity index improved the accuracy of the scale parameter models.Compared to OLS and LSUR,the BPNN had higher accuracy in the re-parameterization process of FMM.OLS,LSUR,and BPNN overestimated the CS of P.davidiana but underestimated the CS of B.platyphylla in the large diameter classes(DBH≥18 cm).BPNN accurately estimated stand-and species-level CS,but it was more suitable for estimating stand-level CS compared to species-level CS,thereby providing a scientific basis for the optimization of stand structure and assessment of carbon sequestration capacity in mixed broadleaf forests.
基金supported by National Key Natural Science Foundation of China (Grant No. 50635010)
文摘The currently prevalent machine performance degradation assessment techniques involve estimating a machine's current condition based upon the recognition of indications of failure features,which entail complete data collected in different conditions.However,failure data are always hard to acquire,thus making those techniques hard to be applied.In this paper,a novel method which does not need failure history data is introduced.Wavelet packet decomposition(WPD) is used to extract features from raw signals,principal component analysis(PCA) is utilized to reduce feature dimensions,and Gaussian mixture model(GMM) is then applied to approximate the feature space distributions.Single-channel confidence value(SCV) is calculated by the overlap between GMM of the monitoring condition and that of the normal condition,which can indicate the performance of single-channel.Furthermore,multi-channel confidence value(MCV),which can be deemed as the overall performance index of multi-channel,is calculated via logistic regression(LR) and that the task of decision-level sensor fusion is also completed.Both SCV and MCV can serve as the basis on which proactive maintenance measures can be taken,thus preventing machine breakdown.The method has been adopted to assess the performance of the turbine of a centrifugal compressor in a factory of Petro-China,and the result shows that it can effectively complete this task.The proposed method has engineering significance for machine performance degradation assessment.
基金the National Natural Science Foundation of China(11861041,11261025).
文摘Mixture of Experts(MoE)regression models are widely studied in statistics and machine learning for modeling heterogeneity in data for regression,clustering and classification.Laplace distribution is one of the most important statistical tools to analyze thick and tail data.Laplace Mixture of Linear Experts(LMoLE)regression models are based on the Laplace distribution which is more robust.Similar to modelling variance parameter in a homogeneous population,we propose and study a new novel class of models:heteroscedastic Laplace mixture of experts regression models to analyze the heteroscedastic data coming from a heterogeneous population in this paper.The issues of maximum likelihood estimation are addressed.In particular,Minorization-Maximization(MM)algorithm for estimating the regression parameters is developed.Properties of the estimators of the regression coefficients are evaluated through Monte Carlo simulations.Results from the analysis of two real data sets are presented.
文摘In this paper, we research the regression problem of time series data from heterogeneous populations on the basis of the finite mixture regression model. We propose two finite mixed time-varying regression models to solve this. A regularization method for variable selection of the models is proposed, which is a mixture of the appropriate penalty functions and l2 penalty. A Block-wise minimization maximization (MM) algorithm is used for maximum penalized log quasi-likelihood estimation of these models. The procedure is illustrated by analyzing simulations and with an application to analyze the behavior of urban vehicular traffic of the city of São Paulo in the period from 14 to 18 December 2009, which shows that the proposed models outperform the FMR models.
文摘Mixture regression is a regression problem with mixed data. Specifically, in the observations, some data are from one model, while others from other models. Only after assuming the quantity of the model is given, EM or other algorithms can be used to solve this problem. We propose an information criterion for mixture regression model in this paper. Compared to ordinary information citizen by data simulations, results show our citizen has better performance on choosing the correct quantity of models.
文摘In this paper, we propose a robust mixture regression model based on the skew scale mixtures of normal distributions (RMR-SSMN) which can accommodate asymmetric, heavy-tailed and contaminated data better. For the variable selection problem, the penalized likelihood approach with a new combined penalty function which balances the SCAD and l<sub>2</sub> penalty is proposed. The adjusted EM algorithm is presented to get parameter estimates of RMR-SSMN models at a faster convergence rate. As simulations show, our mixture models are more robust than general FMR models and the new combined penalty function outperforms SCAD for variable selection. Finally, the proposed methodology and algorithm are applied to a real data set and achieve reasonable results.
基金Supported by the Natural Science Foundation of Jiangsu Province of China(BK20130531)the Priority Academic Program Development of Jiangsu Higher Education Institutions(PAPD[2011]6)Jiangsu Government Scholarship
文摘The dynamic soft sensor based on a single Gaussian process regression(GPR) model has been developed in fermentation processes.However,limitations of single regression models,for multiphase/multimode fermentation processes,may result in large prediction errors and complexity of the soft sensor.Therefore,a dynamic soft sensor based on Gaussian mixture regression(GMR) was proposed to overcome the problems.Two structure parameters,the number of Gaussian components and the order of the model,are crucial to the soft sensor model.To achieve a simple and effective soft sensor,an iterative strategy was proposed to optimize the two structure parameters synchronously.For the aim of comparisons,the proposed dynamic GMR soft sensor and the existing dynamic GPR soft sensor were both investigated to estimate biomass concentration in a Penicillin simulation process and an industrial Erythromycin fermentation process.Results show that the proposed dynamic GMR soft sensor has higher prediction accuracy and is more suitable for dynamic multiphase/multimode fermentation processes.
文摘In this paper, we propose a Fast Iteration Method for solving mixture regression problem, which can be treated as a model-based clustering. Compared to the EM algorithm, the proposed method is faster, more flexible and can solve mixture regression problem with different error distributions (i.e. Laplace and t distribution). Extensive numeric experiments show that our proposed method has better performance on randomly simulations and real data.
基金Supported by the National Natural Science Foundation of China(11261025,11561075)the Natural Science Foundation of Yunnan Province(2016FB005)the Program for Middle-aged Backbone Teacher,Yunnan University
文摘Normal mixture regression models are one of the most important statistical data analysis tools in a heterogeneous population. When the data set under consideration involves asymmetric outcomes, in the last two decades, the skew normal distribution has been shown beneficial in dealing with asymmetric data in various theoretic and applied problems. In this paper, we propose and study a novel class of models: a skew-normal mixture of joint location, scale and skewness models to analyze the heteroscedastic skew-normal data coming from a heterogeneous population. The issues of maximum likelihood estimation are addressed. In particular, an Expectation-Maximization (EM) algorithm for estimating the model parameters is developed. Properties of the estimators of the regression coefficients are evaluated through Monte Carlo experiments. Results from the analysis of a real data set from the Body Mass Index (BMI) data are presented.
文摘随着“双碳”目标的推进,清洁能源所占比重大幅度增加,分布式光伏发电在我国农村地区快速发展,但其随机性、间歇性的特点给新能源消纳和电网稳定带来很大的挑战。光伏发电预测可以在一定程度上改善新能源消纳问题,减少光伏发电的不稳定性对电网的冲击。因此,为提高光伏发电功率预测精度,提出一种基于改进向量加权平均算法优化CNN-QRGRU网络的光伏发电概率预测方法。首先采用ReliefF算法对特征变量进行选择,在此基础上利用高斯混合模型(Gaussian mixture model,GMM)聚类方法将天气分为晴天、晴转多云和阴雨天3种类型,将处理好的数据输入到CNN-GRU模型中,并利用向量加权平均(weighted mean of vectors algorithm,INFO)优化算法对模型超参数进行调参,将分位数回归模型(quantile regression,QR)与INFO-CNN-GRU模型相结合得到光伏功率条件分布,结合核密度估计法从条件分布中获得概率密度函数,完成概率预测。以实际光伏电站数据作为基础,将提出的INFO优化算法与其他几种传统的优化算法进行对比,结果表明INFO的优化效果更好,在此基础上进行概率预测,得到的概率预测结果相较于点预测能提供更多有效信息,更具有应用价值。
基金the National Natural Science Foundation of China[grant numbers 71931004,11901200,71971083,and 11971170]the National Key R&D Program of China[grant numbers 2021YFA1000100,2021YFA1000101]。
文摘The outbreak of COVID-19 on the Diamond Princess cruise ship has attracted much attention.Motivated by the PCR testing data on the Diamond Princess,we propose a novel cure mixture nonparametric model to investigate the detection pattern.It combines a logistic regression for the probability of susceptible subjects with a nonparametric distribution for the detection of infected individuals.Maximum likelihood estimators are proposed.The resulting estimators are shown to be consistent and asymptotically normal.Simulation studies demonstrate that the proposed approach is appropriate for practical use.Finally,we apply the proposed method to PCR testing data on the Diamond Princess to show its practical utility.
文摘有限混合回归(Finite Mixture of Regression,FMR)模型的变量选择常常在统计建模中使用。目前关于FMR模型的研究主要集中在回归误差服从正态分布的情形,而这种假设不适用于研究非对称的数据。对于偏斜数据,众数的代表性优于均值。本文基于混合偏正态数据介绍了众数回归模型的变量选择方法,并证明了变量选择方法的相合性和参数估计的Oracle性质。为了估计模型的参数,提出了一种改进的EM(Expectation-Maximum)算法,通过模拟研究和实例分析进一步说明了所提出模型和变量选择方法的有效性。