In this paper, a model averaging method is proposed for varying-coefficient models with response missing at random by establishing a weight selection criterion based on cross-validation. Under certain regularity condi...In this paper, a model averaging method is proposed for varying-coefficient models with response missing at random by establishing a weight selection criterion based on cross-validation. Under certain regularity conditions, it is proved that the proposed method is asymptotically optimal in the sense of achieving the minimum squared error.展开更多
It is known that conditional independence is a quite basic assumption in many fields of statistics. How to test its validity is of great importance and has been extensively studied by the literature. Nevertheless, all...It is known that conditional independence is a quite basic assumption in many fields of statistics. How to test its validity is of great importance and has been extensively studied by the literature. Nevertheless, all of the existing methods focus on the case that data are fully observed, but none of them seems having taken into account of the scenario when missing data are present. Motivated by this, this paper develops two testing statistics to handle such a situation relying on the idea of inverse probability weighted and augmented inverse probability weighted techniques. The asymptotic distributions of the proposed statistics are also derived under the null hypothesis. The simulation studies indicate that both testing statistics perform well in terms of size and power.展开更多
On the assumption that random interruptions in the observation process are modeled by a sequence of independent Bernoulli random variables, we firstly generalize two kinds of nonlinear filtering methods with random in...On the assumption that random interruptions in the observation process are modeled by a sequence of independent Bernoulli random variables, we firstly generalize two kinds of nonlinear filtering methods with random interruption failures in the observation based on the extended Kalman filtering (EKF) and the unscented Kalman filtering (UKF), which were shortened as GEKF and CUKF in this paper, respectively. Then the nonlinear filtering model is established by using the radial basis function neural network (RBFNN) prototypes and the network weights as state equation and the output of RBFNN to present the observation equation. Finally, we take the filtering problem under missing observed data as a special case of nonlinear filtering with random intermittent failures by setting each missing data to be zero without needing to pre-estimate the missing data, and use the GEKF-based RBFNN and the GUKF-based RBFNN to predict the ground radioactivity time series with missing data. Experimental results demonstrate that the prediction results of GUKF-based RBFNN accord well with the real ground radioactivity time series while the prediction results of GEKF-based RBFNN are divergent.展开更多
In this article, to improve the doubly robust estimator, the nonlinear regression models with missing responses are studied. Based on the covariate balancing propensity score (CBPS), estimators for the regression coef...In this article, to improve the doubly robust estimator, the nonlinear regression models with missing responses are studied. Based on the covariate balancing propensity score (CBPS), estimators for the regression coefficients and the population mean are obtained. It is proved that the proposed estimators are asymptotically normal. In simulation studies, the proposed estimators show improved performance relative to usual augmented inverse probability weighted estimators.展开更多
In this study, we investigate the effects of missing data when estimating HIV/TB co-infection. We revisit the concept of missing data and examine three available approaches for dealing with missingness. The main objec...In this study, we investigate the effects of missing data when estimating HIV/TB co-infection. We revisit the concept of missing data and examine three available approaches for dealing with missingness. The main objective is to identify the best method for correcting missing data in TB/HIV Co-infection setting. We employ both empirical data analysis and extensive simulation study to examine the effects of missing data, the accuracy, sensitivity, specificity and train and test error for different approaches. The novelty of this work hinges on the use of modern statistical learning algorithm when treating missingness. In the empirical analysis, both HIV data and TB-HIV co-infection data imputations were performed, and the missing values were imputed using different approaches. In the simulation study, sets of 0% (Complete case), 10%, 30%, 50% and 80% of the data were drawn randomly and replaced with missing values. Results show complete cases only had a co-infection rate (95% Confidence Interval band) of 29% (25%, 33%), weighted method 27% (23%, 31%), likelihood-based approach 26% (24%, 28%) and multiple imputation approach 21% (20%, 22%). In conclusion, MI remains the best approach for dealing with missing data and failure to apply it, results to overestimation of HIV/TB co-infection rate by 8%.展开更多
Many real-world datasets suffer from the unavoidable issue of missing values,and therefore classification with missing data has to be carefully handled since inadequate treatment of missing values will cause large err...Many real-world datasets suffer from the unavoidable issue of missing values,and therefore classification with missing data has to be carefully handled since inadequate treatment of missing values will cause large errors.In this paper,we propose a random subspace sampling method,RSS,by sampling missing items from the corresponding feature histogram distributions in random subspaces,which is effective and efficient at different levels of missing data.Unlike most established approaches,RSS does not train on fixed imputed datasets.Instead,we design a dynamic training strategy where the filled values change dynamically by resampling during training.Moreover,thanks to the sampling strategy,we design an ensemble testing strategy where we combine the results of multiple runs of a single model,which is more efficient and resource-saving than previous ensemble methods.Finally,we combine these two strategies with the random subspace method,which makes our estimations more robust and accurate.The effectiveness of the proposed RSS method is well validated by experimental studies.展开更多
In this paper, three smoothed empirical log-likelihood ratio functions for the parameters of nonlinear models with missing response are suggested. Under some regular conditions, the corresponding Wilks phenomena are o...In this paper, three smoothed empirical log-likelihood ratio functions for the parameters of nonlinear models with missing response are suggested. Under some regular conditions, the corresponding Wilks phenomena are obtained and the confidence regions for the parameter can be constructed easily.展开更多
考虑在函数型解释变量部分观测的情况下,用函数线性模型刻画与标量响应变量的关系.基于函数型主成分分析(Functional Principal Component Analysis,简称FPCA)实现了对缺失部分样本的重构,并通过实证分析,对一组北京市2010-2014年间统...考虑在函数型解释变量部分观测的情况下,用函数线性模型刻画与标量响应变量的关系.基于函数型主成分分析(Functional Principal Component Analysis,简称FPCA)实现了对缺失部分样本的重构,并通过实证分析,对一组北京市2010-2014年间统计的包括部分观测PM2.5数值的气象数据,分析了PM2.5作为部分观测函数型解释变量对标量响应变量平均气温的影响,结果表明了该方法具有处理缺失函数数据的现实意义.展开更多
Missing data filling is a key step in power big data preprocessing,which helps to improve the quality and the utilization of electric power data.Due to the limitations of the traditional methods of filling missing dat...Missing data filling is a key step in power big data preprocessing,which helps to improve the quality and the utilization of electric power data.Due to the limitations of the traditional methods of filling missing data,an improved random forest filling algorithm is proposed.As a result of the horizontal and vertical directions of the electric power data are based on the characteristics of time series.Therefore,the method of improved random forest filling missing data combines the methods of linear interpolation,matrix combination and matrix transposition to solve the problem of filling large amount of electric power missing data.The filling results show that the improved random forest filling algorithm is applicable to filling electric power data in various missing forms.What’s more,the accuracy of the filling results is high and the stability of the model is strong,which is beneficial in improving the quality of electric power data.展开更多
文摘In this paper, a model averaging method is proposed for varying-coefficient models with response missing at random by establishing a weight selection criterion based on cross-validation. Under certain regularity conditions, it is proved that the proposed method is asymptotically optimal in the sense of achieving the minimum squared error.
基金supported by the Fundamental Research Funds for the Central Universities(17CX02035A)supported by NNSF of China(11601197,11461029,61563018)+2 种基金China Postdoctoral Science Foundation funded project(2016M600511,2017T100475)NSF of Jiangxi Province(20171ACB21030,20161BAB201024,20161ACB200009)the Key Science Fund Project of Jiangxi provincial education department(GJJ150439)
文摘It is known that conditional independence is a quite basic assumption in many fields of statistics. How to test its validity is of great importance and has been extensively studied by the literature. Nevertheless, all of the existing methods focus on the case that data are fully observed, but none of them seems having taken into account of the scenario when missing data are present. Motivated by this, this paper develops two testing statistics to handle such a situation relying on the idea of inverse probability weighted and augmented inverse probability weighted techniques. The asymptotic distributions of the proposed statistics are also derived under the null hypothesis. The simulation studies indicate that both testing statistics perform well in terms of size and power.
基金Project supported by the State Key Program of the National Natural Science of China (Grant No. 60835004)the Natural Science Foundation of Jiangsu Province of China (Grant No. BK2009727)+1 种基金the Natural Science Foundation of Higher Education Institutions of Jiangsu Province of China (Grant No. 10KJB510004)the National Natural Science Foundation of China (Grant No. 61075028)
文摘On the assumption that random interruptions in the observation process are modeled by a sequence of independent Bernoulli random variables, we firstly generalize two kinds of nonlinear filtering methods with random interruption failures in the observation based on the extended Kalman filtering (EKF) and the unscented Kalman filtering (UKF), which were shortened as GEKF and CUKF in this paper, respectively. Then the nonlinear filtering model is established by using the radial basis function neural network (RBFNN) prototypes and the network weights as state equation and the output of RBFNN to present the observation equation. Finally, we take the filtering problem under missing observed data as a special case of nonlinear filtering with random intermittent failures by setting each missing data to be zero without needing to pre-estimate the missing data, and use the GEKF-based RBFNN and the GUKF-based RBFNN to predict the ground radioactivity time series with missing data. Experimental results demonstrate that the prediction results of GUKF-based RBFNN accord well with the real ground radioactivity time series while the prediction results of GEKF-based RBFNN are divergent.
文摘In this article, to improve the doubly robust estimator, the nonlinear regression models with missing responses are studied. Based on the covariate balancing propensity score (CBPS), estimators for the regression coefficients and the population mean are obtained. It is proved that the proposed estimators are asymptotically normal. In simulation studies, the proposed estimators show improved performance relative to usual augmented inverse probability weighted estimators.
文摘In this study, we investigate the effects of missing data when estimating HIV/TB co-infection. We revisit the concept of missing data and examine three available approaches for dealing with missingness. The main objective is to identify the best method for correcting missing data in TB/HIV Co-infection setting. We employ both empirical data analysis and extensive simulation study to examine the effects of missing data, the accuracy, sensitivity, specificity and train and test error for different approaches. The novelty of this work hinges on the use of modern statistical learning algorithm when treating missingness. In the empirical analysis, both HIV data and TB-HIV co-infection data imputations were performed, and the missing values were imputed using different approaches. In the simulation study, sets of 0% (Complete case), 10%, 30%, 50% and 80% of the data were drawn randomly and replaced with missing values. Results show complete cases only had a co-infection rate (95% Confidence Interval band) of 29% (25%, 33%), weighted method 27% (23%, 31%), likelihood-based approach 26% (24%, 28%) and multiple imputation approach 21% (20%, 22%). In conclusion, MI remains the best approach for dealing with missing data and failure to apply it, results to overestimation of HIV/TB co-infection rate by 8%.
基金supported by the National Natural Science Foundation of China under Grant Nos.61772256 and 61921006.
文摘Many real-world datasets suffer from the unavoidable issue of missing values,and therefore classification with missing data has to be carefully handled since inadequate treatment of missing values will cause large errors.In this paper,we propose a random subspace sampling method,RSS,by sampling missing items from the corresponding feature histogram distributions in random subspaces,which is effective and efficient at different levels of missing data.Unlike most established approaches,RSS does not train on fixed imputed datasets.Instead,we design a dynamic training strategy where the filled values change dynamically by resampling during training.Moreover,thanks to the sampling strategy,we design an ensemble testing strategy where we combine the results of multiple runs of a single model,which is more efficient and resource-saving than previous ensemble methods.Finally,we combine these two strategies with the random subspace method,which makes our estimations more robust and accurate.The effectiveness of the proposed RSS method is well validated by experimental studies.
文摘In this paper, three smoothed empirical log-likelihood ratio functions for the parameters of nonlinear models with missing response are suggested. Under some regular conditions, the corresponding Wilks phenomena are obtained and the confidence regions for the parameter can be constructed easily.
文摘考虑在函数型解释变量部分观测的情况下,用函数线性模型刻画与标量响应变量的关系.基于函数型主成分分析(Functional Principal Component Analysis,简称FPCA)实现了对缺失部分样本的重构,并通过实证分析,对一组北京市2010-2014年间统计的包括部分观测PM2.5数值的气象数据,分析了PM2.5作为部分观测函数型解释变量对标量响应变量平均气温的影响,结果表明了该方法具有处理缺失函数数据的现实意义.
基金Supported by the State Grid Power Company of Hunan Province Science and Technology Project(No.5216A517000U).
文摘Missing data filling is a key step in power big data preprocessing,which helps to improve the quality and the utilization of electric power data.Due to the limitations of the traditional methods of filling missing data,an improved random forest filling algorithm is proposed.As a result of the horizontal and vertical directions of the electric power data are based on the characteristics of time series.Therefore,the method of improved random forest filling missing data combines the methods of linear interpolation,matrix combination and matrix transposition to solve the problem of filling large amount of electric power missing data.The filling results show that the improved random forest filling algorithm is applicable to filling electric power data in various missing forms.What’s more,the accuracy of the filling results is high and the stability of the model is strong,which is beneficial in improving the quality of electric power data.