This paper simultaneously investigates variable selection and imputation estimation of semiparametric partially linear varying-coefficient model in that case where there exist missing responses for cluster data. As is...This paper simultaneously investigates variable selection and imputation estimation of semiparametric partially linear varying-coefficient model in that case where there exist missing responses for cluster data. As is well known, commonly used approach to deal with missing data is complete-case data. Combined the idea of complete-case data with a discussion of shrinkage estimation is made on different cluster. In order to avoid the biased results as well as improve the estimation efficiency, this article introduces Group Least Absolute Shrinkage and Selection Operator (Group Lasso) to semiparametric model. That is to say, the method combines the approach of local polynomial smoothing and the Least Absolute Shrinkage and Selection Operator. In that case, it can conduct nonparametric estimation and variable selection in a computationally efficient manner. According to the same criterion, the parametric estimators are also obtained. Additionally, for each cluster, the nonparametric and parametric estimators are derived, and then compute the weighted average per cluster as finally estimators. Moreover, the large sample properties of estimators are also derived respectively.展开更多
This paper deals with estimation and test procedures for restricted linear errors-invariables(EV) models with nonignorable missing covariates. We develop a restricted weighted corrected least squares(WCLS) estimator b...This paper deals with estimation and test procedures for restricted linear errors-invariables(EV) models with nonignorable missing covariates. We develop a restricted weighted corrected least squares(WCLS) estimator based on the propensity score, which is fitted by an exponentially tilted likelihood method. The limiting distributions of the proposed estimators are discussed when tilted parameter is known or unknown. To test the validity of the constraints,we construct two test procedures based on corrected residual sum of squares and empirical likelihood method and derive their asymptotic properties. Numerical studies are conducted to examine the finite sample performance of our proposed methods.展开更多
Objective: With the goal of improving health-related quality of life (HRQOL) in cancer patients, we previously reported a structural equation model (SEM) of subjected QOL and qualifications of pharmacists, based on a ...Objective: With the goal of improving health-related quality of life (HRQOL) in cancer patients, we previously reported a structural equation model (SEM) of subjected QOL and qualifications of pharmacists, based on a series of questionnaires completed by patients and pharmacists. However, several patients and pharmacists were excluded from the previous study because it was not always possible to obtain all the data intended for collection. In order to reveal the effect of missing data on the SEM, we established SEMs of HRQOL and the competency of pharmacists, using correlation matrices derived by two different statistical methods for handling missing data. Method: Fifteen cancer patients hospitalized for cancer and were receiving opioid analgesics for pain control, and eight pharmacists were enrolled in this study. Each subject was asked four times weekly to answer questions presented in a questionnaire. SEMs were explored using two correlation matrices derived with pair-wise deletion (PD matrix) and list-wise deletion (LD matrix). The final models were statistically evaluated with certain goodness-of-fit criteria. Results: Data were intended to be collected four times weekly for each patient, but there were some missing values. The same SEMs for HRQOL were optimized using both the LD and PD matrices. Although the path diagrams of the SEMs were not identical in the “competency of pharmacists,” the two models suggested that a higher competency of a pharmacist lowered the “severity” of condition and increased the “comfort” of patients, resulting in an increase in the subjected QOL. Conclusion: In collecting data for clinical research, missing values are unavoidable. When the structure of the model was robust enough, the missing data had a minor effect on our SEM of QOL. In QOL research, the LD matrix as well as the PD matrix would be effective, provided the model is sufficiently robust.展开更多
In this paper, we focus on a type of inverse problem in which the data are expressed as an unknown function of the sought and unknown model function (or its discretised representation as a model parameter vector). In ...In this paper, we focus on a type of inverse problem in which the data are expressed as an unknown function of the sought and unknown model function (or its discretised representation as a model parameter vector). In particular, we deal with situations in which training data are not available. Then we cannot model the unknown functional relationship between data and the unknown model function (or parameter vector) with a Gaussian Process of appropriate dimensionality. A Bayesian method based on state space modelling is advanced instead. Within this framework, the likelihood is expressed in terms of the probability density function (pdf) of the state space variable and the sought model parameter vector is embedded within the domain of this pdf. As the measurable vector lives only inside an identified sub-volume of the system state space, the pdf of the state space variable is projected onto the space of the measurables, and it is in terms of the projected state space density that the likelihood is written;the final form of the likelihood is achieved after convolution with the distribution of measurement errors. Application motivated vague priors are invoked and the posterior probability density of the model parameter vectors, given the data are computed. Inference is performed by taking posterior samples with adaptive MCMC. The method is illustrated on synthetic as well as real galactic data.展开更多
The absence of some data values in any observed dataset has been a real hindrance to achieving valid results in statistical research. This paper</span></span><span><span><span style="fo...The absence of some data values in any observed dataset has been a real hindrance to achieving valid results in statistical research. This paper</span></span><span><span><span style="font-family:""> </span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;">aim</span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;">ed</span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;"> at the missing data widespread problem faced by analysts and statisticians in academia and professional environments. Some data-driven methods were studied to obtain accurate data. Projects that highly rely on data face this missing data problem. And since machine learning models are only as good as the data used to train them, the missing data problem has a real impact on the solutions developed for real-world problems. Therefore, in this dissertation, there is an attempt to solve this problem using different mechanisms. This is done by testing the effectiveness of both traditional and modern data imputation techniques by determining the loss of statistical power when these different approaches are used to tackle the missing data problem. At the end of this research dissertation, it should be easy to establish which methods are the best when handling the research problem. It is recommended that using Multivariate Imputation by Chained Equations (MICE) for MAR missingness is the best approach </span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;">to</span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;"> dealing with missing data.展开更多
The prevalence of a disease in a population is defined as the proportion of people who are infected. Selection bias in disease prevalence estimates occurs if non-participation in testing is correlated with disease sta...The prevalence of a disease in a population is defined as the proportion of people who are infected. Selection bias in disease prevalence estimates occurs if non-participation in testing is correlated with disease status. Missing data are commonly encountered in most medical research. Unfortunately, they are often neglected or not properly handled during analytic procedures, and this may substantially bias the results of the study, reduce the study power, and lead to invalid conclusions. The goal of this study is to illustrate how to estimate prevalence in the presence of missing data. We consider a case where the variable of interest (response variable) is binary and some of the observations are missing and assume that all the covariates are fully observed. In most cases, the statistic of interest, when faced with binary data is the prevalence. We develop a two stage approach to improve the prevalence estimates;in the first stage, we use the logistic regression model to predict the missing binary observations and then in the second stage we recalculate the prevalence using the observed data and the imputed missing data. Such a model would be of great interest in research studies involving HIV/AIDS in which people usually refuse to donate blood for testing yet they are willing to provide other covariates. The prevalence estimation method is illustrated using simulated data and applied to HIV/AIDS data from the Kenya AIDS Indicator Survey, 2007.展开更多
Irregular seismic data causes problems with multi-trace processing algorithms and degrades processing quality. We introduce the Projection onto Convex Sets (POCS) based image restoration method into the seismic data...Irregular seismic data causes problems with multi-trace processing algorithms and degrades processing quality. We introduce the Projection onto Convex Sets (POCS) based image restoration method into the seismic data reconstruction field to interpolate irregularly missing traces. For entire dead traces, we transfer the POCS iteration reconstruction process from the time to frequency domain to save computational cost because forward and reverse Fourier time transforms are not needed. In each iteration, the selection threshold parameter is important for reconstruction efficiency. In this paper, we designed two types of threshold models to reconstruct irregularly missing seismic data. The experimental results show that an exponential threshold can greatly reduce iterations and improve reconstruction efficiency compared to a linear threshold for the same reconstruction result. We also analyze the anti- noise and anti-alias ability of the POCS reconstruction method. Finally, theoretical model tests and real data examples indicate that the proposed method is efficient and applicable.展开更多
Seismic data reconstruction is an essential and yet fundamental step in seismic data processing workflow,which is of profound significance to improve migration imaging quality,multiple suppression effect,and seismic i...Seismic data reconstruction is an essential and yet fundamental step in seismic data processing workflow,which is of profound significance to improve migration imaging quality,multiple suppression effect,and seismic inversion accuracy.Regularization methods play a central role in solving the underdetermined inverse problem of seismic data reconstruction.In this paper,a novel regularization approach is proposed,the low dimensional manifold model(LDMM),for reconstructing the missing seismic data.Our work relies on the fact that seismic patches always occupy a low dimensional manifold.Specifically,we exploit the dimension of the seismic patches manifold as a regularization term in the reconstruction problem,and reconstruct the missing seismic data by enforcing low dimensionality on this manifold.The crucial procedure of the proposed method is to solve the dimension of the patches manifold.Toward this,we adopt an efficient dimensionality calculation method based on low-rank approximation,which provides a reliable safeguard to enforce the constraints in the reconstruction process.Numerical experiments performed on synthetic and field seismic data demonstrate that,compared with the curvelet-based sparsity-promoting L1-norm minimization method and the multichannel singular spectrum analysis method,the proposed method obtains state-of-the-art reconstruction results.展开更多
The generalized linear model is an indispensable tool for analyzing non-Gaussian response data, with both canonical and non-canonical link functions comprehensively used. When missing values are present, many existing...The generalized linear model is an indispensable tool for analyzing non-Gaussian response data, with both canonical and non-canonical link functions comprehensively used. When missing values are present, many existing methods in the literature heavily depend on an unverifiable assumption of the missing data mechanism, and they fail when the assumption is violated. This paper proposes a missing data mechanism that is as generally applicable as possible, which includes both ignorable and nonignorable missing data cases, as well as both scenarios of missing values in response and covariate.Under this general missing data mechanism, the authors adopt an approximate conditional likelihood method to estimate unknown parameters. The authors rigorously establish the regularity conditions under which the unknown parameters are identifiable under the approximate conditional likelihood approach. For parameters that are identifiable, the authors prove the asymptotic normality of the estimators obtained by maximizing the approximate conditional likelihood. Some simulation studies are conducted to evaluate finite sample performance of the proposed estimators as well as estimators from some existing methods. Finally, the authors present a biomarker analysis in prostate cancer study to illustrate the proposed method.展开更多
This paper discusses the maximum likelihood estimate of β under linear inequalities A0β≥ a in a linear model with missing data, proposes the restricted EM algo rithm and proves the convergence.
A control valve is one of the most widely used machines in hydraulic systems.However,it often works in harsh environments and failure occurs from time to time.An intelligent and robust control valve fault diagnosis is...A control valve is one of the most widely used machines in hydraulic systems.However,it often works in harsh environments and failure occurs from time to time.An intelligent and robust control valve fault diagnosis is therefore important for operation of the system.In this study,a fault diagnosis based on the mathematical model(MM)imputation and the modified deep residual shrinkage network(MDRSN)is proposed to solve the problem that data-driven models for control valves are susceptible to changing operating conditions and missing data.The multiple fault time-series samples of the control valve at different openings are collected for fault diagnosis to verify the effectiveness of the proposed method.The effects of the proposed method in missing data imputation and fault diagnosis are analyzed.Compared with random and k-nearest neighbor(KNN)imputation,the accuracies of MM-based imputation are improved by 17.87%and 21.18%,in the circumstances of a20.00%data missing rate at valve opening from 10%to 28%.Furthermore,the results show that the proposed MDRSN can maintain high fault diagnosis accuracy with missing data.展开更多
In this paper, the AMSAA-BISE model with missing data is discussed. The ML estimates of model parameters and current MTBF are given, and the chi-squared test and a plot for cumulative number of failures versus cumulat...In this paper, the AMSAA-BISE model with missing data is discussed. The ML estimates of model parameters and current MTBF are given, and the chi-squared test and a plot for cumulative number of failures versus cumulative testing time are used to test the goodness of fit for the model. This paper concludes with a numerical example to verify the model.展开更多
Challenges in Big Data analysis arise due to the way the data are recorded, maintained, processed and stored. We demonstrate that a hierarchical, multivariate, statistical machine learning algorithm, namely Boosted Re...Challenges in Big Data analysis arise due to the way the data are recorded, maintained, processed and stored. We demonstrate that a hierarchical, multivariate, statistical machine learning algorithm, namely Boosted Regression Tree (BRT) can address Big Data challenges to drive decision making. The challenge of this study is lack of interoperability since the data, a collection of GIS shapefiles, remotely sensed imagery, and aggregated and interpolated spatio-temporal information, are stored in monolithic hardware components. For the modelling process, it was necessary to create one common input file. By merging the data sources together, a structured but noisy input file, showing inconsistencies and redundancies, was created. Here, it is shown that BRT can process different data granularities, heterogeneous data and missingness. In particular, BRT has the advantage of dealing with missing data by default by allowing a split on whether or not a value is missing as well as what the value is. Most importantly, the BRT offers a wide range of possibilities regarding the interpretation of results and variable selection is automatically performed by considering how frequently a variable is used to define a split in the tree. A comparison with two similar regression models (Random Forests and Least Absolute Shrinkage and Selection Operator, LASSO) shows that BRT outperforms these in this instance. BRT can also be a starting point for sophisticated hierarchical modelling in real world scenarios. For example, a single or ensemble approach of BRT could be tested with existing models in order to improve results for a wide range of data-driven decisions and applications.展开更多
In this paper, we present a variable selection procedure by combining basis function approximations with penalized estimating equations for semiparametric varying-coefficient partially linear models with missing respo...In this paper, we present a variable selection procedure by combining basis function approximations with penalized estimating equations for semiparametric varying-coefficient partially linear models with missing response at random. The proposed procedure simultaneously selects significant variables in parametric components and nonparametric components. With appropriate selection of the tuning parameters, we establish the consistency of the variable selection procedure and the convergence rate of the regularized estimators. A simulation study is undertaken to assess the finite sample performance of the proposed variable selection procedure.展开更多
文摘This paper simultaneously investigates variable selection and imputation estimation of semiparametric partially linear varying-coefficient model in that case where there exist missing responses for cluster data. As is well known, commonly used approach to deal with missing data is complete-case data. Combined the idea of complete-case data with a discussion of shrinkage estimation is made on different cluster. In order to avoid the biased results as well as improve the estimation efficiency, this article introduces Group Least Absolute Shrinkage and Selection Operator (Group Lasso) to semiparametric model. That is to say, the method combines the approach of local polynomial smoothing and the Least Absolute Shrinkage and Selection Operator. In that case, it can conduct nonparametric estimation and variable selection in a computationally efficient manner. According to the same criterion, the parametric estimators are also obtained. Additionally, for each cluster, the nonparametric and parametric estimators are derived, and then compute the weighted average per cluster as finally estimators. Moreover, the large sample properties of estimators are also derived respectively.
基金Supported by the Zhejiang Provincial Natural Science Foundation of China(LY15A010019)National Natural Science Foundation of China(11501250)
文摘This paper deals with estimation and test procedures for restricted linear errors-invariables(EV) models with nonignorable missing covariates. We develop a restricted weighted corrected least squares(WCLS) estimator based on the propensity score, which is fitted by an exponentially tilted likelihood method. The limiting distributions of the proposed estimators are discussed when tilted parameter is known or unknown. To test the validity of the constraints,we construct two test procedures based on corrected residual sum of squares and empirical likelihood method and derive their asymptotic properties. Numerical studies are conducted to examine the finite sample performance of our proposed methods.
文摘Objective: With the goal of improving health-related quality of life (HRQOL) in cancer patients, we previously reported a structural equation model (SEM) of subjected QOL and qualifications of pharmacists, based on a series of questionnaires completed by patients and pharmacists. However, several patients and pharmacists were excluded from the previous study because it was not always possible to obtain all the data intended for collection. In order to reveal the effect of missing data on the SEM, we established SEMs of HRQOL and the competency of pharmacists, using correlation matrices derived by two different statistical methods for handling missing data. Method: Fifteen cancer patients hospitalized for cancer and were receiving opioid analgesics for pain control, and eight pharmacists were enrolled in this study. Each subject was asked four times weekly to answer questions presented in a questionnaire. SEMs were explored using two correlation matrices derived with pair-wise deletion (PD matrix) and list-wise deletion (LD matrix). The final models were statistically evaluated with certain goodness-of-fit criteria. Results: Data were intended to be collected four times weekly for each patient, but there were some missing values. The same SEMs for HRQOL were optimized using both the LD and PD matrices. Although the path diagrams of the SEMs were not identical in the “competency of pharmacists,” the two models suggested that a higher competency of a pharmacist lowered the “severity” of condition and increased the “comfort” of patients, resulting in an increase in the subjected QOL. Conclusion: In collecting data for clinical research, missing values are unavoidable. When the structure of the model was robust enough, the missing data had a minor effect on our SEM of QOL. In QOL research, the LD matrix as well as the PD matrix would be effective, provided the model is sufficiently robust.
文摘In this paper, we focus on a type of inverse problem in which the data are expressed as an unknown function of the sought and unknown model function (or its discretised representation as a model parameter vector). In particular, we deal with situations in which training data are not available. Then we cannot model the unknown functional relationship between data and the unknown model function (or parameter vector) with a Gaussian Process of appropriate dimensionality. A Bayesian method based on state space modelling is advanced instead. Within this framework, the likelihood is expressed in terms of the probability density function (pdf) of the state space variable and the sought model parameter vector is embedded within the domain of this pdf. As the measurable vector lives only inside an identified sub-volume of the system state space, the pdf of the state space variable is projected onto the space of the measurables, and it is in terms of the projected state space density that the likelihood is written;the final form of the likelihood is achieved after convolution with the distribution of measurement errors. Application motivated vague priors are invoked and the posterior probability density of the model parameter vectors, given the data are computed. Inference is performed by taking posterior samples with adaptive MCMC. The method is illustrated on synthetic as well as real galactic data.
文摘The absence of some data values in any observed dataset has been a real hindrance to achieving valid results in statistical research. This paper</span></span><span><span><span style="font-family:""> </span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;">aim</span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;">ed</span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;"> at the missing data widespread problem faced by analysts and statisticians in academia and professional environments. Some data-driven methods were studied to obtain accurate data. Projects that highly rely on data face this missing data problem. And since machine learning models are only as good as the data used to train them, the missing data problem has a real impact on the solutions developed for real-world problems. Therefore, in this dissertation, there is an attempt to solve this problem using different mechanisms. This is done by testing the effectiveness of both traditional and modern data imputation techniques by determining the loss of statistical power when these different approaches are used to tackle the missing data problem. At the end of this research dissertation, it should be easy to establish which methods are the best when handling the research problem. It is recommended that using Multivariate Imputation by Chained Equations (MICE) for MAR missingness is the best approach </span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;">to</span></span></span><span style="font-family:Verdana;"><span style="font-family:Verdana;"><span style="font-family:Verdana;"> dealing with missing data.
文摘The prevalence of a disease in a population is defined as the proportion of people who are infected. Selection bias in disease prevalence estimates occurs if non-participation in testing is correlated with disease status. Missing data are commonly encountered in most medical research. Unfortunately, they are often neglected or not properly handled during analytic procedures, and this may substantially bias the results of the study, reduce the study power, and lead to invalid conclusions. The goal of this study is to illustrate how to estimate prevalence in the presence of missing data. We consider a case where the variable of interest (response variable) is binary and some of the observations are missing and assume that all the covariates are fully observed. In most cases, the statistic of interest, when faced with binary data is the prevalence. We develop a two stage approach to improve the prevalence estimates;in the first stage, we use the logistic regression model to predict the missing binary observations and then in the second stage we recalculate the prevalence using the observed data and the imputed missing data. Such a model would be of great interest in research studies involving HIV/AIDS in which people usually refuse to donate blood for testing yet they are willing to provide other covariates. The prevalence estimation method is illustrated using simulated data and applied to HIV/AIDS data from the Kenya AIDS Indicator Survey, 2007.
基金financially supported by National 863 Program (Grants No.2006AA 09A 102-09)National Science and Technology of Major Projects ( Grants No.2008ZX0 5025-001-001)
文摘Irregular seismic data causes problems with multi-trace processing algorithms and degrades processing quality. We introduce the Projection onto Convex Sets (POCS) based image restoration method into the seismic data reconstruction field to interpolate irregularly missing traces. For entire dead traces, we transfer the POCS iteration reconstruction process from the time to frequency domain to save computational cost because forward and reverse Fourier time transforms are not needed. In each iteration, the selection threshold parameter is important for reconstruction efficiency. In this paper, we designed two types of threshold models to reconstruct irregularly missing seismic data. The experimental results show that an exponential threshold can greatly reduce iterations and improve reconstruction efficiency compared to a linear threshold for the same reconstruction result. We also analyze the anti- noise and anti-alias ability of the POCS reconstruction method. Finally, theoretical model tests and real data examples indicate that the proposed method is efficient and applicable.
基金supported by National Natural Science Foundation of China(Grant No.41874146 and No.42030103)Postgraduate Innovation Project of China University of Petroleum(East China)(No.YCX2021012)
文摘Seismic data reconstruction is an essential and yet fundamental step in seismic data processing workflow,which is of profound significance to improve migration imaging quality,multiple suppression effect,and seismic inversion accuracy.Regularization methods play a central role in solving the underdetermined inverse problem of seismic data reconstruction.In this paper,a novel regularization approach is proposed,the low dimensional manifold model(LDMM),for reconstructing the missing seismic data.Our work relies on the fact that seismic patches always occupy a low dimensional manifold.Specifically,we exploit the dimension of the seismic patches manifold as a regularization term in the reconstruction problem,and reconstruct the missing seismic data by enforcing low dimensionality on this manifold.The crucial procedure of the proposed method is to solve the dimension of the patches manifold.Toward this,we adopt an efficient dimensionality calculation method based on low-rank approximation,which provides a reliable safeguard to enforce the constraints in the reconstruction process.Numerical experiments performed on synthetic and field seismic data demonstrate that,compared with the curvelet-based sparsity-promoting L1-norm minimization method and the multichannel singular spectrum analysis method,the proposed method obtains state-of-the-art reconstruction results.
基金supported by the Chinese 111 Project B14019the US National Science Foundation under Grant Nos.DMS-1305474 and DMS-1612873the US National Institutes of Health Award UL1TR001412
文摘The generalized linear model is an indispensable tool for analyzing non-Gaussian response data, with both canonical and non-canonical link functions comprehensively used. When missing values are present, many existing methods in the literature heavily depend on an unverifiable assumption of the missing data mechanism, and they fail when the assumption is violated. This paper proposes a missing data mechanism that is as generally applicable as possible, which includes both ignorable and nonignorable missing data cases, as well as both scenarios of missing values in response and covariate.Under this general missing data mechanism, the authors adopt an approximate conditional likelihood method to estimate unknown parameters. The authors rigorously establish the regularity conditions under which the unknown parameters are identifiable under the approximate conditional likelihood approach. For parameters that are identifiable, the authors prove the asymptotic normality of the estimators obtained by maximizing the approximate conditional likelihood. Some simulation studies are conducted to evaluate finite sample performance of the proposed estimators as well as estimators from some existing methods. Finally, the authors present a biomarker analysis in prostate cancer study to illustrate the proposed method.
基金We would like to thank the referees for many useful suggestions on the earlier draft of the manuscript.This work was supported by the National Natural Foundation of China(Grant Nos.10431010,10329102&10371015)the Science and Technology Keystone Fund of MOE,China(Grant Nos.104070&00041)+1 种基金EYTP,the Distinguished Young Scholars Science Research Program of Jilin Province(Grant No.20030113)Young Teacher's Foundation of Northeast Normal University,China.
文摘This paper discusses the maximum likelihood estimate of β under linear inequalities A0β≥ a in a linear model with missing data, proposes the restricted EM algo rithm and proves the convergence.
基金supported by the National Natural Science Foundation of China(No.51875113)the Natural Science Joint Guidance Foundation of the Heilongjiang Province of China(No.LH2019E027)the PhD Student Research and Innovation Fund of the Fundamental Research Funds for the Central Universities(No.XK2070021009),China。
文摘A control valve is one of the most widely used machines in hydraulic systems.However,it often works in harsh environments and failure occurs from time to time.An intelligent and robust control valve fault diagnosis is therefore important for operation of the system.In this study,a fault diagnosis based on the mathematical model(MM)imputation and the modified deep residual shrinkage network(MDRSN)is proposed to solve the problem that data-driven models for control valves are susceptible to changing operating conditions and missing data.The multiple fault time-series samples of the control valve at different openings are collected for fault diagnosis to verify the effectiveness of the proposed method.The effects of the proposed method in missing data imputation and fault diagnosis are analyzed.Compared with random and k-nearest neighbor(KNN)imputation,the accuracies of MM-based imputation are improved by 17.87%and 21.18%,in the circumstances of a20.00%data missing rate at valve opening from 10%to 28%.Furthermore,the results show that the proposed MDRSN can maintain high fault diagnosis accuracy with missing data.
文摘In this paper, the AMSAA-BISE model with missing data is discussed. The ML estimates of model parameters and current MTBF are given, and the chi-squared test and a plot for cumulative number of failures versus cumulative testing time are used to test the goodness of fit for the model. This paper concludes with a numerical example to verify the model.
文摘Challenges in Big Data analysis arise due to the way the data are recorded, maintained, processed and stored. We demonstrate that a hierarchical, multivariate, statistical machine learning algorithm, namely Boosted Regression Tree (BRT) can address Big Data challenges to drive decision making. The challenge of this study is lack of interoperability since the data, a collection of GIS shapefiles, remotely sensed imagery, and aggregated and interpolated spatio-temporal information, are stored in monolithic hardware components. For the modelling process, it was necessary to create one common input file. By merging the data sources together, a structured but noisy input file, showing inconsistencies and redundancies, was created. Here, it is shown that BRT can process different data granularities, heterogeneous data and missingness. In particular, BRT has the advantage of dealing with missing data by default by allowing a split on whether or not a value is missing as well as what the value is. Most importantly, the BRT offers a wide range of possibilities regarding the interpretation of results and variable selection is automatically performed by considering how frequently a variable is used to define a split in the tree. A comparison with two similar regression models (Random Forests and Least Absolute Shrinkage and Selection Operator, LASSO) shows that BRT outperforms these in this instance. BRT can also be a starting point for sophisticated hierarchical modelling in real world scenarios. For example, a single or ensemble approach of BRT could be tested with existing models in order to improve results for a wide range of data-driven decisions and applications.
基金Supported by National Natural Science Foundation of China (Grant No. 10871013), Natural Science Foundation of Beijing (Grant No. 1072004), and Natural Science Foundation of Guangxi Province (Grant No. 2010GXNSFB013051)
文摘In this paper, we present a variable selection procedure by combining basis function approximations with penalized estimating equations for semiparametric varying-coefficient partially linear models with missing response at random. The proposed procedure simultaneously selects significant variables in parametric components and nonparametric components. With appropriate selection of the tuning parameters, we establish the consistency of the variable selection procedure and the convergence rate of the regularized estimators. A simulation study is undertaken to assess the finite sample performance of the proposed variable selection procedure.