This paper investigates the tolerable sample size needed for Ordinary Least Square (OLS) Estimator to be used when there is presence of Multicollinearity among the exogenous variables of a linear regression model. A r...This paper investigates the tolerable sample size needed for Ordinary Least Square (OLS) Estimator to be used when there is presence of Multicollinearity among the exogenous variables of a linear regression model. A regression model with constant term (β0) and two independent variables (with β1 and β2 as their respective regression coefficients) that exhibit multicollinearity was considered. A Monte Carlo study of 1000 trials was conducted at eight levels of multicollinearity (0, 0.25, 0.5, 0.7, 0.75, 0.8, 0.9 and 0.99) and sample sizes (10, 20, 40, 80, 100, 150, 250 and 500). At each specification, the true regression coefficients were set at unity while 1.5, 2.0 and 2.5 were taken as the hypothesized value. The power value rate was obtained at every multicollinearity level for the aforementioned sample sizes. Therefore, whether the hypothesized values highly depart from the true values or not once the multicollinearity level is very high (i.e. 0.99), the sample size needed to work with in order to have an error free estimation or the inference result must be greater than five hundred.展开更多
Heteroscedasticity and multicollinearity are serious problems when they exist in econometrics data. These problems exist as a result of violating the assumptions of equal variance between the error terms and that of i...Heteroscedasticity and multicollinearity are serious problems when they exist in econometrics data. These problems exist as a result of violating the assumptions of equal variance between the error terms and that of independence between the explanatory variables of the model. With these assumption violations, Ordinary Least Square Estimator</span><span style="font-family:""> </span><span style="font-family:""><span style="font-family:Verdana;">(OLS) will not give best linear unbiased, efficient and consistent estimator. In practice, there are several structures of heteroscedasticity and several methods of heteroscedasticity detection. For better estimation result, best heteroscedasticity detection methods must be determined for any structure of heteroscedasticity in the presence of multicollinearity between the explanatory variables of the model. In this paper we examine the effects of multicollinearity on type I error rates of some methods of heteroscedasticity detection in linear regression model in other to determine the best method of heteroscedasticity detection to use when both problems exist in the model. Nine heteroscedasticity detection methods were considered with seven heteroscedasticity structures. Simulation study was done via a Monte Carlo experiment on a multiple linear regression model with 3 explanatory variables. This experiment was conducted 1000 times with linear model parameters of </span><span style="white-space:nowrap;"><em><span style="font-family:Verdana;">β</span></em><sub><span style="font-family:Verdana;">0</span></sub><span style="font-family:Verdana;"> = 4 , </span><em><span style="font-family:Verdana;">β</span></em><sub><span style="font-family:Verdana;">1</span></sub><span style="font-family:Verdana;"> = 0.4 , </span><em><span style="font-family:Verdana;">β</span></em><sub><span style="font-family:Verdana;">2</span></sub><span style="font-family:Verdana;">= 1.5</span></span></span><span style="font-family:""><span style="font-family:Verdana;"> and </span><em style="font-family:""><span style="font-family:Verdana;">β</span><span style="font-family:Verdana;"><sub>3 </sub></span></em><span style="font-family:Verdana;">= 3.6</span><span style="font-family:Verdana;">. </span><span style="font-family:Verdana;">Five (5) </span><span style="font-family:Verdana;"></span><span style="font-family:Verdana;">levels of</span><span style="white-space:nowrap;font-family:Verdana;"> </span><span style="font-family:Verdana;"></span><span style="font-family:Verdana;">mulicollinearity </span></span><span style="font-family:Verdana;">are </span><span style="font-family:Verdana;">with seven</span><span style="font-family:""> </span><span style="font-family:Verdana;">(7) different sample sizes. The method’s performances were compared with the aids of set confidence interval (C.I</span><span style="font-family:Verdana;">.</span><span style="font-family:Verdana;">) criterion. Results showed that whenever multicollinearity exists in the model with any forms of heteroscedasticity structures, Breusch-Godfrey (BG) test is the best method to determine the existence of heteroscedasticity at all chosen levels of significance.展开更多
Multicollinearity in factor analysis has negative effects, including unreliable factor structure, inconsistent loadings, inflated standard errors, reduced discriminant validity, and difficulties in interpreting factor...Multicollinearity in factor analysis has negative effects, including unreliable factor structure, inconsistent loadings, inflated standard errors, reduced discriminant validity, and difficulties in interpreting factors. It also leads to reduced stability, hindered factor replication, misinterpretation of factor importance, increased parameter estimation instability, reduced power to detect the true factor structure, compromised model fit indices, and biased factor loadings. Multicollinearity introduces uncertainty, complexity, and limited generalizability, hampering factor analysis. To address multicollinearity, researchers can examine the correlation matrix to identify variables with high correlation coefficients. The Variance Inflation Factor (VIF) measures the inflation of regression coefficients due to multicollinearity. Tolerance, the reciprocal of VIF, indicates the proportion of variance in a predictor variable not shared with others. Eigenvalues help assess multicollinearity, with values greater than 1 suggesting the retention of factors. Principal Component Analysis (PCA) reduces dimensionality and identifies highly correlated variables. Other diagnostic measures include the condition number and Cook’s distance. Researchers can center or standardize data, perform variable filtering, use PCA instead of factor analysis, employ factor scores, merge correlated variables, or apply clustering techniques for the solution of the multicollinearity problem. Further research is needed to explore different types of multicollinearity, assess method effectiveness, and investigate the relationship with other factor analysis issues.展开更多
This paper considers the approaches and methods for reducing the influence of multi-collinearity. Great attention is paid to the question of using shrinkage estimators for this purpose. Two classes of regression model...This paper considers the approaches and methods for reducing the influence of multi-collinearity. Great attention is paid to the question of using shrinkage estimators for this purpose. Two classes of regression models are investigated, the first of which corresponds to systems with a negative feedback, while the second class presents systems without the feedback. In the first case the use of shrinkage estimators, especially the Principal Component estimator, is inappropriate but is possible in the second case with the right choice of the regularization parameter or of the number of principal components included in the regression model. This fact is substantiated by the study of the distribution of the random variable , where b is the LS estimate and β is the true coefficient, since the form of this distribution is the basic characteristic of the specified classes. For this study, a regression approximation of the distribution of the event based on the Edgeworth series was developed. Also, alternative approaches are examined to resolve the multicollinearity issue, including an application of the known Inequality Constrained Least Squares method and the Dual estimator method proposed by the author. It is shown that with a priori information the Euclidean distance between the estimates and the true coefficients can be significantly reduced.展开更多
Targeting the multicollinearity problem in dam statistical model and error perturbations resulting from the monitoring process, we built a regularized regression model using Truncated Singular Value Decomposition (T...Targeting the multicollinearity problem in dam statistical model and error perturbations resulting from the monitoring process, we built a regularized regression model using Truncated Singular Value Decomposition (TSVD). An earth-rock dam in China is presented and discussed as an example. The analysis consists of three steps: multicollinearity detection, regularization pa- rameter selection, and crack opening modeling and forecasting. Generalized Cross-Validation (GCV) function and L-curve criterion are both adopted in the regularization parameter selection. Partial Least-Squares Regression (PLSR) and stepwise regression are also included for comparison. The result indicates the TSVD can promisingly solve the multicollinearity problem of dam regression models. However, no general rules are available to make a decision when TSVD is superior to stepwise regression and PLSR due to the regularization parameter-choice problem. Both fitting accuracy and coefficients' reasonability should be considered when evaluating the mode/reliability.展开更多
The prediction accuracy of the traditional stepwise regression prediction equation(SRPE)is affected by the multicollinearity among its predictors.This paper introduces the condition number analysis into the predicti...The prediction accuracy of the traditional stepwise regression prediction equation(SRPE)is affected by the multicollinearity among its predictors.This paper introduces the condition number analysis into the prediction modeling to minimize the multicollinearity in the SRPE.In the condition number prediction modeling,the condition number is used to select the combination of predictors with the lowest multicollinearity from the possible combinations of a number of candidate predictors(variables),and the selected combination is then used to construct the condition number regression prediction equation(CNRPE).This novel prediction modeling is performed in typhoon track prediction,which is a difficult task among meteorological disaster predictions.Six pairs of typhoon track latitude/longitude SRPEs and CNRPEs for July,August,and September are built by employing the traditional and the novel prediction modeling approaches,respectively,and by using a large number of identical modeling samples.The comparative analysis indicates that under the condition of the same candidate predictors(variables)and predictands(dependent variables),although the fitting accuracy of the novel prediction models used for the historical samples of South China Sea(SCS)typhoon tracks is slightly lower than that of the traditional prediction models,the prediction accuracy for the independent samples is obviously improved,with the averaged prediction error of the novel models for July,August,and September being 153.9 kin,which is 75.3 km smaller than that of the traditional models(a reduction of 33%).This is because the novel prediction modeling effectively minimizes the multicollinearity by computation and analysis of the condition number.It is shown further that when F=1.0,2.0,and 3.0,the average prediction errors of the traditional SRPEs are obviously larger than those of the CNRPEs.Moreover,extremely large and unreasonable prediction errors occur at some individual points of the typhoon track predicted by the SRPEs due to the multicollinearity existing in the combination of predictors.展开更多
Mediterranean anemia is a genetic disease that currently relies heavily on expert clinical experience to determine whether patients are affected. This method is overly reliant on expert experience and is not precise e...Mediterranean anemia is a genetic disease that currently relies heavily on expert clinical experience to determine whether patients are affected. This method is overly reliant on expert experience and is not precise enough. This paper proposes two modeling methods to predict whether patients have Mediterranean anemia. The first method involves using Principal Component Analysis (PCA) to reduce the dimensionality of the data, followed by logistic regression modeling (PCA-LR) on the reduced dataset. The second method involves building a Partial Least Squares Regression (PLS) model. Experimental results show that the prediction accuracy of the PCA-LR model is 87.5% (degree = 2, λ=4), and the prediction accuracy of the PLS model is 92.5% (ncomp = 4), indicating good predictive performance of the models.展开更多
The parameter estimation problem in linear model is considered when multicollinearity and outliers exist simultaneously.A class of new estimators,robust general shrunken estimators,are proposed by grafting the robust ...The parameter estimation problem in linear model is considered when multicollinearity and outliers exist simultaneously.A class of new estimators,robust general shrunken estimators,are proposed by grafting the robust estimation techniques philosophy into the biased estimator,and their statistical properties are discussed.By appropriate choices of the shrinking parameter matrix,we obtain many useful and important estimators.A numerical example is used to illustrate that these new estimators can not only effectively overcome difficulty caused by multicollinearity but also resist the influence of outliers.展开更多
Accurate prediction of stem diameter is an important prerequisite of forest management.In this study,an appropriate stem taper function was developed for upper stem diameter estimation of white birch(Betula platyphyll...Accurate prediction of stem diameter is an important prerequisite of forest management.In this study,an appropriate stem taper function was developed for upper stem diameter estimation of white birch(Betula platyphylla Sukaczev)in ten sub-regions of the Daxing’an Mountains,northeast China.Three commonly used taper functions were assessed using a diameter and height dataset comprising 1344 trees.A first-order continuous-time error structure accounted for the inherent autocorrelation.The segmented model of Max and Burkhart(For Sci 22:283–289,1976.https://doi.org/10.1093/fores tscie nce/22.3.283)and the variable exponent taper function of Kozak(For Chron 80:507–515,2004.https://doi.org/10.5558/tfc80507-4)described the data accurately.Owing to its lower multicollinearity,the Max and Burkhart(1976)model is recommended for diameter estimation at specific heights along the stem for the ten sub-regions.After comparison,the Max and Burkhart(1976)model was refitted using nonlinear mixed-effects techniques.Mixed-effects models would be used only when additional upper stem diameter measurements are available for calibration.Differences in region-specific taper functions were indicated by the method of the non-linear extra sum of squares.Therefore,the particular taper function should be adjusted accordingly for each sub-region in the Daxing’an Mountains.展开更多
Suppression effect in multiple regression analysis may be more common in research than what is currently recognized. We have reviewed several literatures of interest which treats the concept and types of suppressor va...Suppression effect in multiple regression analysis may be more common in research than what is currently recognized. We have reviewed several literatures of interest which treats the concept and types of suppressor variables. Also, we have highlighted systematic ways to identify suppression effect in multiple regressions using statistics such as: R2, sum of squares, regression weight and comparing zero-order correlations with Variance Inflation Factor (VIF) respectively. We also establish that suppression effect is a function of multicollinearity;however, a suppressor variable should only be allowed in a regression analysis if its VIF is less than five (5).展开更多
In several LUCC studies, statistical methods are being used to analyze land use data. A problem using conventional statistical methods in land use analysis is that these methods assume the data to be statistically ind...In several LUCC studies, statistical methods are being used to analyze land use data. A problem using conventional statistical methods in land use analysis is that these methods assume the data to be statistically independent. But in fact, they have the tendency to be dependent, a phenomenon known as multicollinearity, especially in the cases of few observations. In this paper, a Partial Least-Squares (PLS) regression approach is developed to study relationships between land use and its influencing factors through a case study of the Suzhou-Wuxi-Changzhou region in China. Multicollinearity exists in the dataset and the number of variables is high compared to the number of observations. Four PLS factors are selected through a preliminary analysis. The correlation analyses between land use and influencing factors demonstrate the land use character of rural industrialization and urbanization in the Suzhou-Wuxi-Changzhou region, meanwhile illustrate that the first PLS factor has enough ability to best describe land use patterns quantitatively, and most of the statistical relations derived from it accord with the fact. By the decreasing capacity of the PLS factors, the reliability of model outcome decreases correspondingly.展开更多
In the presence of multicollinearity in logistic regression, the variance of the Maximum Likelihood Estimator (MLE) becomes inflated. Siray et al. (2015) [1] proposed a restricted Liu estimator in logistic regression ...In the presence of multicollinearity in logistic regression, the variance of the Maximum Likelihood Estimator (MLE) becomes inflated. Siray et al. (2015) [1] proposed a restricted Liu estimator in logistic regression model with exact linear restrictions. However, there are some situations, where the linear restrictions are stochastic. In this paper, we propose a Stochastic Restricted Maximum Likelihood Estimator (SRMLE) for the logistic regression model with stochastic linear restrictions to overcome this issue. Moreover, a Monte Carlo simulation is conducted for comparing the performances of the MLE, Restricted Maximum Likelihood Estimator (RMLE), Ridge Type Logistic Estimator(LRE), Liu Type Logistic Estimator(LLE), and SRMLE for the logistic regression model by using Scalar Mean Squared Error (SMSE).展开更多
The paper introduces a new biased estimator namely Generalized Optimal Estimator (GOE) in a multiple linear regression when there exists multicollinearity among predictor variables. Stochastic properties of proposed e...The paper introduces a new biased estimator namely Generalized Optimal Estimator (GOE) in a multiple linear regression when there exists multicollinearity among predictor variables. Stochastic properties of proposed estimator were derived, and the proposed estimator was compared with other existing biased estimators based on sample information in the the Scalar Mean Square Error (SMSE) criterion by using a Monte Carlo simulation study and two numerical illustrations.展开更多
In linear regression analysis, detecting anomalous observations is an important step for model building process. Various influential measures based on different motivational arguments and designed to measure the influ...In linear regression analysis, detecting anomalous observations is an important step for model building process. Various influential measures based on different motivational arguments and designed to measure the influence of observations on different aspects of various regression results are elucidated and critiqued. The presence of influential observations in the data is complicated by the presence of multicollinearity. In this paper, when Liu estimator is used to mitigate the effects of multicollinearity the influence of some observations can be drastically modified. Approximate deletion formulas for the detection of influential points are proposed for Liu estimator. Two real macroeconomic data sets are used to illustrate the methodologies proposed in this paper.展开更多
Tropospheric ozone (O3) is one of the pollutants that have a significant impact on human health. It can increase the rate of asthma crises, cause permanent lung infections and death. Predicting its concentration level...Tropospheric ozone (O3) is one of the pollutants that have a significant impact on human health. It can increase the rate of asthma crises, cause permanent lung infections and death. Predicting its concentration levels is therefore important for planning atmospheric protection strategies. The aim of this study is to predict the daily mean O3 concentration one day ahead in the Grand Casablanca area of Morocco using primary pollutants and meteorological variables. Since the available explanatory variables are multicollinear, multiple linear regressions are likely to lead to unstable models. To counteract the multicollinearity problem, we compared several alternative regression methods: 1) Continuum Regression;2) Ridge & Lasso Regressions;3) Principal component regression (PCR);4) Partial least Square regression & sparse PLS and;5) Biased Power Regression. The aim is to set up a good prediction model of the daily ozone in the Grand Casablanca area. These models are fitted on a training data set (from the years 2013 and 2014), tested on a data set (from 2015) and validated on yet another data set data (from 2015). The Lasso model showed a better performance for the prediction of ozone concentrations compared to multiple linear regression and its other alternative methods.展开更多
In this paper we compare recently developed preliminary test estimator called Preliminary Test Stochastic Restricted Liu Estimator (PTSRLE) with Ordinary Least Square Estimator (OLSE) and Mixed Estimator (ME) in the M...In this paper we compare recently developed preliminary test estimator called Preliminary Test Stochastic Restricted Liu Estimator (PTSRLE) with Ordinary Least Square Estimator (OLSE) and Mixed Estimator (ME) in the Mean Square Error Matrix (MSEM) sense for the two cases in which the stochastic restrictions are correct and not correct. Finally a numerical example and a Monte Carlo simulation study are done to illustrate the theoretical findings.展开更多
This work is geared towards detecting and solving the problem of multicolinearity in regression analysis. As such, Variance Inflation Factor (VIF) and the Condition Index (CI) were used as measures of such detection. ...This work is geared towards detecting and solving the problem of multicolinearity in regression analysis. As such, Variance Inflation Factor (VIF) and the Condition Index (CI) were used as measures of such detection. Ridge Regression (RR) and the Principal Component Regression (PCR) were the two other approaches used in modeling apart from the conventional simple linear regression. For the purpose of comparing the two methods, simulated data were used. Our task is to ascertain the effectiveness of each of the methods based on their respective mean square errors. From the result, we found that Ridge Regression (RR) method is better than principal component regression when multicollinearity exists among the predictors.展开更多
In order to overcome the well-known multicollinearity problem, we propose a new Stochastic Restricted Liu Estimator in logistic regression model. In the mean square error matrix sense, the new estimation is compared w...In order to overcome the well-known multicollinearity problem, we propose a new Stochastic Restricted Liu Estimator in logistic regression model. In the mean square error matrix sense, the new estimation is compared with the Maximum Likelihood Estimation, Liu Estimator Stochastic Restricted Maximum Likelihood Estimator etc. Finally, a numerical example and a Monte Carlo simulation are given to explain some of the theoretical results.展开更多
Multicollinearity constitutes shared variation among predictors that inflates standard errors of regression coefficients. Several years ago, it was proven that the common practice of mean centering in moderated regres...Multicollinearity constitutes shared variation among predictors that inflates standard errors of regression coefficients. Several years ago, it was proven that the common practice of mean centering in moderated regression cannot alleviate multicollinearity among variables comprising an interaction, but merely masks it. Residual centering (orthogonalizing) is unacceptable because it biases parameters for predictors from which the interaction derives, thus precluding interpretation of moderator effects. I propose and validate residual centering in sequential re-estimations of a moderated regression—sequential residual centering (SRC)—by revealing unbiased multicollinearity conditioning across the interaction and its related terms. Across simulations, SRC reduces variance inflation factors (VIF) regardless of distribution shape or pattern of regression coefficients across predictors. For any predictor, the reduced VIF is used to derive a lower standard error of its regression coefficient. A cancer sample illustrates SRC, which allows unbiased interpretations of symptom clusters. SRC can be applied efficiently to alleviate multicollinearity after data collection and shows promise for advancing synergistic frontiers of research.展开更多
When we consider the factors affecting the stock market,we often consider the impact of macroeconomic factors on it.On the basis of the academic achievements of predecessor scholars on macroscopic factors affecting th...When we consider the factors affecting the stock market,we often consider the impact of macroeconomic factors on it.On the basis of the academic achievements of predecessor scholars on macroscopic factors affecting the overall stocks,this article selects six macroeconomic influencing factors:money supply,USDCNY exchange rate,GDP,national fiscal revenue,price index and interest rate,and uses PAC regression analysis method to construct a regression model.Analyze the influencing factors of Shanghai Pudong Development Bank stock;then conduct heteroscedasticity,autocorrelation,and multicollinearity tests to modify and adjust the regression model;finally,make relevant recommendations based on the analysis results.展开更多
文摘This paper investigates the tolerable sample size needed for Ordinary Least Square (OLS) Estimator to be used when there is presence of Multicollinearity among the exogenous variables of a linear regression model. A regression model with constant term (β0) and two independent variables (with β1 and β2 as their respective regression coefficients) that exhibit multicollinearity was considered. A Monte Carlo study of 1000 trials was conducted at eight levels of multicollinearity (0, 0.25, 0.5, 0.7, 0.75, 0.8, 0.9 and 0.99) and sample sizes (10, 20, 40, 80, 100, 150, 250 and 500). At each specification, the true regression coefficients were set at unity while 1.5, 2.0 and 2.5 were taken as the hypothesized value. The power value rate was obtained at every multicollinearity level for the aforementioned sample sizes. Therefore, whether the hypothesized values highly depart from the true values or not once the multicollinearity level is very high (i.e. 0.99), the sample size needed to work with in order to have an error free estimation or the inference result must be greater than five hundred.
文摘Heteroscedasticity and multicollinearity are serious problems when they exist in econometrics data. These problems exist as a result of violating the assumptions of equal variance between the error terms and that of independence between the explanatory variables of the model. With these assumption violations, Ordinary Least Square Estimator</span><span style="font-family:""> </span><span style="font-family:""><span style="font-family:Verdana;">(OLS) will not give best linear unbiased, efficient and consistent estimator. In practice, there are several structures of heteroscedasticity and several methods of heteroscedasticity detection. For better estimation result, best heteroscedasticity detection methods must be determined for any structure of heteroscedasticity in the presence of multicollinearity between the explanatory variables of the model. In this paper we examine the effects of multicollinearity on type I error rates of some methods of heteroscedasticity detection in linear regression model in other to determine the best method of heteroscedasticity detection to use when both problems exist in the model. Nine heteroscedasticity detection methods were considered with seven heteroscedasticity structures. Simulation study was done via a Monte Carlo experiment on a multiple linear regression model with 3 explanatory variables. This experiment was conducted 1000 times with linear model parameters of </span><span style="white-space:nowrap;"><em><span style="font-family:Verdana;">β</span></em><sub><span style="font-family:Verdana;">0</span></sub><span style="font-family:Verdana;"> = 4 , </span><em><span style="font-family:Verdana;">β</span></em><sub><span style="font-family:Verdana;">1</span></sub><span style="font-family:Verdana;"> = 0.4 , </span><em><span style="font-family:Verdana;">β</span></em><sub><span style="font-family:Verdana;">2</span></sub><span style="font-family:Verdana;">= 1.5</span></span></span><span style="font-family:""><span style="font-family:Verdana;"> and </span><em style="font-family:""><span style="font-family:Verdana;">β</span><span style="font-family:Verdana;"><sub>3 </sub></span></em><span style="font-family:Verdana;">= 3.6</span><span style="font-family:Verdana;">. </span><span style="font-family:Verdana;">Five (5) </span><span style="font-family:Verdana;"></span><span style="font-family:Verdana;">levels of</span><span style="white-space:nowrap;font-family:Verdana;"> </span><span style="font-family:Verdana;"></span><span style="font-family:Verdana;">mulicollinearity </span></span><span style="font-family:Verdana;">are </span><span style="font-family:Verdana;">with seven</span><span style="font-family:""> </span><span style="font-family:Verdana;">(7) different sample sizes. The method’s performances were compared with the aids of set confidence interval (C.I</span><span style="font-family:Verdana;">.</span><span style="font-family:Verdana;">) criterion. Results showed that whenever multicollinearity exists in the model with any forms of heteroscedasticity structures, Breusch-Godfrey (BG) test is the best method to determine the existence of heteroscedasticity at all chosen levels of significance.
文摘Multicollinearity in factor analysis has negative effects, including unreliable factor structure, inconsistent loadings, inflated standard errors, reduced discriminant validity, and difficulties in interpreting factors. It also leads to reduced stability, hindered factor replication, misinterpretation of factor importance, increased parameter estimation instability, reduced power to detect the true factor structure, compromised model fit indices, and biased factor loadings. Multicollinearity introduces uncertainty, complexity, and limited generalizability, hampering factor analysis. To address multicollinearity, researchers can examine the correlation matrix to identify variables with high correlation coefficients. The Variance Inflation Factor (VIF) measures the inflation of regression coefficients due to multicollinearity. Tolerance, the reciprocal of VIF, indicates the proportion of variance in a predictor variable not shared with others. Eigenvalues help assess multicollinearity, with values greater than 1 suggesting the retention of factors. Principal Component Analysis (PCA) reduces dimensionality and identifies highly correlated variables. Other diagnostic measures include the condition number and Cook’s distance. Researchers can center or standardize data, perform variable filtering, use PCA instead of factor analysis, employ factor scores, merge correlated variables, or apply clustering techniques for the solution of the multicollinearity problem. Further research is needed to explore different types of multicollinearity, assess method effectiveness, and investigate the relationship with other factor analysis issues.
文摘This paper considers the approaches and methods for reducing the influence of multi-collinearity. Great attention is paid to the question of using shrinkage estimators for this purpose. Two classes of regression models are investigated, the first of which corresponds to systems with a negative feedback, while the second class presents systems without the feedback. In the first case the use of shrinkage estimators, especially the Principal Component estimator, is inappropriate but is possible in the second case with the right choice of the regularization parameter or of the number of principal components included in the regression model. This fact is substantiated by the study of the distribution of the random variable , where b is the LS estimate and β is the true coefficient, since the form of this distribution is the basic characteristic of the specified classes. For this study, a regression approximation of the distribution of the event based on the Edgeworth series was developed. Also, alternative approaches are examined to resolve the multicollinearity issue, including an application of the known Inequality Constrained Least Squares method and the Dual estimator method proposed by the author. It is shown that with a priori information the Euclidean distance between the estimates and the true coefficients can be significantly reduced.
基金Supported by the Research Project of Department of Water Resources of Zhejiang Province of China (No. RB1010)
文摘Targeting the multicollinearity problem in dam statistical model and error perturbations resulting from the monitoring process, we built a regularized regression model using Truncated Singular Value Decomposition (TSVD). An earth-rock dam in China is presented and discussed as an example. The analysis consists of three steps: multicollinearity detection, regularization pa- rameter selection, and crack opening modeling and forecasting. Generalized Cross-Validation (GCV) function and L-curve criterion are both adopted in the regularization parameter selection. Partial Least-Squares Regression (PLSR) and stepwise regression are also included for comparison. The result indicates the TSVD can promisingly solve the multicollinearity problem of dam regression models. However, no general rules are available to make a decision when TSVD is superior to stepwise regression and PLSR due to the regularization parameter-choice problem. Both fitting accuracy and coefficients' reasonability should be considered when evaluating the mode/reliability.
基金Supported by the National Natural Science Foundation of China under Grant Nos.40675023 and 41065002the Key Natural Science Foundation of Guangxi Province under Grant No.0832019Z
文摘The prediction accuracy of the traditional stepwise regression prediction equation(SRPE)is affected by the multicollinearity among its predictors.This paper introduces the condition number analysis into the prediction modeling to minimize the multicollinearity in the SRPE.In the condition number prediction modeling,the condition number is used to select the combination of predictors with the lowest multicollinearity from the possible combinations of a number of candidate predictors(variables),and the selected combination is then used to construct the condition number regression prediction equation(CNRPE).This novel prediction modeling is performed in typhoon track prediction,which is a difficult task among meteorological disaster predictions.Six pairs of typhoon track latitude/longitude SRPEs and CNRPEs for July,August,and September are built by employing the traditional and the novel prediction modeling approaches,respectively,and by using a large number of identical modeling samples.The comparative analysis indicates that under the condition of the same candidate predictors(variables)and predictands(dependent variables),although the fitting accuracy of the novel prediction models used for the historical samples of South China Sea(SCS)typhoon tracks is slightly lower than that of the traditional prediction models,the prediction accuracy for the independent samples is obviously improved,with the averaged prediction error of the novel models for July,August,and September being 153.9 kin,which is 75.3 km smaller than that of the traditional models(a reduction of 33%).This is because the novel prediction modeling effectively minimizes the multicollinearity by computation and analysis of the condition number.It is shown further that when F=1.0,2.0,and 3.0,the average prediction errors of the traditional SRPEs are obviously larger than those of the CNRPEs.Moreover,extremely large and unreasonable prediction errors occur at some individual points of the typhoon track predicted by the SRPEs due to the multicollinearity existing in the combination of predictors.
文摘Mediterranean anemia is a genetic disease that currently relies heavily on expert clinical experience to determine whether patients are affected. This method is overly reliant on expert experience and is not precise enough. This paper proposes two modeling methods to predict whether patients have Mediterranean anemia. The first method involves using Principal Component Analysis (PCA) to reduce the dimensionality of the data, followed by logistic regression modeling (PCA-LR) on the reduced dataset. The second method involves building a Partial Least Squares Regression (PLS) model. Experimental results show that the prediction accuracy of the PCA-LR model is 87.5% (degree = 2, λ=4), and the prediction accuracy of the PLS model is 92.5% (ncomp = 4), indicating good predictive performance of the models.
文摘The parameter estimation problem in linear model is considered when multicollinearity and outliers exist simultaneously.A class of new estimators,robust general shrunken estimators,are proposed by grafting the robust estimation techniques philosophy into the biased estimator,and their statistical properties are discussed.By appropriate choices of the shrinking parameter matrix,we obtain many useful and important estimators.A numerical example is used to illustrate that these new estimators can not only effectively overcome difficulty caused by multicollinearity but also resist the influence of outliers.
基金fi nancially supported by the National Natural Science Foundation of China(31570624)Applied Technology Research and Development Plan Project of Heilongjiang Province(GA19C006)Fundamental Research Funds for Central Universities(2572019CP15).
文摘Accurate prediction of stem diameter is an important prerequisite of forest management.In this study,an appropriate stem taper function was developed for upper stem diameter estimation of white birch(Betula platyphylla Sukaczev)in ten sub-regions of the Daxing’an Mountains,northeast China.Three commonly used taper functions were assessed using a diameter and height dataset comprising 1344 trees.A first-order continuous-time error structure accounted for the inherent autocorrelation.The segmented model of Max and Burkhart(For Sci 22:283–289,1976.https://doi.org/10.1093/fores tscie nce/22.3.283)and the variable exponent taper function of Kozak(For Chron 80:507–515,2004.https://doi.org/10.5558/tfc80507-4)described the data accurately.Owing to its lower multicollinearity,the Max and Burkhart(1976)model is recommended for diameter estimation at specific heights along the stem for the ten sub-regions.After comparison,the Max and Burkhart(1976)model was refitted using nonlinear mixed-effects techniques.Mixed-effects models would be used only when additional upper stem diameter measurements are available for calibration.Differences in region-specific taper functions were indicated by the method of the non-linear extra sum of squares.Therefore,the particular taper function should be adjusted accordingly for each sub-region in the Daxing’an Mountains.
文摘Suppression effect in multiple regression analysis may be more common in research than what is currently recognized. We have reviewed several literatures of interest which treats the concept and types of suppressor variables. Also, we have highlighted systematic ways to identify suppression effect in multiple regressions using statistics such as: R2, sum of squares, regression weight and comparing zero-order correlations with Variance Inflation Factor (VIF) respectively. We also establish that suppression effect is a function of multicollinearity;however, a suppressor variable should only be allowed in a regression analysis if its VIF is less than five (5).
基金National Natural Science Foundation of China No.40301038
文摘In several LUCC studies, statistical methods are being used to analyze land use data. A problem using conventional statistical methods in land use analysis is that these methods assume the data to be statistically independent. But in fact, they have the tendency to be dependent, a phenomenon known as multicollinearity, especially in the cases of few observations. In this paper, a Partial Least-Squares (PLS) regression approach is developed to study relationships between land use and its influencing factors through a case study of the Suzhou-Wuxi-Changzhou region in China. Multicollinearity exists in the dataset and the number of variables is high compared to the number of observations. Four PLS factors are selected through a preliminary analysis. The correlation analyses between land use and influencing factors demonstrate the land use character of rural industrialization and urbanization in the Suzhou-Wuxi-Changzhou region, meanwhile illustrate that the first PLS factor has enough ability to best describe land use patterns quantitatively, and most of the statistical relations derived from it accord with the fact. By the decreasing capacity of the PLS factors, the reliability of model outcome decreases correspondingly.
文摘In the presence of multicollinearity in logistic regression, the variance of the Maximum Likelihood Estimator (MLE) becomes inflated. Siray et al. (2015) [1] proposed a restricted Liu estimator in logistic regression model with exact linear restrictions. However, there are some situations, where the linear restrictions are stochastic. In this paper, we propose a Stochastic Restricted Maximum Likelihood Estimator (SRMLE) for the logistic regression model with stochastic linear restrictions to overcome this issue. Moreover, a Monte Carlo simulation is conducted for comparing the performances of the MLE, Restricted Maximum Likelihood Estimator (RMLE), Ridge Type Logistic Estimator(LRE), Liu Type Logistic Estimator(LLE), and SRMLE for the logistic regression model by using Scalar Mean Squared Error (SMSE).
文摘The paper introduces a new biased estimator namely Generalized Optimal Estimator (GOE) in a multiple linear regression when there exists multicollinearity among predictor variables. Stochastic properties of proposed estimator were derived, and the proposed estimator was compared with other existing biased estimators based on sample information in the the Scalar Mean Square Error (SMSE) criterion by using a Monte Carlo simulation study and two numerical illustrations.
文摘In linear regression analysis, detecting anomalous observations is an important step for model building process. Various influential measures based on different motivational arguments and designed to measure the influence of observations on different aspects of various regression results are elucidated and critiqued. The presence of influential observations in the data is complicated by the presence of multicollinearity. In this paper, when Liu estimator is used to mitigate the effects of multicollinearity the influence of some observations can be drastically modified. Approximate deletion formulas for the detection of influential points are proposed for Liu estimator. Two real macroeconomic data sets are used to illustrate the methodologies proposed in this paper.
文摘Tropospheric ozone (O3) is one of the pollutants that have a significant impact on human health. It can increase the rate of asthma crises, cause permanent lung infections and death. Predicting its concentration levels is therefore important for planning atmospheric protection strategies. The aim of this study is to predict the daily mean O3 concentration one day ahead in the Grand Casablanca area of Morocco using primary pollutants and meteorological variables. Since the available explanatory variables are multicollinear, multiple linear regressions are likely to lead to unstable models. To counteract the multicollinearity problem, we compared several alternative regression methods: 1) Continuum Regression;2) Ridge & Lasso Regressions;3) Principal component regression (PCR);4) Partial least Square regression & sparse PLS and;5) Biased Power Regression. The aim is to set up a good prediction model of the daily ozone in the Grand Casablanca area. These models are fitted on a training data set (from the years 2013 and 2014), tested on a data set (from 2015) and validated on yet another data set data (from 2015). The Lasso model showed a better performance for the prediction of ozone concentrations compared to multiple linear regression and its other alternative methods.
文摘In this paper we compare recently developed preliminary test estimator called Preliminary Test Stochastic Restricted Liu Estimator (PTSRLE) with Ordinary Least Square Estimator (OLSE) and Mixed Estimator (ME) in the Mean Square Error Matrix (MSEM) sense for the two cases in which the stochastic restrictions are correct and not correct. Finally a numerical example and a Monte Carlo simulation study are done to illustrate the theoretical findings.
文摘This work is geared towards detecting and solving the problem of multicolinearity in regression analysis. As such, Variance Inflation Factor (VIF) and the Condition Index (CI) were used as measures of such detection. Ridge Regression (RR) and the Principal Component Regression (PCR) were the two other approaches used in modeling apart from the conventional simple linear regression. For the purpose of comparing the two methods, simulated data were used. Our task is to ascertain the effectiveness of each of the methods based on their respective mean square errors. From the result, we found that Ridge Regression (RR) method is better than principal component regression when multicollinearity exists among the predictors.
文摘In order to overcome the well-known multicollinearity problem, we propose a new Stochastic Restricted Liu Estimator in logistic regression model. In the mean square error matrix sense, the new estimation is compared with the Maximum Likelihood Estimation, Liu Estimator Stochastic Restricted Maximum Likelihood Estimator etc. Finally, a numerical example and a Monte Carlo simulation are given to explain some of the theoretical results.
文摘Multicollinearity constitutes shared variation among predictors that inflates standard errors of regression coefficients. Several years ago, it was proven that the common practice of mean centering in moderated regression cannot alleviate multicollinearity among variables comprising an interaction, but merely masks it. Residual centering (orthogonalizing) is unacceptable because it biases parameters for predictors from which the interaction derives, thus precluding interpretation of moderator effects. I propose and validate residual centering in sequential re-estimations of a moderated regression—sequential residual centering (SRC)—by revealing unbiased multicollinearity conditioning across the interaction and its related terms. Across simulations, SRC reduces variance inflation factors (VIF) regardless of distribution shape or pattern of regression coefficients across predictors. For any predictor, the reduced VIF is used to derive a lower standard error of its regression coefficient. A cancer sample illustrates SRC, which allows unbiased interpretations of symptom clusters. SRC can be applied efficiently to alleviate multicollinearity after data collection and shows promise for advancing synergistic frontiers of research.
文摘When we consider the factors affecting the stock market,we often consider the impact of macroeconomic factors on it.On the basis of the academic achievements of predecessor scholars on macroscopic factors affecting the overall stocks,this article selects six macroeconomic influencing factors:money supply,USDCNY exchange rate,GDP,national fiscal revenue,price index and interest rate,and uses PAC regression analysis method to construct a regression model.Analyze the influencing factors of Shanghai Pudong Development Bank stock;then conduct heteroscedasticity,autocorrelation,and multicollinearity tests to modify and adjust the regression model;finally,make relevant recommendations based on the analysis results.