Survival data with amulti-state structure are frequently observed in follow-up studies.An analytic approach based on a multi-state model(MSM)should be used in longitudinal health studies in which a patient experiences...Survival data with amulti-state structure are frequently observed in follow-up studies.An analytic approach based on a multi-state model(MSM)should be used in longitudinal health studies in which a patient experiences a sequence of clinical progression events.One main objective in the MSM framework is variable selection,where attempts are made to identify the risk factors associated with the transition hazard rates or probabilities of disease progression.The usual variable selection methods,including stepwise and penalized methods,do not provide information about the importance of variables.In this context,we present a two-step algorithm to evaluate the importance of variables formulti-state data.Three differentmachine learning approaches(randomforest,gradient boosting,and neural network)as themost widely usedmethods are considered to estimate the variable importance in order to identify the factors affecting disease progression and rank these factors according to their importance.The performance of our proposed methods is validated by simulation and applied to the COVID-19 data set.The results revealed that the proposed two-stage method has promising performance for estimating variable importance.展开更多
The variable importance measure(VIM)can be implemented to rank or select important variables,which can effectively reduce the variable dimension and shorten the computational time.Random forest(RF)is an ensemble learn...The variable importance measure(VIM)can be implemented to rank or select important variables,which can effectively reduce the variable dimension and shorten the computational time.Random forest(RF)is an ensemble learning method by constructing multiple decision trees.In order to improve the prediction accuracy of random forest,advanced random forest is presented by using Kriging models as the models of leaf nodes in all the decision trees.Referring to the Mean Decrease Accuracy(MDA)index based on Out-of-Bag(OOB)data,the single variable,group variables and correlated variables importance measures are proposed to establish a complete VIM system on the basis of advanced random forest.The link of MDA and variance-based sensitivity total index is explored,and then the corresponding relationship of proposed VIM indices and variance-based global sensitivity indices are constructed,which gives a novel way to solve variance-based global sensitivity.Finally,several numerical and engineering examples are given to verify the effectiveness of proposed VIM system and the validity of the established relationship.展开更多
Corn to sugar process has long faced the risks of high energy consumption and thin profits.However,it’s hard to upgrade or optimize the process based on mechanism unit operation models due to the high complexity of t...Corn to sugar process has long faced the risks of high energy consumption and thin profits.However,it’s hard to upgrade or optimize the process based on mechanism unit operation models due to the high complexity of the related processes.Big data technology provides a promising solution as its ability to turn huge amounts of data into insights for operational decisions.In this paper,a neural network-based production process modeling and variable importance analysis approach is proposed for corn to sugar processes,which contains data preprocessing,dimensionality reduction,multilayer perceptron/convolutional neural network/recurrent neural network based modeling and extended weights connection method.In the established model,dextrose equivalent value is selected as the output,and 654 sites from the DCS system are selected as the inputs.LASSO analysis is first applied to reduce the data dimension to 155,then the inputs are dimensionalized to 50 by means of genetic algorithm optimization.Ultimately,variable importance analysis is carried out by the extended weight connection method,and 20 of the most important sites are selected for each neural network.The results indicate that the multilayer perceptron and recurrent neural network models have a relative error of less than 0.1%,which have a better prediction result than other models,and the 20 most important sites selected have better explicable performance.The major contributions derived from this work are of significant aid in process simulation model with high accuracy and process optimization based on the selected most important sites to maintain high quality and stable production for corn to sugar processes.展开更多
Global urbanization causes more environmental stresses in cities and energy efficiency is one of major concerns for urban sustainability.The variable importance techniques have been widely used in building energy anal...Global urbanization causes more environmental stresses in cities and energy efficiency is one of major concerns for urban sustainability.The variable importance techniques have been widely used in building energy analysis to determine key factors influencing building energy use.Most of these applications,however,use only one type of variable importance approaches.Therefore,this paper proposes a procedure of conducting two types of variable importance analysis(predictive and variance-based)to determine robust and effective energy saving measures in urban buildings.These two variable importance methods belong to metamodeling techniques,which can significantly reduce computational cost of building energy simulation models for urban buildings.The predictive importance analysis is based on the prediction errors of metamodels to obtain importance rankings of inputs,while the variance-based variable importance can explore non-linear effects and interactions among input variables based on variance decomposition.The campus buildings are used to demonstrate the application of the method proposed to explore characteristic of heating energy,cooling energy,electricity,and carbon emissions of buildings.The results indicate that the combination of two types of metamodeling variable importance analysis can provide fast and robust analysis to improve energy efficiency of urban buildings.The carbon emissions can be reduced approximately 30%after using a few of effective energy efficiency measures and more aggressive measures can lead to the 60%of reduction of carbon emissions.Moreover,this research demonstrates the application of parallel computing to expedite building energy analysis in urban environment since more multi-core computers become increasingly available.展开更多
Background: Random Forests is a popular classification and regression method that has proven powerful for various prediction problems in biological studies. However, its performance often deteriorates when the number...Background: Random Forests is a popular classification and regression method that has proven powerful for various prediction problems in biological studies. However, its performance often deteriorates when the number of features increases. To address this limitation, feature elimination Random Forests was proposed that only uses features with the largest variable importance scores. Yet the performance of this method is not satisfying, possibly due to its rigid feature selection, and increased correlations between trees of forest. Methods: We propose variable importance-weighted Random Forests, which instead of sampling features with equal probability at each node to build up trees, samples features according to their variable importance scores, and then select the best split from the randomly selected features. Results: We evaluate the performance of our method through comprehensive simulation and real data analyses, for both regression and classification. Compared to the standard Random Forests and the feature elimination Random Forests methods, our proposed method has improved performance in most cases. Conclusions: By incorporating the variable importance scores into the random feature selection step, our method can better utilize more informative features without completely ignoring less informative ones, hence has improved prediction accuracy in the presence of weak signals and large noises. We have implemented an R package "viRandomForests" based on the original R package "randomForest" and it can be freely downloaded from http:// zhaocenter.org/software.展开更多
BACKGROUND Colorectal cancer(CRC)is characterized by high heterogeneity,aggressiveness,and high morbidity and mortality rates.With machine learning(ML)algorithms,patient,tumor,and treatment features can be used to dev...BACKGROUND Colorectal cancer(CRC)is characterized by high heterogeneity,aggressiveness,and high morbidity and mortality rates.With machine learning(ML)algorithms,patient,tumor,and treatment features can be used to develop and validate models for predicting survival.In addition,important variables can be screened and different applications can be provided that could serve as vital references when making clinical decisions and potentially improving patient outcomes in clinical settings.AIM To construct prognostic prediction models and screen important variables for patients with stageⅠtoⅢCRC.METHODS More than 1000 postoperative CRC patients were grouped according to survival time(with cutoff values of 3 years and 5 years)and assigned to training and testing cohorts(7:3).For each 3-category survival time,predictions were made by 4 ML algorithms(all-variable and important variable-only datasets),each of which was validated via 5-fold cross-validation and bootstrap validation.Important variables were screened with multivariable regression methods.Model performance was evaluated and compared before and after variable screening with the area under the curve(AUC).SHapley Additive exPlanations(SHAP)further demonstrated the impact of important variables on model decision-making.Nomograms were constructed for practical model application.RESULTS Our ML models performed well;the model performance before and after important parameter identification was consistent,and variable screening was effective.The highest pre-and postscreening model AUCs 95%confidence intervals in the testing set were 0.87(0.81-0.92)and 0.89(0.84-0.93)for overall survival,0.75(0.69-0.82)and 0.73(0.64-0.81)for disease-free survival,0.95(0.88-1.00)and 0.88(0.75-0.97)for recurrence-free survival,and 0.76(0.47-0.95)and 0.80(0.53-0.94)for distant metastasis-free survival.Repeated cross-validation and bootstrap validation were performed in both the training and testing datasets.The SHAP values of the important variables were consistent with the clinicopathological characteristics of patients with tumors.The nomograms were created.CONCLUSION We constructed a comprehensive,high-accuracy,important variable-based ML architecture for predicting the 3-category survival times.This architecture could serve as a vital reference for managing CRC patients.展开更多
Slope failures lead to catastrophic consequences in numerous countries and thus the stability assessment for slopes is of high interest in geotechnical and geological engineering researches.A hybrid stacking ensemble ...Slope failures lead to catastrophic consequences in numerous countries and thus the stability assessment for slopes is of high interest in geotechnical and geological engineering researches.A hybrid stacking ensemble approach is proposed in this study for enhancing the prediction of slope stability.In the hybrid stacking ensemble approach,we used an artificial bee colony(ABC)algorithm to find out the best combination of base classifiers(level 0)and determined a suitable meta-classifier(level 1)from a pool of 11 individual optimized machine learning(OML)algorithms.Finite element analysis(FEA)was conducted in order to form the synthetic database for the training stage(150 cases)of the proposed model while 107 real field slope cases were used for the testing stage.The results by the hybrid stacking ensemble approach were then compared with that obtained by the 11 individual OML methods using confusion matrix,F1-score,and area under the curve,i.e.AUC-score.The comparisons showed that a significant improvement in the prediction ability of slope stability has been achieved by the hybrid stacking ensemble(AUC?90.4%),which is 7%higher than the best of the 11 individual OML methods(AUC?82.9%).Then,a further comparison was undertaken between the hybrid stacking ensemble method and basic ensemble classifier on slope stability prediction.The results showed a prominent performance of the hybrid stacking ensemble method over the basic ensemble method.Finally,the importance of the variables for slope stability was studied using linear vector quantization(LVQ)method.展开更多
This investigation assessed the efficacy of 10 widely used machine learning algorithms(MLA)comprising the least absolute shrinkage and selection operator(LASSO),generalized linear model(GLM),stepwise generalized linea...This investigation assessed the efficacy of 10 widely used machine learning algorithms(MLA)comprising the least absolute shrinkage and selection operator(LASSO),generalized linear model(GLM),stepwise generalized linear model(SGLM),elastic net(ENET),partial least square(PLS),ridge regression,support vector machine(SVM),classification and regression trees(CART),bagged CART,and random forest(RF)for gully erosion susceptibility mapping(GESM)in Iran.The location of 462 previously existing gully erosion sites were mapped through widespread field investigations,of which 70%(323)and 30%(139)of observations were arbitrarily divided for algorithm calibration and validation.Twelve controlling factors for gully erosion,namely,soil texture,annual mean rainfall,digital elevation model(DEM),drainage density,slope,lithology,topographic wetness index(TWI),distance from rivers,aspect,distance from roads,plan curvature,and profile curvature were ranked in terms of their importance using each MLA.The MLA were compared using a training dataset for gully erosion and statistical measures such as RMSE(root mean square error),MAE(mean absolute error),and R-squared.Based on the comparisons among MLA,the RF algorithm exhibited the minimum RMSE and MAE and the maximum value of R-squared,and was therefore selected as the best model.The variable importance evaluation using the RF model revealed that distance from rivers had the highest significance in influencing the occurrence of gully erosion whereas plan curvature had the least importance.According to the GESM generated using RF,most of the study area is predicted to have a low(53.72%)or moderate(29.65%)susceptibility to gully erosion,whereas only a small area is identified to have a high(12.56%)or very high(4.07%)susceptibility.The outcome generated by RF model is validated using the ROC(Receiver Operating Characteristics)curve approach,which returned an area under the curve(AUC)of 0.985,proving the excellent forecasting ability of the model.The GESM prepared using the RF algorithm can aid decision-makers in targeting remedial actions for minimizing the damage caused by gully erosion.展开更多
In this study some soil phosphorous sorption parameters(PSPs)by using different machine learning models(Cubist(Cu),random forest(RF),support vector machines(SVM)and Gaussian process regression(GPR))were predicted.The ...In this study some soil phosphorous sorption parameters(PSPs)by using different machine learning models(Cubist(Cu),random forest(RF),support vector machines(SVM)and Gaussian process regression(GPR))were predicted.The results showed that using the topographic attributes as the sole auxiliary variables was not adequate for predicting the PSPs.However,remote sensing data and its combination with soil properties were reliably used to predict PSPs(R^(2)=0.41 for MBC by RF model,R^(2)=0.49 for PBC by Cu model,R^(2)=0.37 for SPR by Cu model,and R^(2)=0.38 for SBC by RF model).The lowest RMSE values were obtained for MBC by RF model,PBC by SVM model,SPR by Cubist model and SBC by RF model.The results also showed that remote sensing data as the easily available datasets could reliably predict PSPs in the given study area.The outcomes of variable importance analysis revealed that among the soil properties cation exchange capacity(CEC)and clay content,and among the remote sensing indices B5/B7,Midindex,Coloration index,Saturation index,and OSAVI were the most imperative factors for predicting PSPs.Further studies are recommended to use other proximally sensed data to improve PSPs prediction to precise decision-making throughout the landscape.展开更多
Variations in net ecosystem exchange(NEE)of carbon dioxide,and the variables influencing it,at woodland sites over multiple years determine the long term performance of those sites as carbon sinks.In this study,weekly...Variations in net ecosystem exchange(NEE)of carbon dioxide,and the variables influencing it,at woodland sites over multiple years determine the long term performance of those sites as carbon sinks.In this study,weekly-averaged data from two AmeriFlux sites in North America of evergreen woodland,in different climatic zones and with distinct tree and understory species,are evaluated using four multi-linear regression(MLR)and seven machine learning(ML)models.The site data extend over multiple years and conform to the FLUXNET2015 pre-processing pipeline.Twenty influencing variables are considered for site CA-LP1 and sixteen for site US-Mpj.Rigorous k-fold cross validation analysis verifies that all eleven models assessed generate reproducible NEE predictions to varying degrees of accuracy.At both sites,the best performing ML models(support vector regression(SVR),extreme gradient boosting(XGB)and multi-layer perceptron(MLP))substantially outperform the MLR models in terms of their NEE prediction performance.The ML models also generate predicted versus measured NEE distributions that approximate cross-plot trends passing through the origin,confirming that they more realistically capture the actual NEE trend.MLR and ML models assign some level of importance to all influential variables measured but their degree of influence varies between the two sites.For the best performing SVR models,at site CA-LP1,variables air temperature,shortwave radiation outgoing,net radiation,longwave radiation outgoing,shortwave radiation incoming and vapor pressure deficit have the most influence on NEE predictions.At site US-Mpj,variables vapor pressure deficit,shortwave radiation incoming,longwave radiation incoming,air temperature,photosynthetic photon flux density incoming,shortwave radiation outgoing and precipitation exert the most influence on the model solutions.Sensible heat exerts very low influence at both sites.The methodology applied successfully determines the relative importance of influential variables in determining weekly NEE trends at both conifer woodland sites studied.展开更多
The current study aimed at evaluating the capabilities of seven advanced machine learning techniques(MLTs),including,Support Vector Machine(SVM),Random Forest(RF),Multivariate Adaptive Regression Spline(MARS),Artifici...The current study aimed at evaluating the capabilities of seven advanced machine learning techniques(MLTs),including,Support Vector Machine(SVM),Random Forest(RF),Multivariate Adaptive Regression Spline(MARS),Artificial Neural Network(ANN),Quadratic Discriminant Analysis(QDA),Linear Discriminant Analysis(LDA),and Naive Bayes(NB),for landslide susceptibility modeling and comparison of their performances.Coupling machine learning algorithms with spatial data types for landslide susceptibility mapping is a vitally important issue.This study was carried out using GIS and R open source software at Abha Basin,Asir Region,Saudi Arabia.First,a total of 243 landslide locations were identified at Abha Basin to prepare the landslide inventory map using different data sources.All the landslide areas were randomly separated into two groups with a ratio of 70%for training and 30%for validating purposes.Twelve landslide-variables were generated for landslide susceptibility modeling,which include altitude,lithology,distance to faults,normalized difference vegetation index(NDVI),landuse/landcover(LULC),distance to roads,slope angle,distance to streams,profile curvature,plan curvature,slope length(LS),and slope-aspect.The area under curve(AUC-ROC)approach has been applied to evaluate,validate,and compare the MLTs performance.The results indicated that AUC values for seven MLTs range from 89.0%for QDA to 95.1%for RF.Our findings showed that the RF(AUC=95.1%)and LDA(AUC=941.7%)have produced the best performances in comparison to other MLTs.The outcome of this study and the landslide susceptibility maps would be useful for environmental protection.展开更多
A supersaturated design (SSD), whose run size is not enough for estimating all the main effects, is commonly used in screening experiments. It offers a potential useful tool to investigate a large number of factors ...A supersaturated design (SSD), whose run size is not enough for estimating all the main effects, is commonly used in screening experiments. It offers a potential useful tool to investigate a large number of factors with only a few experimental runs. The associated analysis methods have been proposed by many authors to identify active effects in situations where only one response is considered. However, there are often situations where two or more responses are observed simultaneously in one screening experiment, and the analysis of SSDs with multiple responses is thus needed. In this paper, we propose a two-stage variable selection strategy, called the multivariate partial least squares-stepwise regression (MPLS-SR) method, which uses the multivariate partial least squares regression in conjunction with the stepwise regression procedure to select true active effects in SSDs with multiple responses. Simulation studies show that the MPLS-SR method performs pretty good and is easy to understand and implement.展开更多
Statistical models can efficiently establish the relationships between crop growth and environmental conditions while explicitly quantifying uncertainties. This study aimed to test the efficiency of statistical models...Statistical models can efficiently establish the relationships between crop growth and environmental conditions while explicitly quantifying uncertainties. This study aimed to test the efficiency of statistical models established using partial least squares regression(PLSR) and artificial neural network(ANN) in predicting seed yields of sunflower(Helianthus annuus). Two-year field trial data on sunflower growth under different salinity levels and nitrogen(N) application rates in the Yichang Experimental Station in Hetao Irrigation District, Inner Mongolia, China, were used to calibrate and validate the statistical models. The variable importance in projection score was calculated in order to select the sensitive crop indices for seed yield prediction. We found that when the most sensitive indices were used as inputs for seed yield estimation, the PLSR could attain a comparable accuracy(root mean square error(RMSE) = 0.93 t ha-1, coefficient of determination(R^2) = 0.69) to that when using all measured indices(RMSE = 0.81 t ha-1,R^2= 0.77). The ANN model outperformed the PLSR for yield prediction with different combinations of inputs of both microplots and field data. The results indicated that sunflower seed yield could be reasonably estimated by using a small number of crop characteristic indices under complex environmental conditions and management options(e.g., saline soils and N application). Since leaf area index and plant height were found to be the most sensitive crop indices for sunflower seed yield prediction, remotely sensed data and the ANN model may be joined for regional crop yield simulation.展开更多
Machine learning techniques have attracted more attention as advanced data analytics in building energy analysis.However,most of previous studies are only focused on the prediction capability of machine learning algor...Machine learning techniques have attracted more attention as advanced data analytics in building energy analysis.However,most of previous studies are only focused on the prediction capability of machine learning algorithms to provide reliable energy estimation in buildings.Machine learning also has great potentials to identify energy patterns for urban buildings except for model prediction.Therefore,this paper explores energy characteristic of London domestic properties using ten machine learning algorithms from three aspects:tuning process of learning model;variable importance;spatial analysis of model discrepancy.The results indicate that the combination of these three aspects can provide insights on energy patterns for urban buildings.The tuning process of these models indicates that gas use models should have more terms in comparison with electricity in London and the interaction terms should be considered in both gas and electricity models.The rankings of important variables are very different for gas and electricity prediction in London residential buildings,which suggests that gas and electricity use are affected by different physical and social factors.Moreover,the importance levels for these key variables are markedly different for gas and electricity consumption.There are much more important variables for electricity use in comparison with gas use for the importance levels over 40.The areas with larger model discrepancies can be determined using the local spatial analysis based on these machine learning models.These identified areas have significantly different energy patterns for gas and electricity use.More research is required to understand these unusual patterns of energy use in these areas.展开更多
Wildfire is a primary forest disturbance.A better understanding of wildfire susceptibility and its dominant influencing factors is crucial for regional wildfire risk management.This study performed a wildfire suscepti...Wildfire is a primary forest disturbance.A better understanding of wildfire susceptibility and its dominant influencing factors is crucial for regional wildfire risk management.This study performed a wildfire susceptibility assessment using multiple methods,including logistic regression,probit regression,an artificial neural network,and a random forest(RF) algorithm.Yunnan Province,China was used as a case study area.We investigated the sample ratio of ignition and nonignition data to avoid misleading results due to the overwhelming number of nonignition samples in the models.To compare model performance and the importance of variables among the models,the area under the curve of the receiver operating characteristic plot was used as an indicator.The results show that a cost-sensitive RF had the highest accuracy(88.47%) for all samples,and 94.23% accuracy for ignition prediction.The identified main factors that influence Yunnan wildfire occurrence were forest coverage ratio,month,season,surface roughness,10 days minimum of the 6 h maximum humidity,and 10 days maxima of the 6 h average and maximum temperatures.These seven variables made the greatest contributions to regional wildfire susceptibility.Susceptibility maps developed from the models provide information regarding the spatial variation of ignition susceptibility,which can be used in regional wildfire risk management.展开更多
文摘Survival data with amulti-state structure are frequently observed in follow-up studies.An analytic approach based on a multi-state model(MSM)should be used in longitudinal health studies in which a patient experiences a sequence of clinical progression events.One main objective in the MSM framework is variable selection,where attempts are made to identify the risk factors associated with the transition hazard rates or probabilities of disease progression.The usual variable selection methods,including stepwise and penalized methods,do not provide information about the importance of variables.In this context,we present a two-step algorithm to evaluate the importance of variables formulti-state data.Three differentmachine learning approaches(randomforest,gradient boosting,and neural network)as themost widely usedmethods are considered to estimate the variable importance in order to identify the factors affecting disease progression and rank these factors according to their importance.The performance of our proposed methods is validated by simulation and applied to the COVID-19 data set.The results revealed that the proposed two-stage method has promising performance for estimating variable importance.
文摘The variable importance measure(VIM)can be implemented to rank or select important variables,which can effectively reduce the variable dimension and shorten the computational time.Random forest(RF)is an ensemble learning method by constructing multiple decision trees.In order to improve the prediction accuracy of random forest,advanced random forest is presented by using Kriging models as the models of leaf nodes in all the decision trees.Referring to the Mean Decrease Accuracy(MDA)index based on Out-of-Bag(OOB)data,the single variable,group variables and correlated variables importance measures are proposed to establish a complete VIM system on the basis of advanced random forest.The link of MDA and variance-based sensitivity total index is explored,and then the corresponding relationship of proposed VIM indices and variance-based global sensitivity indices are constructed,which gives a novel way to solve variance-based global sensitivity.Finally,several numerical and engineering examples are given to verify the effectiveness of proposed VIM system and the validity of the established relationship.
基金supports of Special Foundation for State Major Basic Research Program of China(Grant No.2021YFD2101000).
文摘Corn to sugar process has long faced the risks of high energy consumption and thin profits.However,it’s hard to upgrade or optimize the process based on mechanism unit operation models due to the high complexity of the related processes.Big data technology provides a promising solution as its ability to turn huge amounts of data into insights for operational decisions.In this paper,a neural network-based production process modeling and variable importance analysis approach is proposed for corn to sugar processes,which contains data preprocessing,dimensionality reduction,multilayer perceptron/convolutional neural network/recurrent neural network based modeling and extended weights connection method.In the established model,dextrose equivalent value is selected as the output,and 654 sites from the DCS system are selected as the inputs.LASSO analysis is first applied to reduce the data dimension to 155,then the inputs are dimensionalized to 50 by means of genetic algorithm optimization.Ultimately,variable importance analysis is carried out by the extended weight connection method,and 20 of the most important sites are selected for each neural network.The results indicate that the multilayer perceptron and recurrent neural network models have a relative error of less than 0.1%,which have a better prediction result than other models,and the 20 most important sites selected have better explicable performance.The major contributions derived from this work are of significant aid in process simulation model with high accuracy and process optimization based on the selected most important sites to maintain high quality and stable production for corn to sugar processes.
基金supported by the National Natural Science Foundation of China(No.51778416)the Key Projects of Philosophy and Social Sciences Research,Ministry of Education of China“Research on Green Design in Sustainable Development”(contract No.16JZDH014,approval No.16JZD014).
文摘Global urbanization causes more environmental stresses in cities and energy efficiency is one of major concerns for urban sustainability.The variable importance techniques have been widely used in building energy analysis to determine key factors influencing building energy use.Most of these applications,however,use only one type of variable importance approaches.Therefore,this paper proposes a procedure of conducting two types of variable importance analysis(predictive and variance-based)to determine robust and effective energy saving measures in urban buildings.These two variable importance methods belong to metamodeling techniques,which can significantly reduce computational cost of building energy simulation models for urban buildings.The predictive importance analysis is based on the prediction errors of metamodels to obtain importance rankings of inputs,while the variance-based variable importance can explore non-linear effects and interactions among input variables based on variance decomposition.The campus buildings are used to demonstrate the application of the method proposed to explore characteristic of heating energy,cooling energy,electricity,and carbon emissions of buildings.The results indicate that the combination of two types of metamodeling variable importance analysis can provide fast and robust analysis to improve energy efficiency of urban buildings.The carbon emissions can be reduced approximately 30%after using a few of effective energy efficiency measures and more aggressive measures can lead to the 60%of reduction of carbon emissions.Moreover,this research demonstrates the application of parallel computing to expedite building energy analysis in urban environment since more multi-core computers become increasingly available.
文摘Background: Random Forests is a popular classification and regression method that has proven powerful for various prediction problems in biological studies. However, its performance often deteriorates when the number of features increases. To address this limitation, feature elimination Random Forests was proposed that only uses features with the largest variable importance scores. Yet the performance of this method is not satisfying, possibly due to its rigid feature selection, and increased correlations between trees of forest. Methods: We propose variable importance-weighted Random Forests, which instead of sampling features with equal probability at each node to build up trees, samples features according to their variable importance scores, and then select the best split from the randomly selected features. Results: We evaluate the performance of our method through comprehensive simulation and real data analyses, for both regression and classification. Compared to the standard Random Forests and the feature elimination Random Forests methods, our proposed method has improved performance in most cases. Conclusions: By incorporating the variable importance scores into the random feature selection step, our method can better utilize more informative features without completely ignoring less informative ones, hence has improved prediction accuracy in the presence of weak signals and large noises. We have implemented an R package "viRandomForests" based on the original R package "randomForest" and it can be freely downloaded from http:// zhaocenter.org/software.
基金Supported by National Natural Science Foundation of China,No.81802777.
文摘BACKGROUND Colorectal cancer(CRC)is characterized by high heterogeneity,aggressiveness,and high morbidity and mortality rates.With machine learning(ML)algorithms,patient,tumor,and treatment features can be used to develop and validate models for predicting survival.In addition,important variables can be screened and different applications can be provided that could serve as vital references when making clinical decisions and potentially improving patient outcomes in clinical settings.AIM To construct prognostic prediction models and screen important variables for patients with stageⅠtoⅢCRC.METHODS More than 1000 postoperative CRC patients were grouped according to survival time(with cutoff values of 3 years and 5 years)and assigned to training and testing cohorts(7:3).For each 3-category survival time,predictions were made by 4 ML algorithms(all-variable and important variable-only datasets),each of which was validated via 5-fold cross-validation and bootstrap validation.Important variables were screened with multivariable regression methods.Model performance was evaluated and compared before and after variable screening with the area under the curve(AUC).SHapley Additive exPlanations(SHAP)further demonstrated the impact of important variables on model decision-making.Nomograms were constructed for practical model application.RESULTS Our ML models performed well;the model performance before and after important parameter identification was consistent,and variable screening was effective.The highest pre-and postscreening model AUCs 95%confidence intervals in the testing set were 0.87(0.81-0.92)and 0.89(0.84-0.93)for overall survival,0.75(0.69-0.82)and 0.73(0.64-0.81)for disease-free survival,0.95(0.88-1.00)and 0.88(0.75-0.97)for recurrence-free survival,and 0.76(0.47-0.95)and 0.80(0.53-0.94)for distant metastasis-free survival.Repeated cross-validation and bootstrap validation were performed in both the training and testing datasets.The SHAP values of the important variables were consistent with the clinicopathological characteristics of patients with tumors.The nomograms were created.CONCLUSION We constructed a comprehensive,high-accuracy,important variable-based ML architecture for predicting the 3-category survival times.This architecture could serve as a vital reference for managing CRC patients.
基金We acknowledge the funding support from Australia Research Council(Grant Nos.DP200100549 and IH180100010).
文摘Slope failures lead to catastrophic consequences in numerous countries and thus the stability assessment for slopes is of high interest in geotechnical and geological engineering researches.A hybrid stacking ensemble approach is proposed in this study for enhancing the prediction of slope stability.In the hybrid stacking ensemble approach,we used an artificial bee colony(ABC)algorithm to find out the best combination of base classifiers(level 0)and determined a suitable meta-classifier(level 1)from a pool of 11 individual optimized machine learning(OML)algorithms.Finite element analysis(FEA)was conducted in order to form the synthetic database for the training stage(150 cases)of the proposed model while 107 real field slope cases were used for the testing stage.The results by the hybrid stacking ensemble approach were then compared with that obtained by the 11 individual OML methods using confusion matrix,F1-score,and area under the curve,i.e.AUC-score.The comparisons showed that a significant improvement in the prediction ability of slope stability has been achieved by the hybrid stacking ensemble(AUC?90.4%),which is 7%higher than the best of the 11 individual OML methods(AUC?82.9%).Then,a further comparison was undertaken between the hybrid stacking ensemble method and basic ensemble classifier on slope stability prediction.The results showed a prominent performance of the hybrid stacking ensemble method over the basic ensemble method.Finally,the importance of the variables for slope stability was studied using linear vector quantization(LVQ)method.
基金supported by the College of Agriculture,Shiraz University(Grant No.97GRC1M271143)funding from the UK Biotechnology and Biological Sciences Research Council(BBSRC)funded by BBSRC grant award BBS/E/C/000I0330–Soil to Nutrition project 3–Sustainable intensification:optimisation at multiple scales。
文摘This investigation assessed the efficacy of 10 widely used machine learning algorithms(MLA)comprising the least absolute shrinkage and selection operator(LASSO),generalized linear model(GLM),stepwise generalized linear model(SGLM),elastic net(ENET),partial least square(PLS),ridge regression,support vector machine(SVM),classification and regression trees(CART),bagged CART,and random forest(RF)for gully erosion susceptibility mapping(GESM)in Iran.The location of 462 previously existing gully erosion sites were mapped through widespread field investigations,of which 70%(323)and 30%(139)of observations were arbitrarily divided for algorithm calibration and validation.Twelve controlling factors for gully erosion,namely,soil texture,annual mean rainfall,digital elevation model(DEM),drainage density,slope,lithology,topographic wetness index(TWI),distance from rivers,aspect,distance from roads,plan curvature,and profile curvature were ranked in terms of their importance using each MLA.The MLA were compared using a training dataset for gully erosion and statistical measures such as RMSE(root mean square error),MAE(mean absolute error),and R-squared.Based on the comparisons among MLA,the RF algorithm exhibited the minimum RMSE and MAE and the maximum value of R-squared,and was therefore selected as the best model.The variable importance evaluation using the RF model revealed that distance from rivers had the highest significance in influencing the occurrence of gully erosion whereas plan curvature had the least importance.According to the GESM generated using RF,most of the study area is predicted to have a low(53.72%)or moderate(29.65%)susceptibility to gully erosion,whereas only a small area is identified to have a high(12.56%)or very high(4.07%)susceptibility.The outcome generated by RF model is validated using the ROC(Receiver Operating Characteristics)curve approach,which returned an area under the curve(AUC)of 0.985,proving the excellent forecasting ability of the model.The GESM prepared using the RF algorithm can aid decision-makers in targeting remedial actions for minimizing the damage caused by gully erosion.
文摘In this study some soil phosphorous sorption parameters(PSPs)by using different machine learning models(Cubist(Cu),random forest(RF),support vector machines(SVM)and Gaussian process regression(GPR))were predicted.The results showed that using the topographic attributes as the sole auxiliary variables was not adequate for predicting the PSPs.However,remote sensing data and its combination with soil properties were reliably used to predict PSPs(R^(2)=0.41 for MBC by RF model,R^(2)=0.49 for PBC by Cu model,R^(2)=0.37 for SPR by Cu model,and R^(2)=0.38 for SBC by RF model).The lowest RMSE values were obtained for MBC by RF model,PBC by SVM model,SPR by Cubist model and SBC by RF model.The results also showed that remote sensing data as the easily available datasets could reliably predict PSPs in the given study area.The outcomes of variable importance analysis revealed that among the soil properties cation exchange capacity(CEC)and clay content,and among the remote sensing indices B5/B7,Midindex,Coloration index,Saturation index,and OSAVI were the most imperative factors for predicting PSPs.Further studies are recommended to use other proximally sensed data to improve PSPs prediction to precise decision-making throughout the landscape.
文摘Variations in net ecosystem exchange(NEE)of carbon dioxide,and the variables influencing it,at woodland sites over multiple years determine the long term performance of those sites as carbon sinks.In this study,weekly-averaged data from two AmeriFlux sites in North America of evergreen woodland,in different climatic zones and with distinct tree and understory species,are evaluated using four multi-linear regression(MLR)and seven machine learning(ML)models.The site data extend over multiple years and conform to the FLUXNET2015 pre-processing pipeline.Twenty influencing variables are considered for site CA-LP1 and sixteen for site US-Mpj.Rigorous k-fold cross validation analysis verifies that all eleven models assessed generate reproducible NEE predictions to varying degrees of accuracy.At both sites,the best performing ML models(support vector regression(SVR),extreme gradient boosting(XGB)and multi-layer perceptron(MLP))substantially outperform the MLR models in terms of their NEE prediction performance.The ML models also generate predicted versus measured NEE distributions that approximate cross-plot trends passing through the origin,confirming that they more realistically capture the actual NEE trend.MLR and ML models assign some level of importance to all influential variables measured but their degree of influence varies between the two sites.For the best performing SVR models,at site CA-LP1,variables air temperature,shortwave radiation outgoing,net radiation,longwave radiation outgoing,shortwave radiation incoming and vapor pressure deficit have the most influence on NEE predictions.At site US-Mpj,variables vapor pressure deficit,shortwave radiation incoming,longwave radiation incoming,air temperature,photosynthetic photon flux density incoming,shortwave radiation outgoing and precipitation exert the most influence on the model solutions.Sensible heat exerts very low influence at both sites.The methodology applied successfully determines the relative importance of influential variables in determining weekly NEE trends at both conifer woodland sites studied.
文摘The current study aimed at evaluating the capabilities of seven advanced machine learning techniques(MLTs),including,Support Vector Machine(SVM),Random Forest(RF),Multivariate Adaptive Regression Spline(MARS),Artificial Neural Network(ANN),Quadratic Discriminant Analysis(QDA),Linear Discriminant Analysis(LDA),and Naive Bayes(NB),for landslide susceptibility modeling and comparison of their performances.Coupling machine learning algorithms with spatial data types for landslide susceptibility mapping is a vitally important issue.This study was carried out using GIS and R open source software at Abha Basin,Asir Region,Saudi Arabia.First,a total of 243 landslide locations were identified at Abha Basin to prepare the landslide inventory map using different data sources.All the landslide areas were randomly separated into two groups with a ratio of 70%for training and 30%for validating purposes.Twelve landslide-variables were generated for landslide susceptibility modeling,which include altitude,lithology,distance to faults,normalized difference vegetation index(NDVI),landuse/landcover(LULC),distance to roads,slope angle,distance to streams,profile curvature,plan curvature,slope length(LS),and slope-aspect.The area under curve(AUC-ROC)approach has been applied to evaluate,validate,and compare the MLTs performance.The results indicated that AUC values for seven MLTs range from 89.0%for QDA to 95.1%for RF.Our findings showed that the RF(AUC=95.1%)and LDA(AUC=941.7%)have produced the best performances in comparison to other MLTs.The outcome of this study and the landslide susceptibility maps would be useful for environmental protection.
基金supported by the National Natural Science Foundation of China (Grant Nos. 10971107, 11271205), the "131" Talents Program of Tianjin, and the Fundamental Research Funds for the Central Universities (Grant Nos. 65030011, 65011481).
文摘A supersaturated design (SSD), whose run size is not enough for estimating all the main effects, is commonly used in screening experiments. It offers a potential useful tool to investigate a large number of factors with only a few experimental runs. The associated analysis methods have been proposed by many authors to identify active effects in situations where only one response is considered. However, there are often situations where two or more responses are observed simultaneously in one screening experiment, and the analysis of SSDs with multiple responses is thus needed. In this paper, we propose a two-stage variable selection strategy, called the multivariate partial least squares-stepwise regression (MPLS-SR) method, which uses the multivariate partial least squares regression in conjunction with the stepwise regression procedure to select true active effects in SSDs with multiple responses. Simulation studies show that the MPLS-SR method performs pretty good and is easy to understand and implement.
基金supported by the National Natural Science Foundation of China (Nos. 51609175, 51790533, 51879196, and 51439006)
文摘Statistical models can efficiently establish the relationships between crop growth and environmental conditions while explicitly quantifying uncertainties. This study aimed to test the efficiency of statistical models established using partial least squares regression(PLSR) and artificial neural network(ANN) in predicting seed yields of sunflower(Helianthus annuus). Two-year field trial data on sunflower growth under different salinity levels and nitrogen(N) application rates in the Yichang Experimental Station in Hetao Irrigation District, Inner Mongolia, China, were used to calibrate and validate the statistical models. The variable importance in projection score was calculated in order to select the sensitive crop indices for seed yield prediction. We found that when the most sensitive indices were used as inputs for seed yield estimation, the PLSR could attain a comparable accuracy(root mean square error(RMSE) = 0.93 t ha-1, coefficient of determination(R^2) = 0.69) to that when using all measured indices(RMSE = 0.81 t ha-1,R^2= 0.77). The ANN model outperformed the PLSR for yield prediction with different combinations of inputs of both microplots and field data. The results indicated that sunflower seed yield could be reasonably estimated by using a small number of crop characteristic indices under complex environmental conditions and management options(e.g., saline soils and N application). Since leaf area index and plant height were found to be the most sensitive crop indices for sunflower seed yield prediction, remotely sensed data and the ANN model may be joined for regional crop yield simulation.
基金This research was supported by the National Natural Science Foundation of China(No.51778416)the Key Projects of Philosophy and Social Sciences Research,Ministry of Education(China)“Research on Green Design in Sustainable Development”(contract No.16JZDH014,approval No.16JZD014).
文摘Machine learning techniques have attracted more attention as advanced data analytics in building energy analysis.However,most of previous studies are only focused on the prediction capability of machine learning algorithms to provide reliable energy estimation in buildings.Machine learning also has great potentials to identify energy patterns for urban buildings except for model prediction.Therefore,this paper explores energy characteristic of London domestic properties using ten machine learning algorithms from three aspects:tuning process of learning model;variable importance;spatial analysis of model discrepancy.The results indicate that the combination of these three aspects can provide insights on energy patterns for urban buildings.The tuning process of these models indicates that gas use models should have more terms in comparison with electricity in London and the interaction terms should be considered in both gas and electricity models.The rankings of important variables are very different for gas and electricity prediction in London residential buildings,which suggests that gas and electricity use are affected by different physical and social factors.Moreover,the importance levels for these key variables are markedly different for gas and electricity consumption.There are much more important variables for electricity use in comparison with gas use for the importance levels over 40.The areas with larger model discrepancies can be determined using the local spatial analysis based on these machine learning models.These identified areas have significantly different energy patterns for gas and electricity use.More research is required to understand these unusual patterns of energy use in these areas.
基金supported by the international partnership program of Chinese Academy of Sciences (Grant # 131551KYSB20160002)the National Natural Science Foundation of China (Grants # 41671503 and 41621061)
文摘Wildfire is a primary forest disturbance.A better understanding of wildfire susceptibility and its dominant influencing factors is crucial for regional wildfire risk management.This study performed a wildfire susceptibility assessment using multiple methods,including logistic regression,probit regression,an artificial neural network,and a random forest(RF) algorithm.Yunnan Province,China was used as a case study area.We investigated the sample ratio of ignition and nonignition data to avoid misleading results due to the overwhelming number of nonignition samples in the models.To compare model performance and the importance of variables among the models,the area under the curve of the receiver operating characteristic plot was used as an indicator.The results show that a cost-sensitive RF had the highest accuracy(88.47%) for all samples,and 94.23% accuracy for ignition prediction.The identified main factors that influence Yunnan wildfire occurrence were forest coverage ratio,month,season,surface roughness,10 days minimum of the 6 h maximum humidity,and 10 days maxima of the 6 h average and maximum temperatures.These seven variables made the greatest contributions to regional wildfire susceptibility.Susceptibility maps developed from the models provide information regarding the spatial variation of ignition susceptibility,which can be used in regional wildfire risk management.