Machine learning methods dealing with the spatial auto-correlation of the response variable have garnered significant attention in the context of spatial prediction.Nonetheless,under these methods,the relationship bet...Machine learning methods dealing with the spatial auto-correlation of the response variable have garnered significant attention in the context of spatial prediction.Nonetheless,under these methods,the relationship between the response variable and explanatory variables is assumed to be homogeneous throughout the entire study area.This assumption,known as spatial stationarity,is very questionable in real-world situations due to the influence of contextual factors.Therefore,allowing the relationship between the target variable and predictor variables to vary spatially within the study region is more reasonable.However,existing machine learning techniques accounting for the spatially varying relationship between the dependent variable and the predictor variables do not capture the spatial auto-correlation of the dependent variable itself.Moreover,under these techniques,local machine learning models are effectively built using only fewer observations,which can lead to well-known issues such as over-fitting and the curse of dimensionality.This paper introduces a novel geostatistical machine learning approach where both the spatial auto-correlation of the response variable and the spatial non-stationarity of the regression relationship between the response and predictor variables are explicitly considered.The basic idea consists of relying on the local stationarity assumption to build a collection of local machine learning models while leveraging on the local spatial auto-correlation of the response variable to locally augment the training dataset.The proposed method’s effectiveness is showcased via experiments conducted on synthetic spatial data with known characteristics as well as real-world spatial data.In the synthetic(resp.real)case study,the proposed method’s predictive accuracy,as indicated by the Root Mean Square Error(RMSE)on the test set,is 17%(resp.7%)better than that of popular machine learning methods dealing with the response variable’s spatial auto-correlation.Additionally,this method is not only valuable for spatial prediction but also offers a deeper understanding of how the relationship between the target and predictor variables varies across space,and it can even be used to investigate the local significance of predictor variables.展开更多
The spatial prediction of a continuous response variable when spatially exhaustive predictor variables are available within the region under study has become ubiquitous in many geoscience fields.The response variable ...The spatial prediction of a continuous response variable when spatially exhaustive predictor variables are available within the region under study has become ubiquitous in many geoscience fields.The response variable is often subject to detection limits due to limitations of the measuring instrument or the sampling protocol used.Consequently,the response variable's observations are censored(left-censored,right-censored,or intervalcensored).Machine learning methods dedicated to the spatial prediction of uncensored response variables can not explicitly account for the response variable's censored observations.In such cases,they are routinely applied through ad hoc approaches such as ignoring the response variable's censored observations or replacing them with arbitrary values.Therefore,the response variable's spatial prediction may be inaccurate and sensitive to the assumptions and approximations involved in those arbitrary choices.This paper introduces a random forest-based machine learning method for spatially predicting a censored response variable,in which the response variable's censored observations are explicitly taken into account.The basic idea consists of building an ensemble of regression tree predictors by training the classical regression random forest on the subset of data containing only the response variable's uncensored observations.Then,the principal component analysis applied to this ensemble allows translating the response variable's observations(uncensored and censored)into a linear equalities and inequalities system.This system of linear equalities and inequalities is solved through randomized quadratic programming,which allows obtaining an ensemble of reconstructed regression tree predictors that exactly honor the response variable's observations(uncensored and censored).The response variable's spatial prediction is then obtained by averaging this latter ensemble.The effectiveness of the proposed machine learning method is illustrated on simulated data for which ground truth is available and showcased on real-world data,including geochemical data.The results suggest that the proposed machine learning technique allows greater utilization of the response variable's censored observations than ad hoc methods.展开更多
文摘Machine learning methods dealing with the spatial auto-correlation of the response variable have garnered significant attention in the context of spatial prediction.Nonetheless,under these methods,the relationship between the response variable and explanatory variables is assumed to be homogeneous throughout the entire study area.This assumption,known as spatial stationarity,is very questionable in real-world situations due to the influence of contextual factors.Therefore,allowing the relationship between the target variable and predictor variables to vary spatially within the study region is more reasonable.However,existing machine learning techniques accounting for the spatially varying relationship between the dependent variable and the predictor variables do not capture the spatial auto-correlation of the dependent variable itself.Moreover,under these techniques,local machine learning models are effectively built using only fewer observations,which can lead to well-known issues such as over-fitting and the curse of dimensionality.This paper introduces a novel geostatistical machine learning approach where both the spatial auto-correlation of the response variable and the spatial non-stationarity of the regression relationship between the response and predictor variables are explicitly considered.The basic idea consists of relying on the local stationarity assumption to build a collection of local machine learning models while leveraging on the local spatial auto-correlation of the response variable to locally augment the training dataset.The proposed method’s effectiveness is showcased via experiments conducted on synthetic spatial data with known characteristics as well as real-world spatial data.In the synthetic(resp.real)case study,the proposed method’s predictive accuracy,as indicated by the Root Mean Square Error(RMSE)on the test set,is 17%(resp.7%)better than that of popular machine learning methods dealing with the response variable’s spatial auto-correlation.Additionally,this method is not only valuable for spatial prediction but also offers a deeper understanding of how the relationship between the target and predictor variables varies across space,and it can even be used to investigate the local significance of predictor variables.
文摘The spatial prediction of a continuous response variable when spatially exhaustive predictor variables are available within the region under study has become ubiquitous in many geoscience fields.The response variable is often subject to detection limits due to limitations of the measuring instrument or the sampling protocol used.Consequently,the response variable's observations are censored(left-censored,right-censored,or intervalcensored).Machine learning methods dedicated to the spatial prediction of uncensored response variables can not explicitly account for the response variable's censored observations.In such cases,they are routinely applied through ad hoc approaches such as ignoring the response variable's censored observations or replacing them with arbitrary values.Therefore,the response variable's spatial prediction may be inaccurate and sensitive to the assumptions and approximations involved in those arbitrary choices.This paper introduces a random forest-based machine learning method for spatially predicting a censored response variable,in which the response variable's censored observations are explicitly taken into account.The basic idea consists of building an ensemble of regression tree predictors by training the classical regression random forest on the subset of data containing only the response variable's uncensored observations.Then,the principal component analysis applied to this ensemble allows translating the response variable's observations(uncensored and censored)into a linear equalities and inequalities system.This system of linear equalities and inequalities is solved through randomized quadratic programming,which allows obtaining an ensemble of reconstructed regression tree predictors that exactly honor the response variable's observations(uncensored and censored).The response variable's spatial prediction is then obtained by averaging this latter ensemble.The effectiveness of the proposed machine learning method is illustrated on simulated data for which ground truth is available and showcased on real-world data,including geochemical data.The results suggest that the proposed machine learning technique allows greater utilization of the response variable's censored observations than ad hoc methods.