Machine learning methods are increasingly used for spatially predicting a categorical target variable when spatially exhaustive predictor variables are available within the study region.Even though these methods exhib...Machine learning methods are increasingly used for spatially predicting a categorical target variable when spatially exhaustive predictor variables are available within the study region.Even though these methods exhibit competitive spatial prediction performance,they do not exactly honor the categorical target variable's observed values at sampling locations by construction.On the other side,competitor geostatistical methods perfectly match the categorical target variable's observed values at sampling locations by essence.In many geoscience applications,it is often desirable to perfectly match the observed values of the categorical target variable at sampling locations,especially when the categorical target variable's measurements can be reasonably considered error-free.This paper addresses the problem of exact conditioning of machine learning methods for the spatial prediction of categorical variables.It introduces a classification random forest-based approach in which the categorical target variable is exactly conditioned to the data,thus having the exact conditioning property like competitor geostatistical methods.The proposed method extends a previous work dedicated to continuous target variables by using an implicit representation of the categorical target variable.The basic idea consists of transforming the ensemble of classification tree predictors'(categorical)resulting from the traditional classification random forest into an ensemble of signed distances(continuous)associated with each category of the categorical target variable.Then,an orthogonal representation of the ensemble of signed distances is created through the principal component analysis,thus allowing to reformulate the exact conditioning problem as a system of linear inequalities on principal component scores.Then,the sampling of new principal component scores ensuring the data's exact conditioning is performed via randomized quadratic programming.The resulting conditional signed distances are turned out into an ensemble of categorical outputs,which perfectly honor the categorical target variable's observed values at sampling locations.Then,the majority vote is used to aggregate the ensemble of categorical outputs.The effectiveness of the proposed method is illustrated on a simulated dataset for which ground-truth is available and showcased on a real-world dataset,including geochemical data.A comparison with geostatistical and traditional machine learning methods show that the proposed technique can perfectly match the categorical target variable's observed values at sampling locations while maintaining competitive out-of-sample predictive performance.展开更多
Regression random forest is becoming a widely-used machine learning technique for spatial prediction that shows competitive prediction performance in various geoscience fields.Like other popular machine learning metho...Regression random forest is becoming a widely-used machine learning technique for spatial prediction that shows competitive prediction performance in various geoscience fields.Like other popular machine learning methods for spatial prediction,regression random forest does not exactly honor the response variable’s measured values at sampled locations.However,competitor methods such as regression-kriging perfectly fit the response variable’s observed values at sampled locations by construction.Exactly matching the response variable’s measured values at sampled locations is often desirable in many geoscience applications.This paper presents a new approach ensuring that regression random forest perfectly matches the response variable’s observed values at sampled locations.The main idea consists of using the principal component analysis to create an orthogonal representation of the ensemble of regression tree predictors resulting from the traditional regression random forest.Then,the exact conditioning problem is reformulated as a Bayes-linear-Gauss problem on principal component scores.This problem has an analytical solution making it easy to perform Monte Carlo sampling of new principal component scores and then reconstruct regression tree predictors that perfectly match the response variable’s observed values at sampled locations.The reconstructed regression tree predictors’average also precisely matches the response variable’s measured values at sampled locations by construction.The proposed method’s effectiveness is illustrated on the one hand using a synthetic dataset where the ground-truth is available everywhere within the study region,and on the other hand,using a real dataset comprising southwest England’s geochemical concentration data.It is compared with the regression-kriging and the traditional regression random forest.It appears that the proposed method can perfectly fit the response variable’s measured values at sampled locations while achieving good out of sample predictive performance comparatively to regression-kriging and traditional regression random forest.展开更多
文摘Machine learning methods are increasingly used for spatially predicting a categorical target variable when spatially exhaustive predictor variables are available within the study region.Even though these methods exhibit competitive spatial prediction performance,they do not exactly honor the categorical target variable's observed values at sampling locations by construction.On the other side,competitor geostatistical methods perfectly match the categorical target variable's observed values at sampling locations by essence.In many geoscience applications,it is often desirable to perfectly match the observed values of the categorical target variable at sampling locations,especially when the categorical target variable's measurements can be reasonably considered error-free.This paper addresses the problem of exact conditioning of machine learning methods for the spatial prediction of categorical variables.It introduces a classification random forest-based approach in which the categorical target variable is exactly conditioned to the data,thus having the exact conditioning property like competitor geostatistical methods.The proposed method extends a previous work dedicated to continuous target variables by using an implicit representation of the categorical target variable.The basic idea consists of transforming the ensemble of classification tree predictors'(categorical)resulting from the traditional classification random forest into an ensemble of signed distances(continuous)associated with each category of the categorical target variable.Then,an orthogonal representation of the ensemble of signed distances is created through the principal component analysis,thus allowing to reformulate the exact conditioning problem as a system of linear inequalities on principal component scores.Then,the sampling of new principal component scores ensuring the data's exact conditioning is performed via randomized quadratic programming.The resulting conditional signed distances are turned out into an ensemble of categorical outputs,which perfectly honor the categorical target variable's observed values at sampling locations.Then,the majority vote is used to aggregate the ensemble of categorical outputs.The effectiveness of the proposed method is illustrated on a simulated dataset for which ground-truth is available and showcased on a real-world dataset,including geochemical data.A comparison with geostatistical and traditional machine learning methods show that the proposed technique can perfectly match the categorical target variable's observed values at sampling locations while maintaining competitive out-of-sample predictive performance.
文摘Regression random forest is becoming a widely-used machine learning technique for spatial prediction that shows competitive prediction performance in various geoscience fields.Like other popular machine learning methods for spatial prediction,regression random forest does not exactly honor the response variable’s measured values at sampled locations.However,competitor methods such as regression-kriging perfectly fit the response variable’s observed values at sampled locations by construction.Exactly matching the response variable’s measured values at sampled locations is often desirable in many geoscience applications.This paper presents a new approach ensuring that regression random forest perfectly matches the response variable’s observed values at sampled locations.The main idea consists of using the principal component analysis to create an orthogonal representation of the ensemble of regression tree predictors resulting from the traditional regression random forest.Then,the exact conditioning problem is reformulated as a Bayes-linear-Gauss problem on principal component scores.This problem has an analytical solution making it easy to perform Monte Carlo sampling of new principal component scores and then reconstruct regression tree predictors that perfectly match the response variable’s observed values at sampled locations.The reconstructed regression tree predictors’average also precisely matches the response variable’s measured values at sampled locations by construction.The proposed method’s effectiveness is illustrated on the one hand using a synthetic dataset where the ground-truth is available everywhere within the study region,and on the other hand,using a real dataset comprising southwest England’s geochemical concentration data.It is compared with the regression-kriging and the traditional regression random forest.It appears that the proposed method can perfectly fit the response variable’s measured values at sampled locations while achieving good out of sample predictive performance comparatively to regression-kriging and traditional regression random forest.