Hydraulic fracturing is an effective technology for hydrocarbon extraction from unconventional shale and tight gas reservoirs.A potential risk of hydraulic fracturing is the upward migration of stray gas from the deep...Hydraulic fracturing is an effective technology for hydrocarbon extraction from unconventional shale and tight gas reservoirs.A potential risk of hydraulic fracturing is the upward migration of stray gas from the deep subsurface to shallow aquifers.The stray gas can dissolve in groundwater leading to chemical and biological reactions,which could negatively affect groundwater quality and contribute to atmospheric emissions.The knowledge oflight hydrocarbon solubility in the aqueous environment is essential for the numerical modelling offlow and transport in the subsurface.Herein,we compiled a database containing 2129experimental data of methane,ethane,and propane solubility in pure water and various electrolyte solutions over wide ranges of operating temperature and pressure.Two machine learning algorithms,namely regression tree(RT)and boosted regression tree(BRT)tuned with a Bayesian optimization algorithm(BO)were employed to determine the solubility of gases.The predictions were compared with the experimental data as well as four well-established thermodynamic models.Our analysis shows that the BRT-BO is sufficiently accurate,and the predicted values agree well with those obtained from the thermodynamic models.The coefficient of determination(R2)between experimental and predicted values is 0.99 and the mean squared error(MSE)is 9.97×10^(-8).The leverage statistical approach further confirmed the validity of the model developed.展开更多
Challenges in Big Data analysis arise due to the way the data are recorded, maintained, processed and stored. We demonstrate that a hierarchical, multivariate, statistical machine learning algorithm, namely Boosted Re...Challenges in Big Data analysis arise due to the way the data are recorded, maintained, processed and stored. We demonstrate that a hierarchical, multivariate, statistical machine learning algorithm, namely Boosted Regression Tree (BRT) can address Big Data challenges to drive decision making. The challenge of this study is lack of interoperability since the data, a collection of GIS shapefiles, remotely sensed imagery, and aggregated and interpolated spatio-temporal information, are stored in monolithic hardware components. For the modelling process, it was necessary to create one common input file. By merging the data sources together, a structured but noisy input file, showing inconsistencies and redundancies, was created. Here, it is shown that BRT can process different data granularities, heterogeneous data and missingness. In particular, BRT has the advantage of dealing with missing data by default by allowing a split on whether or not a value is missing as well as what the value is. Most importantly, the BRT offers a wide range of possibilities regarding the interpretation of results and variable selection is automatically performed by considering how frequently a variable is used to define a split in the tree. A comparison with two similar regression models (Random Forests and Least Absolute Shrinkage and Selection Operator, LASSO) shows that BRT outperforms these in this instance. BRT can also be a starting point for sophisticated hierarchical modelling in real world scenarios. For example, a single or ensemble approach of BRT could be tested with existing models in order to improve results for a wide range of data-driven decisions and applications.展开更多
In this paper we aim to analyse temporal variation of CD4 cell counts for HIV-infected individuals under antiretroviral therapy by using statistical methods. This is achieved by resorting to recursive binary regressio...In this paper we aim to analyse temporal variation of CD4 cell counts for HIV-infected individuals under antiretroviral therapy by using statistical methods. This is achieved by resorting to recursive binary regression tree approach [1]?[2]. This approach has made it possible to highlight the existence of several segments of the population of interest described by the interactions between the predictive covariates of the response to the treatment regimen.展开更多
Researchers in bioinformatics, biostatistics and other related fields seek biomarkers for many purposes, including risk assessment, disease diagnosis and prognosis, which can be formulated as a patient classification....Researchers in bioinformatics, biostatistics and other related fields seek biomarkers for many purposes, including risk assessment, disease diagnosis and prognosis, which can be formulated as a patient classification. In this paper, a new method of using a tree regression to improve logistic classification model is introduced in biomarker data analysis. The numerical results show that the linear logistic model can be significantly improved by a tree regression on the residuals. Although the classification problem of binary responses is discussed in this research, the idea is easy to extend to the classification of multinomial responses.展开更多
The dead fuel moisture content(DFMC)is the key driver leading to fire occurrence.Accurately estimating the DFMC could help identify locations facing fire risks,prioritise areas for fire monitoring,and facilitate timel...The dead fuel moisture content(DFMC)is the key driver leading to fire occurrence.Accurately estimating the DFMC could help identify locations facing fire risks,prioritise areas for fire monitoring,and facilitate timely deployment of fire-suppression resources.In this study,the DFMC and environmental variables,including air temperature,relative humidity,wind speed,solar radiation,rainfall,atmospheric pressure,soil temperature,and soil humidity,were simultaneously measured in a grassland of Ergun City,Inner Mongolia Autonomous Region of China in 2021.We chose three regression models,i.e.,random forest(RF)model,extreme gradient boosting(XGB)model,and boosted regression tree(BRT)model,to model the seasonal DFMC according to the data collected.To ensure accuracy,we added time-lag variables of 3 d to the models.The results showed that the RF model had the best fitting effect with an R2value of 0.847 and a prediction accuracy with a mean absolute error score of 4.764%among the three models.The accuracies of the models in spring and autumn were higher than those in the other two seasons.In addition,different seasons had different key influencing factors,and the degree of influence of these factors on the DFMC changed with time lags.Moreover,time-lag variables within 44 h clearly improved the fitting effect and prediction accuracy,indicating that environmental conditions within approximately 48 h greatly influence the DFMC.This study highlights the importance of considering 48 h time-lagged variables when predicting the DFMC of grassland fuels and mapping grassland fire risks based on the DFMC to help locate high-priority areas for grassland fire monitoring and prevention.展开更多
目的:比较决策树和Logistic回归模型对体外受精-胚胎移植(in vitro fertilization and embryo transfer,IVF-ET)患者妊娠结局的预测价值。方法:纳入2021年1月至2022年10月在长治医学院附属和平医院接受IVF-ET的患者350例为研究对象,根...目的:比较决策树和Logistic回归模型对体外受精-胚胎移植(in vitro fertilization and embryo transfer,IVF-ET)患者妊娠结局的预测价值。方法:纳入2021年1月至2022年10月在长治医学院附属和平医院接受IVF-ET的患者350例为研究对象,根据妊娠结局分为妊娠成功组(215例)和妊娠失败组(135例)。收集患者临床资料,建立IVF-ET患者妊娠结局Logistic回归和决策树预测模型,并在是否基于Logistic回归结果条件下建立决策树分析模型(决策树1和决策树2),采用受试者工作特征(receiver operating characteristic,ROC)曲线对模型预测效果进行评价。结果:350例患者中,妊娠成功患者占61.43%,妊娠失败者占38.57%。妊娠失败组年龄≥35岁、不孕年限≥5年、周期次数≥1次、有心理精神障碍的患者比例及HCG日血清孕酮水平均高于妊娠成功组,获卵数≥10枚、受精率≥75%的患者比例及HCG日子宫内膜厚度、优质胚胎数小于妊娠成功组(P<0.05)。多因素Logistic回归分析结果显示,年龄、HCG日血清孕酮水平、优质胚胎数及心理精神障碍均是IVF-ET患者妊娠结局的影响因素(P<0.05)。决策树模型显示,年龄、HCG日血清孕酮水平、优质胚胎数为IVF-ET患者妊娠结局的影响因素。Logistic回归模型曲线下面积(area under curve,AUC)为0.832,预测敏感度、特异度和准确度分别为87.3%、71.4%、83.5%;决策树1的AUC为0.859,预测敏感度、特异度和准确度分别为85.1%、76.8%、85.6%;决策树2的AUC为0.820,预测敏感度、特异度和准确度分别为83.7%、73.2%、82.4%。决策树1的AUC大于决策树2(P<0.05),但与Logistic回归模型的AUC比较差异无统计学意义(P>0.05)。结论:Logistic回归模型和决策树模型对于IVF-ET患者妊娠结局均有一定的预测价值。展开更多
Plant epidemics are often associated with weather-related variables.It is difficult to identify weather-related predictors for models predicting plant epidemics.In the article by Shah et al.,to predict Fusarium head b...Plant epidemics are often associated with weather-related variables.It is difficult to identify weather-related predictors for models predicting plant epidemics.In the article by Shah et al.,to predict Fusarium head blight(FHB)epidemics of wheat,they explored a functional approach using scalar-on-function regression to model a binary outcome(FHB epidemic or non-epidemic)with respect to weather time series spanning 140 days relative to anthesis.The scalar-on-function models fit the data better than previously described logistic regression models.In this work,given the same dataset and models,we attempt to reproduce the article by Shah et al.using a different approach,boosted regression trees.After fitting,the classification accuracy and model statistics are surprisingly good.展开更多
文摘Hydraulic fracturing is an effective technology for hydrocarbon extraction from unconventional shale and tight gas reservoirs.A potential risk of hydraulic fracturing is the upward migration of stray gas from the deep subsurface to shallow aquifers.The stray gas can dissolve in groundwater leading to chemical and biological reactions,which could negatively affect groundwater quality and contribute to atmospheric emissions.The knowledge oflight hydrocarbon solubility in the aqueous environment is essential for the numerical modelling offlow and transport in the subsurface.Herein,we compiled a database containing 2129experimental data of methane,ethane,and propane solubility in pure water and various electrolyte solutions over wide ranges of operating temperature and pressure.Two machine learning algorithms,namely regression tree(RT)and boosted regression tree(BRT)tuned with a Bayesian optimization algorithm(BO)were employed to determine the solubility of gases.The predictions were compared with the experimental data as well as four well-established thermodynamic models.Our analysis shows that the BRT-BO is sufficiently accurate,and the predicted values agree well with those obtained from the thermodynamic models.The coefficient of determination(R2)between experimental and predicted values is 0.99 and the mean squared error(MSE)is 9.97×10^(-8).The leverage statistical approach further confirmed the validity of the model developed.
文摘Challenges in Big Data analysis arise due to the way the data are recorded, maintained, processed and stored. We demonstrate that a hierarchical, multivariate, statistical machine learning algorithm, namely Boosted Regression Tree (BRT) can address Big Data challenges to drive decision making. The challenge of this study is lack of interoperability since the data, a collection of GIS shapefiles, remotely sensed imagery, and aggregated and interpolated spatio-temporal information, are stored in monolithic hardware components. For the modelling process, it was necessary to create one common input file. By merging the data sources together, a structured but noisy input file, showing inconsistencies and redundancies, was created. Here, it is shown that BRT can process different data granularities, heterogeneous data and missingness. In particular, BRT has the advantage of dealing with missing data by default by allowing a split on whether or not a value is missing as well as what the value is. Most importantly, the BRT offers a wide range of possibilities regarding the interpretation of results and variable selection is automatically performed by considering how frequently a variable is used to define a split in the tree. A comparison with two similar regression models (Random Forests and Least Absolute Shrinkage and Selection Operator, LASSO) shows that BRT outperforms these in this instance. BRT can also be a starting point for sophisticated hierarchical modelling in real world scenarios. For example, a single or ensemble approach of BRT could be tested with existing models in order to improve results for a wide range of data-driven decisions and applications.
文摘In this paper we aim to analyse temporal variation of CD4 cell counts for HIV-infected individuals under antiretroviral therapy by using statistical methods. This is achieved by resorting to recursive binary regression tree approach [1]?[2]. This approach has made it possible to highlight the existence of several segments of the population of interest described by the interactions between the predictive covariates of the response to the treatment regimen.
文摘Researchers in bioinformatics, biostatistics and other related fields seek biomarkers for many purposes, including risk assessment, disease diagnosis and prognosis, which can be formulated as a patient classification. In this paper, a new method of using a tree regression to improve logistic classification model is introduced in biomarker data analysis. The numerical results show that the linear logistic model can be significantly improved by a tree regression on the residuals. Although the classification problem of binary responses is discussed in this research, the idea is easy to extend to the classification of multinomial responses.
基金funded by the National Key Research and Development Program of China Strategic International Cooperation in Science and Technology Innovation Program (2018YFE0207800)the National Natural Science Foundation of China (31971483)。
文摘The dead fuel moisture content(DFMC)is the key driver leading to fire occurrence.Accurately estimating the DFMC could help identify locations facing fire risks,prioritise areas for fire monitoring,and facilitate timely deployment of fire-suppression resources.In this study,the DFMC and environmental variables,including air temperature,relative humidity,wind speed,solar radiation,rainfall,atmospheric pressure,soil temperature,and soil humidity,were simultaneously measured in a grassland of Ergun City,Inner Mongolia Autonomous Region of China in 2021.We chose three regression models,i.e.,random forest(RF)model,extreme gradient boosting(XGB)model,and boosted regression tree(BRT)model,to model the seasonal DFMC according to the data collected.To ensure accuracy,we added time-lag variables of 3 d to the models.The results showed that the RF model had the best fitting effect with an R2value of 0.847 and a prediction accuracy with a mean absolute error score of 4.764%among the three models.The accuracies of the models in spring and autumn were higher than those in the other two seasons.In addition,different seasons had different key influencing factors,and the degree of influence of these factors on the DFMC changed with time lags.Moreover,time-lag variables within 44 h clearly improved the fitting effect and prediction accuracy,indicating that environmental conditions within approximately 48 h greatly influence the DFMC.This study highlights the importance of considering 48 h time-lagged variables when predicting the DFMC of grassland fuels and mapping grassland fire risks based on the DFMC to help locate high-priority areas for grassland fire monitoring and prevention.
文摘目的:比较决策树和Logistic回归模型对体外受精-胚胎移植(in vitro fertilization and embryo transfer,IVF-ET)患者妊娠结局的预测价值。方法:纳入2021年1月至2022年10月在长治医学院附属和平医院接受IVF-ET的患者350例为研究对象,根据妊娠结局分为妊娠成功组(215例)和妊娠失败组(135例)。收集患者临床资料,建立IVF-ET患者妊娠结局Logistic回归和决策树预测模型,并在是否基于Logistic回归结果条件下建立决策树分析模型(决策树1和决策树2),采用受试者工作特征(receiver operating characteristic,ROC)曲线对模型预测效果进行评价。结果:350例患者中,妊娠成功患者占61.43%,妊娠失败者占38.57%。妊娠失败组年龄≥35岁、不孕年限≥5年、周期次数≥1次、有心理精神障碍的患者比例及HCG日血清孕酮水平均高于妊娠成功组,获卵数≥10枚、受精率≥75%的患者比例及HCG日子宫内膜厚度、优质胚胎数小于妊娠成功组(P<0.05)。多因素Logistic回归分析结果显示,年龄、HCG日血清孕酮水平、优质胚胎数及心理精神障碍均是IVF-ET患者妊娠结局的影响因素(P<0.05)。决策树模型显示,年龄、HCG日血清孕酮水平、优质胚胎数为IVF-ET患者妊娠结局的影响因素。Logistic回归模型曲线下面积(area under curve,AUC)为0.832,预测敏感度、特异度和准确度分别为87.3%、71.4%、83.5%;决策树1的AUC为0.859,预测敏感度、特异度和准确度分别为85.1%、76.8%、85.6%;决策树2的AUC为0.820,预测敏感度、特异度和准确度分别为83.7%、73.2%、82.4%。决策树1的AUC大于决策树2(P<0.05),但与Logistic回归模型的AUC比较差异无统计学意义(P>0.05)。结论:Logistic回归模型和决策树模型对于IVF-ET患者妊娠结局均有一定的预测价值。
文摘目的调查中老年慢性肌肉骨骼疼痛(chronic musculoskeletal pain,CMP)患者恐动症发生现状,并采用Logistic回归分析模型与决策树模型分析其影响因素。方法于2023年1月—6月采用方便抽样法,选取在新疆乌鲁木齐市某三级甲等骨科专科医院住院的370例中老年CMP患者为调查对象,采用一般情况调查表、恐动症评分表、一般自我效能感量表、医学应对方式问卷、社会支持评定量表进行调查,采用Logistic回归分析模型与决策树模型分析老年CMP患者恐动症的影响因素,并评价两种模型的预测效果。结果352例中老年CMP住院患者完成研究。恐动症发生率为67.3%。Logistic回归分析模型与决策树模型均显示,自我效能感、社会支持是中老年CMP住院患者恐动症发生的影响因素。两种模型分析结果比较显示,Logistic回归分析模型的敏感度(99.2%)、特异度(89.5%),均高于决策树模型(96.6%、87.8%)。Logistic回归分析模型的曲线下面积(area under curve,AUC)为0.971(95%CI:0.952~0.990),标准误为0.010,决策树模型的AUC为0.948(95%CI:0.921~0.974),标准误为0.013,两种模型的预测价值良好。结论中老年CMP住院患者恐动症发生率高,Logistic回归分析模型与决策树模型相结合,能充分挖掘中老年CMP住院患者恐动症影响因素,建议将两种模型联合使用,为中老年CMP住院患者恐动症的评估与干预提供依据。
基金supported by the National Natural Science Foundation of China(Grant No.12071173 and 12171192)Huaian Key Laboratory for Infectious Diseases Control and Prevention(HAP201704).
文摘Plant epidemics are often associated with weather-related variables.It is difficult to identify weather-related predictors for models predicting plant epidemics.In the article by Shah et al.,to predict Fusarium head blight(FHB)epidemics of wheat,they explored a functional approach using scalar-on-function regression to model a binary outcome(FHB epidemic or non-epidemic)with respect to weather time series spanning 140 days relative to anthesis.The scalar-on-function models fit the data better than previously described logistic regression models.In this work,given the same dataset and models,we attempt to reproduce the article by Shah et al.using a different approach,boosted regression trees.After fitting,the classification accuracy and model statistics are surprisingly good.