The complex sand-casting process combined with the interactions between process parameters makes it difficult to control the casting quality,resulting in a high scrap rate.A strategy based on a data-driven model was p...The complex sand-casting process combined with the interactions between process parameters makes it difficult to control the casting quality,resulting in a high scrap rate.A strategy based on a data-driven model was proposed to reduce casting defects and improve production efficiency,which includes the random forest(RF)classification model,the feature importance analysis,and the process parameters optimization with Monte Carlo simulation.The collected data includes four types of defects and corresponding process parameters were used to construct the RF model.Classification results show a recall rate above 90% for all categories.The Gini Index was used to assess the importance of the process parameters in the formation of various defects in the RF model.Finally,the classification model was applied to different production conditions for quality prediction.In the case of process parameters optimization for gas porosity defects,this model serves as an experimental process in the Monte Carlo method to estimate a better temperature distribution.The prediction model,when applied to the factory,greatly improved the efficiency of defect detection.Results show that the scrap rate decreased from 10.16% to 6.68%.展开更多
The prediction of liquefaction-induced lateral spreading/displacement(Dh)is a challenging task for civil/geotechnical engineers.In this study,a new approach is proposed to predict Dh using gene expression programming(...The prediction of liquefaction-induced lateral spreading/displacement(Dh)is a challenging task for civil/geotechnical engineers.In this study,a new approach is proposed to predict Dh using gene expression programming(GEP).Based on statistical reasoning,individual models were developed for two topographies:free-face and gently sloping ground.Along with a comparison with conventional approaches for predicting the Dh,four additional regression-based soft computing models,i.e.Gaussian process regression(GPR),relevance vector machine(RVM),sequential minimal optimization regression(SMOR),and M5-tree,were developed and compared with the GEP model.The results indicate that the GEP models predict Dh with less bias,as evidenced by the root mean square error(RMSE)and mean absolute error(MAE)for training(i.e.1.092 and 0.815;and 0.643 and 0.526)and for testing(i.e.0.89 and 0.705;and 0.773 and 0.573)in free-face and gently sloping ground topographies,respectively.The overall performance for the free-face topology was ranked as follows:GEP>RVM>M5-tree>GPR>SMOR,with a total score of 40,32,24,15,and 10,respectively.For the gently sloping condition,the performance was ranked as follows:GEP>RVM>GPR>M5-tree>SMOR with a total score of 40,32,21,19,and 8,respectively.Finally,the results of the sensitivity analysis showed that for both free-face and gently sloping ground,the liquefiable layer thickness(T_(15))was the major parameter with percentage deterioration(%D)value of 99.15 and 90.72,respectively.展开更多
This study was conducted to enable prompt classification of malware,which was becoming increasingly sophisticated.To do this,we analyzed the important features of malware and the relative importance of selected featur...This study was conducted to enable prompt classification of malware,which was becoming increasingly sophisticated.To do this,we analyzed the important features of malware and the relative importance of selected features according to a learning model to assess how those important features were identified.Initially,the analysis features were extracted using Cuckoo Sandbox,an open-source malware analysis tool,then the features were divided into five categories using the extracted information.The 804 extracted features were reduced by 70%after selecting only the most suitable ones for malware classification using a learning model-based feature selection method called the recursive feature elimination.Next,these important features were analyzed.The level of contribution from each one was assessed by the Random Forest classifier method.The results showed that System call features were mostly allocated.At the end,it was possible to accurately identify the malware type using only 36 to 76 features for each of the four types of malware with the most analysis samples available.These were the Trojan,Adware,Downloader,and Backdoor malware.展开更多
Breast cancer is one of the most common cancers among women in the world, with more than two million new cases of breast cancer every year. This disease is associated with numerous clinical and genetic characteristics...Breast cancer is one of the most common cancers among women in the world, with more than two million new cases of breast cancer every year. This disease is associated with numerous clinical and genetic characteristics. In recent years, machine learning technology has been increasingly applied to the medical field, including predicting the risk of malignant tumors such as breast cancer. Based on clinical and targeted sequencing data of 1980 primary breast cancer samples, this article aimed to analyze these data and predict living conditions after breast cancer. After data engineering, feature selection, and comparison of machine learning methods, the light gradient boosting machine model was found the best with hyperparameter tuning (precision = 0.818, recall = 0.816, f1 score = 0.817, roc-auc = 0.867). And the top 5 determinants were clinical features age at diagnosis, Nottingham Prognostic Index, cohort and genetic features rheb, nr3c1. The study shed light on rational allocation of medical resources and provided insights to early prevention, diagnosis and treatment of breast cancer with the identified risk clinical and genetic factors.展开更多
Prediction of tunneling-induced ground settlements is an essential task,particularly for tunneling in urban settings.Ground settlements should be limited within a tolerable threshold to avoid damages to aboveground st...Prediction of tunneling-induced ground settlements is an essential task,particularly for tunneling in urban settings.Ground settlements should be limited within a tolerable threshold to avoid damages to aboveground structures.Machine learning(ML)methods are becoming popular in many fields,including tunneling and underground excavations,as a powerful learning and predicting technique.However,the available datasets collected from a tunneling project are usually small from the perspective of applying ML methods.Can ML algorithms effectively predict tunneling-induced ground settlements when the available datasets are small?In this study,seven ML methods are utilized to predict tunneling-induced ground settlement using 14 contributing factors measured before or during tunnel excavation.These methods include multiple linear regression(MLR),decision tree(DT),random forest(RF),gradient boosting(GB),support vector regression(SVR),back-propagation neural network(BPNN),and permutation importancebased BPNN(PI-BPNN)models.All methods except BPNN and PI-BPNN are shallow-structure ML methods.The effectiveness of these seven ML approaches on small datasets is evaluated using model accuracy and stability.The model accuracy is measured by the coefficient of determination(R2)of training and testing datasets,and the stability of a learning algorithm indicates robust predictive performance.Also,the quantile error(QE)criterion is introduced to assess model predictive performance considering underpredictions and overpredictions.Our study reveals that the RF algorithm outperforms all the other models with the highest model prediction accuracy(0.9)and stability(3.0210^(-27)).Deep-structure ML models do not perform well for small datasets with relatively low model accuracy(0.59)and stability(5.76).The PI-BPNN architecture is proposed and designed for small datasets,showing better performance than typical BPNN.Six important contributing factors of ground settlements are identified,including tunnel depth,the distance between tunnel face and surface monitoring points(DTM),weighted average soil compressibility modulus(ACM),grouting pressure,penetrating rate and thrust force.展开更多
The existence of time delay in complex industrial processes or dynamical systems is a common phenomenon and is a difficult problem to deal with in industrial control systems,as well as in the textile field.Accurate id...The existence of time delay in complex industrial processes or dynamical systems is a common phenomenon and is a difficult problem to deal with in industrial control systems,as well as in the textile field.Accurate identification of the time delay can greatly improve the efficiency of the design of industrial process control systems.The time delay identification methods based on mathematical modeling require prior knowledge of the structural information of the model,especially for nonlinear systems.The neural network-based identification method can predict the time delay of the system,but cannot accurately obtain the specific parameters of the time delay.Benefit from the interpretability of machine learning,a novel method for delay identification based on an interpretable regression decision tree is proposed.Utilizing the self-explanatory analysis of the decision tree model,the parameters with the highest feature importance are obtained to identify the time delay of the system.Excellent results are gained by the simulation data of linear and nonlinear control systems,and the time delay of the systems can be accurately identified.展开更多
Energy issue is of strategic importance influencing China’s overall economic and social development that needs systematic planning and far-sighted deliberation.At the present time the revolution of energy technology ...Energy issue is of strategic importance influencing China’s overall economic and social development that needs systematic planning and far-sighted deliberation.At the present time the revolution of energy technology is advancing rapidly.The global innovation of energy technology has entered a highly dynamic period featured by multi-point breakthroughs,展开更多
As an essential property of frozen soils,change of unfrozen water content(UWC)with temperature,namely soil-freezing characteristic curve(SFCC),plays significant roles in numerous physical,hydraulic and mechanical proc...As an essential property of frozen soils,change of unfrozen water content(UWC)with temperature,namely soil-freezing characteristic curve(SFCC),plays significant roles in numerous physical,hydraulic and mechanical processes in cold regions,including the heat and water transfer within soils and at the land–atmosphere interface,frost heave and thaw settlement,as well as the simulation of coupled thermo-hydro-mechanical interactions.Although various models have been proposed to estimate SFCC,their applicability remains limited due to their derivation from specific soil types,soil treatments,and test devices.Accordingly,this study proposes a novel data-driven model to predict the SFCC using an extreme Gradient Boosting(XGBoost)model.A systematic database for SFCC of frozen soils compiled from extensive experimental investigations via various testing methods was utilized to train the XGBoost model.The predicted soil freezing characteristic curves(SFCC,UWC as a function of temperature)from the well-trained XGBoost model were compared with original experimental data and three conventional models.The results demonstrate the superior performance of the proposed XGBoost model over the traditional models in predicting SFCC.This study provides valuable insights for future investigations regarding the SFCC of frozen soils.展开更多
Solar radiation is capable of producing heat,causing chemical reactions,or generating electricity.Thus,the amount of solar radiation at different times of the day must be determined to design and equip all solar syste...Solar radiation is capable of producing heat,causing chemical reactions,or generating electricity.Thus,the amount of solar radiation at different times of the day must be determined to design and equip all solar systems.Moreover,it is necessary to have a thorough understanding of different solar radiation components,such as Direct Normal Irradiance(DNI),Diffuse Horizontal Irradiance(DHI),and Global Horizontal Irradiance(GHI).Unfortunately,measurements of solar radiation are not easily accessible for the majority of regions on the globe.This paper aims to develop a set of deep learning models through feature importance algorithms to predict the DNI data.The proposed models are based on historical data of meteorological parameters and solar radiation properties in a specific location of the region of Errachidia,Morocco,from January 1,2017,to December 31,2019,with an interval of 60 minutes.The findings demonstrated that feature selection approaches play a crucial role in forecasting of solar radiation accurately when compared with the available data.展开更多
The purpose of this work is to enhance KinasePhos,a machine learning-based kinasespecific phosphorylation site prediction tool.Experimentally verified kinase-specific phosphorylation data were collected from PhosphoSi...The purpose of this work is to enhance KinasePhos,a machine learning-based kinasespecific phosphorylation site prediction tool.Experimentally verified kinase-specific phosphorylation data were collected from PhosphoSitePlus,UniProtKB,the GPS 5.0,and Phospho.ELM.In total,41,421 experimentally verified kinase-specific phosphorylation sites were identified.A total of 1380 unique kinases were identified,including 753 with existing classification information from KinBase and the remaining 627 annotated by building a phylogenetic tree.Based on this kinase classification,a total of 771 predictive models were built at the individual,family,and group levels,using at least 15 experimentally verified substrate sites in positive training datasets.The improved models demonstrated their effectiveness compared with other prediction tools.For example,the prediction of sites phosphorylated by the protein kinase B,casein kinase 2,and protein kinase A families had accuracies of 94.5%,92.5%,and 90.0%,respectively.The average prediction accuracy for all 771 models was 87.2%.For enhancing interpretability,the SHapley Additive exPlanations(SHAP)method was employed to assess feature importance.The web interface of KinasePhos 3.0 has been redesigned to provide comprehensive annotations of kinase-specific phosphorylation sites on multiple proteins.Additionally,considering the large scale of phosphoproteomic data,a downloadable prediction tool is available at https://awi.cuhk.edu.cn/KinasePhos/download.html or https://github.com/tom-209/KinasePhos-3.0-executable-file.展开更多
Alluvial fans are an important land resource in the Qinghai-Tibet Plateau with the expansion of human activities. However, the factors of alluvial fan development are poorly understood. According to our previous inves...Alluvial fans are an important land resource in the Qinghai-Tibet Plateau with the expansion of human activities. However, the factors of alluvial fan development are poorly understood. According to our previous investigation and research, approximately 826 alluvial fans exist in the Lhasa River Basin(LRB). The main purpose of this work is to identify the main influencing factors by using machine learning. A development index(Di) of alluvial fan was created by combining its area, perimeter, height and gradient. The 72% of data, including Di, 11 types of environmental parameters of the matching catchment of alluvial fan and 10 commonly used machine learning algorithms were used to train and build models.The 18% of data were used to validate models. The remaining 10% of data were used to test the model accuracy. The feature importance of the model was used to illustrate the significance of the 11 types of environmental parameters to Di. The primary modelling results showed that the accuracy of the ensemble models, including Gradient Boost Decision Tree,Random Forest and XGBoost, are not less than 0.5(R^(2)). The accuracy of the Gradient Boost Decision Tree and XGBoost improved after grid research, and their R^(2) values are 0.782 and 0.870, respectively. The XGBoost was selected as the final model due to its optimal accuracy and generalisation ability at the sites closest to the LRB. Morphology parameters are the main factors in alluvial fan development, with a cumulative value of relative feature importance of 74.60% in XGBoost. The final model will have better accuracy and generalisation ability after adding training samples in other regions.展开更多
Accurately estimating the interfacial bond capacity of the near-surface mounted(NSM)carbon fiber-reinforced polymer(CFRP)to concrete joint is a fundamental task in the strengthening and retrofit of existing reinforced...Accurately estimating the interfacial bond capacity of the near-surface mounted(NSM)carbon fiber-reinforced polymer(CFRP)to concrete joint is a fundamental task in the strengthening and retrofit of existing reinforced concrete(RC)structures.The machine learning(ML)approach may provide an alternative to the commonly used semi-empirical or semi-analytical methods.Therefore,in this work we have developed a predictive model based on an artificial neural network(ANN)approach,i.e.using a back propagation neural network(BPNN),to map the complex data pattern obtained from an NSM CFRP to concrete joint.It involves a set of nine material and geometric input parameters and one output value.Moreover,by employing the neural interpretation diagram(NID)technique,the BPNN model becomes interpretable,as the influence of each input variable on the model can be tracked and quantified based on the connection weights of the neural network.An extensive database including 163 pull-out testing samples,collected from the authors’research group and from published results in the literature,is used to train and verify the ANN.Our results show that the prediction given by the BPNN model agrees well with the experimental data and yields a coefficient of determination of 0.957 on the whole database.After removing one non-significant feature,the BPNN becomes even more computationally efficient and accurate.In addition,compared with the existed semi-analytical model,the ANN-based approach demonstrates a more accurate estimation.Therefore,the proposed ML method may be a promising alternative for predicting the bond strength of NSM CFRP to concrete joint for structural engineers.展开更多
基金financially supported by the National Key Research and Development Program of China(2022YFB3706800,2020YFB1710100)the National Natural Science Foundation of China(51821001,52090042,52074183)。
文摘The complex sand-casting process combined with the interactions between process parameters makes it difficult to control the casting quality,resulting in a high scrap rate.A strategy based on a data-driven model was proposed to reduce casting defects and improve production efficiency,which includes the random forest(RF)classification model,the feature importance analysis,and the process parameters optimization with Monte Carlo simulation.The collected data includes four types of defects and corresponding process parameters were used to construct the RF model.Classification results show a recall rate above 90% for all categories.The Gini Index was used to assess the importance of the process parameters in the formation of various defects in the RF model.Finally,the classification model was applied to different production conditions for quality prediction.In the case of process parameters optimization for gas porosity defects,this model serves as an experimental process in the Monte Carlo method to estimate a better temperature distribution.The prediction model,when applied to the factory,greatly improved the efficiency of defect detection.Results show that the scrap rate decreased from 10.16% to 6.68%.
文摘The prediction of liquefaction-induced lateral spreading/displacement(Dh)is a challenging task for civil/geotechnical engineers.In this study,a new approach is proposed to predict Dh using gene expression programming(GEP).Based on statistical reasoning,individual models were developed for two topographies:free-face and gently sloping ground.Along with a comparison with conventional approaches for predicting the Dh,four additional regression-based soft computing models,i.e.Gaussian process regression(GPR),relevance vector machine(RVM),sequential minimal optimization regression(SMOR),and M5-tree,were developed and compared with the GEP model.The results indicate that the GEP models predict Dh with less bias,as evidenced by the root mean square error(RMSE)and mean absolute error(MAE)for training(i.e.1.092 and 0.815;and 0.643 and 0.526)and for testing(i.e.0.89 and 0.705;and 0.773 and 0.573)in free-face and gently sloping ground topographies,respectively.The overall performance for the free-face topology was ranked as follows:GEP>RVM>M5-tree>GPR>SMOR,with a total score of 40,32,24,15,and 10,respectively.For the gently sloping condition,the performance was ranked as follows:GEP>RVM>GPR>M5-tree>SMOR with a total score of 40,32,21,19,and 8,respectively.Finally,the results of the sensitivity analysis showed that for both free-face and gently sloping ground,the liquefiable layer thickness(T_(15))was the major parameter with percentage deterioration(%D)value of 99.15 and 90.72,respectively.
基金supported by the Research Program through the National Research Foundation of Korea,NRF-2018R1D1A1B07050864.
文摘This study was conducted to enable prompt classification of malware,which was becoming increasingly sophisticated.To do this,we analyzed the important features of malware and the relative importance of selected features according to a learning model to assess how those important features were identified.Initially,the analysis features were extracted using Cuckoo Sandbox,an open-source malware analysis tool,then the features were divided into five categories using the extracted information.The 804 extracted features were reduced by 70%after selecting only the most suitable ones for malware classification using a learning model-based feature selection method called the recursive feature elimination.Next,these important features were analyzed.The level of contribution from each one was assessed by the Random Forest classifier method.The results showed that System call features were mostly allocated.At the end,it was possible to accurately identify the malware type using only 36 to 76 features for each of the four types of malware with the most analysis samples available.These were the Trojan,Adware,Downloader,and Backdoor malware.
文摘Breast cancer is one of the most common cancers among women in the world, with more than two million new cases of breast cancer every year. This disease is associated with numerous clinical and genetic characteristics. In recent years, machine learning technology has been increasingly applied to the medical field, including predicting the risk of malignant tumors such as breast cancer. Based on clinical and targeted sequencing data of 1980 primary breast cancer samples, this article aimed to analyze these data and predict living conditions after breast cancer. After data engineering, feature selection, and comparison of machine learning methods, the light gradient boosting machine model was found the best with hyperparameter tuning (precision = 0.818, recall = 0.816, f1 score = 0.817, roc-auc = 0.867). And the top 5 determinants were clinical features age at diagnosis, Nottingham Prognostic Index, cohort and genetic features rheb, nr3c1. The study shed light on rational allocation of medical resources and provided insights to early prevention, diagnosis and treatment of breast cancer with the identified risk clinical and genetic factors.
基金funded by the University Transportation Center for Underground Transportation Infrastructure(UTC-UTI)at the Colorado School of Mines under Grant No.69A3551747118 from the US Department of Transportation(DOT).
文摘Prediction of tunneling-induced ground settlements is an essential task,particularly for tunneling in urban settings.Ground settlements should be limited within a tolerable threshold to avoid damages to aboveground structures.Machine learning(ML)methods are becoming popular in many fields,including tunneling and underground excavations,as a powerful learning and predicting technique.However,the available datasets collected from a tunneling project are usually small from the perspective of applying ML methods.Can ML algorithms effectively predict tunneling-induced ground settlements when the available datasets are small?In this study,seven ML methods are utilized to predict tunneling-induced ground settlement using 14 contributing factors measured before or during tunnel excavation.These methods include multiple linear regression(MLR),decision tree(DT),random forest(RF),gradient boosting(GB),support vector regression(SVR),back-propagation neural network(BPNN),and permutation importancebased BPNN(PI-BPNN)models.All methods except BPNN and PI-BPNN are shallow-structure ML methods.The effectiveness of these seven ML approaches on small datasets is evaluated using model accuracy and stability.The model accuracy is measured by the coefficient of determination(R2)of training and testing datasets,and the stability of a learning algorithm indicates robust predictive performance.Also,the quantile error(QE)criterion is introduced to assess model predictive performance considering underpredictions and overpredictions.Our study reveals that the RF algorithm outperforms all the other models with the highest model prediction accuracy(0.9)and stability(3.0210^(-27)).Deep-structure ML models do not perform well for small datasets with relatively low model accuracy(0.59)and stability(5.76).The PI-BPNN architecture is proposed and designed for small datasets,showing better performance than typical BPNN.Six important contributing factors of ground settlements are identified,including tunnel depth,the distance between tunnel face and surface monitoring points(DTM),weighted average soil compressibility modulus(ACM),grouting pressure,penetrating rate and thrust force.
基金Shanghai Philosophy and Social Science Program,China(No.2019BGL004)。
文摘The existence of time delay in complex industrial processes or dynamical systems is a common phenomenon and is a difficult problem to deal with in industrial control systems,as well as in the textile field.Accurate identification of the time delay can greatly improve the efficiency of the design of industrial process control systems.The time delay identification methods based on mathematical modeling require prior knowledge of the structural information of the model,especially for nonlinear systems.The neural network-based identification method can predict the time delay of the system,but cannot accurately obtain the specific parameters of the time delay.Benefit from the interpretability of machine learning,a novel method for delay identification based on an interpretable regression decision tree is proposed.Utilizing the self-explanatory analysis of the decision tree model,the parameters with the highest feature importance are obtained to identify the time delay of the system.Excellent results are gained by the simulation data of linear and nonlinear control systems,and the time delay of the systems can be accurately identified.
文摘Energy issue is of strategic importance influencing China’s overall economic and social development that needs systematic planning and far-sighted deliberation.At the present time the revolution of energy technology is advancing rapidly.The global innovation of energy technology has entered a highly dynamic period featured by multi-point breakthroughs,
基金supported by the National Natural Science Foundation of China(Grant No.42177291)Innovation Capability Support Program of Shaanxi Province(2023-JC-JQ-25 and 2021KJXX-11).
文摘As an essential property of frozen soils,change of unfrozen water content(UWC)with temperature,namely soil-freezing characteristic curve(SFCC),plays significant roles in numerous physical,hydraulic and mechanical processes in cold regions,including the heat and water transfer within soils and at the land–atmosphere interface,frost heave and thaw settlement,as well as the simulation of coupled thermo-hydro-mechanical interactions.Although various models have been proposed to estimate SFCC,their applicability remains limited due to their derivation from specific soil types,soil treatments,and test devices.Accordingly,this study proposes a novel data-driven model to predict the SFCC using an extreme Gradient Boosting(XGBoost)model.A systematic database for SFCC of frozen soils compiled from extensive experimental investigations via various testing methods was utilized to train the XGBoost model.The predicted soil freezing characteristic curves(SFCC,UWC as a function of temperature)from the well-trained XGBoost model were compared with original experimental data and three conventional models.The results demonstrate the superior performance of the proposed XGBoost model over the traditional models in predicting SFCC.This study provides valuable insights for future investigations regarding the SFCC of frozen soils.
文摘Solar radiation is capable of producing heat,causing chemical reactions,or generating electricity.Thus,the amount of solar radiation at different times of the day must be determined to design and equip all solar systems.Moreover,it is necessary to have a thorough understanding of different solar radiation components,such as Direct Normal Irradiance(DNI),Diffuse Horizontal Irradiance(DHI),and Global Horizontal Irradiance(GHI).Unfortunately,measurements of solar radiation are not easily accessible for the majority of regions on the globe.This paper aims to develop a set of deep learning models through feature importance algorithms to predict the DNI data.The proposed models are based on historical data of meteorological parameters and solar radiation properties in a specific location of the region of Errachidia,Morocco,from January 1,2017,to December 31,2019,with an interval of 60 minutes.The findings demonstrated that feature selection approaches play a crucial role in forecasting of solar radiation accurately when compared with the available data.
基金The authors express their gratitude toward all database developers mentioned and quoted in this article for their important work and the data they shared.The author also would like to thank users for their comments and suggestions on the previous version of KinasePhos.This work was supported by the National Natural Science Foundation of China(Grant No.32070659)the Science,Technology and Innovation Commission of Shenzhen Municipality(Grant No.JCYJ20200109150003938)+1 种基金the Guangdong Province Basic and Applied Basic Research Fund(Grant No.2021A1515012447)the Ganghong Young Scholar Development Fund(Grant No.2021E007),China.This work is supported by the Warshel Institute for Computational Biology funding from Shenzhen City and Longgang District,China.
文摘The purpose of this work is to enhance KinasePhos,a machine learning-based kinasespecific phosphorylation site prediction tool.Experimentally verified kinase-specific phosphorylation data were collected from PhosphoSitePlus,UniProtKB,the GPS 5.0,and Phospho.ELM.In total,41,421 experimentally verified kinase-specific phosphorylation sites were identified.A total of 1380 unique kinases were identified,including 753 with existing classification information from KinBase and the remaining 627 annotated by building a phylogenetic tree.Based on this kinase classification,a total of 771 predictive models were built at the individual,family,and group levels,using at least 15 experimentally verified substrate sites in positive training datasets.The improved models demonstrated their effectiveness compared with other prediction tools.For example,the prediction of sites phosphorylated by the protein kinase B,casein kinase 2,and protein kinase A families had accuracies of 94.5%,92.5%,and 90.0%,respectively.The average prediction accuracy for all 771 models was 87.2%.For enhancing interpretability,the SHapley Additive exPlanations(SHAP)method was employed to assess feature importance.The web interface of KinasePhos 3.0 has been redesigned to provide comprehensive annotations of kinase-specific phosphorylation sites on multiple proteins.Additionally,considering the large scale of phosphoproteomic data,a downloadable prediction tool is available at https://awi.cuhk.edu.cn/KinasePhos/download.html or https://github.com/tom-209/KinasePhos-3.0-executable-file.
基金The Strategic Priority Research Program of Chinese Academy of Sciences,No.XDA20040202The Second Tibetan Plateau Scientific Expedition and Research Program (STEP),No.2019QZKK0603。
文摘Alluvial fans are an important land resource in the Qinghai-Tibet Plateau with the expansion of human activities. However, the factors of alluvial fan development are poorly understood. According to our previous investigation and research, approximately 826 alluvial fans exist in the Lhasa River Basin(LRB). The main purpose of this work is to identify the main influencing factors by using machine learning. A development index(Di) of alluvial fan was created by combining its area, perimeter, height and gradient. The 72% of data, including Di, 11 types of environmental parameters of the matching catchment of alluvial fan and 10 commonly used machine learning algorithms were used to train and build models.The 18% of data were used to validate models. The remaining 10% of data were used to test the model accuracy. The feature importance of the model was used to illustrate the significance of the 11 types of environmental parameters to Di. The primary modelling results showed that the accuracy of the ensemble models, including Gradient Boost Decision Tree,Random Forest and XGBoost, are not less than 0.5(R^(2)). The accuracy of the Gradient Boost Decision Tree and XGBoost improved after grid research, and their R^(2) values are 0.782 and 0.870, respectively. The XGBoost was selected as the final model due to its optimal accuracy and generalisation ability at the sites closest to the LRB. Morphology parameters are the main factors in alluvial fan development, with a cumulative value of relative feature importance of 74.60% in XGBoost. The final model will have better accuracy and generalisation ability after adding training samples in other regions.
基金the National Natural Science Foundation of China(No.51808056)the Hunan Provincial Natural Science Foundation of China(No.2020JJ5583)+1 种基金the Research Foundation of Education Bureau of Hunan Province(No.19B012)the China Scholarship Council(No.201808430232)。
文摘Accurately estimating the interfacial bond capacity of the near-surface mounted(NSM)carbon fiber-reinforced polymer(CFRP)to concrete joint is a fundamental task in the strengthening and retrofit of existing reinforced concrete(RC)structures.The machine learning(ML)approach may provide an alternative to the commonly used semi-empirical or semi-analytical methods.Therefore,in this work we have developed a predictive model based on an artificial neural network(ANN)approach,i.e.using a back propagation neural network(BPNN),to map the complex data pattern obtained from an NSM CFRP to concrete joint.It involves a set of nine material and geometric input parameters and one output value.Moreover,by employing the neural interpretation diagram(NID)technique,the BPNN model becomes interpretable,as the influence of each input variable on the model can be tracked and quantified based on the connection weights of the neural network.An extensive database including 163 pull-out testing samples,collected from the authors’research group and from published results in the literature,is used to train and verify the ANN.Our results show that the prediction given by the BPNN model agrees well with the experimental data and yields a coefficient of determination of 0.957 on the whole database.After removing one non-significant feature,the BPNN becomes even more computationally efficient and accurate.In addition,compared with the existed semi-analytical model,the ANN-based approach demonstrates a more accurate estimation.Therefore,the proposed ML method may be a promising alternative for predicting the bond strength of NSM CFRP to concrete joint for structural engineers.