Forest habitats are critical for biodiversity,ecosystem services,human livelihoods,and well-being.Capacity to conduct theoretical and applied forest ecology research addressing direct(e.g.,deforestation)and indirect(e...Forest habitats are critical for biodiversity,ecosystem services,human livelihoods,and well-being.Capacity to conduct theoretical and applied forest ecology research addressing direct(e.g.,deforestation)and indirect(e.g.,climate change)anthropogenic pressures has benefited considerably from new field-and statistical-techniques.We used machine learning and bibliometric structural topic modelling to identify 20 latent topics comprising four principal fields from a corpus of 16,952 forest ecology/forestry articles published in eight ecology and five forestry journals between 2010 and 2022.Articles published per year increased from 820 in 2010 to 2,354 in 2021,shifting toward more applied topics.Publications from China and some countries in North America and Europe dominated,with relatively fewer articles from some countries in West and Central Africa and West Asia,despite globally important forest resources.Most study sites were in some countries in North America,Central Asia,and South America,and Australia.Articles utilizing R statistical software predominated,increasing from 29.5%in 2010 to 71.4%in 2022.The most frequently used packages included lme4,vegan,nlme,MuMIn,ggplot2,car,MASS,mgcv,multcomp and raster.R was more often used in forest ecology than applied forestry articles.R software offers advantages in script and workflow-sharing compared to other statistical packages.Our findings demonstrate that the disciplines of forest ecology/forestry are expanding both in number and scope,aided by more sophisticated statistical tools,to tackle the challenges of redressing forest habitat loss and the socio-economic impacts of deforestation.展开更多
Automatically detecting Ulva prolifera(U.prolifera)in rainy and cloudy weather using remote sensing imagery has been a long-standing problem.Here,we address this challenge by combining high-resolution Synthetic Apertu...Automatically detecting Ulva prolifera(U.prolifera)in rainy and cloudy weather using remote sensing imagery has been a long-standing problem.Here,we address this challenge by combining high-resolution Synthetic Aperture Radar(SAR)imagery with the machine learning,and detect the U.prolifera of the South Yellow Sea of China(SYS)in 2021.The findings indicate that the Random Forest model can accurately and robustly detect U.prolifera,even in the presence of complex ocean backgrounds and speckle noise.Visual inspection confirmed that the method successfully identified the majority of pixels containing U.prolifera without misidentifying noise pixels or seawater pixels as U.prolifera.Additionally,the method demonstrated consistent performance across different im-ages,with an average Area Under Curve(AUC)of 0.930(+0.028).The analysis yielded an overall accuracy of over 96%,with an average Kappa coefficient of 0.941(+0.038).Compared to the traditional thresholding method,Random Forest model has a lower estimation error of 14.81%.Practical application indicates that this method can be used in the detection of unprecedented U.prolifera in 2021 to derive continuous spatiotemporal changes.This study provides a potential new method to detect U.prolifera and enhances our under-standing of macroalgal outbreaks in the marine environment.展开更多
Forecasting of ocean currents is critical for both marine meteorological research and ocean engineering and construction.Timely and accurate forecasting of coastal current velocities offers a scientific foundation and...Forecasting of ocean currents is critical for both marine meteorological research and ocean engineering and construction.Timely and accurate forecasting of coastal current velocities offers a scientific foundation and decision support for multiple practices such as search and rescue,disaster avoidance and remediation,and offshore construction.This research established a framework to generate short-term surface current forecasts based on ensemble machine learning trained on high frequency radar observation.Results indicate that an ensemble algorithm that used random forests to filter forecasting features by weighting them,and then used the AdaBoost method to forecast can significantly reduce the model training time,while ensuring the model forecasting effectiveness,with great economic benefits.Model accuracy is a function of surface current variability and the forecasting horizon.In order to improve the forecasting capability and accuracy of the model,the model structure of the ensemble algorithm was optimized,and the random forest algorithm was used to dynamically select model features.The results show that the error variation of the optimized surface current forecasting model has a more regular error variation,and the importance of the features varies with the forecasting time-step.At ten-step ahead forecasting horizon the model reported root mean square error,mean absolute error,and correlation coefficient by 2.84 cm/s,2.02 cm/s,and 0.96,respectively.The model error is affected by factors such as topography,boundaries,and geometric accuracy of the observation system.This paper demonstrates the potential of ensemble-based machine learning algorithm to improve forecasting of ocean currents.展开更多
A machine learning(ML)-based random forest(RF)classification model algorithm was employed to investigate the main factors affecting the formation of the core-shell structure of BaTiO_(3)-based ceramics and their inter...A machine learning(ML)-based random forest(RF)classification model algorithm was employed to investigate the main factors affecting the formation of the core-shell structure of BaTiO_(3)-based ceramics and their interpretability was analyzed by using Shapley additive explanations(SHAP).An F1-score changed from 0.8795 to 0.9310,accuracy from 0.8450 to 0.9070,precision from 0.8714 to 0.9000,recall from 0.8929 to 0.9643,and ROC/AUC value of 0.97±0.03 was achieved by the RF classification with the optimal set of features containing only 5 features,demonstrating the high accuracy of our model and its high robustness.During the interpretability analysis of the model,it was found that the electronegativity,melting point,and sintering temperature of the dopant contribute highly to the formation of the core-shell structure,and based on these characteristics,specific ranges were delineated and twelve elements were finally obtained that met all the requirements,namely Si,Sc,Mn,Fe,Co,Ni,Pd,Er,Tm,Lu,Pa,and Cm.In the process of exploring the structure of the core-shell,the doping elements can be effectively localized to be selected by choosing the range of features.展开更多
Machine learning is currently one of the research hotspots in the field of landslide prediction.To clarify and evaluate the differences in characteristics and prediction effects of different machine learning models,Co...Machine learning is currently one of the research hotspots in the field of landslide prediction.To clarify and evaluate the differences in characteristics and prediction effects of different machine learning models,Conghua District,which is the most prone to landslide disasters in Guangzhou,was selected for landslide susceptibility evaluation.The evaluation factors were selected by using correlation analysis and variance expansion factor method.Applying four machine learning methods namely Logistic Regression(LR),Random Forest(RF),Support Vector Machines(SVM),and Extreme Gradient Boosting(XGB),landslide models were constructed.Comparative analysis and evaluation of the model were conducted through statistical indices and receiver operating characteristic(ROC)curves.The results showed that LR,RF,SVM,and XGB models have good predictive performance for landslide susceptibility,with the area under curve(AUC)values of 0.752,0.965,0.996,and 0.998,respectively.XGB model had the highest predictive ability,followed by RF model,SVM model,and LR model.The frequency ratio(FR)accuracy of LR,RF,SVM,and XGB models was 0.775,0.842,0.759,and 0.822,respectively.RF and XGB models were superior to LR and SVM models,indicating that the integrated algorithm has better predictive ability than a single classification algorithm in regional landslide classification problems.展开更多
The application of carbon dioxide(CO_(2)) in enhanced oil recovery(EOR) has increased significantly, in which CO_(2) solubility in oil is a key parameter in predicting CO_(2) flooding performance. Hydrocarbons are the...The application of carbon dioxide(CO_(2)) in enhanced oil recovery(EOR) has increased significantly, in which CO_(2) solubility in oil is a key parameter in predicting CO_(2) flooding performance. Hydrocarbons are the major constituents of oil, thus the focus of this work lies in investigating the solubility of CO_(2) in hydrocarbons. However, current experimental measurements are time-consuming, and equations of state can be computationally complex. To address these challenges, we developed an artificial intelligence-based model to predict the solubility of CO_(2) in hydrocarbons under varying conditions of temperature, pressure, molecular weight, and density. Using experimental data from previous studies,we trained and predicted the solubility using four machine learning models: support vector regression(SVR), extreme gradient boosting(XGBoost), random forest(RF), and multilayer perceptron(MLP).Among four models, the XGBoost model has the best predictive performance, with an R^(2) of 0.9838.Additionally, sensitivity analysis and evaluation of the relative impacts of each input parameter indicate that the prediction of CO_(2) solubility in hydrocarbons is most sensitive to pressure. Furthermore, our trained model was compared with existing models, demonstrating higher accuracy and applicability of our model. The developed machine learning-based model provides a more efficient and accurate approach for predicting CO_(2) solubility in hydrocarbons, which may contribute to the advancement of CO_(2)-related applications in the petroleum industry.展开更多
Manual investigation of chest radiography(CXR)images by physicians is crucial for effective decision-making in COVID-19 diagnosis.However,the high demand during the pandemic necessitates auxiliary help through image a...Manual investigation of chest radiography(CXR)images by physicians is crucial for effective decision-making in COVID-19 diagnosis.However,the high demand during the pandemic necessitates auxiliary help through image analysis and machine learning techniques.This study presents a multi-threshold-based segmentation technique to probe high pixel intensity regions in CXR images of various pathologies,including normal cases.Texture information is extracted using gray co-occurrence matrix(GLCM)-based features,while vessel-like features are obtained using Frangi,Sato,and Meijering filters.Machine learning models employing Decision Tree(DT)and RandomForest(RF)approaches are designed to categorize CXR images into common lung infections,lung opacity(LO),COVID-19,and viral pneumonia(VP).The results demonstrate that the fusion of texture and vesselbased features provides an effective ML model for aiding diagnosis.The ML model validation using performance measures,including an accuracy of approximately 91.8%with an RF-based classifier,supports the usefulness of the feature set and classifier model in categorizing the four different pathologies.Furthermore,the study investigates the importance of the devised features in identifying the underlying pathology and incorporates histogrambased analysis.This analysis reveals varying natural pixel distributions in CXR images belonging to the normal,COVID-19,LO,and VP groups,motivating the incorporation of additional features such as mean,standard deviation,skewness,and percentile based on the filtered images.Notably,the study achieves a considerable improvement in categorizing COVID-19 from LO,with a true positive rate of 97%,further substantiating the effectiveness of the methodology implemented.展开更多
BACKGROUND Liver cancer is one of the most prevalent malignant tumors worldwide,and its early detection and treatment are crucial for enhancing patient survival rates and quality of life.However,the early symptoms of ...BACKGROUND Liver cancer is one of the most prevalent malignant tumors worldwide,and its early detection and treatment are crucial for enhancing patient survival rates and quality of life.However,the early symptoms of liver cancer are often not obvious,resulting in a late-stage diagnosis in many patients,which significantly reduces the effectiveness of treatment.Developing a highly targeted,widely applicable,and practical risk prediction model for liver cancer is crucial for enhancing the early diagnosis and long-term survival rates among affected individuals.AIM To develop a liver cancer risk prediction model by employing machine learning techniques,and subsequently assess its performance.METHODS In this study,a total of 550 patients were enrolled,with 190 hepatocellular carcinoma(HCC)and 195 cirrhosis patients serving as the training cohort,and 83 HCC and 82 cirrhosis patients forming the validation cohort.Logistic regression(LR),support vector machine(SVM),random forest(RF),and least absolute shrinkage and selection operator(LASSO)regression models were developed in the training cohort.Model performance was assessed in the validation cohort.Additionally,this study conducted a comparative evaluation of the diagnostic efficacy between the ASAP model and the model developed in this study using receiver operating characteristic curve,calibration curve,and decision curve analysis(DCA)to determine the optimal predictive model for assessing liver cancer risk.RESULTS Six variables including age,white blood cell,red blood cell,platelet counts,alpha-fetoprotein and protein induced by vitamin K absence or antagonist II levels were used to develop LR,SVM,RF,and LASSO regression models.The RF model exhibited superior discrimination,and the area under curve of the training and validation sets was 0.969 and 0.858,respectively.These values significantly surpassed those of the LR(0.850 and 0.827),SVM(0.860 and 0.803),LASSO regression(0.845 and 0.831),and ASAP(0.866 and 0.813)models.Furthermore,calibration and DCA indicated that the RF model exhibited robust calibration and clinical validity.CONCLUSION The RF model demonstrated excellent prediction capabilities for HCC and can facilitate early diagnosis of HCC in clinical practice.展开更多
The application of machine learning(ML)algorithms in various fields of hepatology is an issue of interest.However,we must be cautious with the results.In this letter,based on a published ML prediction model for acute ...The application of machine learning(ML)algorithms in various fields of hepatology is an issue of interest.However,we must be cautious with the results.In this letter,based on a published ML prediction model for acute kidney injury after liver surgery,we discuss some limitations of ML models and how they may be addressed in the future.Although the future faces significant challenges,it also holds a great potential.展开更多
Survival rates following radical surgery for gastric neuroendocrine neoplasms(g-NENs)are low,with high recurrence rates.This fact impacts patient prognosis and complicates postoperative management.Traditional prognost...Survival rates following radical surgery for gastric neuroendocrine neoplasms(g-NENs)are low,with high recurrence rates.This fact impacts patient prognosis and complicates postoperative management.Traditional prognostic models,including the Cox proportional hazards(CoxPH)model,have shown limited predictive power for postoperative survival in gastrointestinal neuroectodermal tumor patients.Machine learning methods offer a unique opportunity to analyze complex relationships within datasets,providing tools and methodologies to assess large volumes of high-dimensional,multimodal data generated by biological sciences.These methods show promise in predicting outcomes across various medical disciplines.In the context of g-NENs,utilizing machine learning to predict survival outcomes holds potential for personalized postoperative management strategies.This editorial reviews a study exploring the advantages and effectiveness of the random survival forest(RSF)model,using the lymph node ratio(LNR),in predicting disease-specific survival(DSS)in postoperative g-NEN patients stratified into low-risk and high-risk groups.The findings demonstrate that the RSF model,incorporating LNR,outperformed the CoxPH model in predicting DSS and constitutes an important step towards precision medicine.展开更多
The incidence of prediabetes is in a dangerous condition in the USA. The likelihood of increasing chronic and complex health issues is very high if this stage of prediabetes is ignored. So, early detection of prediabe...The incidence of prediabetes is in a dangerous condition in the USA. The likelihood of increasing chronic and complex health issues is very high if this stage of prediabetes is ignored. So, early detection of prediabetes conditions is critical to decrease or avoid type 2 diabetes and other health issues that come as a result of untreated and undiagnosed prediabetes condition. This study is done in order to detect the prediabetes condition with an artificial intelligence method. Data used for this study is collected from the Centers for Disease Control and Prevention’s (CDC) survey conducted by the Division of Health and Nutrition Examination Surveys (DHANES). In this study, several machine learning algorithms are exploited and compared to determine the best algorithm based on Average Squared Error (ASE), Kolmogorov-Smirnov (Youden) scores, areas under the ROC and some other measures of the machine learning algorithm. Based on these scores, the champion model is selected, and Random Forest is the champion model with approximately 89% accuracy.展开更多
Early stroke prediction is vital to prevent damage. A stroke happens when the blood flow to the brain is disrupted by a clot or bleeding, resulting in brain death or injury. However, early diagnosis and treatment redu...Early stroke prediction is vital to prevent damage. A stroke happens when the blood flow to the brain is disrupted by a clot or bleeding, resulting in brain death or injury. However, early diagnosis and treatment reduce long-term needs and lower health costs. We aim for this research to be a machine-learning method for forecasting early warning signs of stroke. The methodology we employed feature selection techniques and multiple algorithms. Utilizing the XGboost Algorithm, the research findings indicate that their proposed model achieved an accuracy rate of 96.45%. This research shows that machine learning can effectively predict early warning signs of stroke, which can help reduce long-term treatment and rehabilitation needs and lower health costs.展开更多
Customer churn poses a significant challenge for the banking and finance industry in the United States, directly affecting profitability and market share. This study conducts a comprehensive comparative analysis of ma...Customer churn poses a significant challenge for the banking and finance industry in the United States, directly affecting profitability and market share. This study conducts a comprehensive comparative analysis of machine learning models for customer churn prediction, focusing on the U.S. context. The research evaluates the performance of logistic regression, random forest, and neural networks using industry-specific datasets, considering the economic impact and practical implications of the findings. The exploratory data analysis reveals unique patterns and trends in the U.S. banking and finance industry, such as the age distribution of customers and the prevalence of dormant accounts. The study incorporates macroeconomic factors to capture the potential influence of external conditions on customer churn behavior. The findings highlight the importance of leveraging advanced machine learning techniques and comprehensive customer data to develop effective churn prevention strategies in the U.S. context. By accurately predicting customer churn, financial institutions can proactively identify at-risk customers, implement targeted retention strategies, and optimize resource allocation. The study discusses the limitations and potential future improvements, serving as a roadmap for researchers and practitioners to further advance the field of customer churn prediction in the evolving landscape of the U.S. banking and finance industry.展开更多
Credit card fraud remains a significant challenge, with financial losses and consumer protection at stake. This study addresses the need for practical, real-time fraud detection methodologies. Using a Kaggle credit ca...Credit card fraud remains a significant challenge, with financial losses and consumer protection at stake. This study addresses the need for practical, real-time fraud detection methodologies. Using a Kaggle credit card dataset, I tackle class imbalance using the Synthetic Minority Oversampling Technique (SMOTE) to enhance modeling efficiency. I compare several machine learning algorithms, including Logistic Regression, Linear Discriminant Analysis, K-nearest Neighbors, Classification and Regression Tree, Naive Bayes, Support Vector, Random Forest, XGBoost, and Light Gradient-Boosting Machine to classify transactions as fraud or genuine. Rigorous evaluation metrics, such as AUC, PRAUC, F1, KS, Recall, and Precision, identify the Random Forest as the best performer in detecting fraudulent activities. The Random Forest model successfully identifies approximately 92% of transactions scoring 90 and above as fraudulent, equating to a detection rate of over 70% for all fraudulent transactions in the test dataset. Moreover, the model captures more than half of the fraud in each bin of the test dataset. SHAP values provide model explainability, with the SHAP summary plot highlighting the global importance of individual features, such as “V12” and “V14”. SHAP force plots offer local interpretability, revealing the impact of specific features on individual predictions. This study demonstrates the potential of machine learning, particularly the Random Forest model, for real-time credit card fraud detection, offering a promising approach to mitigate financial losses and protect consumers.展开更多
The increasing amount and intricacy of network traffic in the modern digital era have worsened the difficulty of identifying abnormal behaviours that may indicate potential security breaches or operational interruptio...The increasing amount and intricacy of network traffic in the modern digital era have worsened the difficulty of identifying abnormal behaviours that may indicate potential security breaches or operational interruptions. Conventional detection approaches face challenges in keeping up with the ever-changing strategies of cyber-attacks, resulting in heightened susceptibility and significant harm to network infrastructures. In order to tackle this urgent issue, this project focused on developing an effective anomaly detection system that utilizes Machine Learning technology. The suggested model utilizes contemporary machine learning algorithms and frameworks to autonomously detect deviations from typical network behaviour. It promptly identifies anomalous activities that may indicate security breaches or performance difficulties. The solution entails a multi-faceted approach encompassing data collection, preprocessing, feature engineering, model training, and evaluation. By utilizing machine learning methods, the model is trained on a wide range of datasets that include both regular and abnormal network traffic patterns. This training ensures that the model can adapt to numerous scenarios. The main priority is to ensure that the system is functional and efficient, with a particular emphasis on reducing false positives to avoid unwanted alerts. Additionally, efforts are directed on improving anomaly detection accuracy so that the model can consistently distinguish between potentially harmful and benign activity. This project aims to greatly strengthen network security by addressing emerging cyber threats and improving their resilience and reliability.展开更多
Every year, a higher number of dogs are abandoned or euthanised due to temperament issues and a lack of understanding by owners regarding dog behaviour and training. This research focuses on the potential to make pred...Every year, a higher number of dogs are abandoned or euthanised due to temperament issues and a lack of understanding by owners regarding dog behaviour and training. This research focuses on the potential to make predictions of adult dog temperament based on early puppy behaviours by using a machine learning model. Specifically, the research used guard dog breeds, such as American Bully, American Pit Bull Terrier, and German Shepherd. The study collected dog data and general data from dog owners and used the Random Forest approach to build a predictive model. Users are allowed to input puppy data and receive adult dog temperament predictions in model, which is integrated into a web application. The aims of this web application are to enhance responsible dog ownership and reduce abandonment by offering insights and training recommendations based on predicted outcomes. The model achieved a prediction accuracy of 86% on testing, and it is continually improving, though further refinement is recommended to improve its reliability and applicability across a broader range of breeds. The study contributes to canine welfare by providing a practical solution for predicting temperament outcomes, ultimately helping to reduce shelter populations and euthanasia rates.展开更多
Hyperparameter tuning is a key step in developing high-performing machine learning models, but searching large hyperparameter spaces requires extensive computation using standard sequential methods. This work analyzes...Hyperparameter tuning is a key step in developing high-performing machine learning models, but searching large hyperparameter spaces requires extensive computation using standard sequential methods. This work analyzes the performance gains from parallel versus sequential hyperparameter optimization. Using scikit-learn’s Randomized SearchCV, this project tuned a Random Forest classifier for fake news detection via randomized grid search. Setting n_jobs to -1 enabled full parallelization across CPU cores. Results show the parallel implementation achieved over 5× faster CPU times and 3× faster total run times compared to sequential tuning. However, test accuracy slightly dropped from 99.26% sequentially to 99.15% with parallelism, indicating a trade-off between evaluation efficiency and model performance. Still, the significant computational gains allow more extensive hyperparameter exploration within reasonable timeframes, outweighing the small accuracy decrease. Further analysis could better quantify this trade-off across different models, tuning techniques, tasks, and hardware.展开更多
Customer attrition in the banking industry occurs when consumers quit using the goods and services offered by the bank for some time and,after that,end their connection with the bank.Therefore,customer retention is es...Customer attrition in the banking industry occurs when consumers quit using the goods and services offered by the bank for some time and,after that,end their connection with the bank.Therefore,customer retention is essential in today’s extremely competitive banking market.Additionally,having a solid customer base helps attract new consumers by fostering confidence and a referral from a current clientele.These factors make reducing client attrition a crucial step that banks must pursue.In our research,we aim to examine bank data and forecast which users will most likely discontinue using the bank’s services and become paying customers.We use various machine learning algorithms to analyze the data and show comparative analysis on different evaluation metrics.In addition,we developed a Data Visualization RShiny app for data science and management regarding customer churn analysis.Analyzing this data will help the bank indicate the trend and then try to retain customers on the verge of attrition.展开更多
Accurately assessing the State of Charge(SOC)is paramount for optimizing battery management systems,a cornerstone for ensuring peak battery performance and safety across diverse applications,encompassing vehicle power...Accurately assessing the State of Charge(SOC)is paramount for optimizing battery management systems,a cornerstone for ensuring peak battery performance and safety across diverse applications,encompassing vehicle powertrains and renewable energy storage systems.Confronted with the challenges of traditional SOC estimation methods,which often struggle with accuracy and cost-effectiveness,this research endeavors to elevate the precision of SOC estimation to a new level,thereby refining battery management strategies.Leveraging the power of integrated learning techniques,the study fuses Random Forest Regressor,Gradient Boosting Regressor,and Linear Regression into a comprehensive framework that substantially enhances the accuracy and overall performance of SOC predictions.By harnessing the publicly accessible National Aeronautics and Space Administration(NASA)Battery Cycle dataset,our analysis reveals that these integrated learning approaches significantly outperform traditional methods like Coulomb counting and electrochemical models,achieving remarkable improvements in SOC estimation accuracy,error reduction,and optimization of key metrics like R2 and Adjusted R2.This pioneering work propels the development of innovative battery management systems grounded in machine learning and deepens our comprehension of how this cutting-edge technology can revolutionize battery technology.展开更多
Most forest fires in the Margalla Hills are related to human activities and socioeconomic factors are essential to assess their likelihood of occurrence.This study considers both environmental(altitude,precipitation,f...Most forest fires in the Margalla Hills are related to human activities and socioeconomic factors are essential to assess their likelihood of occurrence.This study considers both environmental(altitude,precipitation,forest type,terrain and humidity index)and socioeconomic(population density,distance from roads and urban areas)factors to analyze how human behavior affects the risk of forest fires.Maximum entropy(Maxent)modelling and random forest(RF)machine learning methods were used to predict the probability and spatial diffusion patterns of forest fires in the Margalla Hills.The receiver operating characteristic(ROC)curve and the area under the ROC curve(AUC)were used to compare the models.We studied the fire history from 1990 to 2019 to establish the relationship between the probability of forest fire and environmental and socioeconomic changes.Using Maxent,the AUC fire probability values for the 1999 s,2009 s,and 2019 s were 0.532,0.569,and 0.518,respectively;using RF,they were 0.782,0.825,and 0.789,respectively.Fires were mainly distributed in urban areas and their probability of occurrence was related to accessibility and human behaviour/activity.AUC principles for validation were greater in the random forest models than in the Maxent models.Our results can be used to establish preventive measures to reduce risks of forest fires by considering socio-economic and environmental conditions.展开更多
基金financially supported by the National Natural Science Foundation of China(31971541).
文摘Forest habitats are critical for biodiversity,ecosystem services,human livelihoods,and well-being.Capacity to conduct theoretical and applied forest ecology research addressing direct(e.g.,deforestation)and indirect(e.g.,climate change)anthropogenic pressures has benefited considerably from new field-and statistical-techniques.We used machine learning and bibliometric structural topic modelling to identify 20 latent topics comprising four principal fields from a corpus of 16,952 forest ecology/forestry articles published in eight ecology and five forestry journals between 2010 and 2022.Articles published per year increased from 820 in 2010 to 2,354 in 2021,shifting toward more applied topics.Publications from China and some countries in North America and Europe dominated,with relatively fewer articles from some countries in West and Central Africa and West Asia,despite globally important forest resources.Most study sites were in some countries in North America,Central Asia,and South America,and Australia.Articles utilizing R statistical software predominated,increasing from 29.5%in 2010 to 71.4%in 2022.The most frequently used packages included lme4,vegan,nlme,MuMIn,ggplot2,car,MASS,mgcv,multcomp and raster.R was more often used in forest ecology than applied forestry articles.R software offers advantages in script and workflow-sharing compared to other statistical packages.Our findings demonstrate that the disciplines of forest ecology/forestry are expanding both in number and scope,aided by more sophisticated statistical tools,to tackle the challenges of redressing forest habitat loss and the socio-economic impacts of deforestation.
基金Under the auspices of National Natural Science Foundation of China(No.42071385)National Science and Technology Major Project of High Resolution Earth Observation System(No.79-Y50-G18-9001-22/23)。
文摘Automatically detecting Ulva prolifera(U.prolifera)in rainy and cloudy weather using remote sensing imagery has been a long-standing problem.Here,we address this challenge by combining high-resolution Synthetic Aperture Radar(SAR)imagery with the machine learning,and detect the U.prolifera of the South Yellow Sea of China(SYS)in 2021.The findings indicate that the Random Forest model can accurately and robustly detect U.prolifera,even in the presence of complex ocean backgrounds and speckle noise.Visual inspection confirmed that the method successfully identified the majority of pixels containing U.prolifera without misidentifying noise pixels or seawater pixels as U.prolifera.Additionally,the method demonstrated consistent performance across different im-ages,with an average Area Under Curve(AUC)of 0.930(+0.028).The analysis yielded an overall accuracy of over 96%,with an average Kappa coefficient of 0.941(+0.038).Compared to the traditional thresholding method,Random Forest model has a lower estimation error of 14.81%.Practical application indicates that this method can be used in the detection of unprecedented U.prolifera in 2021 to derive continuous spatiotemporal changes.This study provides a potential new method to detect U.prolifera and enhances our under-standing of macroalgal outbreaks in the marine environment.
基金The fund from Southern Marine Science and Engineering Guangdong Laboratory(Zhuhai)under contract No.SML2020SP009the National Basic Research and Development Program of China under contract Nos 2022YFF0802000 and 2022YFF0802004+3 种基金the“Renowned Overseas Professors”Project of Guangdong Provincial Department of Science and Technology under contract No.76170-52910004the Belt and Road Special Foundation of the National Key Laboratory of Water Disaster Prevention under contract No.2022491711the National Natural Science Foundation of China under contract No.51909290the Key Research and Development Program of Guangdong Province under contract No.2020B1111020003.
文摘Forecasting of ocean currents is critical for both marine meteorological research and ocean engineering and construction.Timely and accurate forecasting of coastal current velocities offers a scientific foundation and decision support for multiple practices such as search and rescue,disaster avoidance and remediation,and offshore construction.This research established a framework to generate short-term surface current forecasts based on ensemble machine learning trained on high frequency radar observation.Results indicate that an ensemble algorithm that used random forests to filter forecasting features by weighting them,and then used the AdaBoost method to forecast can significantly reduce the model training time,while ensuring the model forecasting effectiveness,with great economic benefits.Model accuracy is a function of surface current variability and the forecasting horizon.In order to improve the forecasting capability and accuracy of the model,the model structure of the ensemble algorithm was optimized,and the random forest algorithm was used to dynamically select model features.The results show that the error variation of the optimized surface current forecasting model has a more regular error variation,and the importance of the features varies with the forecasting time-step.At ten-step ahead forecasting horizon the model reported root mean square error,mean absolute error,and correlation coefficient by 2.84 cm/s,2.02 cm/s,and 0.96,respectively.The model error is affected by factors such as topography,boundaries,and geometric accuracy of the observation system.This paper demonstrates the potential of ensemble-based machine learning algorithm to improve forecasting of ocean currents.
基金Funded by the National Key Research and Development Program of China(No.2023YFB3812200)。
文摘A machine learning(ML)-based random forest(RF)classification model algorithm was employed to investigate the main factors affecting the formation of the core-shell structure of BaTiO_(3)-based ceramics and their interpretability was analyzed by using Shapley additive explanations(SHAP).An F1-score changed from 0.8795 to 0.9310,accuracy from 0.8450 to 0.9070,precision from 0.8714 to 0.9000,recall from 0.8929 to 0.9643,and ROC/AUC value of 0.97±0.03 was achieved by the RF classification with the optimal set of features containing only 5 features,demonstrating the high accuracy of our model and its high robustness.During the interpretability analysis of the model,it was found that the electronegativity,melting point,and sintering temperature of the dopant contribute highly to the formation of the core-shell structure,and based on these characteristics,specific ranges were delineated and twelve elements were finally obtained that met all the requirements,namely Si,Sc,Mn,Fe,Co,Ni,Pd,Er,Tm,Lu,Pa,and Cm.In the process of exploring the structure of the core-shell,the doping elements can be effectively localized to be selected by choosing the range of features.
基金supported by the projects of the China Geological Survey(DD20221729,DD20190291)Zhuhai Urban Geological Survey(including informatization)(MZCD–2201–008).
文摘Machine learning is currently one of the research hotspots in the field of landslide prediction.To clarify and evaluate the differences in characteristics and prediction effects of different machine learning models,Conghua District,which is the most prone to landslide disasters in Guangzhou,was selected for landslide susceptibility evaluation.The evaluation factors were selected by using correlation analysis and variance expansion factor method.Applying four machine learning methods namely Logistic Regression(LR),Random Forest(RF),Support Vector Machines(SVM),and Extreme Gradient Boosting(XGB),landslide models were constructed.Comparative analysis and evaluation of the model were conducted through statistical indices and receiver operating characteristic(ROC)curves.The results showed that LR,RF,SVM,and XGB models have good predictive performance for landslide susceptibility,with the area under curve(AUC)values of 0.752,0.965,0.996,and 0.998,respectively.XGB model had the highest predictive ability,followed by RF model,SVM model,and LR model.The frequency ratio(FR)accuracy of LR,RF,SVM,and XGB models was 0.775,0.842,0.759,and 0.822,respectively.RF and XGB models were superior to LR and SVM models,indicating that the integrated algorithm has better predictive ability than a single classification algorithm in regional landslide classification problems.
基金supported by the Fundamental Research Funds for the National Major Science and Technology Projects of China (No. 2017ZX05009-005)。
文摘The application of carbon dioxide(CO_(2)) in enhanced oil recovery(EOR) has increased significantly, in which CO_(2) solubility in oil is a key parameter in predicting CO_(2) flooding performance. Hydrocarbons are the major constituents of oil, thus the focus of this work lies in investigating the solubility of CO_(2) in hydrocarbons. However, current experimental measurements are time-consuming, and equations of state can be computationally complex. To address these challenges, we developed an artificial intelligence-based model to predict the solubility of CO_(2) in hydrocarbons under varying conditions of temperature, pressure, molecular weight, and density. Using experimental data from previous studies,we trained and predicted the solubility using four machine learning models: support vector regression(SVR), extreme gradient boosting(XGBoost), random forest(RF), and multilayer perceptron(MLP).Among four models, the XGBoost model has the best predictive performance, with an R^(2) of 0.9838.Additionally, sensitivity analysis and evaluation of the relative impacts of each input parameter indicate that the prediction of CO_(2) solubility in hydrocarbons is most sensitive to pressure. Furthermore, our trained model was compared with existing models, demonstrating higher accuracy and applicability of our model. The developed machine learning-based model provides a more efficient and accurate approach for predicting CO_(2) solubility in hydrocarbons, which may contribute to the advancement of CO_(2)-related applications in the petroleum industry.
文摘Manual investigation of chest radiography(CXR)images by physicians is crucial for effective decision-making in COVID-19 diagnosis.However,the high demand during the pandemic necessitates auxiliary help through image analysis and machine learning techniques.This study presents a multi-threshold-based segmentation technique to probe high pixel intensity regions in CXR images of various pathologies,including normal cases.Texture information is extracted using gray co-occurrence matrix(GLCM)-based features,while vessel-like features are obtained using Frangi,Sato,and Meijering filters.Machine learning models employing Decision Tree(DT)and RandomForest(RF)approaches are designed to categorize CXR images into common lung infections,lung opacity(LO),COVID-19,and viral pneumonia(VP).The results demonstrate that the fusion of texture and vesselbased features provides an effective ML model for aiding diagnosis.The ML model validation using performance measures,including an accuracy of approximately 91.8%with an RF-based classifier,supports the usefulness of the feature set and classifier model in categorizing the four different pathologies.Furthermore,the study investigates the importance of the devised features in identifying the underlying pathology and incorporates histogrambased analysis.This analysis reveals varying natural pixel distributions in CXR images belonging to the normal,COVID-19,LO,and VP groups,motivating the incorporation of additional features such as mean,standard deviation,skewness,and percentile based on the filtered images.Notably,the study achieves a considerable improvement in categorizing COVID-19 from LO,with a true positive rate of 97%,further substantiating the effectiveness of the methodology implemented.
基金Cuiying Scientific and Technological Innovation Program of the Second Hospital,No.CY2021-BJ-A16 and No.CY2022-QN-A18Clinical Medical School of Lanzhou University and Lanzhou Science and Technology Development Guidance Plan Project,No.2023-ZD-85.
文摘BACKGROUND Liver cancer is one of the most prevalent malignant tumors worldwide,and its early detection and treatment are crucial for enhancing patient survival rates and quality of life.However,the early symptoms of liver cancer are often not obvious,resulting in a late-stage diagnosis in many patients,which significantly reduces the effectiveness of treatment.Developing a highly targeted,widely applicable,and practical risk prediction model for liver cancer is crucial for enhancing the early diagnosis and long-term survival rates among affected individuals.AIM To develop a liver cancer risk prediction model by employing machine learning techniques,and subsequently assess its performance.METHODS In this study,a total of 550 patients were enrolled,with 190 hepatocellular carcinoma(HCC)and 195 cirrhosis patients serving as the training cohort,and 83 HCC and 82 cirrhosis patients forming the validation cohort.Logistic regression(LR),support vector machine(SVM),random forest(RF),and least absolute shrinkage and selection operator(LASSO)regression models were developed in the training cohort.Model performance was assessed in the validation cohort.Additionally,this study conducted a comparative evaluation of the diagnostic efficacy between the ASAP model and the model developed in this study using receiver operating characteristic curve,calibration curve,and decision curve analysis(DCA)to determine the optimal predictive model for assessing liver cancer risk.RESULTS Six variables including age,white blood cell,red blood cell,platelet counts,alpha-fetoprotein and protein induced by vitamin K absence or antagonist II levels were used to develop LR,SVM,RF,and LASSO regression models.The RF model exhibited superior discrimination,and the area under curve of the training and validation sets was 0.969 and 0.858,respectively.These values significantly surpassed those of the LR(0.850 and 0.827),SVM(0.860 and 0.803),LASSO regression(0.845 and 0.831),and ASAP(0.866 and 0.813)models.Furthermore,calibration and DCA indicated that the RF model exhibited robust calibration and clinical validity.CONCLUSION The RF model demonstrated excellent prediction capabilities for HCC and can facilitate early diagnosis of HCC in clinical practice.
文摘The application of machine learning(ML)algorithms in various fields of hepatology is an issue of interest.However,we must be cautious with the results.In this letter,based on a published ML prediction model for acute kidney injury after liver surgery,we discuss some limitations of ML models and how they may be addressed in the future.Although the future faces significant challenges,it also holds a great potential.
文摘Survival rates following radical surgery for gastric neuroendocrine neoplasms(g-NENs)are low,with high recurrence rates.This fact impacts patient prognosis and complicates postoperative management.Traditional prognostic models,including the Cox proportional hazards(CoxPH)model,have shown limited predictive power for postoperative survival in gastrointestinal neuroectodermal tumor patients.Machine learning methods offer a unique opportunity to analyze complex relationships within datasets,providing tools and methodologies to assess large volumes of high-dimensional,multimodal data generated by biological sciences.These methods show promise in predicting outcomes across various medical disciplines.In the context of g-NENs,utilizing machine learning to predict survival outcomes holds potential for personalized postoperative management strategies.This editorial reviews a study exploring the advantages and effectiveness of the random survival forest(RSF)model,using the lymph node ratio(LNR),in predicting disease-specific survival(DSS)in postoperative g-NEN patients stratified into low-risk and high-risk groups.The findings demonstrate that the RSF model,incorporating LNR,outperformed the CoxPH model in predicting DSS and constitutes an important step towards precision medicine.
文摘The incidence of prediabetes is in a dangerous condition in the USA. The likelihood of increasing chronic and complex health issues is very high if this stage of prediabetes is ignored. So, early detection of prediabetes conditions is critical to decrease or avoid type 2 diabetes and other health issues that come as a result of untreated and undiagnosed prediabetes condition. This study is done in order to detect the prediabetes condition with an artificial intelligence method. Data used for this study is collected from the Centers for Disease Control and Prevention’s (CDC) survey conducted by the Division of Health and Nutrition Examination Surveys (DHANES). In this study, several machine learning algorithms are exploited and compared to determine the best algorithm based on Average Squared Error (ASE), Kolmogorov-Smirnov (Youden) scores, areas under the ROC and some other measures of the machine learning algorithm. Based on these scores, the champion model is selected, and Random Forest is the champion model with approximately 89% accuracy.
文摘Early stroke prediction is vital to prevent damage. A stroke happens when the blood flow to the brain is disrupted by a clot or bleeding, resulting in brain death or injury. However, early diagnosis and treatment reduce long-term needs and lower health costs. We aim for this research to be a machine-learning method for forecasting early warning signs of stroke. The methodology we employed feature selection techniques and multiple algorithms. Utilizing the XGboost Algorithm, the research findings indicate that their proposed model achieved an accuracy rate of 96.45%. This research shows that machine learning can effectively predict early warning signs of stroke, which can help reduce long-term treatment and rehabilitation needs and lower health costs.
文摘Customer churn poses a significant challenge for the banking and finance industry in the United States, directly affecting profitability and market share. This study conducts a comprehensive comparative analysis of machine learning models for customer churn prediction, focusing on the U.S. context. The research evaluates the performance of logistic regression, random forest, and neural networks using industry-specific datasets, considering the economic impact and practical implications of the findings. The exploratory data analysis reveals unique patterns and trends in the U.S. banking and finance industry, such as the age distribution of customers and the prevalence of dormant accounts. The study incorporates macroeconomic factors to capture the potential influence of external conditions on customer churn behavior. The findings highlight the importance of leveraging advanced machine learning techniques and comprehensive customer data to develop effective churn prevention strategies in the U.S. context. By accurately predicting customer churn, financial institutions can proactively identify at-risk customers, implement targeted retention strategies, and optimize resource allocation. The study discusses the limitations and potential future improvements, serving as a roadmap for researchers and practitioners to further advance the field of customer churn prediction in the evolving landscape of the U.S. banking and finance industry.
文摘Credit card fraud remains a significant challenge, with financial losses and consumer protection at stake. This study addresses the need for practical, real-time fraud detection methodologies. Using a Kaggle credit card dataset, I tackle class imbalance using the Synthetic Minority Oversampling Technique (SMOTE) to enhance modeling efficiency. I compare several machine learning algorithms, including Logistic Regression, Linear Discriminant Analysis, K-nearest Neighbors, Classification and Regression Tree, Naive Bayes, Support Vector, Random Forest, XGBoost, and Light Gradient-Boosting Machine to classify transactions as fraud or genuine. Rigorous evaluation metrics, such as AUC, PRAUC, F1, KS, Recall, and Precision, identify the Random Forest as the best performer in detecting fraudulent activities. The Random Forest model successfully identifies approximately 92% of transactions scoring 90 and above as fraudulent, equating to a detection rate of over 70% for all fraudulent transactions in the test dataset. Moreover, the model captures more than half of the fraud in each bin of the test dataset. SHAP values provide model explainability, with the SHAP summary plot highlighting the global importance of individual features, such as “V12” and “V14”. SHAP force plots offer local interpretability, revealing the impact of specific features on individual predictions. This study demonstrates the potential of machine learning, particularly the Random Forest model, for real-time credit card fraud detection, offering a promising approach to mitigate financial losses and protect consumers.
文摘The increasing amount and intricacy of network traffic in the modern digital era have worsened the difficulty of identifying abnormal behaviours that may indicate potential security breaches or operational interruptions. Conventional detection approaches face challenges in keeping up with the ever-changing strategies of cyber-attacks, resulting in heightened susceptibility and significant harm to network infrastructures. In order to tackle this urgent issue, this project focused on developing an effective anomaly detection system that utilizes Machine Learning technology. The suggested model utilizes contemporary machine learning algorithms and frameworks to autonomously detect deviations from typical network behaviour. It promptly identifies anomalous activities that may indicate security breaches or performance difficulties. The solution entails a multi-faceted approach encompassing data collection, preprocessing, feature engineering, model training, and evaluation. By utilizing machine learning methods, the model is trained on a wide range of datasets that include both regular and abnormal network traffic patterns. This training ensures that the model can adapt to numerous scenarios. The main priority is to ensure that the system is functional and efficient, with a particular emphasis on reducing false positives to avoid unwanted alerts. Additionally, efforts are directed on improving anomaly detection accuracy so that the model can consistently distinguish between potentially harmful and benign activity. This project aims to greatly strengthen network security by addressing emerging cyber threats and improving their resilience and reliability.
文摘Every year, a higher number of dogs are abandoned or euthanised due to temperament issues and a lack of understanding by owners regarding dog behaviour and training. This research focuses on the potential to make predictions of adult dog temperament based on early puppy behaviours by using a machine learning model. Specifically, the research used guard dog breeds, such as American Bully, American Pit Bull Terrier, and German Shepherd. The study collected dog data and general data from dog owners and used the Random Forest approach to build a predictive model. Users are allowed to input puppy data and receive adult dog temperament predictions in model, which is integrated into a web application. The aims of this web application are to enhance responsible dog ownership and reduce abandonment by offering insights and training recommendations based on predicted outcomes. The model achieved a prediction accuracy of 86% on testing, and it is continually improving, though further refinement is recommended to improve its reliability and applicability across a broader range of breeds. The study contributes to canine welfare by providing a practical solution for predicting temperament outcomes, ultimately helping to reduce shelter populations and euthanasia rates.
文摘Hyperparameter tuning is a key step in developing high-performing machine learning models, but searching large hyperparameter spaces requires extensive computation using standard sequential methods. This work analyzes the performance gains from parallel versus sequential hyperparameter optimization. Using scikit-learn’s Randomized SearchCV, this project tuned a Random Forest classifier for fake news detection via randomized grid search. Setting n_jobs to -1 enabled full parallelization across CPU cores. Results show the parallel implementation achieved over 5× faster CPU times and 3× faster total run times compared to sequential tuning. However, test accuracy slightly dropped from 99.26% sequentially to 99.15% with parallelism, indicating a trade-off between evaluation efficiency and model performance. Still, the significant computational gains allow more extensive hyperparameter exploration within reasonable timeframes, outweighing the small accuracy decrease. Further analysis could better quantify this trade-off across different models, tuning techniques, tasks, and hardware.
文摘Customer attrition in the banking industry occurs when consumers quit using the goods and services offered by the bank for some time and,after that,end their connection with the bank.Therefore,customer retention is essential in today’s extremely competitive banking market.Additionally,having a solid customer base helps attract new consumers by fostering confidence and a referral from a current clientele.These factors make reducing client attrition a crucial step that banks must pursue.In our research,we aim to examine bank data and forecast which users will most likely discontinue using the bank’s services and become paying customers.We use various machine learning algorithms to analyze the data and show comparative analysis on different evaluation metrics.In addition,we developed a Data Visualization RShiny app for data science and management regarding customer churn analysis.Analyzing this data will help the bank indicate the trend and then try to retain customers on the verge of attrition.
文摘Accurately assessing the State of Charge(SOC)is paramount for optimizing battery management systems,a cornerstone for ensuring peak battery performance and safety across diverse applications,encompassing vehicle powertrains and renewable energy storage systems.Confronted with the challenges of traditional SOC estimation methods,which often struggle with accuracy and cost-effectiveness,this research endeavors to elevate the precision of SOC estimation to a new level,thereby refining battery management strategies.Leveraging the power of integrated learning techniques,the study fuses Random Forest Regressor,Gradient Boosting Regressor,and Linear Regression into a comprehensive framework that substantially enhances the accuracy and overall performance of SOC predictions.By harnessing the publicly accessible National Aeronautics and Space Administration(NASA)Battery Cycle dataset,our analysis reveals that these integrated learning approaches significantly outperform traditional methods like Coulomb counting and electrochemical models,achieving remarkable improvements in SOC estimation accuracy,error reduction,and optimization of key metrics like R2 and Adjusted R2.This pioneering work propels the development of innovative battery management systems grounded in machine learning and deepens our comprehension of how this cutting-edge technology can revolutionize battery technology.
基金supported by the National Key Research and Development Program of China(Grant No.2019YFE0127700)。
文摘Most forest fires in the Margalla Hills are related to human activities and socioeconomic factors are essential to assess their likelihood of occurrence.This study considers both environmental(altitude,precipitation,forest type,terrain and humidity index)and socioeconomic(population density,distance from roads and urban areas)factors to analyze how human behavior affects the risk of forest fires.Maximum entropy(Maxent)modelling and random forest(RF)machine learning methods were used to predict the probability and spatial diffusion patterns of forest fires in the Margalla Hills.The receiver operating characteristic(ROC)curve and the area under the ROC curve(AUC)were used to compare the models.We studied the fire history from 1990 to 2019 to establish the relationship between the probability of forest fire and environmental and socioeconomic changes.Using Maxent,the AUC fire probability values for the 1999 s,2009 s,and 2019 s were 0.532,0.569,and 0.518,respectively;using RF,they were 0.782,0.825,and 0.789,respectively.Fires were mainly distributed in urban areas and their probability of occurrence was related to accessibility and human behaviour/activity.AUC principles for validation were greater in the random forest models than in the Maxent models.Our results can be used to establish preventive measures to reduce risks of forest fires by considering socio-economic and environmental conditions.