Leaf area index (LAI) is a key parameter for describing vegetation structures and is closely associated with vegetative photosynthesis and energy balance. The accurate retrieval of LAI is important when modeling bio...Leaf area index (LAI) is a key parameter for describing vegetation structures and is closely associated with vegetative photosynthesis and energy balance. The accurate retrieval of LAI is important when modeling biophysical processes of vegetation and the productivity of earth systems. The Random Forests (RF) method aggregates an ensemble of deci- sion trees to improve the prediction accuracy and demonstrates a more robust capacity than other regression methods. This study evaluated the RF method for predicting grassland LAI using ground measurements and remote sensing data. Parameter optimization and variable reduction were conducted before model prediction. Two variable reduction methods were examined: the Variable Importance Value method and the principal component analysis (PCA) method. Finally, the sensitivity of RF to highly correlated variables was tested. The results showed that the RF parameters have a small effect on the performance of RF, and a satisfactory prediction was acquired with a root mean square error (RMSE) of 0.1956. The two variable reduction methods for RF prediction produced different results; variable reduction based on the Variable Importance Value method achieved nearly the same prediction accuracy with no reduced prediction, whereas variable re- duction using the PCA method had an obviously degraded result that may have been caused by the loss of subtle variations and the fusion of noise information. After removing highly correlated variables, the relative variable importance remained steady, and the use of variables selected based on the best-performing vegetation indices performed better than the vari- ables with all vegetation indices or those selected based on the most important one. The results in this study demonstrate the practical and powerful ability of the RF method in predicting grassland LAI, which can also be applied to the estimation of other vegetation traits as an alternative to conventional empirical regression models and the selection of relevant variables used in ecological models.展开更多
Massive Open Online Course(MOOC)has become a popular way of online learning used across the world by millions of people.Meanwhile,a vast amount of information has been collected from the MOOC learners and institutions...Massive Open Online Course(MOOC)has become a popular way of online learning used across the world by millions of people.Meanwhile,a vast amount of information has been collected from the MOOC learners and institutions.Based on the educational data,a lot of researches have been investigated for the prediction of the MOOC learner’s final grade.However,there are still two problems in this research field.The first problem is how to select the most proper features to improve the prediction accuracy,and the second problem is how to use or modify the data mining algorithms for a better analysis of the MOOC data.In order to solve these two problems,an improved random forests method is proposed in this paper.First,a hybrid indicator is defined to measure the importance of the features,and a rule is further established for the feature selection;then,a Clustering-Synthetic Minority Over-sampling Technique(SMOTE)is embedded into the traditional random forests algorithm to solve the class imbalance problem.In experiment part,we verify the performance of the proposed method by using the Canvas Network Person-Course(CNPC)dataset.Furthermore,four well-known prediction methods have been applied for comparison,where the superiority of our method has been proved.展开更多
On-site programming big data refers to the massive data generated in the process of software development with the characteristics of real-time,complexity and high-difficulty for processing.Therefore,data cleaning is e...On-site programming big data refers to the massive data generated in the process of software development with the characteristics of real-time,complexity and high-difficulty for processing.Therefore,data cleaning is essential for on-site programming big data.Duplicate data detection is an important step in data cleaning,which can save storage resources and enhance data consistency.Due to the insufficiency in traditional Sorted Neighborhood Method(SNM)and the difficulty of high-dimensional data detection,an optimized algorithm based on random forests with the dynamic and adaptive window size is proposed.The efficiency of the algorithm can be elevated by improving the method of the key-selection,reducing dimension of data set and using an adaptive variable size sliding window.Experimental results show that the improved SNM algorithm exhibits better performance and achieve higher accuracy.展开更多
Background:To create and validate nomograms for the personalized prediction of survival in octogenarians with newly diagnosed nonsmall-cell lung cancer(NSCLC)with sole brain metastases(BMs).Methods:Random forests(RF)w...Background:To create and validate nomograms for the personalized prediction of survival in octogenarians with newly diagnosed nonsmall-cell lung cancer(NSCLC)with sole brain metastases(BMs).Methods:Random forests(RF)were applied to identify independent prognostic factors for building nomogram models.The predictive accuracy of the model was evaluated based on the receiver operating characteristic(ROC)curve,C-index,and calibration plots.Results:The area under the curve(AUC)values for overall survival at 6,12,and 18 months in the validation cohort were 0.837,0.867,and 0.849,respectively;the AUC values for cancer-specific survival prediction were 0.819,0.835,and 0.818,respectively.The calibration curves visualized the accuracy of the model.Conclusion:The new nomograms have good predictive power for survival among octogenarians with sole BMs related to NSCLC.展开更多
HIV and AIDS has continued to be a major public health concern, and hence one of the epidemics that the world resolved to end by 2030 as highlighted in sustainable development goals (SDGs). A colossal amount of effort...HIV and AIDS has continued to be a major public health concern, and hence one of the epidemics that the world resolved to end by 2030 as highlighted in sustainable development goals (SDGs). A colossal amount of effort has been taken to reduce new HIV infections, but there are still a significant number of new infections reported. HIV prevalence is more skewed towards the key population who include female sex workers (FSW), men who have sex with men (MSM), and people who inject drugs (PWID). The study design was retrospective and focused on key population enrolled in a comprehensive HIV and AIDS programme by the Kenya Red Cross Society from July 2019 to June 2021. Individuals who were either lost to follow up, defaulted (dropped out, transferred out, or relocated) or died were classified as attrition;while those who were active and alive by the end of the study were classified as retention. The study used density analysis to determine the spatial differences of key population attrition in the 19 targeted counties, and used Kilifi county as an example to map attrition cases in smaller administrative areas (sub-county level). The study used synthetic minority oversampling technique-nominal continuous (SMOTE-NC) to balance the datasets since the cases of attrition were much less than retention. The random survival forests model was then fitted to the balanced dataset. The model correctly identified attrition cases using the predicted ensemble mortality and their survival time using the estimated Kaplan-Meier survival function. The predictive performance of the model was strong and way better than random chance with concordance indices greater than 0.75.展开更多
Real-time intelligent lithology identification while drilling is vital to realizing downhole closed-loop drilling. The complex and changeable geological environment in the drilling makes lithology identification face ...Real-time intelligent lithology identification while drilling is vital to realizing downhole closed-loop drilling. The complex and changeable geological environment in the drilling makes lithology identification face many challenges. This paper studies the problems of difficult feature information extraction,low precision of thin-layer identification and limited applicability of the model in intelligent lithologic identification. The author tries to improve the comprehensive performance of the lithology identification model from three aspects: data feature extraction, class balance, and model design. A new real-time intelligent lithology identification model of dynamic felling strategy weighted random forest algorithm(DFW-RF) is proposed. According to the feature selection results, gamma ray and 2 MHz phase resistivity are the logging while drilling(LWD) parameters that significantly influence lithology identification. The comprehensive performance of the DFW-RF lithology identification model has been verified in the application of 3 wells in different areas. By comparing the prediction results of five typical lithology identification algorithms, the DFW-RF model has a higher lithology identification accuracy rate and F1 score. This model improves the identification accuracy of thin-layer lithology and is effective and feasible in different geological environments. The DFW-RF model plays a truly efficient role in the realtime intelligent identification of lithologic information in closed-loop drilling and has greater applicability, which is worthy of being widely used in logging interpretation.展开更多
As massive underground projects have become popular in dense urban cities,a problem has arisen:which model predicts the best for Tunnel Boring Machine(TBM)performance in these tunneling projects?However,performance le...As massive underground projects have become popular in dense urban cities,a problem has arisen:which model predicts the best for Tunnel Boring Machine(TBM)performance in these tunneling projects?However,performance level of TBMs in complex geological conditions is still a great challenge for practitioners and researchers.On the other hand,a reliable and accurate prediction of TBM performance is essential to planning an applicable tunnel construction schedule.The performance of TBM is very difficult to estimate due to various geotechnical and geological factors and machine specifications.The previously-proposed intelligent techniques in this field are mostly based on a single or base model with a low level of accuracy.Hence,this study aims to introduce a hybrid randomforest(RF)technique optimized by global harmony search with generalized oppositionbased learning(GOGHS)for forecasting TBM advance rate(AR).Optimizing the RF hyper-parameters in terms of,e.g.,tree number and maximum tree depth is the main objective of using the GOGHS-RF model.In the modelling of this study,a comprehensive databasewith themost influential parameters onTBMtogetherwithTBM AR were used as input and output variables,respectively.To examine the capability and power of the GOGHSRF model,three more hybrid models of particle swarm optimization-RF,genetic algorithm-RF and artificial bee colony-RF were also constructed to forecast TBM AR.Evaluation of the developed models was performed by calculating several performance indices,including determination coefficient(R2),root-mean-square-error(RMSE),and mean-absolute-percentage-error(MAPE).The results showed that theGOGHS-RF is a more accurate technique for estimatingTBMAR compared to the other applied models.The newly-developedGOGHS-RFmodel enjoyed R2=0.9937 and 0.9844,respectively,for train and test stages,which are higher than a pre-developed RF.Also,the importance of the input parameters was interpreted through the SHapley Additive exPlanations(SHAP)method,and it was found that thrust force per cutter is the most important variable on TBMAR.The GOGHS-RF model can be used in mechanized tunnel projects for predicting and checking performance.展开更多
In recent years,machine learning(ML)and deep learning(DL)have significantly advanced intrusion detection systems,effectively addressing potential malicious attacks across networks.This paper introduces a robust method...In recent years,machine learning(ML)and deep learning(DL)have significantly advanced intrusion detection systems,effectively addressing potential malicious attacks across networks.This paper introduces a robust method for detecting and categorizing attacks within the Internet of Things(IoT)environment,leveraging the NSL-KDD dataset.To achieve high accuracy,the authors used the feature extraction technique in combination with an autoencoder,integrated with a gated recurrent unit(GRU).Therefore,the accurate features are selected by using the cuckoo search algorithm integrated particle swarm optimization(PSO),and PSO has been employed for training the features.The final classification of features has been carried out by using the proposed RF-GNB random forest with the Gaussian Naïve Bayes classifier.The proposed model has been evaluated and its performance is verified with some of the standard metrics such as precision,accuracy rate,recall F1-score,etc.,and has been compared with different existing models.The generated results that detected approximately 99.87%of intrusions within the IoT environments,demonstrated the high performance of the proposed method.These results affirmed the efficacy of the proposed method in increasing the accuracy of intrusion detection within IoT network systems.展开更多
Driven piles are used in many geological environments as a practical and convenient structural component.Hence,the determination of the drivability of piles is actually of great importance in complex geotechnical appl...Driven piles are used in many geological environments as a practical and convenient structural component.Hence,the determination of the drivability of piles is actually of great importance in complex geotechnical applications.Conventional methods of predicting pile drivability often rely on simplified physicalmodels or empirical formulas,whichmay lack accuracy or applicability in complex geological conditions.Therefore,this study presents a practical machine learning approach,namely a Random Forest(RF)optimized by Bayesian Optimization(BO)and Particle Swarm Optimization(PSO),which not only enhances prediction accuracy but also better adapts to varying geological environments to predict the drivability parameters of piles(i.e.,maximumcompressive stress,maximum tensile stress,and blow per foot).In addition,support vector regression,extreme gradient boosting,k nearest neighbor,and decision tree are also used and applied for comparison purposes.In order to train and test these models,among the 4072 datasets collected with 17model inputs,3258 datasets were randomly selected for training,and the remaining 814 datasets were used for model testing.Lastly,the results of these models were compared and evaluated using two performance indices,i.e.,the root mean square error(RMSE)and the coefficient of determination(R2).The results indicate that the optimized RF model achieved lower RMSE than other prediction models in predicting the three parameters,specifically 0.044,0.438,and 0.146;and higher R^(2) values than other implemented techniques,specifically 0.966,0.884,and 0.977.In addition,the sensitivity and uncertainty of the optimized RF model were analyzed using Sobol sensitivity analysis and Monte Carlo(MC)simulation.It can be concluded that the optimized RF model could be used to predict the performance of the pile,and it may provide a useful reference for solving some problems under similar engineering conditions.展开更多
Precise and timely prediction of crop yields is crucial for food security and the development of agricultural policies.However,crop yield is influenced by multiple factors within complex growth environments.Previous r...Precise and timely prediction of crop yields is crucial for food security and the development of agricultural policies.However,crop yield is influenced by multiple factors within complex growth environments.Previous research has paid relatively little attention to the interference of environmental factors and drought on the growth of winter wheat.Therefore,there is an urgent need for more effective methods to explore the inherent relationship between these factors and crop yield,making precise yield prediction increasingly important.This study was based on four type of indicators including meteorological,crop growth status,environmental,and drought index,from October 2003 to June 2019 in Henan Province as the basic data for predicting winter wheat yield.Using the sparrow search al-gorithm combined with random forest(SSA-RF)under different input indicators,accuracy of winter wheat yield estimation was calcu-lated.The estimation accuracy of SSA-RF was compared with partial least squares regression(PLSR),extreme gradient boosting(XG-Boost),and random forest(RF)models.Finally,the determined optimal yield estimation method was used to predict winter wheat yield in three typical years.Following are the findings:1)the SSA-RF demonstrates superior performance in estimating winter wheat yield compared to other algorithms.The best yield estimation method is achieved by four types indicators’composition with SSA-RF)(R^(2)=0.805,RRMSE=9.9%.2)Crops growth status and environmental indicators play significant roles in wheat yield estimation,accounting for 46%and 22%of the yield importance among all indicators,respectively.3)Selecting indicators from October to April of the follow-ing year yielded the highest accuracy in winter wheat yield estimation,with an R^(2)of 0.826 and an RMSE of 9.0%.Yield estimates can be completed two months before the winter wheat harvest in June.4)The predicted performance will be slightly affected by severe drought.Compared with severe drought year(2011)(R^(2)=0.680)and normal year(2017)(R^(2)=0.790),the SSA-RF model has higher prediction accuracy for wet year(2018)(R^(2)=0.820).This study could provide an innovative approach for remote sensing estimation of winter wheat yield.yield.展开更多
Fatigue reliability-based design optimization of aeroengine structures involves multiple repeated calculations of reliability degree and large-scale calls of implicit high-nonlinearity limit state function,leading to ...Fatigue reliability-based design optimization of aeroengine structures involves multiple repeated calculations of reliability degree and large-scale calls of implicit high-nonlinearity limit state function,leading to the traditional direct Monte Claro and surrogate methods prone to unacceptable computing efficiency and accuracy.In this case,by fusing the random subspace strategy and weight allocation technology into bagging ensemble theory,a random forest(RF)model is presented to enhance the computing efficiency of reliability degree;moreover,by embedding the RF model into multilevel optimization model,an efficient RF-assisted fatigue reliability-based design optimization framework is developed.Regarding the low-cycle fatigue reliability-based design optimization of aeroengine turbine disc as a case,the effectiveness of the presented framework is validated.The reliabilitybased design optimization results exhibit that the proposed framework holds high computing accuracy and computing efficiency.The current efforts shed a light on the theory/method development of reliability-based design optimization of complex engineering structures.展开更多
In the era of the Internet,widely used web applications have become the target of hacker attacks because they contain a large amount of personal information.Among these vulnerabilities,stealing private data through cr...In the era of the Internet,widely used web applications have become the target of hacker attacks because they contain a large amount of personal information.Among these vulnerabilities,stealing private data through crosssite scripting(XSS)attacks is one of the most commonly used attacks by hackers.Currently,deep learning-based XSS attack detection methods have good application prospects;however,they suffer from problems such as being prone to overfitting,a high false alarm rate,and low accuracy.To address these issues,we propose a multi-stage feature extraction and fusion model for XSS detection based on Random Forest feature enhancement.The model utilizes RandomForests to capture the intrinsic structure and patterns of the data by extracting leaf node indices as features,which are subsequentlymergedwith the original data features to forma feature setwith richer information content.Further feature extraction is conducted through three parallel channels.Channel I utilizes parallel onedimensional convolutional layers(1Dconvolutional layers)with different convolutional kernel sizes to extract local features at different scales and performmulti-scale feature fusion;Channel II employsmaximum one-dimensional pooling layers(max 1D pooling layers)of various sizes to extract key features from the data;and Channel III extracts global information bi-directionally using a Bi-Directional Long-Short TermMemory Network(Bi-LSTM)and incorporates a multi-head attention mechanism to enhance global features.Finally,effective classification and prediction of XSS are performed by fusing the features of the three channels.To test the effectiveness of the model,we conduct experiments on six datasets.We achieve an accuracy of 100%on the UNSW-NB15 dataset and 99.99%on the CICIDS2017 dataset,which is higher than that of the existing models.展开更多
This study proposed a new real-time manufacturing process monitoring method to monitor and detect process shifts in manufacturing operations.Since real-time production process monitoring is critical in today’s smart ...This study proposed a new real-time manufacturing process monitoring method to monitor and detect process shifts in manufacturing operations.Since real-time production process monitoring is critical in today’s smart manufacturing.The more robust the monitoring model,the more reliable a process is to be under control.In the past,many researchers have developed real-time monitoring methods to detect process shifts early.However,thesemethods have limitations in detecting process shifts as quickly as possible and handling various data volumes and varieties.In this paper,a robust monitoring model combining Gated Recurrent Unit(GRU)and Random Forest(RF)with Real-Time Contrast(RTC)called GRU-RF-RTC was proposed to detect process shifts rapidly.The effectiveness of the proposed GRU-RF-RTC model is first evaluated using multivariate normal and nonnormal distribution datasets.Then,to prove the applicability of the proposed model in a realmanufacturing setting,the model was evaluated using real-world normal and non-normal problems.The results demonstrate that the proposed GRU-RF-RTC outperforms other methods in detecting process shifts quickly with the lowest average out-of-control run length(ARL1)in all synthesis and real-world problems under normal and non-normal cases.The experiment results on real-world problems highlight the significance of the proposed GRU-RF-RTC model in modern manufacturing process monitoring applications.The result reveals that the proposed method improves the shift detection capability by 42.14%in normal and 43.64%in gamma distribution problems.展开更多
Background: Random Forests is a popular classification and regression method that has proven powerful for various prediction problems in biological studies. However, its performance often deteriorates when the number...Background: Random Forests is a popular classification and regression method that has proven powerful for various prediction problems in biological studies. However, its performance often deteriorates when the number of features increases. To address this limitation, feature elimination Random Forests was proposed that only uses features with the largest variable importance scores. Yet the performance of this method is not satisfying, possibly due to its rigid feature selection, and increased correlations between trees of forest. Methods: We propose variable importance-weighted Random Forests, which instead of sampling features with equal probability at each node to build up trees, samples features according to their variable importance scores, and then select the best split from the randomly selected features. Results: We evaluate the performance of our method through comprehensive simulation and real data analyses, for both regression and classification. Compared to the standard Random Forests and the feature elimination Random Forests methods, our proposed method has improved performance in most cases. Conclusions: By incorporating the variable importance scores into the random feature selection step, our method can better utilize more informative features without completely ignoring less informative ones, hence has improved prediction accuracy in the presence of weak signals and large noises. We have implemented an R package "viRandomForests" based on the original R package "randomForest" and it can be freely downloaded from http:// zhaocenter.org/software.展开更多
The random forests (RF) algorithm, which combines the predictions from an ensemble of random trees, has achieved significant improvements in terms of classification accuracy. In many real-world applications, however...The random forests (RF) algorithm, which combines the predictions from an ensemble of random trees, has achieved significant improvements in terms of classification accuracy. In many real-world applications, however, ranking is often required in order to make optimal decisions. Thus, we focus our attention on the ranking performance of RF in this paper. Our experi- mental results based on the entire 36 UC Irvine Machine Learning Repository (UCI) data sets published on the main website of Weka platform show that RF doesn't perform well in ranking, and is even about the same as a single C4.4 tree. This fact raises the question of whether several improvements to RF can scale up its ranking performance. To answer this question, we single out an improved random forests (IRF) algorithm. Instead of the information gain measure and the maximum-likelihood estimate, the average gain measure and the similarity- weighted estimate are used in IRF. Our experiments show that IRF significantly outperforms all the other algorithms used to compare in terms of ranking while maintains the high classification accuracy characterizing RF.展开更多
Alzheimer's disease(AD) is a serious neurodegenerative disorder and its cause remains largely elusive.In past years,genome-wide association(GWA) studies have provided an effective means for AD research.However,the...Alzheimer's disease(AD) is a serious neurodegenerative disorder and its cause remains largely elusive.In past years,genome-wide association(GWA) studies have provided an effective means for AD research.However,the univariate method that is commonly used in GWA studies cannot effectively detect the biological mechanisms associated with this disease.In this study,we propose a new strategy for the GWA analysis of AD that combines random forests with enrichment analysis.First,backward feature selection using random forests was performed on a GWA dataset of AD patients carrying the apolipoprotein gene(APOEε4) and 1058 susceptible single nucleotide polymorphisms(SNPs) were detected,including several known AD-associated SNPs.Next,the susceptible SNPs were investigated by enrichment analysis and significantly-associated gene functional annotations,such as 'alternative splicing','glycoprotein',and 'neuron development',were successfully discovered,indicating that these biological mechanisms play important roles in the development of AD in APOEε4 carriers.These findings may provide insights into the pathogenesis of AD and helpful guidance for further studies.Furthermore,this strategy can easily be modified and applied to GWA studies of other complex diseases.展开更多
To address the problems of wind power abandonment and the stoppage of electricity transmission caused by a short circuit in a power line of a doubly-fed induction generator(DFIG) based wind farm, this paper proposes a...To address the problems of wind power abandonment and the stoppage of electricity transmission caused by a short circuit in a power line of a doubly-fed induction generator(DFIG) based wind farm, this paper proposes an intelligent location method for a single-phase grounding fault based on a multiple random forests(multi-RF) algorithm. First, the simulation model is built, and the fundamental amplitudes of the zerosequence currents are extracted by a fast Fourier transform(FFT) to construct the feature set. Then, the random forest classification algorithm is applied to establish the fault section locator. The model is resampled on the basis of the bootstrap method to generate multiple sample subsets, which are used to establish multiple classification and regression tree(CART) classifiers. The CART classifiers use the mean decrease in the node impurity as the feature importance,which is used to mine the relationship between features and fault sections. Subsequently, a fault section is identified by voting on the test results for each classifier. Finally, a multi-RF regression fault locator is built to output the predicted fault distance. Experimental results with PSCAD/EMTDC software show that the proposed method can overcome the shortcomings of a single RF and has the advantage of locating a short hybrid overhead/cable line with multiple branches. Compared with support vector machines(SVMs)and previously reported methods, the proposed method can meet the location accuracy and efficiency requirements of a DFIG-based wind farm better.展开更多
Dysfunction of microbial communities in various human body sites has been shown to be associated with a variety of diseases raising the possibility of predicting diseases based on metagenomic samples.Although many stu...Dysfunction of microbial communities in various human body sites has been shown to be associated with a variety of diseases raising the possibility of predicting diseases based on metagenomic samples.Although many studies have investigated this problem,there are no consensus on the optimal approaches for predicting disease status based on metagenomic samples.Using six human gut metagenomic datasets consisting of large numbers of colorectal cancer patients and healthy controls from different countries,we investigated different software packages for extracting relative abundances of known microbial genomes and for integrating mapping and as-sembly approaches to obtain the relative abundance profiles of both known and novel genomes.The random forests(RF)classification algorithm was then used to predict colorectal cancer status based on the microbial relative abundance profiles.Based on within data cross-validation and cross-dataset prediction,we show that the RF prediction performance using the microbial relative abundance profiles estimated by Centrifuge is generally higher than that using the microbial relative abundance profiles estimated by MetaPhlAn2 and Bracken.We also develop a novel method to integrate the relative abundance profiles of both known and novel microbial or-ganisms to further increase the prediction performance for colorectal cancer from metagenomes.展开更多
基金funded by the Key Technologies Research and Development Program of China (2013BAC03B02,2012BAC19B04)the International Science and Technology Cooperation Project of China (2012DFA31290)the Earmarked Fund for Modern Agro-industry Technology Research System,China (CARS-35)
文摘Leaf area index (LAI) is a key parameter for describing vegetation structures and is closely associated with vegetative photosynthesis and energy balance. The accurate retrieval of LAI is important when modeling biophysical processes of vegetation and the productivity of earth systems. The Random Forests (RF) method aggregates an ensemble of deci- sion trees to improve the prediction accuracy and demonstrates a more robust capacity than other regression methods. This study evaluated the RF method for predicting grassland LAI using ground measurements and remote sensing data. Parameter optimization and variable reduction were conducted before model prediction. Two variable reduction methods were examined: the Variable Importance Value method and the principal component analysis (PCA) method. Finally, the sensitivity of RF to highly correlated variables was tested. The results showed that the RF parameters have a small effect on the performance of RF, and a satisfactory prediction was acquired with a root mean square error (RMSE) of 0.1956. The two variable reduction methods for RF prediction produced different results; variable reduction based on the Variable Importance Value method achieved nearly the same prediction accuracy with no reduced prediction, whereas variable re- duction using the PCA method had an obviously degraded result that may have been caused by the loss of subtle variations and the fusion of noise information. After removing highly correlated variables, the relative variable importance remained steady, and the use of variables selected based on the best-performing vegetation indices performed better than the vari- ables with all vegetation indices or those selected based on the most important one. The results in this study demonstrate the practical and powerful ability of the RF method in predicting grassland LAI, which can also be applied to the estimation of other vegetation traits as an alternative to conventional empirical regression models and the selection of relevant variables used in ecological models.
基金supported by the National Natural Science Foundation of China under Grant No.61801222in part supported by the Fundamental Research Funds for the Central Universities under Grant No.30919011230in part supported by the Jiangsu Provincial Department of Education Degree and Graduate Education Research Fund under Grant No.JGZD18_012.
文摘Massive Open Online Course(MOOC)has become a popular way of online learning used across the world by millions of people.Meanwhile,a vast amount of information has been collected from the MOOC learners and institutions.Based on the educational data,a lot of researches have been investigated for the prediction of the MOOC learner’s final grade.However,there are still two problems in this research field.The first problem is how to select the most proper features to improve the prediction accuracy,and the second problem is how to use or modify the data mining algorithms for a better analysis of the MOOC data.In order to solve these two problems,an improved random forests method is proposed in this paper.First,a hybrid indicator is defined to measure the importance of the features,and a rule is further established for the feature selection;then,a Clustering-Synthetic Minority Over-sampling Technique(SMOTE)is embedded into the traditional random forests algorithm to solve the class imbalance problem.In experiment part,we verify the performance of the proposed method by using the Canvas Network Person-Course(CNPC)dataset.Furthermore,four well-known prediction methods have been applied for comparison,where the superiority of our method has been proved.
基金supported by the National Key R&D Program of China(Nos.2018YFB1003905)the National Natural Science Foundation of China under Grant No.61971032,Fundamental Research Funds for the Central Universities(No.FRF-TP-18-008A3).
文摘On-site programming big data refers to the massive data generated in the process of software development with the characteristics of real-time,complexity and high-difficulty for processing.Therefore,data cleaning is essential for on-site programming big data.Duplicate data detection is an important step in data cleaning,which can save storage resources and enhance data consistency.Due to the insufficiency in traditional Sorted Neighborhood Method(SNM)and the difficulty of high-dimensional data detection,an optimized algorithm based on random forests with the dynamic and adaptive window size is proposed.The efficiency of the algorithm can be elevated by improving the method of the key-selection,reducing dimension of data set and using an adaptive variable size sliding window.Experimental results show that the improved SNM algorithm exhibits better performance and achieve higher accuracy.
基金supported by the key specialty of traditional Chinese medicine promotion project
文摘Background:To create and validate nomograms for the personalized prediction of survival in octogenarians with newly diagnosed nonsmall-cell lung cancer(NSCLC)with sole brain metastases(BMs).Methods:Random forests(RF)were applied to identify independent prognostic factors for building nomogram models.The predictive accuracy of the model was evaluated based on the receiver operating characteristic(ROC)curve,C-index,and calibration plots.Results:The area under the curve(AUC)values for overall survival at 6,12,and 18 months in the validation cohort were 0.837,0.867,and 0.849,respectively;the AUC values for cancer-specific survival prediction were 0.819,0.835,and 0.818,respectively.The calibration curves visualized the accuracy of the model.Conclusion:The new nomograms have good predictive power for survival among octogenarians with sole BMs related to NSCLC.
文摘HIV and AIDS has continued to be a major public health concern, and hence one of the epidemics that the world resolved to end by 2030 as highlighted in sustainable development goals (SDGs). A colossal amount of effort has been taken to reduce new HIV infections, but there are still a significant number of new infections reported. HIV prevalence is more skewed towards the key population who include female sex workers (FSW), men who have sex with men (MSM), and people who inject drugs (PWID). The study design was retrospective and focused on key population enrolled in a comprehensive HIV and AIDS programme by the Kenya Red Cross Society from July 2019 to June 2021. Individuals who were either lost to follow up, defaulted (dropped out, transferred out, or relocated) or died were classified as attrition;while those who were active and alive by the end of the study were classified as retention. The study used density analysis to determine the spatial differences of key population attrition in the 19 targeted counties, and used Kilifi county as an example to map attrition cases in smaller administrative areas (sub-county level). The study used synthetic minority oversampling technique-nominal continuous (SMOTE-NC) to balance the datasets since the cases of attrition were much less than retention. The random survival forests model was then fitted to the balanced dataset. The model correctly identified attrition cases using the predicted ensemble mortality and their survival time using the estimated Kaplan-Meier survival function. The predictive performance of the model was strong and way better than random chance with concordance indices greater than 0.75.
基金financially supported by the National Natural Science Foundation of China(No.52174001)the National Natural Science Foundation of China(No.52004064)+1 种基金the Hainan Province Science and Technology Special Fund “Research on Real-time Intelligent Sensing Technology for Closed-loop Drilling of Oil and Gas Reservoirs in Deepwater Drilling”(ZDYF2023GXJS012)Heilongjiang Provincial Government and Daqing Oilfield's first batch of the scientific and technological key project “Research on the Construction Technology of Gulong Shale Oil Big Data Analysis System”(DQYT-2022-JS-750)。
文摘Real-time intelligent lithology identification while drilling is vital to realizing downhole closed-loop drilling. The complex and changeable geological environment in the drilling makes lithology identification face many challenges. This paper studies the problems of difficult feature information extraction,low precision of thin-layer identification and limited applicability of the model in intelligent lithologic identification. The author tries to improve the comprehensive performance of the lithology identification model from three aspects: data feature extraction, class balance, and model design. A new real-time intelligent lithology identification model of dynamic felling strategy weighted random forest algorithm(DFW-RF) is proposed. According to the feature selection results, gamma ray and 2 MHz phase resistivity are the logging while drilling(LWD) parameters that significantly influence lithology identification. The comprehensive performance of the DFW-RF lithology identification model has been verified in the application of 3 wells in different areas. By comparing the prediction results of five typical lithology identification algorithms, the DFW-RF model has a higher lithology identification accuracy rate and F1 score. This model improves the identification accuracy of thin-layer lithology and is effective and feasible in different geological environments. The DFW-RF model plays a truly efficient role in the realtime intelligent identification of lithologic information in closed-loop drilling and has greater applicability, which is worthy of being widely used in logging interpretation.
基金the National Natural Science Foundation of China(Grant 42177164)the Distinguished Youth Science Foundation of Hunan Province of China(2022JJ10073).
文摘As massive underground projects have become popular in dense urban cities,a problem has arisen:which model predicts the best for Tunnel Boring Machine(TBM)performance in these tunneling projects?However,performance level of TBMs in complex geological conditions is still a great challenge for practitioners and researchers.On the other hand,a reliable and accurate prediction of TBM performance is essential to planning an applicable tunnel construction schedule.The performance of TBM is very difficult to estimate due to various geotechnical and geological factors and machine specifications.The previously-proposed intelligent techniques in this field are mostly based on a single or base model with a low level of accuracy.Hence,this study aims to introduce a hybrid randomforest(RF)technique optimized by global harmony search with generalized oppositionbased learning(GOGHS)for forecasting TBM advance rate(AR).Optimizing the RF hyper-parameters in terms of,e.g.,tree number and maximum tree depth is the main objective of using the GOGHS-RF model.In the modelling of this study,a comprehensive databasewith themost influential parameters onTBMtogetherwithTBM AR were used as input and output variables,respectively.To examine the capability and power of the GOGHSRF model,three more hybrid models of particle swarm optimization-RF,genetic algorithm-RF and artificial bee colony-RF were also constructed to forecast TBM AR.Evaluation of the developed models was performed by calculating several performance indices,including determination coefficient(R2),root-mean-square-error(RMSE),and mean-absolute-percentage-error(MAPE).The results showed that theGOGHS-RF is a more accurate technique for estimatingTBMAR compared to the other applied models.The newly-developedGOGHS-RFmodel enjoyed R2=0.9937 and 0.9844,respectively,for train and test stages,which are higher than a pre-developed RF.Also,the importance of the input parameters was interpreted through the SHapley Additive exPlanations(SHAP)method,and it was found that thrust force per cutter is the most important variable on TBMAR.The GOGHS-RF model can be used in mechanized tunnel projects for predicting and checking performance.
基金the Deanship of Scientific Research at Shaqra University for funding this research work through the project number(SU-ANN-2023051).
文摘In recent years,machine learning(ML)and deep learning(DL)have significantly advanced intrusion detection systems,effectively addressing potential malicious attacks across networks.This paper introduces a robust method for detecting and categorizing attacks within the Internet of Things(IoT)environment,leveraging the NSL-KDD dataset.To achieve high accuracy,the authors used the feature extraction technique in combination with an autoencoder,integrated with a gated recurrent unit(GRU).Therefore,the accurate features are selected by using the cuckoo search algorithm integrated particle swarm optimization(PSO),and PSO has been employed for training the features.The final classification of features has been carried out by using the proposed RF-GNB random forest with the Gaussian Naïve Bayes classifier.The proposed model has been evaluated and its performance is verified with some of the standard metrics such as precision,accuracy rate,recall F1-score,etc.,and has been compared with different existing models.The generated results that detected approximately 99.87%of intrusions within the IoT environments,demonstrated the high performance of the proposed method.These results affirmed the efficacy of the proposed method in increasing the accuracy of intrusion detection within IoT network systems.
基金supported by the National Science Foundation of China(42107183).
文摘Driven piles are used in many geological environments as a practical and convenient structural component.Hence,the determination of the drivability of piles is actually of great importance in complex geotechnical applications.Conventional methods of predicting pile drivability often rely on simplified physicalmodels or empirical formulas,whichmay lack accuracy or applicability in complex geological conditions.Therefore,this study presents a practical machine learning approach,namely a Random Forest(RF)optimized by Bayesian Optimization(BO)and Particle Swarm Optimization(PSO),which not only enhances prediction accuracy but also better adapts to varying geological environments to predict the drivability parameters of piles(i.e.,maximumcompressive stress,maximum tensile stress,and blow per foot).In addition,support vector regression,extreme gradient boosting,k nearest neighbor,and decision tree are also used and applied for comparison purposes.In order to train and test these models,among the 4072 datasets collected with 17model inputs,3258 datasets were randomly selected for training,and the remaining 814 datasets were used for model testing.Lastly,the results of these models were compared and evaluated using two performance indices,i.e.,the root mean square error(RMSE)and the coefficient of determination(R2).The results indicate that the optimized RF model achieved lower RMSE than other prediction models in predicting the three parameters,specifically 0.044,0.438,and 0.146;and higher R^(2) values than other implemented techniques,specifically 0.966,0.884,and 0.977.In addition,the sensitivity and uncertainty of the optimized RF model were analyzed using Sobol sensitivity analysis and Monte Carlo(MC)simulation.It can be concluded that the optimized RF model could be used to predict the performance of the pile,and it may provide a useful reference for solving some problems under similar engineering conditions.
基金Under the auspices of National Natural Science Foundation of China(No.52079103)。
文摘Precise and timely prediction of crop yields is crucial for food security and the development of agricultural policies.However,crop yield is influenced by multiple factors within complex growth environments.Previous research has paid relatively little attention to the interference of environmental factors and drought on the growth of winter wheat.Therefore,there is an urgent need for more effective methods to explore the inherent relationship between these factors and crop yield,making precise yield prediction increasingly important.This study was based on four type of indicators including meteorological,crop growth status,environmental,and drought index,from October 2003 to June 2019 in Henan Province as the basic data for predicting winter wheat yield.Using the sparrow search al-gorithm combined with random forest(SSA-RF)under different input indicators,accuracy of winter wheat yield estimation was calcu-lated.The estimation accuracy of SSA-RF was compared with partial least squares regression(PLSR),extreme gradient boosting(XG-Boost),and random forest(RF)models.Finally,the determined optimal yield estimation method was used to predict winter wheat yield in three typical years.Following are the findings:1)the SSA-RF demonstrates superior performance in estimating winter wheat yield compared to other algorithms.The best yield estimation method is achieved by four types indicators’composition with SSA-RF)(R^(2)=0.805,RRMSE=9.9%.2)Crops growth status and environmental indicators play significant roles in wheat yield estimation,accounting for 46%and 22%of the yield importance among all indicators,respectively.3)Selecting indicators from October to April of the follow-ing year yielded the highest accuracy in winter wheat yield estimation,with an R^(2)of 0.826 and an RMSE of 9.0%.Yield estimates can be completed two months before the winter wheat harvest in June.4)The predicted performance will be slightly affected by severe drought.Compared with severe drought year(2011)(R^(2)=0.680)and normal year(2017)(R^(2)=0.790),the SSA-RF model has higher prediction accuracy for wet year(2018)(R^(2)=0.820).This study could provide an innovative approach for remote sensing estimation of winter wheat yield.yield.
基金supported by the National Natural Science Foundation of China under Grant(Number:52105136)the Hong Kong Scholar program under Grant(Number:XJ2022013)China Postdoctoral Science Foundation under Grant(Number:2021M690290)Academic Excellence Foundation of BUAA under Grant(Number:BY2004103).
文摘Fatigue reliability-based design optimization of aeroengine structures involves multiple repeated calculations of reliability degree and large-scale calls of implicit high-nonlinearity limit state function,leading to the traditional direct Monte Claro and surrogate methods prone to unacceptable computing efficiency and accuracy.In this case,by fusing the random subspace strategy and weight allocation technology into bagging ensemble theory,a random forest(RF)model is presented to enhance the computing efficiency of reliability degree;moreover,by embedding the RF model into multilevel optimization model,an efficient RF-assisted fatigue reliability-based design optimization framework is developed.Regarding the low-cycle fatigue reliability-based design optimization of aeroengine turbine disc as a case,the effectiveness of the presented framework is validated.The reliabilitybased design optimization results exhibit that the proposed framework holds high computing accuracy and computing efficiency.The current efforts shed a light on the theory/method development of reliability-based design optimization of complex engineering structures.
文摘In the era of the Internet,widely used web applications have become the target of hacker attacks because they contain a large amount of personal information.Among these vulnerabilities,stealing private data through crosssite scripting(XSS)attacks is one of the most commonly used attacks by hackers.Currently,deep learning-based XSS attack detection methods have good application prospects;however,they suffer from problems such as being prone to overfitting,a high false alarm rate,and low accuracy.To address these issues,we propose a multi-stage feature extraction and fusion model for XSS detection based on Random Forest feature enhancement.The model utilizes RandomForests to capture the intrinsic structure and patterns of the data by extracting leaf node indices as features,which are subsequentlymergedwith the original data features to forma feature setwith richer information content.Further feature extraction is conducted through three parallel channels.Channel I utilizes parallel onedimensional convolutional layers(1Dconvolutional layers)with different convolutional kernel sizes to extract local features at different scales and performmulti-scale feature fusion;Channel II employsmaximum one-dimensional pooling layers(max 1D pooling layers)of various sizes to extract key features from the data;and Channel III extracts global information bi-directionally using a Bi-Directional Long-Short TermMemory Network(Bi-LSTM)and incorporates a multi-head attention mechanism to enhance global features.Finally,effective classification and prediction of XSS are performed by fusing the features of the three channels.To test the effectiveness of the model,we conduct experiments on six datasets.We achieve an accuracy of 100%on the UNSW-NB15 dataset and 99.99%on the CICIDS2017 dataset,which is higher than that of the existing models.
基金support from the National Science and Technology Council of Taiwan(Contract Nos.111-2221 E-011081 and 111-2622-E-011019)the support from Intelligent Manufacturing Innovation Center(IMIC),National Taiwan University of Science and Technology(NTUST),Taipei,Taiwan,which is a Featured Areas Research Center in Higher Education Sprout Project of Ministry of Education(MOE),Taiwan(since 2023)was appreciatedWe also thank Wang Jhan Yang Charitable Trust Fund(Contract No.WJY 2020-HR-01)for its financial support.
文摘This study proposed a new real-time manufacturing process monitoring method to monitor and detect process shifts in manufacturing operations.Since real-time production process monitoring is critical in today’s smart manufacturing.The more robust the monitoring model,the more reliable a process is to be under control.In the past,many researchers have developed real-time monitoring methods to detect process shifts early.However,thesemethods have limitations in detecting process shifts as quickly as possible and handling various data volumes and varieties.In this paper,a robust monitoring model combining Gated Recurrent Unit(GRU)and Random Forest(RF)with Real-Time Contrast(RTC)called GRU-RF-RTC was proposed to detect process shifts rapidly.The effectiveness of the proposed GRU-RF-RTC model is first evaluated using multivariate normal and nonnormal distribution datasets.Then,to prove the applicability of the proposed model in a realmanufacturing setting,the model was evaluated using real-world normal and non-normal problems.The results demonstrate that the proposed GRU-RF-RTC outperforms other methods in detecting process shifts quickly with the lowest average out-of-control run length(ARL1)in all synthesis and real-world problems under normal and non-normal cases.The experiment results on real-world problems highlight the significance of the proposed GRU-RF-RTC model in modern manufacturing process monitoring applications.The result reveals that the proposed method improves the shift detection capability by 42.14%in normal and 43.64%in gamma distribution problems.
文摘Background: Random Forests is a popular classification and regression method that has proven powerful for various prediction problems in biological studies. However, its performance often deteriorates when the number of features increases. To address this limitation, feature elimination Random Forests was proposed that only uses features with the largest variable importance scores. Yet the performance of this method is not satisfying, possibly due to its rigid feature selection, and increased correlations between trees of forest. Methods: We propose variable importance-weighted Random Forests, which instead of sampling features with equal probability at each node to build up trees, samples features according to their variable importance scores, and then select the best split from the randomly selected features. Results: We evaluate the performance of our method through comprehensive simulation and real data analyses, for both regression and classification. Compared to the standard Random Forests and the feature elimination Random Forests methods, our proposed method has improved performance in most cases. Conclusions: By incorporating the variable importance scores into the random feature selection step, our method can better utilize more informative features without completely ignoring less informative ones, hence has improved prediction accuracy in the presence of weak signals and large noises. We have implemented an R package "viRandomForests" based on the original R package "randomForest" and it can be freely downloaded from http:// zhaocenter.org/software.
文摘The random forests (RF) algorithm, which combines the predictions from an ensemble of random trees, has achieved significant improvements in terms of classification accuracy. In many real-world applications, however, ranking is often required in order to make optimal decisions. Thus, we focus our attention on the ranking performance of RF in this paper. Our experi- mental results based on the entire 36 UC Irvine Machine Learning Repository (UCI) data sets published on the main website of Weka platform show that RF doesn't perform well in ranking, and is even about the same as a single C4.4 tree. This fact raises the question of whether several improvements to RF can scale up its ranking performance. To answer this question, we single out an improved random forests (IRF) algorithm. Instead of the information gain measure and the maximum-likelihood estimate, the average gain measure and the similarity- weighted estimate are used in IRF. Our experiments show that IRF significantly outperforms all the other algorithms used to compare in terms of ranking while maintains the high classification accuracy characterizing RF.
基金supported by the National Natural Science Foundation of China (Nos. 2100230024 and 2100230023)
文摘Alzheimer's disease(AD) is a serious neurodegenerative disorder and its cause remains largely elusive.In past years,genome-wide association(GWA) studies have provided an effective means for AD research.However,the univariate method that is commonly used in GWA studies cannot effectively detect the biological mechanisms associated with this disease.In this study,we propose a new strategy for the GWA analysis of AD that combines random forests with enrichment analysis.First,backward feature selection using random forests was performed on a GWA dataset of AD patients carrying the apolipoprotein gene(APOEε4) and 1058 susceptible single nucleotide polymorphisms(SNPs) were detected,including several known AD-associated SNPs.Next,the susceptible SNPs were investigated by enrichment analysis and significantly-associated gene functional annotations,such as 'alternative splicing','glycoprotein',and 'neuron development',were successfully discovered,indicating that these biological mechanisms play important roles in the development of AD in APOEε4 carriers.These findings may provide insights into the pathogenesis of AD and helpful guidance for further studies.Furthermore,this strategy can easily be modified and applied to GWA studies of other complex diseases.
基金supported in part by the National Natural Science Foundation of China (No. 51677072)。
文摘To address the problems of wind power abandonment and the stoppage of electricity transmission caused by a short circuit in a power line of a doubly-fed induction generator(DFIG) based wind farm, this paper proposes an intelligent location method for a single-phase grounding fault based on a multiple random forests(multi-RF) algorithm. First, the simulation model is built, and the fundamental amplitudes of the zerosequence currents are extracted by a fast Fourier transform(FFT) to construct the feature set. Then, the random forest classification algorithm is applied to establish the fault section locator. The model is resampled on the basis of the bootstrap method to generate multiple sample subsets, which are used to establish multiple classification and regression tree(CART) classifiers. The CART classifiers use the mean decrease in the node impurity as the feature importance,which is used to mine the relationship between features and fault sections. Subsequently, a fault section is identified by voting on the test results for each classifier. Finally, a multi-RF regression fault locator is built to output the predicted fault distance. Experimental results with PSCAD/EMTDC software show that the proposed method can overcome the shortcomings of a single RF and has the advantage of locating a short hybrid overhead/cable line with multiple branches. Compared with support vector machines(SVMs)and previously reported methods, the proposed method can meet the location accuracy and efficiency requirements of a DFIG-based wind farm better.
文摘Dysfunction of microbial communities in various human body sites has been shown to be associated with a variety of diseases raising the possibility of predicting diseases based on metagenomic samples.Although many studies have investigated this problem,there are no consensus on the optimal approaches for predicting disease status based on metagenomic samples.Using six human gut metagenomic datasets consisting of large numbers of colorectal cancer patients and healthy controls from different countries,we investigated different software packages for extracting relative abundances of known microbial genomes and for integrating mapping and as-sembly approaches to obtain the relative abundance profiles of both known and novel genomes.The random forests(RF)classification algorithm was then used to predict colorectal cancer status based on the microbial relative abundance profiles.Based on within data cross-validation and cross-dataset prediction,we show that the RF prediction performance using the microbial relative abundance profiles estimated by Centrifuge is generally higher than that using the microbial relative abundance profiles estimated by MetaPhlAn2 and Bracken.We also develop a novel method to integrate the relative abundance profiles of both known and novel microbial or-ganisms to further increase the prediction performance for colorectal cancer from metagenomes.