On-site programming big data refers to the massive data generated in the process of software development with the characteristics of real-time,complexity and high-difficulty for processing.Therefore,data cleaning is e...On-site programming big data refers to the massive data generated in the process of software development with the characteristics of real-time,complexity and high-difficulty for processing.Therefore,data cleaning is essential for on-site programming big data.Duplicate data detection is an important step in data cleaning,which can save storage resources and enhance data consistency.Due to the insufficiency in traditional Sorted Neighborhood Method(SNM)and the difficulty of high-dimensional data detection,an optimized algorithm based on random forests with the dynamic and adaptive window size is proposed.The efficiency of the algorithm can be elevated by improving the method of the key-selection,reducing dimension of data set and using an adaptive variable size sliding window.Experimental results show that the improved SNM algorithm exhibits better performance and achieve higher accuracy.展开更多
Massive Open Online Course(MOOC)has become a popular way of online learning used across the world by millions of people.Meanwhile,a vast amount of information has been collected from the MOOC learners and institutions...Massive Open Online Course(MOOC)has become a popular way of online learning used across the world by millions of people.Meanwhile,a vast amount of information has been collected from the MOOC learners and institutions.Based on the educational data,a lot of researches have been investigated for the prediction of the MOOC learner’s final grade.However,there are still two problems in this research field.The first problem is how to select the most proper features to improve the prediction accuracy,and the second problem is how to use or modify the data mining algorithms for a better analysis of the MOOC data.In order to solve these two problems,an improved random forests method is proposed in this paper.First,a hybrid indicator is defined to measure the importance of the features,and a rule is further established for the feature selection;then,a Clustering-Synthetic Minority Over-sampling Technique(SMOTE)is embedded into the traditional random forests algorithm to solve the class imbalance problem.In experiment part,we verify the performance of the proposed method by using the Canvas Network Person-Course(CNPC)dataset.Furthermore,four well-known prediction methods have been applied for comparison,where the superiority of our method has been proved.展开更多
HIV and AIDS has continued to be a major public health concern, and hence one of the epidemics that the world resolved to end by 2030 as highlighted in sustainable development goals (SDGs). A colossal amount of effort...HIV and AIDS has continued to be a major public health concern, and hence one of the epidemics that the world resolved to end by 2030 as highlighted in sustainable development goals (SDGs). A colossal amount of effort has been taken to reduce new HIV infections, but there are still a significant number of new infections reported. HIV prevalence is more skewed towards the key population who include female sex workers (FSW), men who have sex with men (MSM), and people who inject drugs (PWID). The study design was retrospective and focused on key population enrolled in a comprehensive HIV and AIDS programme by the Kenya Red Cross Society from July 2019 to June 2021. Individuals who were either lost to follow up, defaulted (dropped out, transferred out, or relocated) or died were classified as attrition;while those who were active and alive by the end of the study were classified as retention. The study used density analysis to determine the spatial differences of key population attrition in the 19 targeted counties, and used Kilifi county as an example to map attrition cases in smaller administrative areas (sub-county level). The study used synthetic minority oversampling technique-nominal continuous (SMOTE-NC) to balance the datasets since the cases of attrition were much less than retention. The random survival forests model was then fitted to the balanced dataset. The model correctly identified attrition cases using the predicted ensemble mortality and their survival time using the estimated Kaplan-Meier survival function. The predictive performance of the model was strong and way better than random chance with concordance indices greater than 0.75.展开更多
As massive underground projects have become popular in dense urban cities,a problem has arisen:which model predicts the best for Tunnel Boring Machine(TBM)performance in these tunneling projects?However,performance le...As massive underground projects have become popular in dense urban cities,a problem has arisen:which model predicts the best for Tunnel Boring Machine(TBM)performance in these tunneling projects?However,performance level of TBMs in complex geological conditions is still a great challenge for practitioners and researchers.On the other hand,a reliable and accurate prediction of TBM performance is essential to planning an applicable tunnel construction schedule.The performance of TBM is very difficult to estimate due to various geotechnical and geological factors and machine specifications.The previously-proposed intelligent techniques in this field are mostly based on a single or base model with a low level of accuracy.Hence,this study aims to introduce a hybrid randomforest(RF)technique optimized by global harmony search with generalized oppositionbased learning(GOGHS)for forecasting TBM advance rate(AR).Optimizing the RF hyper-parameters in terms of,e.g.,tree number and maximum tree depth is the main objective of using the GOGHS-RF model.In the modelling of this study,a comprehensive databasewith themost influential parameters onTBMtogetherwithTBM AR were used as input and output variables,respectively.To examine the capability and power of the GOGHSRF model,three more hybrid models of particle swarm optimization-RF,genetic algorithm-RF and artificial bee colony-RF were also constructed to forecast TBM AR.Evaluation of the developed models was performed by calculating several performance indices,including determination coefficient(R2),root-mean-square-error(RMSE),and mean-absolute-percentage-error(MAPE).The results showed that theGOGHS-RF is a more accurate technique for estimatingTBMAR compared to the other applied models.The newly-developedGOGHS-RFmodel enjoyed R2=0.9937 and 0.9844,respectively,for train and test stages,which are higher than a pre-developed RF.Also,the importance of the input parameters was interpreted through the SHapley Additive exPlanations(SHAP)method,and it was found that thrust force per cutter is the most important variable on TBMAR.The GOGHS-RF model can be used in mechanized tunnel projects for predicting and checking performance.展开更多
Precise and timely prediction of crop yields is crucial for food security and the development of agricultural policies.However,crop yield is influenced by multiple factors within complex growth environments.Previous r...Precise and timely prediction of crop yields is crucial for food security and the development of agricultural policies.However,crop yield is influenced by multiple factors within complex growth environments.Previous research has paid relatively little attention to the interference of environmental factors and drought on the growth of winter wheat.Therefore,there is an urgent need for more effective methods to explore the inherent relationship between these factors and crop yield,making precise yield prediction increasingly important.This study was based on four type of indicators including meteorological,crop growth status,environmental,and drought index,from October 2003 to June 2019 in Henan Province as the basic data for predicting winter wheat yield.Using the sparrow search al-gorithm combined with random forest(SSA-RF)under different input indicators,accuracy of winter wheat yield estimation was calcu-lated.The estimation accuracy of SSA-RF was compared with partial least squares regression(PLSR),extreme gradient boosting(XG-Boost),and random forest(RF)models.Finally,the determined optimal yield estimation method was used to predict winter wheat yield in three typical years.Following are the findings:1)the SSA-RF demonstrates superior performance in estimating winter wheat yield compared to other algorithms.The best yield estimation method is achieved by four types indicators’composition with SSA-RF)(R^(2)=0.805,RRMSE=9.9%.2)Crops growth status and environmental indicators play significant roles in wheat yield estimation,accounting for 46%and 22%of the yield importance among all indicators,respectively.3)Selecting indicators from October to April of the follow-ing year yielded the highest accuracy in winter wheat yield estimation,with an R^(2)of 0.826 and an RMSE of 9.0%.Yield estimates can be completed two months before the winter wheat harvest in June.4)The predicted performance will be slightly affected by severe drought.Compared with severe drought year(2011)(R^(2)=0.680)and normal year(2017)(R^(2)=0.790),the SSA-RF model has higher prediction accuracy for wet year(2018)(R^(2)=0.820).This study could provide an innovative approach for remote sensing estimation of winter wheat yield.yield.展开更多
Dysfunction of microbial communities in various human body sites has been shown to be associated with a variety of diseases raising the possibility of predicting diseases based on metagenomic samples.Although many stu...Dysfunction of microbial communities in various human body sites has been shown to be associated with a variety of diseases raising the possibility of predicting diseases based on metagenomic samples.Although many studies have investigated this problem,there are no consensus on the optimal approaches for predicting disease status based on metagenomic samples.Using six human gut metagenomic datasets consisting of large numbers of colorectal cancer patients and healthy controls from different countries,we investigated different software packages for extracting relative abundances of known microbial genomes and for integrating mapping and as-sembly approaches to obtain the relative abundance profiles of both known and novel genomes.The random forests(RF)classification algorithm was then used to predict colorectal cancer status based on the microbial relative abundance profiles.Based on within data cross-validation and cross-dataset prediction,we show that the RF prediction performance using the microbial relative abundance profiles estimated by Centrifuge is generally higher than that using the microbial relative abundance profiles estimated by MetaPhlAn2 and Bracken.We also develop a novel method to integrate the relative abundance profiles of both known and novel microbial or-ganisms to further increase the prediction performance for colorectal cancer from metagenomes.展开更多
To address the problems of wind power abandonment and the stoppage of electricity transmission caused by a short circuit in a power line of a doubly-fed induction generator(DFIG) based wind farm, this paper proposes a...To address the problems of wind power abandonment and the stoppage of electricity transmission caused by a short circuit in a power line of a doubly-fed induction generator(DFIG) based wind farm, this paper proposes an intelligent location method for a single-phase grounding fault based on a multiple random forests(multi-RF) algorithm. First, the simulation model is built, and the fundamental amplitudes of the zerosequence currents are extracted by a fast Fourier transform(FFT) to construct the feature set. Then, the random forest classification algorithm is applied to establish the fault section locator. The model is resampled on the basis of the bootstrap method to generate multiple sample subsets, which are used to establish multiple classification and regression tree(CART) classifiers. The CART classifiers use the mean decrease in the node impurity as the feature importance,which is used to mine the relationship between features and fault sections. Subsequently, a fault section is identified by voting on the test results for each classifier. Finally, a multi-RF regression fault locator is built to output the predicted fault distance. Experimental results with PSCAD/EMTDC software show that the proposed method can overcome the shortcomings of a single RF and has the advantage of locating a short hybrid overhead/cable line with multiple branches. Compared with support vector machines(SVMs)and previously reported methods, the proposed method can meet the location accuracy and efficiency requirements of a DFIG-based wind farm better.展开更多
Purpose-Ensemble methods have been widely used in the field of pattern recognition due to the difficulty offinding a single classifier that performs well on a wide variety of problems.Despite the effectiveness of thes...Purpose-Ensemble methods have been widely used in the field of pattern recognition due to the difficulty offinding a single classifier that performs well on a wide variety of problems.Despite the effectiveness of thesetechniques,studies have shown that ensemble methods generate a large number of hypotheses and thatcontain redundant classifiers in most cases.Several works proposed in the state of the art attempt to reduce allhypotheses without affecting performance.Design/methodology/approach-In this work,the authors are proposing a pruning method that takes intoconsideration the correlation between classifiers/classes and each classifier with the rest of the set.The authorshave used the random forest algorithm as trees-based ensemble classifiers and the pruning was made by atechnique inspired by the CFS(correlation feature selection)algorithm.Findings-The proposed method CES(correlation-based Ensemble Selection)was evaluated onten datasets from the UCI machine learning repository,and the performances were compared to sixensemble pruning techniques.The results showed that our proposed pruning method selects a smallensemble in a smaller amount of time while improving classification rates compared to the state-of-the-artmethods.Originality/value-CES is a new ordering-based method that uses the CFS algorithm.CES selects,in a shorttime,a small sub-ensemble that outperforms results obtained from the whole forest and the other state-of-thearttechniques used in this study.展开更多
Acid production with flue gas is a complex nonlinear process with multiple variables and strong coupling.The operation data is an important basis for state monitoring,optimal control,and fault diagnosis.However,the op...Acid production with flue gas is a complex nonlinear process with multiple variables and strong coupling.The operation data is an important basis for state monitoring,optimal control,and fault diagnosis.However,the operating environment of acid production with flue gas is complex and there is much equipment.The data obtained by the detection equipment is seriously polluted and prone to abnormal phenomena such as data loss and outliers.Therefore,to solve the problem of abnormal data in the process of acid production with flue gas,a data cleaning method based on improved random forest is proposed.Firstly,an outlier data recognition model based on isolation forest is designed to identify and eliminate the outliers in the dataset.Secondly,an improved random forest regression model is established.Genetic algorithm is used to optimize the hyperparameters of the random forest regression model.Then the optimal parameter combination is found in the search space and the trend of data is predicted.Finally,the improved random forest data cleaning method is used to compensate for the missing data after eliminating abnormal data and the data cleaning is realized.Results show that the proposed method can accurately eliminate and compensate for the abnormal data in the process of acid production with flue gas.The method improves the accuracy of compensation for missing data.With the data after cleaning,a more accurate model can be established,which is significant to the subsequent temperature control.The conversion rate of SO_(2) can be further improved,thereby improving the yield of sulfuric acid and economic benefits.展开更多
A huge number of old arch bridges located in rural regions are at the peak of maintenance.The health monitoring technology of the long-span bridge is hardly applicable to the small-span bridge,owing to the absence of ...A huge number of old arch bridges located in rural regions are at the peak of maintenance.The health monitoring technology of the long-span bridge is hardly applicable to the small-span bridge,owing to the absence of technical resources and sufficient funds in rural regions.There is an urgent need for an economical,fast,and accurate damage identification solution.The authors proposed a damage identification system of an old arch bridge implemented with amachine learning algorithm,which took the vehicle-induced response as the excitation.A damage index was defined based on wavelet packet theory,and a machine learning sample database collecting the denoised response was constructed.Through comparing three machine learning algorithms:Back-Propagation Neural Network(BPNN),Support Vector Machine(SVM),and Random Forest(R.F.),the R.F.damage identification model were found to have a better recognition ability.Finally,the Particle Swarm Optimization(PSO)algorithm was used to optimize the number of subtrees and split features of the R.F.model.The PSO optimized R.F.model was capable of the identification of different damage levels of old arch bridges with sensitive damage index.The proposed framework is practical and promising for the old bridge’s structural damage identification in rural regions.展开更多
As a complex hot problem in the financial field,stock trend forecasting uses a large amount of data and many related indicators;hence it is difficult to obtain sustainable and effective results only by relying on empi...As a complex hot problem in the financial field,stock trend forecasting uses a large amount of data and many related indicators;hence it is difficult to obtain sustainable and effective results only by relying on empirical analysis.Researchers in the field of machine learning have proved that random forest can form better judgements on this kind of problem,and it has an auxiliary role in the prediction of stock trend.This study uses historical trading data of four listed companies in the USA stock market,and the purpose of this study is to improve the performance of random forest model in medium-and long-term stock trend prediction.This study applies the exponential smoothing method to process the initial data,calculates the relevant technical indicators as the characteristics to be selected,and proposes the D-RF-RS method to optimize random forest.As the random forest is an ensemble learning model and is closely related to decision tree,D-RF-RS method uses a decision tree to screen the importance of features,and obtains the effective strong feature set of the model as input.Then,the parameter combination of the model is optimized through random parameter search.The experimental results show that the average accuracy of random forest is increased by 0.17 after the above process optimization,which is 0.18 higher than the average accuracy of light gradient boosting machine model.Combined with the performance of the ROC curve and Precision–Recall curve,the stability of the model is also guaranteed,which further demonstrates the advantages of random forest in medium-and long-term trend prediction of the stock market.展开更多
Power transformer is one of the most crucial devices in power grid.It is significant to determine incipient faults of power transformers fast and accurately.Input features play critical roles in fault diagnosis accura...Power transformer is one of the most crucial devices in power grid.It is significant to determine incipient faults of power transformers fast and accurately.Input features play critical roles in fault diagnosis accuracy.In order to further improve the fault diagnosis performance of power trans-formers,a random forest feature selection method coupled with optimized kernel extreme learning machine is presented in this study.Firstly,the random forest feature selection approach is adopted to rank 42 related input features derived from gas concentration,gas ratio and energy-weighted dissolved gas analysis.Afterwards,a kernel extreme learning machine tuned by the Aquila optimization algorithm is implemented to adjust crucial parameters and select the optimal feature subsets.The diagnosis accuracy is used to assess the fault diagnosis capability of concerned feature subsets.Finally,the optimal feature subsets are applied to establish fault diagnosis model.According to the experimental results based on two public datasets and comparison with 5 conventional approaches,it can be seen that the average accuracy of the pro-posed method is up to 94.5%,which is superior to that of other conventional approaches.Fault diagnosis performances verify that the optimum feature subset obtained by the presented method can dramatically improve power transformers fault diagnosis accuracy.展开更多
Many of the best predictors for complex problems are typically regarded as hard to interpret physically.These include kernel methods,Shtarkov solutions,and random forests.We show that,despite the inability to interpre...Many of the best predictors for complex problems are typically regarded as hard to interpret physically.These include kernel methods,Shtarkov solutions,and random forests.We show that,despite the inability to interpret these three predictors to infinite precision,they can be asymptotically approximated and admit conceptual interpretations in terms of their mathe-matical/statistical properties.The resulting expressions can be in terms of polynomials,basis elements,or other functions that an analyst may regard as interpretable.展开更多
Objective Body fluid mixtures are complex biological samples that frequently occur in crime scenes,and can provide important clues for criminal case analysis.DNA methylation assay has been applied in the identificatio...Objective Body fluid mixtures are complex biological samples that frequently occur in crime scenes,and can provide important clues for criminal case analysis.DNA methylation assay has been applied in the identification of human body fluids,and has exhibited excellent performance in predicting single-source body fluids.The present study aims to develop a methylation SNaPshot multiplex system for body fluid identification,and accurately predict the mixture samples.In addition,the value of DNA methylation in the prediction of body fluid mixtures was further explored.Methods In the present study,420 samples of body fluid mixtures and 250 samples of single body fluids were tested using an optimized multiplex methylation system.Each kind of body fluid sample presented the specific methylation profiles of the 10 markers.Results Significant differences in methylation levels were observed between the mixtures and single body fluids.For all kinds of mixtures,the Spearman’s correlation analysis revealed a significantly strong correlation between the methylation levels and component proportions(1:20,1:10,1:5,1:1,5:1,10:1 and 20:1).Two random forest classification models were trained for the prediction of mixture types and the prediction of the mixture proportion of 2 components,based on the methylation levels of 10 markers.For the mixture prediction,Model-1 presented outstanding prediction accuracy,which reached up to 99.3%in 427 training samples,and had a remarkable accuracy of 100%in 243 independent test samples.For the mixture proportion prediction,Model-2 demonstrated an excellent accuracy of 98.8%in 252 training samples,and 98.2%in 168 independent test samples.The total prediction accuracy reached 99.3%for body fluid mixtures and 98.6%for the mixture proportions.Conclusion These results indicate the excellent capability and powerful value of the multiplex methylation system in the identification of forensic body fluid mixtures.展开更多
The layered pavements usually exhibit complicated mechanical properties with the effect of complex material properties under external environment.In some cases,such as launching missiles or rockets,layered pavements a...The layered pavements usually exhibit complicated mechanical properties with the effect of complex material properties under external environment.In some cases,such as launching missiles or rockets,layered pavements are required to bear large impulse load.However,traditional methods cannot non-destructively and quickly detect the internal structural of pavements.Thus,accurate and fast prediction of the mechanical properties of layered pavements is of great importance and necessity.In recent years,machine learning has shown great superiority in solving nonlinear problems.In this work,we present a method of predicting the maximum deflection and damage factor of layered pavements under instantaneous large impact based on random forest regression with the deflection basin parameters obtained from falling weight deflection testing.The regression coefficient R^(2)of testing datasets are above 0.94 in the process of predicting the elastic moduli of structural layers and mechanical responses,which indicates that the prediction results have great consistency with finite element simulation results.This paper provides a novel method for fast and accurate prediction of pavement mechanical responses under instantaneous large impact load using partial structural parameters of pavements,and has application potential in non-destructive evaluation of pavement structure.展开更多
60 GHz millimeter wave(mmWave)system provides extremely high time resolution and multipath components(MPC)separation and has great potential to achieve high precision in the indoor positioning.However,the ranging data...60 GHz millimeter wave(mmWave)system provides extremely high time resolution and multipath components(MPC)separation and has great potential to achieve high precision in the indoor positioning.However,the ranging data is often contaminated by non-line-of-sight(NLOS)transmission.First,six features of 60GHz mm Wave signal under LOS and NLOS conditions are evaluated.Next,a classifier constructed by random forest(RF)algorithm is used to identify line-of-sight(LOS)or NLOS channel.The identification mechanism has excellent generalization performance and the classification accuracy is over 97%.Finally,based on the identification results,a residual weighted least squares positioning method is proposed.All ranging information including that under NLOS channels is fully utilized,positioning failure caused by insufficient LOS links can be avoided.Compared with the conventional least squares approach,the positioning error of the proposed algorithm is reduced by 49%.展开更多
Information on the decay process of nuclides in the superheavy region is critical in investigating new elements beyond oganesson and the island of stability.This paper presents the application of a random forest algor...Information on the decay process of nuclides in the superheavy region is critical in investigating new elements beyond oganesson and the island of stability.This paper presents the application of a random forest algorithm to examine the competition among different decay modes in the superheavy region,includingα decay,β^(-)decay,β^(+)decay,electron capture and spontaneous fission.The observed half-lives and dominant decay mode are well reproduced.The dominant decay mode of 96.9%of the nuclei beyond ^(212) Po is correctly obtained.Further,α decay is predicted to be the dominant decay mode for isotopes in new elements Z=119-122,except for spontaneous fission in certain even–even elements owing to the increased Coulomb repulsion and odd–even effect.The predicted half-lives demonstrate the existence of a long-lived spontaneous fission island southwest of ^(298) Fl caused by the competition between the fission barrier and Coulomb repulsion.A better understanding of spontaneous fission,particularly beyond ^(286)Fl,is crucial in the search for new elements and the island of stability.展开更多
Coronary artery disease(CAD)is one of themost authentic cardiovascular afflictions because it is an uncommonly overwhelming heart issue.The breakdown of coronary cardiovascular disease is one of the principal sources ...Coronary artery disease(CAD)is one of themost authentic cardiovascular afflictions because it is an uncommonly overwhelming heart issue.The breakdown of coronary cardiovascular disease is one of the principal sources of death all over theworld.Cardiovascular deterioration is a challenge,especially in youthful and rural countries where there is an absence of humantrained professionals.Since heart diseases happen without apparent signs,high-level detection is desirable.This paper proposed a robust and tuned random forest model using the randomized grid search technique to predictCAD.The proposed framework increases the ability of CADpredictions by tracking down risk pointers and learning the confusing joint efforts between them.Nowadays,the healthcare industry has a lot of data but needs to gain more knowledge.Our proposed framework is used for extracting knowledge from data stores and using that knowledge to help doctors accurately and effectively diagnose heart disease(HD).We evaluated the proposed framework over two public databases,Cleveland and Framingham datasets.The datasets were preprocessed by using a cleaning technique,a normalization technique,and an outlier detection technique.Secondly,the principal component analysis(PCA)algorithm was utilized to lessen the feature dimensionality of the two datasets.Finally,we used a hyperparameter tuning technique,randomized grid search,to tune a random forest(RF)machine learning(ML)model.The randomized grid search selected the best parameters and got the ideal CAD analysis.The proposed framework was evaluated and compared with traditional classifiers.Our proposed framework’s accuracy,sensitivity,precision,specificity,and f1-score were 100%.The evaluation of the proposed framework showed that it is an unrivaled perceptive outcome with tuning as opposed to other ongoing existing frameworks.展开更多
Multifaceted asymmetric radiation from the edge(MARFE) movement which can cause density limit disruption is often encountered during high density operation on many tokamaks. Therefore, identifying and predicting MARFE...Multifaceted asymmetric radiation from the edge(MARFE) movement which can cause density limit disruption is often encountered during high density operation on many tokamaks. Therefore, identifying and predicting MARFE movement is meaningful to mitigate or avoid density limit disruption for the steady-state high-density plasma operation. A machine learning method named random forest(RF) has been used to predict the MARFE movement based on the density ramp-up experiment in the 2022’s first campaign of Experimental Advanced Superconducting Tokamak(EAST). The RF model shows that besides Greenwald fraction which is the ratio of plasma density and Greenwald density limit, dβp/dt,H98and d Wmhd/dt are relatively important parameters for MARFE-movement prediction. Applying the RF model on test discharges, the test results show that the successful alarm rate for MARFE movement causing density limit disruption reaches ~ 85% with a minimum alarm time of ~ 40 ms and mean alarm time of ~ 700 ms. At the same time, the false alarm rate for non-disruptive and non-density-limit disruptive discharges can be kept below 5%. These results provide a reference to the prediction of MARFE movement in high density plasmas, which can help the avoidance or mitigation of density limit disruption in future fusion reactors.展开更多
基金supported by the National Key R&D Program of China(Nos.2018YFB1003905)the National Natural Science Foundation of China under Grant No.61971032,Fundamental Research Funds for the Central Universities(No.FRF-TP-18-008A3).
文摘On-site programming big data refers to the massive data generated in the process of software development with the characteristics of real-time,complexity and high-difficulty for processing.Therefore,data cleaning is essential for on-site programming big data.Duplicate data detection is an important step in data cleaning,which can save storage resources and enhance data consistency.Due to the insufficiency in traditional Sorted Neighborhood Method(SNM)and the difficulty of high-dimensional data detection,an optimized algorithm based on random forests with the dynamic and adaptive window size is proposed.The efficiency of the algorithm can be elevated by improving the method of the key-selection,reducing dimension of data set and using an adaptive variable size sliding window.Experimental results show that the improved SNM algorithm exhibits better performance and achieve higher accuracy.
基金supported by the National Natural Science Foundation of China under Grant No.61801222in part supported by the Fundamental Research Funds for the Central Universities under Grant No.30919011230in part supported by the Jiangsu Provincial Department of Education Degree and Graduate Education Research Fund under Grant No.JGZD18_012.
文摘Massive Open Online Course(MOOC)has become a popular way of online learning used across the world by millions of people.Meanwhile,a vast amount of information has been collected from the MOOC learners and institutions.Based on the educational data,a lot of researches have been investigated for the prediction of the MOOC learner’s final grade.However,there are still two problems in this research field.The first problem is how to select the most proper features to improve the prediction accuracy,and the second problem is how to use or modify the data mining algorithms for a better analysis of the MOOC data.In order to solve these two problems,an improved random forests method is proposed in this paper.First,a hybrid indicator is defined to measure the importance of the features,and a rule is further established for the feature selection;then,a Clustering-Synthetic Minority Over-sampling Technique(SMOTE)is embedded into the traditional random forests algorithm to solve the class imbalance problem.In experiment part,we verify the performance of the proposed method by using the Canvas Network Person-Course(CNPC)dataset.Furthermore,four well-known prediction methods have been applied for comparison,where the superiority of our method has been proved.
文摘HIV and AIDS has continued to be a major public health concern, and hence one of the epidemics that the world resolved to end by 2030 as highlighted in sustainable development goals (SDGs). A colossal amount of effort has been taken to reduce new HIV infections, but there are still a significant number of new infections reported. HIV prevalence is more skewed towards the key population who include female sex workers (FSW), men who have sex with men (MSM), and people who inject drugs (PWID). The study design was retrospective and focused on key population enrolled in a comprehensive HIV and AIDS programme by the Kenya Red Cross Society from July 2019 to June 2021. Individuals who were either lost to follow up, defaulted (dropped out, transferred out, or relocated) or died were classified as attrition;while those who were active and alive by the end of the study were classified as retention. The study used density analysis to determine the spatial differences of key population attrition in the 19 targeted counties, and used Kilifi county as an example to map attrition cases in smaller administrative areas (sub-county level). The study used synthetic minority oversampling technique-nominal continuous (SMOTE-NC) to balance the datasets since the cases of attrition were much less than retention. The random survival forests model was then fitted to the balanced dataset. The model correctly identified attrition cases using the predicted ensemble mortality and their survival time using the estimated Kaplan-Meier survival function. The predictive performance of the model was strong and way better than random chance with concordance indices greater than 0.75.
基金the National Natural Science Foundation of China(Grant 42177164)the Distinguished Youth Science Foundation of Hunan Province of China(2022JJ10073).
文摘As massive underground projects have become popular in dense urban cities,a problem has arisen:which model predicts the best for Tunnel Boring Machine(TBM)performance in these tunneling projects?However,performance level of TBMs in complex geological conditions is still a great challenge for practitioners and researchers.On the other hand,a reliable and accurate prediction of TBM performance is essential to planning an applicable tunnel construction schedule.The performance of TBM is very difficult to estimate due to various geotechnical and geological factors and machine specifications.The previously-proposed intelligent techniques in this field are mostly based on a single or base model with a low level of accuracy.Hence,this study aims to introduce a hybrid randomforest(RF)technique optimized by global harmony search with generalized oppositionbased learning(GOGHS)for forecasting TBM advance rate(AR).Optimizing the RF hyper-parameters in terms of,e.g.,tree number and maximum tree depth is the main objective of using the GOGHS-RF model.In the modelling of this study,a comprehensive databasewith themost influential parameters onTBMtogetherwithTBM AR were used as input and output variables,respectively.To examine the capability and power of the GOGHSRF model,three more hybrid models of particle swarm optimization-RF,genetic algorithm-RF and artificial bee colony-RF were also constructed to forecast TBM AR.Evaluation of the developed models was performed by calculating several performance indices,including determination coefficient(R2),root-mean-square-error(RMSE),and mean-absolute-percentage-error(MAPE).The results showed that theGOGHS-RF is a more accurate technique for estimatingTBMAR compared to the other applied models.The newly-developedGOGHS-RFmodel enjoyed R2=0.9937 and 0.9844,respectively,for train and test stages,which are higher than a pre-developed RF.Also,the importance of the input parameters was interpreted through the SHapley Additive exPlanations(SHAP)method,and it was found that thrust force per cutter is the most important variable on TBMAR.The GOGHS-RF model can be used in mechanized tunnel projects for predicting and checking performance.
基金Under the auspices of National Natural Science Foundation of China(No.52079103)。
文摘Precise and timely prediction of crop yields is crucial for food security and the development of agricultural policies.However,crop yield is influenced by multiple factors within complex growth environments.Previous research has paid relatively little attention to the interference of environmental factors and drought on the growth of winter wheat.Therefore,there is an urgent need for more effective methods to explore the inherent relationship between these factors and crop yield,making precise yield prediction increasingly important.This study was based on four type of indicators including meteorological,crop growth status,environmental,and drought index,from October 2003 to June 2019 in Henan Province as the basic data for predicting winter wheat yield.Using the sparrow search al-gorithm combined with random forest(SSA-RF)under different input indicators,accuracy of winter wheat yield estimation was calcu-lated.The estimation accuracy of SSA-RF was compared with partial least squares regression(PLSR),extreme gradient boosting(XG-Boost),and random forest(RF)models.Finally,the determined optimal yield estimation method was used to predict winter wheat yield in three typical years.Following are the findings:1)the SSA-RF demonstrates superior performance in estimating winter wheat yield compared to other algorithms.The best yield estimation method is achieved by four types indicators’composition with SSA-RF)(R^(2)=0.805,RRMSE=9.9%.2)Crops growth status and environmental indicators play significant roles in wheat yield estimation,accounting for 46%and 22%of the yield importance among all indicators,respectively.3)Selecting indicators from October to April of the follow-ing year yielded the highest accuracy in winter wheat yield estimation,with an R^(2)of 0.826 and an RMSE of 9.0%.Yield estimates can be completed two months before the winter wheat harvest in June.4)The predicted performance will be slightly affected by severe drought.Compared with severe drought year(2011)(R^(2)=0.680)and normal year(2017)(R^(2)=0.790),the SSA-RF model has higher prediction accuracy for wet year(2018)(R^(2)=0.820).This study could provide an innovative approach for remote sensing estimation of winter wheat yield.yield.
文摘Dysfunction of microbial communities in various human body sites has been shown to be associated with a variety of diseases raising the possibility of predicting diseases based on metagenomic samples.Although many studies have investigated this problem,there are no consensus on the optimal approaches for predicting disease status based on metagenomic samples.Using six human gut metagenomic datasets consisting of large numbers of colorectal cancer patients and healthy controls from different countries,we investigated different software packages for extracting relative abundances of known microbial genomes and for integrating mapping and as-sembly approaches to obtain the relative abundance profiles of both known and novel genomes.The random forests(RF)classification algorithm was then used to predict colorectal cancer status based on the microbial relative abundance profiles.Based on within data cross-validation and cross-dataset prediction,we show that the RF prediction performance using the microbial relative abundance profiles estimated by Centrifuge is generally higher than that using the microbial relative abundance profiles estimated by MetaPhlAn2 and Bracken.We also develop a novel method to integrate the relative abundance profiles of both known and novel microbial or-ganisms to further increase the prediction performance for colorectal cancer from metagenomes.
基金supported in part by the National Natural Science Foundation of China (No. 51677072)。
文摘To address the problems of wind power abandonment and the stoppage of electricity transmission caused by a short circuit in a power line of a doubly-fed induction generator(DFIG) based wind farm, this paper proposes an intelligent location method for a single-phase grounding fault based on a multiple random forests(multi-RF) algorithm. First, the simulation model is built, and the fundamental amplitudes of the zerosequence currents are extracted by a fast Fourier transform(FFT) to construct the feature set. Then, the random forest classification algorithm is applied to establish the fault section locator. The model is resampled on the basis of the bootstrap method to generate multiple sample subsets, which are used to establish multiple classification and regression tree(CART) classifiers. The CART classifiers use the mean decrease in the node impurity as the feature importance,which is used to mine the relationship between features and fault sections. Subsequently, a fault section is identified by voting on the test results for each classifier. Finally, a multi-RF regression fault locator is built to output the predicted fault distance. Experimental results with PSCAD/EMTDC software show that the proposed method can overcome the shortcomings of a single RF and has the advantage of locating a short hybrid overhead/cable line with multiple branches. Compared with support vector machines(SVMs)and previously reported methods, the proposed method can meet the location accuracy and efficiency requirements of a DFIG-based wind farm better.
基金The authors would like to thank the Directorate-General of Scientific Research and Technological Development(Direction Generale de la Recherche Scientifique et du Developpement Technologique,DGRSDT,URL:www.dgrsdt.dz,Algeria)for the financial assistance towards this research.
文摘Purpose-Ensemble methods have been widely used in the field of pattern recognition due to the difficulty offinding a single classifier that performs well on a wide variety of problems.Despite the effectiveness of thesetechniques,studies have shown that ensemble methods generate a large number of hypotheses and thatcontain redundant classifiers in most cases.Several works proposed in the state of the art attempt to reduce allhypotheses without affecting performance.Design/methodology/approach-In this work,the authors are proposing a pruning method that takes intoconsideration the correlation between classifiers/classes and each classifier with the rest of the set.The authorshave used the random forest algorithm as trees-based ensemble classifiers and the pruning was made by atechnique inspired by the CFS(correlation feature selection)algorithm.Findings-The proposed method CES(correlation-based Ensemble Selection)was evaluated onten datasets from the UCI machine learning repository,and the performances were compared to sixensemble pruning techniques.The results showed that our proposed pruning method selects a smallensemble in a smaller amount of time while improving classification rates compared to the state-of-the-artmethods.Originality/value-CES is a new ordering-based method that uses the CFS algorithm.CES selects,in a shorttime,a small sub-ensemble that outperforms results obtained from the whole forest and the other state-of-thearttechniques used in this study.
基金supported by the National Natural Science Foundation of China(61873006)Beijing Natural Science Foundation(4204087,4212040).
文摘Acid production with flue gas is a complex nonlinear process with multiple variables and strong coupling.The operation data is an important basis for state monitoring,optimal control,and fault diagnosis.However,the operating environment of acid production with flue gas is complex and there is much equipment.The data obtained by the detection equipment is seriously polluted and prone to abnormal phenomena such as data loss and outliers.Therefore,to solve the problem of abnormal data in the process of acid production with flue gas,a data cleaning method based on improved random forest is proposed.Firstly,an outlier data recognition model based on isolation forest is designed to identify and eliminate the outliers in the dataset.Secondly,an improved random forest regression model is established.Genetic algorithm is used to optimize the hyperparameters of the random forest regression model.Then the optimal parameter combination is found in the search space and the trend of data is predicted.Finally,the improved random forest data cleaning method is used to compensate for the missing data after eliminating abnormal data and the data cleaning is realized.Results show that the proposed method can accurately eliminate and compensate for the abnormal data in the process of acid production with flue gas.The method improves the accuracy of compensation for missing data.With the data after cleaning,a more accurate model can be established,which is significant to the subsequent temperature control.The conversion rate of SO_(2) can be further improved,thereby improving the yield of sulfuric acid and economic benefits.
基金supported by the Elite Scholar Program of Northwest A&F University (Grant No.Z111022001)the Research Fund of Department of Transport of Shannxi Province (Grant No.22-23K)the Student Innovation and Entrepreneurship Training Program of China (Project Nos.S202110712555 and S202110712534).
文摘A huge number of old arch bridges located in rural regions are at the peak of maintenance.The health monitoring technology of the long-span bridge is hardly applicable to the small-span bridge,owing to the absence of technical resources and sufficient funds in rural regions.There is an urgent need for an economical,fast,and accurate damage identification solution.The authors proposed a damage identification system of an old arch bridge implemented with amachine learning algorithm,which took the vehicle-induced response as the excitation.A damage index was defined based on wavelet packet theory,and a machine learning sample database collecting the denoised response was constructed.Through comparing three machine learning algorithms:Back-Propagation Neural Network(BPNN),Support Vector Machine(SVM),and Random Forest(R.F.),the R.F.damage identification model were found to have a better recognition ability.Finally,the Particle Swarm Optimization(PSO)algorithm was used to optimize the number of subtrees and split features of the R.F.model.The PSO optimized R.F.model was capable of the identification of different damage levels of old arch bridges with sensitive damage index.The proposed framework is practical and promising for the old bridge’s structural damage identification in rural regions.
基金National Natural Science Foundation of China,Grant/Award Numbers:61673084,National Natural Science Foundation of ChinaThe Fundamental Research Foundation for Universities of Heilongjiang Province,Grant/Award Number:LGYC2018JC017。
文摘As a complex hot problem in the financial field,stock trend forecasting uses a large amount of data and many related indicators;hence it is difficult to obtain sustainable and effective results only by relying on empirical analysis.Researchers in the field of machine learning have proved that random forest can form better judgements on this kind of problem,and it has an auxiliary role in the prediction of stock trend.This study uses historical trading data of four listed companies in the USA stock market,and the purpose of this study is to improve the performance of random forest model in medium-and long-term stock trend prediction.This study applies the exponential smoothing method to process the initial data,calculates the relevant technical indicators as the characteristics to be selected,and proposes the D-RF-RS method to optimize random forest.As the random forest is an ensemble learning model and is closely related to decision tree,D-RF-RS method uses a decision tree to screen the importance of features,and obtains the effective strong feature set of the model as input.Then,the parameter combination of the model is optimized through random parameter search.The experimental results show that the average accuracy of random forest is increased by 0.17 after the above process optimization,which is 0.18 higher than the average accuracy of light gradient boosting machine model.Combined with the performance of the ROC curve and Precision–Recall curve,the stability of the model is also guaranteed,which further demonstrates the advantages of random forest in medium-and long-term trend prediction of the stock market.
基金support of national natural science foundation of China(No.52067021)natural science foundation of Xinjiang(2022D01C35)+1 种基金excellent youth scientific and technological talents plan of Xinjiang(No.2019Q012)major science and technology special project of Xinjiang Uygur Autonomous Region(2022A01002-2).
文摘Power transformer is one of the most crucial devices in power grid.It is significant to determine incipient faults of power transformers fast and accurately.Input features play critical roles in fault diagnosis accuracy.In order to further improve the fault diagnosis performance of power trans-formers,a random forest feature selection method coupled with optimized kernel extreme learning machine is presented in this study.Firstly,the random forest feature selection approach is adopted to rank 42 related input features derived from gas concentration,gas ratio and energy-weighted dissolved gas analysis.Afterwards,a kernel extreme learning machine tuned by the Aquila optimization algorithm is implemented to adjust crucial parameters and select the optimal feature subsets.The diagnosis accuracy is used to assess the fault diagnosis capability of concerned feature subsets.Finally,the optimal feature subsets are applied to establish fault diagnosis model.According to the experimental results based on two public datasets and comparison with 5 conventional approaches,it can be seen that the average accuracy of the pro-posed method is up to 94.5%,which is superior to that of other conventional approaches.Fault diagnosis performances verify that the optimum feature subset obtained by the presented method can dramatically improve power transformers fault diagnosis accuracy.
文摘Many of the best predictors for complex problems are typically regarded as hard to interpret physically.These include kernel methods,Shtarkov solutions,and random forests.We show that,despite the inability to interpret these three predictors to infinite precision,they can be asymptotically approximated and admit conceptual interpretations in terms of their mathe-matical/statistical properties.The resulting expressions can be in terms of polynomials,basis elements,or other functions that an analyst may regard as interpretable.
基金supported by the grants from the Natural Science Foundation of Hubei Province(No.2020CFB780)the Fundamental Research Funds for the Central Universities(No.2017KFYXJJ020).
文摘Objective Body fluid mixtures are complex biological samples that frequently occur in crime scenes,and can provide important clues for criminal case analysis.DNA methylation assay has been applied in the identification of human body fluids,and has exhibited excellent performance in predicting single-source body fluids.The present study aims to develop a methylation SNaPshot multiplex system for body fluid identification,and accurately predict the mixture samples.In addition,the value of DNA methylation in the prediction of body fluid mixtures was further explored.Methods In the present study,420 samples of body fluid mixtures and 250 samples of single body fluids were tested using an optimized multiplex methylation system.Each kind of body fluid sample presented the specific methylation profiles of the 10 markers.Results Significant differences in methylation levels were observed between the mixtures and single body fluids.For all kinds of mixtures,the Spearman’s correlation analysis revealed a significantly strong correlation between the methylation levels and component proportions(1:20,1:10,1:5,1:1,5:1,10:1 and 20:1).Two random forest classification models were trained for the prediction of mixture types and the prediction of the mixture proportion of 2 components,based on the methylation levels of 10 markers.For the mixture prediction,Model-1 presented outstanding prediction accuracy,which reached up to 99.3%in 427 training samples,and had a remarkable accuracy of 100%in 243 independent test samples.For the mixture proportion prediction,Model-2 demonstrated an excellent accuracy of 98.8%in 252 training samples,and 98.2%in 168 independent test samples.The total prediction accuracy reached 99.3%for body fluid mixtures and 98.6%for the mixture proportions.Conclusion These results indicate the excellent capability and powerful value of the multiplex methylation system in the identification of forensic body fluid mixtures.
基金Project supported in part by the National Natural Science Foundation of China(Grant No.12075168)the Fund from the Science and Technology Commission of Shanghai Municipality(Grant No.21JC1405600)。
文摘The layered pavements usually exhibit complicated mechanical properties with the effect of complex material properties under external environment.In some cases,such as launching missiles or rockets,layered pavements are required to bear large impulse load.However,traditional methods cannot non-destructively and quickly detect the internal structural of pavements.Thus,accurate and fast prediction of the mechanical properties of layered pavements is of great importance and necessity.In recent years,machine learning has shown great superiority in solving nonlinear problems.In this work,we present a method of predicting the maximum deflection and damage factor of layered pavements under instantaneous large impact based on random forest regression with the deflection basin parameters obtained from falling weight deflection testing.The regression coefficient R^(2)of testing datasets are above 0.94 in the process of predicting the elastic moduli of structural layers and mechanical responses,which indicates that the prediction results have great consistency with finite element simulation results.This paper provides a novel method for fast and accurate prediction of pavement mechanical responses under instantaneous large impact load using partial structural parameters of pavements,and has application potential in non-destructive evaluation of pavement structure.
基金supported by National Natural Science Foundation of China(No.62101298)Collaborative Education Project between Industry and Academia,China(22050609312501)。
文摘60 GHz millimeter wave(mmWave)system provides extremely high time resolution and multipath components(MPC)separation and has great potential to achieve high precision in the indoor positioning.However,the ranging data is often contaminated by non-line-of-sight(NLOS)transmission.First,six features of 60GHz mm Wave signal under LOS and NLOS conditions are evaluated.Next,a classifier constructed by random forest(RF)algorithm is used to identify line-of-sight(LOS)or NLOS channel.The identification mechanism has excellent generalization performance and the classification accuracy is over 97%.Finally,based on the identification results,a residual weighted least squares positioning method is proposed.All ranging information including that under NLOS channels is fully utilized,positioning failure caused by insufficient LOS links can be avoided.Compared with the conventional least squares approach,the positioning error of the proposed algorithm is reduced by 49%.
基金supported by the Guangdong Major Project of Basic and Applied Basic Research under Grant No. 2021B0301030006the computational resources from SYSU and the National Supercomputer Center in Guangzhou。
文摘Information on the decay process of nuclides in the superheavy region is critical in investigating new elements beyond oganesson and the island of stability.This paper presents the application of a random forest algorithm to examine the competition among different decay modes in the superheavy region,includingα decay,β^(-)decay,β^(+)decay,electron capture and spontaneous fission.The observed half-lives and dominant decay mode are well reproduced.The dominant decay mode of 96.9%of the nuclei beyond ^(212) Po is correctly obtained.Further,α decay is predicted to be the dominant decay mode for isotopes in new elements Z=119-122,except for spontaneous fission in certain even–even elements owing to the increased Coulomb repulsion and odd–even effect.The predicted half-lives demonstrate the existence of a long-lived spontaneous fission island southwest of ^(298) Fl caused by the competition between the fission barrier and Coulomb repulsion.A better understanding of spontaneous fission,particularly beyond ^(286)Fl,is crucial in the search for new elements and the island of stability.
文摘Coronary artery disease(CAD)is one of themost authentic cardiovascular afflictions because it is an uncommonly overwhelming heart issue.The breakdown of coronary cardiovascular disease is one of the principal sources of death all over theworld.Cardiovascular deterioration is a challenge,especially in youthful and rural countries where there is an absence of humantrained professionals.Since heart diseases happen without apparent signs,high-level detection is desirable.This paper proposed a robust and tuned random forest model using the randomized grid search technique to predictCAD.The proposed framework increases the ability of CADpredictions by tracking down risk pointers and learning the confusing joint efforts between them.Nowadays,the healthcare industry has a lot of data but needs to gain more knowledge.Our proposed framework is used for extracting knowledge from data stores and using that knowledge to help doctors accurately and effectively diagnose heart disease(HD).We evaluated the proposed framework over two public databases,Cleveland and Framingham datasets.The datasets were preprocessed by using a cleaning technique,a normalization technique,and an outlier detection technique.Secondly,the principal component analysis(PCA)algorithm was utilized to lessen the feature dimensionality of the two datasets.Finally,we used a hyperparameter tuning technique,randomized grid search,to tune a random forest(RF)machine learning(ML)model.The randomized grid search selected the best parameters and got the ideal CAD analysis.The proposed framework was evaluated and compared with traditional classifiers.Our proposed framework’s accuracy,sensitivity,precision,specificity,and f1-score were 100%.The evaluation of the proposed framework showed that it is an unrivaled perceptive outcome with tuning as opposed to other ongoing existing frameworks.
基金This work is supported by the National MCF Energy R&D Program of China(Grant Nos.2018YFE0302100 and 2019YFE03010003)the National Natural Science Foundation of China(Grant Nos.12005264,12105322,and 12075285)+3 种基金the National Magnetic Confinement Fusion Science Program of China(Grant No.2022YFE03100003)the Natural Science Foundation of Anhui Province of China(Grant No.2108085QA38)the Chinese Postdoctoral Science Found(Grant No.2021000278)the Presidential Foundation of Hefei institutes of Physical Science(Grant No.YZJJ2021QN12).
文摘Multifaceted asymmetric radiation from the edge(MARFE) movement which can cause density limit disruption is often encountered during high density operation on many tokamaks. Therefore, identifying and predicting MARFE movement is meaningful to mitigate or avoid density limit disruption for the steady-state high-density plasma operation. A machine learning method named random forest(RF) has been used to predict the MARFE movement based on the density ramp-up experiment in the 2022’s first campaign of Experimental Advanced Superconducting Tokamak(EAST). The RF model shows that besides Greenwald fraction which is the ratio of plasma density and Greenwald density limit, dβp/dt,H98and d Wmhd/dt are relatively important parameters for MARFE-movement prediction. Applying the RF model on test discharges, the test results show that the successful alarm rate for MARFE movement causing density limit disruption reaches ~ 85% with a minimum alarm time of ~ 40 ms and mean alarm time of ~ 700 ms. At the same time, the false alarm rate for non-disruptive and non-density-limit disruptive discharges can be kept below 5%. These results provide a reference to the prediction of MARFE movement in high density plasmas, which can help the avoidance or mitigation of density limit disruption in future fusion reactors.