On-site programming big data refers to the massive data generated in the process of software development with the characteristics of real-time,complexity and high-difficulty for processing.Therefore,data cleaning is e...On-site programming big data refers to the massive data generated in the process of software development with the characteristics of real-time,complexity and high-difficulty for processing.Therefore,data cleaning is essential for on-site programming big data.Duplicate data detection is an important step in data cleaning,which can save storage resources and enhance data consistency.Due to the insufficiency in traditional Sorted Neighborhood Method(SNM)and the difficulty of high-dimensional data detection,an optimized algorithm based on random forests with the dynamic and adaptive window size is proposed.The efficiency of the algorithm can be elevated by improving the method of the key-selection,reducing dimension of data set and using an adaptive variable size sliding window.Experimental results show that the improved SNM algorithm exhibits better performance and achieve higher accuracy.展开更多
Missing data filling is a key step in power big data preprocessing,which helps to improve the quality and the utilization of electric power data.Due to the limitations of the traditional methods of filling missing dat...Missing data filling is a key step in power big data preprocessing,which helps to improve the quality and the utilization of electric power data.Due to the limitations of the traditional methods of filling missing data,an improved random forest filling algorithm is proposed.As a result of the horizontal and vertical directions of the electric power data are based on the characteristics of time series.Therefore,the method of improved random forest filling missing data combines the methods of linear interpolation,matrix combination and matrix transposition to solve the problem of filling large amount of electric power missing data.The filling results show that the improved random forest filling algorithm is applicable to filling electric power data in various missing forms.What’s more,the accuracy of the filling results is high and the stability of the model is strong,which is beneficial in improving the quality of electric power data.展开更多
The Very Fast Decision Tree(VFDT)algorithm is a classification algorithm for data streams.When processing large amounts of data,VFDT requires less time than traditional decision tree algorithms.However,when training s...The Very Fast Decision Tree(VFDT)algorithm is a classification algorithm for data streams.When processing large amounts of data,VFDT requires less time than traditional decision tree algorithms.However,when training samples become fewer,the label values of VFDT leaf nodes will have more errors,and the classification ability of single VFDT decision tree is limited.The Random Forest algorithm is a combinational classifier with high prediction accuracy and noise-tol-erant ability.It is constituted by multiple decision trees and can make up for the shortage of single decision tree.In this paper,in order to improve the classification accuracy on data streams,the Random Forest algorithm is integrated into the process of tree building of the VFDT algorithm,and a new Random Forest Based Very Fast Decision Tree algorithm named RFVFDT is designed.The RFVFDT algorithm adopts the decision tree building criterion of a Random Forest classifier,and improves Random Forest algorithm with sliding window to meet the unboundedness of data streams and avoid process delay and data loss.Experimental results of the classification of KDD CUP data sets show that the classification accuracy of RFVFDT algorithm is higher than that of VFDT.The less the samples are,the more obvious the advantage is.RFVFDT is fast when running in the multithread mode.展开更多
Estimating the volume growth of forest ecosystems accurately is important for understanding carbon sequestration and achieving carbon neutrality goals.However,the key environmental factors affecting volume growth diff...Estimating the volume growth of forest ecosystems accurately is important for understanding carbon sequestration and achieving carbon neutrality goals.However,the key environmental factors affecting volume growth differ across various scales and plant functional types.This study was,therefore,conducted to estimate the volume growth of Larix and Quercus forests based on national-scale forestry inventory data in China and its influencing factors using random forest algorithms.The results showed that the model performances of volume growth in natural forests(R^(2)=0.65 for Larix and 0.66 for Quercus,respectively)were better than those in planted forests(R^(2)=0.44 for Larix and 0.40 for Quercus,respectively).In both natural and planted forests,the stand age showed a strong relative importance for volume growth(8.6%–66.2%),while the edaphic and climatic variables had a limited relative importance(<6.0%).The relationship between stand age and volume growth was unimodal in natural forests and linear increase in planted Quercus forests.And the specific locations(i.e.,altitude and aspect)of sampling plots exhibited high relative importance for volume growth in planted forests(4.1%–18.2%).Altitude positively affected volume growth in planted Larix forests but controlled volume growth negatively in planted Quercus forests.Similarly,the effects of other environmental factors on volume growth also differed in both stand origins(planted versus natural)and plant functional types(Larix versus Quercus).These results highlighted that the stand age was the most important predictor for volume growth and there were diverse effects of environmental factors on volume growth among stand origins and plant functional types.Our findings will provide a good framework for site-specific recommendations regarding the management practices necessary to maintain the volume growth in China's forest ecosystems.展开更多
The traditional random forest algorithm works along with unbalanced data,cannot achieve satisfactory prediction results for minority class,and suffers from the parameter selection dilemma.In view of this problem,this ...The traditional random forest algorithm works along with unbalanced data,cannot achieve satisfactory prediction results for minority class,and suffers from the parameter selection dilemma.In view of this problem,this paper proposes an unbalanced accuracy weighted random forest algorithm(UAW_RF)based on the adaptive step size artificial bee colony optimization.It combines the ideas of decision tree optimization,sampling selection,and weighted voting to improve the ability of stochastic forest algorithm when dealing with biased data classification.The adaptive step size and the optimal solution were introduced to improve the position updating formula of the artificial bee colony algorithm,and then the parameter combination of the random forest algorithm was iteratively optimized with the advantages of the algorithm.Experimental results show satisfactory accuracies and prove that the method can effectively improve the classification accuracy of the random forest algorithm.展开更多
The North China district has been subjected to significant research with regard to the ore-forming dynamics,processes,and quantitative forecasting of gold deposits;it accounts for the highest number of gold reserves a...The North China district has been subjected to significant research with regard to the ore-forming dynamics,processes,and quantitative forecasting of gold deposits;it accounts for the highest number of gold reserves and annual products in China.Based on the top-level design of geoscience theory and the method adopted by the National Key R&D Project(deep process and metallogenic mechanism of North China Craton(NCC)metallogenic system),this paper systematically collects and constructs the geoscience data(district,camp,and deposit scales)in four key gold districts of North China(Jiaojia-Sanshandao,Southern Zhaoping,Wulong,and Qingchengzi).The settings associated with the geological dynamics of gold deposits were quantitatively and synthetically analyzed,namely:NCC destruction,metallogenic events,genetic models,and exploration models.Three-dimensional(3D)and four-dimensional(4D)geological modeling was performed using the big data on the districts,while the district-scale 3D exploration criteria were integrated to construct a quantitative exploration model.Among them,FLAC3D modelling and the Geo Cube software(version 3.0)were used to implement the numerical simulation of the 3D geological models and the constraints of the fluid saturation parameters of the Jiaojia fault to reconstruct the 4D fault structure models of the Jiaojia fault(with a depth of 5000 m).Using Geo Cube3.0,multiple integration modules(general weights of evidence(Wof E),Boost Wof E,Fuzzy Wof E,Logistic Regression,Information Entropy,and Random Forest)and exploration criteria were integrated,while the C-V fractal classification of A,B and C targets in four districts was carried out.The research results are summarized in the following four areas:(1)Four gold districts in the study area have more than three targets(the depth is 3000 m),and the class A,B and C targets exhibit a good spatial correlation with gold bodies that are controlled by mining engineering at depths greater than 1000 m.(2)The Boost Wof E method was used to identify the target optimization in 3D spaces(at depths of 3000–5000 m)of the Jiaojia-Sanshandao,Southern Zhaoping,and Wulong districts.(3)The general Wof E method is based on the Bayesian theory in 3D space and provides robust integration and target optimization that are suitable for the Jiaojia-Sanshandao and Southern Zhaoping districts in the Jiaodong area;it can also be applied to the Wulong district in the Liaodong area using a quantitative genetic model and an exploration model.Random forest is a multi-objective integration and target optimization method for 3D spaces,and it is suitable for the complex exploration model in the Qingchengzi district of the Liaodong area.The genetic model and exploration criteria associated with the exploration model of the Qingchengzi district were constrained by the common characteristics of the gold fault structure,magmatic rock emplacement in North China,and the strata fold and interlayer detachment structure.(4)Based on the gold reserves and the 3D block unit model of the Sanshandao gold deposit in the Jiaojia-Sanshandao district,the gold contents of the 3D block units in class A and B targets of the ore concentration were estimated to be 65.5%and 25.1%,respectively.The total Au resources of the optimized targets below a depth of 3000 m were 3908 t(including 1700 t reserves),and the total Au resources of the targets at depths from 3000 to 5000 m were 936 t.The study shows that the deep gold deposits in the four gold districts of North China exhibit a strong"transport-deposition"spatial correlation with potential targets.These"transport-deposition"spatial models represent the tectonic-magmatic-hydrothermal activities of the metallogenic system associated with the NCC destruction events and indicate the Au enrichment zones.展开更多
Background: Knowledge of the different kinds of tree communities that currently exist can provide a baseline for assessing the ecological attributes of forests and monitoring future changes. Forest inventory data can...Background: Knowledge of the different kinds of tree communities that currently exist can provide a baseline for assessing the ecological attributes of forests and monitoring future changes. Forest inventory data can facilitate the development of this baseline knowledge across broad extents, but they first must be classified into forest community types. Here, we compared three alternative classifications across the United States using data from over 117,000 U.S. Department of Agriculture Forest Service Forest Inventory and Analysis (FIA) plots. Methods: Each plot had three forest community type labels: (1) "FIA" types were assigned by the FIA program using a supervised method; (2) "USNVC" types were assigned via a key based on the U.S. National Vegetation Classification; (3) "empirical" types resulted from unsupervised clustering of tree species information. We assessed the degree to which analog classes occurred among classifications, compared indicator species values, and used random forest models to determine how well the classifications could be predicted using environmental variables. Results: The classifications generated groups of classes that had broadly similar distributions, but often there was no one-to-one analog across the classifications. The Iongleaf pine forest community type stood out as the exception: it was the only class with strong analogs across all classifications. Analogs were most lacking for forest community types with species that occurred across a range of geographic and environmental conditions, such as Ioblolly pine types, indicator species metrics were generally high for the USNVC, suggesting that LJSNVC classes are floristically well-defined. The empirical classification was best predicted by environmental variables. The most important predictors differed slightly but were broadly similar across all classifications, and included slope, amount of forest in the surrounding landscape, average minimum temperature, and other climate variables. Conclusions: The classifications have similarities and differences that reflect their differing approaches and Dbjectives. They are most consistent for forest community types that occur in a relatively narrow range of Invironmental conditions, and differ most for types with wide-ranging tree species. Environmental variables at variety of scales were important for predicting all classifications, though strongest for the empirical and FIA, guggesting that each is useful for studying how forest communities respond to of multi-scale environmental processes, including global change drivers.展开更多
MapReduce is a programming model for processing large data sets, and Hadoop is the most popular open-source implementation of MapReduce. To achieve high performance, up to 190 Hadoop configuration parameters must be m...MapReduce is a programming model for processing large data sets, and Hadoop is the most popular open-source implementation of MapReduce. To achieve high performance, up to 190 Hadoop configuration parameters must be manually tunned. This is not only time-consuming but also error-pron. In this paper, we propose a new performance model based on random forest, a recently devel- oped machine-learning algorithm. The model, called RFMS, is used to predict the performance of a Hadoop system according to the system' s configuration parameters. RFMS is created from 2000 distinct fine-grained performance observations with different Hadoop configurations. We test RFMS against the measured performance of representative workloads from the Hadoop Micro-benchmark suite. The results show that the prediction accuracy of RFMS achieves 95% on average and up to 99%. This new, highly accurate prediction model can be used to automatically optimize the performance of Hadoop systems.展开更多
基金supported by the National Key R&D Program of China(Nos.2018YFB1003905)the National Natural Science Foundation of China under Grant No.61971032,Fundamental Research Funds for the Central Universities(No.FRF-TP-18-008A3).
文摘On-site programming big data refers to the massive data generated in the process of software development with the characteristics of real-time,complexity and high-difficulty for processing.Therefore,data cleaning is essential for on-site programming big data.Duplicate data detection is an important step in data cleaning,which can save storage resources and enhance data consistency.Due to the insufficiency in traditional Sorted Neighborhood Method(SNM)and the difficulty of high-dimensional data detection,an optimized algorithm based on random forests with the dynamic and adaptive window size is proposed.The efficiency of the algorithm can be elevated by improving the method of the key-selection,reducing dimension of data set and using an adaptive variable size sliding window.Experimental results show that the improved SNM algorithm exhibits better performance and achieve higher accuracy.
基金Supported by the State Grid Power Company of Hunan Province Science and Technology Project(No.5216A517000U).
文摘Missing data filling is a key step in power big data preprocessing,which helps to improve the quality and the utilization of electric power data.Due to the limitations of the traditional methods of filling missing data,an improved random forest filling algorithm is proposed.As a result of the horizontal and vertical directions of the electric power data are based on the characteristics of time series.Therefore,the method of improved random forest filling missing data combines the methods of linear interpolation,matrix combination and matrix transposition to solve the problem of filling large amount of electric power missing data.The filling results show that the improved random forest filling algorithm is applicable to filling electric power data in various missing forms.What’s more,the accuracy of the filling results is high and the stability of the model is strong,which is beneficial in improving the quality of electric power data.
文摘The Very Fast Decision Tree(VFDT)algorithm is a classification algorithm for data streams.When processing large amounts of data,VFDT requires less time than traditional decision tree algorithms.However,when training samples become fewer,the label values of VFDT leaf nodes will have more errors,and the classification ability of single VFDT decision tree is limited.The Random Forest algorithm is a combinational classifier with high prediction accuracy and noise-tol-erant ability.It is constituted by multiple decision trees and can make up for the shortage of single decision tree.In this paper,in order to improve the classification accuracy on data streams,the Random Forest algorithm is integrated into the process of tree building of the VFDT algorithm,and a new Random Forest Based Very Fast Decision Tree algorithm named RFVFDT is designed.The RFVFDT algorithm adopts the decision tree building criterion of a Random Forest classifier,and improves Random Forest algorithm with sliding window to meet the unboundedness of data streams and avoid process delay and data loss.Experimental results of the classification of KDD CUP data sets show that the classification accuracy of RFVFDT algorithm is higher than that of VFDT.The less the samples are,the more obvious the advantage is.RFVFDT is fast when running in the multithread mode.
基金supported by the Major Program of the National Natural Science Foundation of China(No.32192434)the Fundamental Research Funds of Chinese Academy of Forestry(No.CAFYBB2019ZD001)the National Key Research and Development Program of China(2016YFD060020602).
文摘Estimating the volume growth of forest ecosystems accurately is important for understanding carbon sequestration and achieving carbon neutrality goals.However,the key environmental factors affecting volume growth differ across various scales and plant functional types.This study was,therefore,conducted to estimate the volume growth of Larix and Quercus forests based on national-scale forestry inventory data in China and its influencing factors using random forest algorithms.The results showed that the model performances of volume growth in natural forests(R^(2)=0.65 for Larix and 0.66 for Quercus,respectively)were better than those in planted forests(R^(2)=0.44 for Larix and 0.40 for Quercus,respectively).In both natural and planted forests,the stand age showed a strong relative importance for volume growth(8.6%–66.2%),while the edaphic and climatic variables had a limited relative importance(<6.0%).The relationship between stand age and volume growth was unimodal in natural forests and linear increase in planted Quercus forests.And the specific locations(i.e.,altitude and aspect)of sampling plots exhibited high relative importance for volume growth in planted forests(4.1%–18.2%).Altitude positively affected volume growth in planted Larix forests but controlled volume growth negatively in planted Quercus forests.Similarly,the effects of other environmental factors on volume growth also differed in both stand origins(planted versus natural)and plant functional types(Larix versus Quercus).These results highlighted that the stand age was the most important predictor for volume growth and there were diverse effects of environmental factors on volume growth among stand origins and plant functional types.Our findings will provide a good framework for site-specific recommendations regarding the management practices necessary to maintain the volume growth in China's forest ecosystems.
基金the CERNET Innovation Project(No.NGII20190315)the Foundation of A Hundred Youth Talents Training Program of Lanzhou Jiaotong University.
文摘The traditional random forest algorithm works along with unbalanced data,cannot achieve satisfactory prediction results for minority class,and suffers from the parameter selection dilemma.In view of this problem,this paper proposes an unbalanced accuracy weighted random forest algorithm(UAW_RF)based on the adaptive step size artificial bee colony optimization.It combines the ideas of decision tree optimization,sampling selection,and weighted voting to improve the ability of stochastic forest algorithm when dealing with biased data classification.The adaptive step size and the optimal solution were introduced to improve the position updating formula of the artificial bee colony algorithm,and then the parameter combination of the random forest algorithm was iteratively optimized with the advantages of the algorithm.Experimental results show satisfactory accuracies and prove that the method can effectively improve the classification accuracy of the random forest algorithm.
基金supported by the National Key R&D Program of China(Grant Nos.2016YFC0600107&2016YFC0600108)。
文摘The North China district has been subjected to significant research with regard to the ore-forming dynamics,processes,and quantitative forecasting of gold deposits;it accounts for the highest number of gold reserves and annual products in China.Based on the top-level design of geoscience theory and the method adopted by the National Key R&D Project(deep process and metallogenic mechanism of North China Craton(NCC)metallogenic system),this paper systematically collects and constructs the geoscience data(district,camp,and deposit scales)in four key gold districts of North China(Jiaojia-Sanshandao,Southern Zhaoping,Wulong,and Qingchengzi).The settings associated with the geological dynamics of gold deposits were quantitatively and synthetically analyzed,namely:NCC destruction,metallogenic events,genetic models,and exploration models.Three-dimensional(3D)and four-dimensional(4D)geological modeling was performed using the big data on the districts,while the district-scale 3D exploration criteria were integrated to construct a quantitative exploration model.Among them,FLAC3D modelling and the Geo Cube software(version 3.0)were used to implement the numerical simulation of the 3D geological models and the constraints of the fluid saturation parameters of the Jiaojia fault to reconstruct the 4D fault structure models of the Jiaojia fault(with a depth of 5000 m).Using Geo Cube3.0,multiple integration modules(general weights of evidence(Wof E),Boost Wof E,Fuzzy Wof E,Logistic Regression,Information Entropy,and Random Forest)and exploration criteria were integrated,while the C-V fractal classification of A,B and C targets in four districts was carried out.The research results are summarized in the following four areas:(1)Four gold districts in the study area have more than three targets(the depth is 3000 m),and the class A,B and C targets exhibit a good spatial correlation with gold bodies that are controlled by mining engineering at depths greater than 1000 m.(2)The Boost Wof E method was used to identify the target optimization in 3D spaces(at depths of 3000–5000 m)of the Jiaojia-Sanshandao,Southern Zhaoping,and Wulong districts.(3)The general Wof E method is based on the Bayesian theory in 3D space and provides robust integration and target optimization that are suitable for the Jiaojia-Sanshandao and Southern Zhaoping districts in the Jiaodong area;it can also be applied to the Wulong district in the Liaodong area using a quantitative genetic model and an exploration model.Random forest is a multi-objective integration and target optimization method for 3D spaces,and it is suitable for the complex exploration model in the Qingchengzi district of the Liaodong area.The genetic model and exploration criteria associated with the exploration model of the Qingchengzi district were constrained by the common characteristics of the gold fault structure,magmatic rock emplacement in North China,and the strata fold and interlayer detachment structure.(4)Based on the gold reserves and the 3D block unit model of the Sanshandao gold deposit in the Jiaojia-Sanshandao district,the gold contents of the 3D block units in class A and B targets of the ore concentration were estimated to be 65.5%and 25.1%,respectively.The total Au resources of the optimized targets below a depth of 3000 m were 3908 t(including 1700 t reserves),and the total Au resources of the targets at depths from 3000 to 5000 m were 936 t.The study shows that the deep gold deposits in the four gold districts of North China exhibit a strong"transport-deposition"spatial correlation with potential targets.These"transport-deposition"spatial models represent the tectonic-magmatic-hydrothermal activities of the metallogenic system associated with the NCC destruction events and indicate the Au enrichment zones.
基金Funding for this work came from the USDA Forest Service Resources Planning Act Assessment,via an agreement with North Carolina State University
文摘Background: Knowledge of the different kinds of tree communities that currently exist can provide a baseline for assessing the ecological attributes of forests and monitoring future changes. Forest inventory data can facilitate the development of this baseline knowledge across broad extents, but they first must be classified into forest community types. Here, we compared three alternative classifications across the United States using data from over 117,000 U.S. Department of Agriculture Forest Service Forest Inventory and Analysis (FIA) plots. Methods: Each plot had three forest community type labels: (1) "FIA" types were assigned by the FIA program using a supervised method; (2) "USNVC" types were assigned via a key based on the U.S. National Vegetation Classification; (3) "empirical" types resulted from unsupervised clustering of tree species information. We assessed the degree to which analog classes occurred among classifications, compared indicator species values, and used random forest models to determine how well the classifications could be predicted using environmental variables. Results: The classifications generated groups of classes that had broadly similar distributions, but often there was no one-to-one analog across the classifications. The Iongleaf pine forest community type stood out as the exception: it was the only class with strong analogs across all classifications. Analogs were most lacking for forest community types with species that occurred across a range of geographic and environmental conditions, such as Ioblolly pine types, indicator species metrics were generally high for the USNVC, suggesting that LJSNVC classes are floristically well-defined. The empirical classification was best predicted by environmental variables. The most important predictors differed slightly but were broadly similar across all classifications, and included slope, amount of forest in the surrounding landscape, average minimum temperature, and other climate variables. Conclusions: The classifications have similarities and differences that reflect their differing approaches and Dbjectives. They are most consistent for forest community types that occur in a relatively narrow range of Invironmental conditions, and differ most for types with wide-ranging tree species. Environmental variables at variety of scales were important for predicting all classifications, though strongest for the empirical and FIA, guggesting that each is useful for studying how forest communities respond to of multi-scale environmental processes, including global change drivers.
基金supported by the cooperation project of Research on Green Cloud IDC Resource Scheduling with ZTE Corporation
文摘MapReduce is a programming model for processing large data sets, and Hadoop is the most popular open-source implementation of MapReduce. To achieve high performance, up to 190 Hadoop configuration parameters must be manually tunned. This is not only time-consuming but also error-pron. In this paper, we propose a new performance model based on random forest, a recently devel- oped machine-learning algorithm. The model, called RFMS, is used to predict the performance of a Hadoop system according to the system' s configuration parameters. RFMS is created from 2000 distinct fine-grained performance observations with different Hadoop configurations. We test RFMS against the measured performance of representative workloads from the Hadoop Micro-benchmark suite. The results show that the prediction accuracy of RFMS achieves 95% on average and up to 99%. This new, highly accurate prediction model can be used to automatically optimize the performance of Hadoop systems.