In order to improve nitrogen removal in anoxic/oxic(A/O) process effectively for treating domestic wastewaters, the influence factors, DO(dissolved oxygen), nitrate recirculation, sludge recycle, SRT(solids residence ...In order to improve nitrogen removal in anoxic/oxic(A/O) process effectively for treating domestic wastewaters, the influence factors, DO(dissolved oxygen), nitrate recirculation, sludge recycle, SRT(solids residence time), influent COD/TN and HRT(hydraulic retention time) were studied. Results indicated that it was possible to increase nitrogen removal by using corresponding control strategies, such as, adjusting the DO set point according to effluent ammonia concentration; manipulating nitrate recirculation flow according to nitrate concentration at the end of anoxic zone. Based on the experiments results, a knowledge-based approach for supervision of the nitrogen removal problems was considered, and decision trees for diagnosing nitrification and denitrification problems were built and successfully applied to A/O process.展开更多
AIM: To assess the usefulness of FibroTest to forecast scores by constructing decision trees in patients with chronic hepatitis C.METHODS: We used the C4.5 classification algorithm to construct decision trees with d...AIM: To assess the usefulness of FibroTest to forecast scores by constructing decision trees in patients with chronic hepatitis C.METHODS: We used the C4.5 classification algorithm to construct decision trees with data from 261 patients with chronic hepatitis C without a liver biopsy. The FibroTest attributes of age, gender, bilirubin, apolipoprotein, haptoglobin, α2 macroglobulin, and γ-glutamyl transpeptidase were used as predictors, and the FibroTest score as the target. For testing, a 10-fold cross validation was used.RESULTS: The overall classification error was 14.9% (accuracy 85.1%). FibroTest's cases with true scores of FO and F4 were classified with very high accuracy (18/20 for FO, 9/9 for FO-1 and 92/96 for F4) and the largest confusion centered on F3. The algorithm produced a set of compound rules out of the ten classification trees and was used to classify the 261 patients. The rules for the classification of patients in FO and F4 were effective in more than 75% of the cases in which they were tested.CONCLUSION: The recognition of clinical subgroups should help to enhance our ability to assess differences in fibrosis scores in clinical studies and improve our understanding of fibrosis progression,展开更多
This paper explores the use of soft decision trees [1] in basic reinforcement applications to examine the efficacy of using passive-expert like networks for optimal Q-Value learning on Artificial Neural Networks (ANN)...This paper explores the use of soft decision trees [1] in basic reinforcement applications to examine the efficacy of using passive-expert like networks for optimal Q-Value learning on Artificial Neural Networks (ANN). The soft decision tree networks were built using the PyTorch machine learning and the OpenAi’s Gym environment frameworks. The conducted research study aimed at assessing the performance of soft decision tree networks on Cartpole as provided in the OpenAi Gym software package. The baseline performance metric that the soft decision tree networks were compared against was a simple Deep Neural Network using several linear layers with ReLU and Softmax activation functions for the input and output layers, respectively. All networks were trained using the Backpropagation algorithm provided generically by PyTorch’sAutograd module.展开更多
Based on a fuzzy neural network, the letter presents an approach for the induction of decision trees. The approach makes use of the weights of fuzzy mappings in the fuzzy neural network which has been trained. It can ...Based on a fuzzy neural network, the letter presents an approach for the induction of decision trees. The approach makes use of the weights of fuzzy mappings in the fuzzy neural network which has been trained. It can realize the optimization of fuzzy decision trees by branch cutting, and improve the ratio of correctness and efficiency of the induction of decision trees.展开更多
The ID3 algorithm is a classical learning algorithm of decision tree in data mining.The algorithm trends to choosing the attribute with more values,affect the efficiency of classification and prediction for building a...The ID3 algorithm is a classical learning algorithm of decision tree in data mining.The algorithm trends to choosing the attribute with more values,affect the efficiency of classification and prediction for building a decision tree.This article proposes a new approach based on an improved ID3 algorithm.The new algorithm introduces the importance factor λ when calculating the information entropy.It can strengthen the label of important attributes of a tree and reduce the label of non-important attributes.The algorithm overcomes the flaw of the traditional ID3 algorithm which tends to choose the attributes with more values,and also improves the efficiency and flexibility in the process of generating decision trees.展开更多
In many decision making tasks,the features and decision are ordinal.Several ordinal classification learning algorithms have been developed in recent years,it is shown that these algorithms are sensitive to noisy sampl...In many decision making tasks,the features and decision are ordinal.Several ordinal classification learning algorithms have been developed in recent years,it is shown that these algorithms are sensitive to noisy samples and do not work in real-world applications.In this work,we propose a new measure of feature quality, called rank mutual information.Then,we design an ordinal decision tree(REOT) construction technique based on rank mutual information.The theoretic and experimental analysis shows that the proposed algorithm is effective.展开更多
Traditional 3Ni weathering steel cannot completely meet the requirements for offshore engineering development,resulting in the design of novel 3Ni steel with the addition of microalloy elements such as Mn or Nb for st...Traditional 3Ni weathering steel cannot completely meet the requirements for offshore engineering development,resulting in the design of novel 3Ni steel with the addition of microalloy elements such as Mn or Nb for strength enhancement becoming a trend.The stress-assisted corrosion behavior of a novel designed high-strength 3Ni steel was investigated in the current study using the corrosion big data method.The information on the corrosion process was recorded using the galvanic corrosion current monitoring method.The gradi-ent boosting decision tree(GBDT)machine learning method was used to mine the corrosion mechanism,and the importance of the struc-ture factor was investigated.Field exposure tests were conducted to verify the calculated results using the GBDT method.Results indic-ated that the GBDT method can be effectively used to study the influence of structural factors on the corrosion process of 3Ni steel.Dif-ferent mechanisms for the addition of Mn and Cu to the stress-assisted corrosion of 3Ni steel suggested that Mn and Cu have no obvious effect on the corrosion rate of non-stressed 3Ni steel during the early stage of corrosion.When the corrosion reached a stable state,the in-crease in Mn element content increased the corrosion rate of 3Ni steel,while Cu reduced this rate.In the presence of stress,the increase in Mn element content and Cu addition can inhibit the corrosion process.The corrosion law of outdoor-exposed 3Ni steel is consistent with the law based on corrosion big data technology,verifying the reliability of the big data evaluation method and data prediction model selection.展开更多
The North China Plain and the agricultural region are crossed by the Shanxi-Beijing natural gas pipeline.Resi-dents in the area use rototillers for planting and harvesting;however,the depth of the rototillers into the...The North China Plain and the agricultural region are crossed by the Shanxi-Beijing natural gas pipeline.Resi-dents in the area use rototillers for planting and harvesting;however,the depth of the rototillers into the ground is greater than the depth of the pipeline,posing a significant threat to the safe operation of the pipeline.Therefore,it is of great significance to study the dynamic response of rotary tillers impacting pipelines to ensure the safe opera-tion of pipelines.This article focuses on the Shanxi-Beijing natural gas pipeline,utilizingfinite element simulation software to establish afinite element model for the interaction among the machinery,pipeline,and soil,and ana-lyzing the dynamic response of the pipeline.At the same time,a decision tree model is introduced to classify the damage of pipelines under different working conditions,and the boundary value and importance of each influen-cing factor on pipeline damage are derived.Considering the actual conditions in the hemp yam planting area,targeted management measures have been proposed to ensure the operational safety of the Shanxi-Beijing natural gas pipeline in this region.展开更多
Mangroves are woody plant communities in the intertidal zone of tropical and subtropical coasts that play an important role in these zones. The infrared wave band is one of the key bands in the remote sensing identifi...Mangroves are woody plant communities in the intertidal zone of tropical and subtropical coasts that play an important role in these zones. The infrared wave band is one of the key bands in the remote sensing identification of mangrove forest, and ALI(advanced land imagery) has a large number of infrared bands. Two angle indices were proposed based on liquid water absorption at band 5p and band 5 of EO-1 ALI, denoted as β1.25 and β1.65 respectively. A decision tree method was adopted to identify mangrove forest using remote sensing techniques for β1.25–β1.65 and NDVI(normalized difference vegetation index) for EO-1 ALI imagery acquired at Shenzhen Bay. The results showed that the reflectance of mangrove forests at band 5p and band 5 was significantly lower than that of terrestrial vegetation due to the characteristics of coastal wetlands of mangrove forests. This resulted in a greater β1.25–β1.65 value for mangrove forest than terrestrial vegetation. The decision tree method using β1.25–β1.65 and NDVI effectively identifies mangrove forest from other land cover categories. The misclassification and leakage rates were 4.29% and 5.11% respectively. ALI sensors with many infrared bands could play an important role in discriminating mangrove forest.展开更多
Decision trees and their ensembles became quite popular for data analysis during the past decade.One of the main reasons for that is current boom in big data,where traditional statistical methods(such as,e.g.,multiple...Decision trees and their ensembles became quite popular for data analysis during the past decade.One of the main reasons for that is current boom in big data,where traditional statistical methods(such as,e.g.,multiple linear regression)are not very efficient.However,in chemometrics these methods are still not very widespread,first of all because of several limitations related to the ratio between number of variables and observations.This paper presents several examples on how decision trees and their ensembles can be used in analysis of NIR spectroscopic data both for regression and classification.We will try to consider all important aspects including optimization and validation of models,evaluation of results,treating missing data and selection of most important variables.The performance and outcome of the decision tree-based methods are compared with more traditional approach based on partial least squares.展开更多
Power systems transport an increasing amount of electricity,and in the future,involve more distributed renewables and dynamic interactions of the equipment.The system response to disturbances must be secure and predic...Power systems transport an increasing amount of electricity,and in the future,involve more distributed renewables and dynamic interactions of the equipment.The system response to disturbances must be secure and predictable to avoid power blackouts.The system response can be simulated in the time domain.However,this dynamic security assessment(DSA)is not computationally tractable in real-time.Particularly promising is to train decision trees(DTs)from machine learning as interpretable classifiers to predict whether the systemwide responses to disturbances are secure.In most research,selecting the best DT model focuses on predictive accuracy.However,it is insufficient to focus solely on predictive accuracy.Missed alarms and false alarms have drastically different costs,and as security assessment is a critical task,interpretability is crucial for operators.In this work,the multiple objectives of interpretability,varying costs,and accuracies are considered for DT model selection.We propose a rigorous workflow to select the best classifier.In addition,we present two graphical approaches for visual inspection to illustrate the selection sensitivity to probability and impacts of disturbances.We propose cost curves to inspect selection combining all three objectives for the first time.Case studies on the IEEE 68 bus system and the French system show that the proposed approach allows for better DT-selections,with an 80%increase in interpretability,5%reduction in expected operating cost,while making almost zero accuracy compromises.The proposed approach scales well with larger systems and can be used for models beyond DTs.Hence,this work provides insights into criteria for model selection in a promising application for methods from artificial intelligence(AI).展开更多
Decision trees can be used to enhance the interpretability of neural networks.In this work,we compare the classification and interpretability performance of the normal decision tree and a type of soft decision tree wh...Decision trees can be used to enhance the interpretability of neural networks.In this work,we compare the classification and interpretability performance of the normal decision tree and a type of soft decision tree when they are used to interpret the decision paths of CNN networks.With the help of feature visualization and human-labeled features,we demonstrate that the soft decision trees identify more consistent features while maintaining much higher classification performance than the normal decision tree.展开更多
Background Various methods can be applied to build predictive models for the clinical data with binary outcome variable. This research aims to explore the process of constructing common predictive models, Logistic reg...Background Various methods can be applied to build predictive models for the clinical data with binary outcome variable. This research aims to explore the process of constructing common predictive models, Logistic regression (LR), decision tree (DT) and multilayer perceptron (MLP), as well as focus on specific details when applying the methods mentioned above: what preconditions should be satisfied, how to set parameters of the model, how to screen variables and build accuracy models quickly and efficiently, and how to assess the generalization ability (that is, prediction performance) reliably by Monte Carlo method in the case of small sample size.展开更多
Most stream data classification algorithms apply the supervised learning strategy which requires massive labeled data.Such approaches are impractical since labeled data are usually hard to obtain in reality.In this pa...Most stream data classification algorithms apply the supervised learning strategy which requires massive labeled data.Such approaches are impractical since labeled data are usually hard to obtain in reality.In this paper,we build a clustering feature decision tree model,CFDT,from data streams having both unlabeled and a small number of labeled examples.CFDT applies a micro-clustering algorithm that scans the data only once to provide the statistical summaries of the data for incremental decision tree induction.Micro-clusters also serve as classifiers in tree leaves to improve classification accuracy and reinforce the any-time property.Our experiments on synthetic and real-world datasets show that CFDT is highly scalable for data streams while gener-ating high classification accuracy with high speed.展开更多
Decision trees have three main disadvantages: reduced performance when the training set is small; rigid decision criteria; and the fact that a single "uncharacteristic" attribute might "derail" the classification...Decision trees have three main disadvantages: reduced performance when the training set is small; rigid decision criteria; and the fact that a single "uncharacteristic" attribute might "derail" the classification process. In this paper we present ConfDTree (Confidence-Based Decision Tree) -- a post-processing method that enables decision trees to better classify outlier instances. This method, which can be applied to any decision tree algorithm, uses easy-to-implement statistical methods (confidence intervals and two-proportion tests) in order to identify hard-to-classify instances and to propose alternative routes. The experimental study indicates that the proposed post-processing method consistently and significantly improves the predictive performance of decision trees, particularly for small, imbalanced or multi-class datasets in which an average improvement of 5%-9% in the AUC performance is reported.展开更多
Accurate prediction ofmonthly oil and gas production is essential for oil enterprises tomake reasonable production plans,avoid blind investment and realize sustainable development.Traditional oil well production trend...Accurate prediction ofmonthly oil and gas production is essential for oil enterprises tomake reasonable production plans,avoid blind investment and realize sustainable development.Traditional oil well production trend prediction methods are based on years of oil field production experience and expertise,and the application conditions are very demanding.With the rapid development of artificial intelligence technology,big data analysis methods are gradually applied in various sub-fields of the oil and gas reservoir development.Based on the data-driven artificial intelligence algorithmGradient BoostingDecision Tree(GBDT),this paper predicts the initial single-layer production by considering geological data,fluid PVT data and well data.The results show that the GBDT algorithm prediction model has great accuracy,significantly improving efficiency and strong universal applicability.The GBDTmethod trained in this paper can predict production,which is helpful for well site optimization,perforation layer optimization and engineering parameter optimization and has guiding significance for oilfield development.展开更多
The trend toward designing an intelligent distribution system based on students’individual differences and individual needs has taken precedence in view of the traditional dormitory distribution system,which neglects...The trend toward designing an intelligent distribution system based on students’individual differences and individual needs has taken precedence in view of the traditional dormitory distribution system,which neglects the students’personality traits,causes dormitory disputes,and affects the students’quality of life and academic quality.This paper collects freshmen's data according to college students’personal preferences,conducts a classification comparison,uses the decision tree classification algorithm based on the information gain principle as the core algorithm of dormitory allocation,determines the description rules of students’personal preferences and decision tree classification preferences,completes the conceptual design of the database of entity relations and data dictionaries,meets students’personality classification requirements for the dormitory,and lays the foundation for the intelligent dormitory allocation system.展开更多
Big data is usually unstructured, and many applications require theanalysis in real-time. Decision tree (DT) algorithm is widely used to analyzebig data. Selecting the optimal depth of DT is time-consuming process as ...Big data is usually unstructured, and many applications require theanalysis in real-time. Decision tree (DT) algorithm is widely used to analyzebig data. Selecting the optimal depth of DT is time-consuming process as itrequires many iterations. In this paper, we have designed a modified versionof a (DT). The tree aims to achieve optimal depth by self-tuning runningparameters and improving the accuracy. The efficiency of the modified (DT)was verified using two datasets (airport and fire datasets). The airport datasethas 500000 instances and the fire dataset has 600000 instances. A comparisonhas been made between the modified (DT) and standard (DT) with resultsshowing that the modified performs better. This comparison was conductedon multi-node on Apache Spark tool using Amazon web services. Resultingin accuracy with an increase of 6.85% for the first dataset and 8.85% for theairport dataset. In conclusion, the modified DT showed better accuracy inhandling different-sized datasets compared to standard DT algorithm.展开更多
Aiming at the problems of multiple types of power quality composite disturbances,strong feature correlation and high recognition error rate,a method of power quality composite disturbances identification based on mult...Aiming at the problems of multiple types of power quality composite disturbances,strong feature correlation and high recognition error rate,a method of power quality composite disturbances identification based on multiresolution S-transform and decision tree was proposed.Firstly,according to IEEE standard,the signal models of seven single power quality disturbances and 17 combined power quality disturbances are given,and the disturbance waveform samples are generated in batches.Then,in order to improve the recognition accuracy,the adjustment factor is introduced to obtain the controllable time-frequency resolution through multi-resolution S-transform time-frequency domain analysis.On this basis,five disturbance time-frequency domain features are extracted,which quantitatively reflect the characteristics of the analyzed power quality disturbance signal,which is less than the traditional method based on S-transform.Finally,three classifiers such as K-nearest neighbor,support vector machine and decision tree algorithm are used to effectively complete the identification of power quality composite disturbances.Simulation results showthat the classification accuracy of decision tree algorithmis higher than that of K-nearest neighbor and support vector machine.Finally,the proposed method is compared with other commonly used recognition algorithms.Experimental results show that the proposedmethod is effective in terms of detection accuracy,especially for combined PQ interference.展开更多
文摘In order to improve nitrogen removal in anoxic/oxic(A/O) process effectively for treating domestic wastewaters, the influence factors, DO(dissolved oxygen), nitrate recirculation, sludge recycle, SRT(solids residence time), influent COD/TN and HRT(hydraulic retention time) were studied. Results indicated that it was possible to increase nitrogen removal by using corresponding control strategies, such as, adjusting the DO set point according to effluent ammonia concentration; manipulating nitrate recirculation flow according to nitrate concentration at the end of anoxic zone. Based on the experiments results, a knowledge-based approach for supervision of the nitrogen removal problems was considered, and decision trees for diagnosing nitrification and denitrification problems were built and successfully applied to A/O process.
基金Supported by A grant of the Universidad Nacional Autonoma de Mexico SDI.PTID.05.6
文摘AIM: To assess the usefulness of FibroTest to forecast scores by constructing decision trees in patients with chronic hepatitis C.METHODS: We used the C4.5 classification algorithm to construct decision trees with data from 261 patients with chronic hepatitis C without a liver biopsy. The FibroTest attributes of age, gender, bilirubin, apolipoprotein, haptoglobin, α2 macroglobulin, and γ-glutamyl transpeptidase were used as predictors, and the FibroTest score as the target. For testing, a 10-fold cross validation was used.RESULTS: The overall classification error was 14.9% (accuracy 85.1%). FibroTest's cases with true scores of FO and F4 were classified with very high accuracy (18/20 for FO, 9/9 for FO-1 and 92/96 for F4) and the largest confusion centered on F3. The algorithm produced a set of compound rules out of the ten classification trees and was used to classify the 261 patients. The rules for the classification of patients in FO and F4 were effective in more than 75% of the cases in which they were tested.CONCLUSION: The recognition of clinical subgroups should help to enhance our ability to assess differences in fibrosis scores in clinical studies and improve our understanding of fibrosis progression,
文摘This paper explores the use of soft decision trees [1] in basic reinforcement applications to examine the efficacy of using passive-expert like networks for optimal Q-Value learning on Artificial Neural Networks (ANN). The soft decision tree networks were built using the PyTorch machine learning and the OpenAi’s Gym environment frameworks. The conducted research study aimed at assessing the performance of soft decision tree networks on Cartpole as provided in the OpenAi Gym software package. The baseline performance metric that the soft decision tree networks were compared against was a simple Deep Neural Network using several linear layers with ReLU and Softmax activation functions for the input and output layers, respectively. All networks were trained using the Backpropagation algorithm provided generically by PyTorch’sAutograd module.
文摘Based on a fuzzy neural network, the letter presents an approach for the induction of decision trees. The approach makes use of the weights of fuzzy mappings in the fuzzy neural network which has been trained. It can realize the optimization of fuzzy decision trees by branch cutting, and improve the ratio of correctness and efficiency of the induction of decision trees.
文摘The ID3 algorithm is a classical learning algorithm of decision tree in data mining.The algorithm trends to choosing the attribute with more values,affect the efficiency of classification and prediction for building a decision tree.This article proposes a new approach based on an improved ID3 algorithm.The new algorithm introduces the importance factor λ when calculating the information entropy.It can strengthen the label of important attributes of a tree and reduce the label of non-important attributes.The algorithm overcomes the flaw of the traditional ID3 algorithm which tends to choose the attributes with more values,and also improves the efficiency and flexibility in the process of generating decision trees.
基金supported by National Natural Science Foundation of China under Grant 60703013 and 10978011Key Program of National Natural Science Foundation of China under Grant 60932008+1 种基金National Science Fund for Distinguished Young Scholars under Grant 50925625China Postdoctoral Science Foundation.
文摘In many decision making tasks,the features and decision are ordinal.Several ordinal classification learning algorithms have been developed in recent years,it is shown that these algorithms are sensitive to noisy samples and do not work in real-world applications.In this work,we propose a new measure of feature quality, called rank mutual information.Then,we design an ordinal decision tree(REOT) construction technique based on rank mutual information.The theoretic and experimental analysis shows that the proposed algorithm is effective.
基金supported by the National Nat-ural Science Foundation of China(No.52203376)the National Key Research and Development Program of China(No.2023YFB3813200).
文摘Traditional 3Ni weathering steel cannot completely meet the requirements for offshore engineering development,resulting in the design of novel 3Ni steel with the addition of microalloy elements such as Mn or Nb for strength enhancement becoming a trend.The stress-assisted corrosion behavior of a novel designed high-strength 3Ni steel was investigated in the current study using the corrosion big data method.The information on the corrosion process was recorded using the galvanic corrosion current monitoring method.The gradi-ent boosting decision tree(GBDT)machine learning method was used to mine the corrosion mechanism,and the importance of the struc-ture factor was investigated.Field exposure tests were conducted to verify the calculated results using the GBDT method.Results indic-ated that the GBDT method can be effectively used to study the influence of structural factors on the corrosion process of 3Ni steel.Dif-ferent mechanisms for the addition of Mn and Cu to the stress-assisted corrosion of 3Ni steel suggested that Mn and Cu have no obvious effect on the corrosion rate of non-stressed 3Ni steel during the early stage of corrosion.When the corrosion reached a stable state,the in-crease in Mn element content increased the corrosion rate of 3Ni steel,while Cu reduced this rate.In the presence of stress,the increase in Mn element content and Cu addition can inhibit the corrosion process.The corrosion law of outdoor-exposed 3Ni steel is consistent with the law based on corrosion big data technology,verifying the reliability of the big data evaluation method and data prediction model selection.
文摘The North China Plain and the agricultural region are crossed by the Shanxi-Beijing natural gas pipeline.Resi-dents in the area use rototillers for planting and harvesting;however,the depth of the rototillers into the ground is greater than the depth of the pipeline,posing a significant threat to the safe operation of the pipeline.Therefore,it is of great significance to study the dynamic response of rotary tillers impacting pipelines to ensure the safe opera-tion of pipelines.This article focuses on the Shanxi-Beijing natural gas pipeline,utilizingfinite element simulation software to establish afinite element model for the interaction among the machinery,pipeline,and soil,and ana-lyzing the dynamic response of the pipeline.At the same time,a decision tree model is introduced to classify the damage of pipelines under different working conditions,and the boundary value and importance of each influen-cing factor on pipeline damage are derived.Considering the actual conditions in the hemp yam planting area,targeted management measures have been proposed to ensure the operational safety of the Shanxi-Beijing natural gas pipeline in this region.
基金National Natural Science Foundation of China(41201461)
文摘Mangroves are woody plant communities in the intertidal zone of tropical and subtropical coasts that play an important role in these zones. The infrared wave band is one of the key bands in the remote sensing identification of mangrove forest, and ALI(advanced land imagery) has a large number of infrared bands. Two angle indices were proposed based on liquid water absorption at band 5p and band 5 of EO-1 ALI, denoted as β1.25 and β1.65 respectively. A decision tree method was adopted to identify mangrove forest using remote sensing techniques for β1.25–β1.65 and NDVI(normalized difference vegetation index) for EO-1 ALI imagery acquired at Shenzhen Bay. The results showed that the reflectance of mangrove forests at band 5p and band 5 was significantly lower than that of terrestrial vegetation due to the characteristics of coastal wetlands of mangrove forests. This resulted in a greater β1.25–β1.65 value for mangrove forest than terrestrial vegetation. The decision tree method using β1.25–β1.65 and NDVI effectively identifies mangrove forest from other land cover categories. The misclassification and leakage rates were 4.29% and 5.11% respectively. ALI sensors with many infrared bands could play an important role in discriminating mangrove forest.
文摘Decision trees and their ensembles became quite popular for data analysis during the past decade.One of the main reasons for that is current boom in big data,where traditional statistical methods(such as,e.g.,multiple linear regression)are not very efficient.However,in chemometrics these methods are still not very widespread,first of all because of several limitations related to the ratio between number of variables and observations.This paper presents several examples on how decision trees and their ensembles can be used in analysis of NIR spectroscopic data both for regression and classification.We will try to consider all important aspects including optimization and validation of models,evaluation of results,treating missing data and selection of most important variables.The performance and outcome of the decision tree-based methods are compared with more traditional approach based on partial least squares.
基金The authors were supported by a scholarship funded by the Nige-rian National Petroleum Corporation,NNPC,the TU Delft AI Labs Programme,NL,and the research project IDLES,UK(EP/R045518/1).
文摘Power systems transport an increasing amount of electricity,and in the future,involve more distributed renewables and dynamic interactions of the equipment.The system response to disturbances must be secure and predictable to avoid power blackouts.The system response can be simulated in the time domain.However,this dynamic security assessment(DSA)is not computationally tractable in real-time.Particularly promising is to train decision trees(DTs)from machine learning as interpretable classifiers to predict whether the systemwide responses to disturbances are secure.In most research,selecting the best DT model focuses on predictive accuracy.However,it is insufficient to focus solely on predictive accuracy.Missed alarms and false alarms have drastically different costs,and as security assessment is a critical task,interpretability is crucial for operators.In this work,the multiple objectives of interpretability,varying costs,and accuracies are considered for DT model selection.We propose a rigorous workflow to select the best classifier.In addition,we present two graphical approaches for visual inspection to illustrate the selection sensitivity to probability and impacts of disturbances.We propose cost curves to inspect selection combining all three objectives for the first time.Case studies on the IEEE 68 bus system and the French system show that the proposed approach allows for better DT-selections,with an 80%increase in interpretability,5%reduction in expected operating cost,while making almost zero accuracy compromises.The proposed approach scales well with larger systems and can be used for models beyond DTs.Hence,this work provides insights into criteria for model selection in a promising application for methods from artificial intelligence(AI).
基金National Defense Science and Technology Innovation Special Zone Project (No.18-163-11-ZT-002-045-04)Engineering Research Center of State Financial Security,Ministry of Education,Central University of Finance and Economics,Beijing,102206,China+1 种基金Program for Innovation Research inCentral University of Finance and EconomicsNational College Students’Innovation and Entrepreneurship Training Program“Research and development of interpretable algorithms and prototype system for small sample image recognition”.
文摘Decision trees can be used to enhance the interpretability of neural networks.In this work,we compare the classification and interpretability performance of the normal decision tree and a type of soft decision tree when they are used to interpret the decision paths of CNN networks.With the help of feature visualization and human-labeled features,we demonstrate that the soft decision trees identify more consistent features while maintaining much higher classification performance than the normal decision tree.
基金This work was supported by the grants from National Natural Science Foundation of China (No. 21003077), College of Public Health of Tianjin Medical University in China (No. GWKY-2010-01), the Open Project of Key Laboratory of Advanced Energy Materials Chemistry (No. KLAEMC- OP201101) and Natural Science Foundation of Tianjin China (No. 08JCZDJC21400).
文摘Background Various methods can be applied to build predictive models for the clinical data with binary outcome variable. This research aims to explore the process of constructing common predictive models, Logistic regression (LR), decision tree (DT) and multilayer perceptron (MLP), as well as focus on specific details when applying the methods mentioned above: what preconditions should be satisfied, how to set parameters of the model, how to screen variables and build accuracy models quickly and efficiently, and how to assess the generalization ability (that is, prediction performance) reliably by Monte Carlo method in the case of small sample size.
基金supported by the National Natural Science Foundation of China (No. 60673024)the "Eleventh Five" Preliminary Research Project of PLA (No. 102060206)
文摘Most stream data classification algorithms apply the supervised learning strategy which requires massive labeled data.Such approaches are impractical since labeled data are usually hard to obtain in reality.In this paper,we build a clustering feature decision tree model,CFDT,from data streams having both unlabeled and a small number of labeled examples.CFDT applies a micro-clustering algorithm that scans the data only once to provide the statistical summaries of the data for incremental decision tree induction.Micro-clusters also serve as classifiers in tree leaves to improve classification accuracy and reinforce the any-time property.Our experiments on synthetic and real-world datasets show that CFDT is highly scalable for data streams while gener-ating high classification accuracy with high speed.
文摘Decision trees have three main disadvantages: reduced performance when the training set is small; rigid decision criteria; and the fact that a single "uncharacteristic" attribute might "derail" the classification process. In this paper we present ConfDTree (Confidence-Based Decision Tree) -- a post-processing method that enables decision trees to better classify outlier instances. This method, which can be applied to any decision tree algorithm, uses easy-to-implement statistical methods (confidence intervals and two-proportion tests) in order to identify hard-to-classify instances and to propose alternative routes. The experimental study indicates that the proposed post-processing method consistently and significantly improves the predictive performance of decision trees, particularly for small, imbalanced or multi-class datasets in which an average improvement of 5%-9% in the AUC performance is reported.
文摘Accurate prediction ofmonthly oil and gas production is essential for oil enterprises tomake reasonable production plans,avoid blind investment and realize sustainable development.Traditional oil well production trend prediction methods are based on years of oil field production experience and expertise,and the application conditions are very demanding.With the rapid development of artificial intelligence technology,big data analysis methods are gradually applied in various sub-fields of the oil and gas reservoir development.Based on the data-driven artificial intelligence algorithmGradient BoostingDecision Tree(GBDT),this paper predicts the initial single-layer production by considering geological data,fluid PVT data and well data.The results show that the GBDT algorithm prediction model has great accuracy,significantly improving efficiency and strong universal applicability.The GBDTmethod trained in this paper can predict production,which is helpful for well site optimization,perforation layer optimization and engineering parameter optimization and has guiding significance for oilfield development.
文摘The trend toward designing an intelligent distribution system based on students’individual differences and individual needs has taken precedence in view of the traditional dormitory distribution system,which neglects the students’personality traits,causes dormitory disputes,and affects the students’quality of life and academic quality.This paper collects freshmen's data according to college students’personal preferences,conducts a classification comparison,uses the decision tree classification algorithm based on the information gain principle as the core algorithm of dormitory allocation,determines the description rules of students’personal preferences and decision tree classification preferences,completes the conceptual design of the database of entity relations and data dictionaries,meets students’personality classification requirements for the dormitory,and lays the foundation for the intelligent dormitory allocation system.
文摘Big data is usually unstructured, and many applications require theanalysis in real-time. Decision tree (DT) algorithm is widely used to analyzebig data. Selecting the optimal depth of DT is time-consuming process as itrequires many iterations. In this paper, we have designed a modified versionof a (DT). The tree aims to achieve optimal depth by self-tuning runningparameters and improving the accuracy. The efficiency of the modified (DT)was verified using two datasets (airport and fire datasets). The airport datasethas 500000 instances and the fire dataset has 600000 instances. A comparisonhas been made between the modified (DT) and standard (DT) with resultsshowing that the modified performs better. This comparison was conductedon multi-node on Apache Spark tool using Amazon web services. Resultingin accuracy with an increase of 6.85% for the first dataset and 8.85% for theairport dataset. In conclusion, the modified DT showed better accuracy inhandling different-sized datasets compared to standard DT algorithm.
基金Foundation of China(No.52067013)the Key Natural Science Fund Project of Gansu Provincial Department of Science and Technology(No.21JR7RA280)+1 种基金the Tianyou Innovation Team Science Foundation of Intelligent Power Supply and State Perception for Rail Transit(No.TY202010)the Natural Science Foundation of Gansu Province(No.20JR5RA395).
文摘Aiming at the problems of multiple types of power quality composite disturbances,strong feature correlation and high recognition error rate,a method of power quality composite disturbances identification based on multiresolution S-transform and decision tree was proposed.Firstly,according to IEEE standard,the signal models of seven single power quality disturbances and 17 combined power quality disturbances are given,and the disturbance waveform samples are generated in batches.Then,in order to improve the recognition accuracy,the adjustment factor is introduced to obtain the controllable time-frequency resolution through multi-resolution S-transform time-frequency domain analysis.On this basis,five disturbance time-frequency domain features are extracted,which quantitatively reflect the characteristics of the analyzed power quality disturbance signal,which is less than the traditional method based on S-transform.Finally,three classifiers such as K-nearest neighbor,support vector machine and decision tree algorithm are used to effectively complete the identification of power quality composite disturbances.Simulation results showthat the classification accuracy of decision tree algorithmis higher than that of K-nearest neighbor and support vector machine.Finally,the proposed method is compared with other commonly used recognition algorithms.Experimental results show that the proposedmethod is effective in terms of detection accuracy,especially for combined PQ interference.