[Objective]The aim was to overcome the shortage of being difficult to build land evaluation model when the impact factors had continuous value in the traditional land evaluation process,as well as to improve the intel...[Objective]The aim was to overcome the shortage of being difficult to build land evaluation model when the impact factors had continuous value in the traditional land evaluation process,as well as to improve the intelligibility of the land evaluation knowledge.[Method] The land evaluation method combining classification rule extracted by C4.5 algorithm with fuzzy decision was proposed in this study.[Result] The result of Second General Soil Survey of Guangdong Province had demonstrated that the method was convenient to extract classification rules,and by using only 100 rules,quantity correct rate 86.67% and area correct rate 84.80% of land evaluation could be obtained.[Conclusions] The use of C4.5 algorithm to obtain the rules,combined with fuzzy decision algorithm to build classifiers had got satisfactory results,which provided a practical algorithm for the land evaluation.展开更多
As a distributed computing platform, Hadoop provides an effective way to handle big data. In Hadoop, the completion time of job will be delayed by a straggler. Although the definitive cause of the straggler is hard to...As a distributed computing platform, Hadoop provides an effective way to handle big data. In Hadoop, the completion time of job will be delayed by a straggler. Although the definitive cause of the straggler is hard to detect, speculative execution is usually used for dealing with this problem, by simply backing up those stragglers on alternative nodes. In this paper, we design a new Speculative Execution algorithm based on C4.5 Decision Tree, SECDT, for Hadoop. In SECDT, we speculate completion time of stragglers and also of backup tasks, based on a kind of decision tree method: C4.5 decision tree. After we speculate the completion time, we compare the completion time of stragglers and of the backup tasks, calculating their differential value, and selecting the straggler with the maximum differential value to start the backup task.Experiment result shows that the SECDT can predict execution time more accurately than other speculative execution methods, hence reduce the job completion time.展开更多
Based on the discuss of the basic concept of data mining technology and the decision tree method,combining with the data samples of wind and hailstorm disasters in some counties of Mudanjiang region,the forecasting mo...Based on the discuss of the basic concept of data mining technology and the decision tree method,combining with the data samples of wind and hailstorm disasters in some counties of Mudanjiang region,the forecasting model of agro-meteorological disaster grade was established by adopting the C4.5 classification algorithm of decision tree,which can forecast the direct economic loss degree to provide rational data mining model and obtain effective analysis results.展开更多
Machine learning algorithms are an important measure with which to perform landslide susceptibility assessments, but most studies use GIS-based classification methods to conduct susceptibility zonation.This study pres...Machine learning algorithms are an important measure with which to perform landslide susceptibility assessments, but most studies use GIS-based classification methods to conduct susceptibility zonation.This study presents a machine learning approach based on the C5.0 decision tree(DT) model and the K-means cluster algorithm to produce a regional landslide susceptibility map. Yanchang County, a typical landslide-prone area located in northwestern China, was taken as the area of interest to introduce the proposed application procedure. A landslide inventory containing 82 landslides was prepared and subsequently randomly partitioned into two subsets: training data(70% landslide pixels) and validation data(30% landslide pixels). Fourteen landslide influencing factors were considered in the input dataset and were used to calculate the landslide occurrence probability based on the C5.0 decision tree model.Susceptibility zonation was implemented according to the cut-off values calculated by the K-means cluster algorithm. The validation results of the model performance analysis showed that the AUC(area under the receiver operating characteristic(ROC) curve) of the proposed model was the highest, reaching 0.88,compared with traditional models(support vector machine(SVM) = 0.85, Bayesian network(BN) = 0.81,frequency ratio(FR) = 0.75, weight of evidence(WOE) = 0.76). The landslide frequency ratio and frequency density of the high susceptibility zones were 6.76/km^(2) and 0.88/km^(2), respectively, which were much higher than those of the low susceptibility zones. The top 20% interval of landslide occurrence probability contained 89% of the historical landslides but only accounted for 10.3% of the total area.Our results indicate that the distribution of high susceptibility zones was more focused without containing more " stable" pixels. Therefore, the obtained susceptibility map is suitable for application to landslide risk management practices.展开更多
AIM: To assess the usefulness of FibroTest to forecast scores by constructing decision trees in patients with chronic hepatitis C.METHODS: We used the C4.5 classification algorithm to construct decision trees with d...AIM: To assess the usefulness of FibroTest to forecast scores by constructing decision trees in patients with chronic hepatitis C.METHODS: We used the C4.5 classification algorithm to construct decision trees with data from 261 patients with chronic hepatitis C without a liver biopsy. The FibroTest attributes of age, gender, bilirubin, apolipoprotein, haptoglobin, α2 macroglobulin, and γ-glutamyl transpeptidase were used as predictors, and the FibroTest score as the target. For testing, a 10-fold cross validation was used.RESULTS: The overall classification error was 14.9% (accuracy 85.1%). FibroTest's cases with true scores of FO and F4 were classified with very high accuracy (18/20 for FO, 9/9 for FO-1 and 92/96 for F4) and the largest confusion centered on F3. The algorithm produced a set of compound rules out of the ten classification trees and was used to classify the 261 patients. The rules for the classification of patients in FO and F4 were effective in more than 75% of the cases in which they were tested.CONCLUSION: The recognition of clinical subgroups should help to enhance our ability to assess differences in fibrosis scores in clinical studies and improve our understanding of fibrosis progression,展开更多
Refinery scheduling attracts increasing concerns in both academic and industrial communities in recent years.However, due to the complexity of refinery processes, little has been reported for success use in real world...Refinery scheduling attracts increasing concerns in both academic and industrial communities in recent years.However, due to the complexity of refinery processes, little has been reported for success use in real world refineries. In academic studies, refinery scheduling is usually treated as an integrated, large-scale optimization problem,though such complex optimization problems are extremely difficult to solve. In this paper, we proposed a way to exploit the prior knowledge existing in refineries, and developed a decision making system to guide the scheduling process. For a real world fuel oil oriented refinery, ten adjusting process scales are predetermined. A C4.5 decision tree works based on the finished oil demand plan to classify the corresponding category(i.e. adjusting scale). Then,a specific sub-scheduling problem with respect to the determined adjusting scale is solved. The proposed strategy is demonstrated with a scheduling case originated from a real world refinery.展开更多
In this work, a hardware intrusion detection system (IDS) model and its implementation are introduced to perform online real-time traffic monitoring and analysis. The introduced system gathers some advantages of man...In this work, a hardware intrusion detection system (IDS) model and its implementation are introduced to perform online real-time traffic monitoring and analysis. The introduced system gathers some advantages of many IDSs: hardware based from implementation point of view, network based from system type point of view, and anomaly detection from detection approach point of view. In addition, it can detect most of network attacks, such as denial of services (DOS), leakage, etc. from detection behavior point of view and can detect both internal and external intruders from intruder type point of view. Gathering these features in one IDS system gives lots of strengths and advantages of the work. The system is implemented by using field programmable gate array (FPGA), giving a more advantages to the system. A C5.0 decision tree classifier is used as inference engine to the system and gives a high detection ratio of 99.93%.展开更多
Under the modern education system of China, the annual scholarship evaluation is a vital thing for many of the collegestudents. This paper adopts the classification algorithm of decision tree C4.5 based on the betteri...Under the modern education system of China, the annual scholarship evaluation is a vital thing for many of the collegestudents. This paper adopts the classification algorithm of decision tree C4.5 based on the bettering of ID3 algorithm and constructa data set of the scholarship evaluation system through the analysis of the related attributes in scholarship evaluation information.And also having found some factors that plays a significant role in the growing up of the college students through analysis and re-search of moral education, intellectural education and culture&PE.展开更多
Association rules and C4.5 rules can overcome the shortage of the traditional land evaluation methods and improve the intelligibility and efficiency of the land evaluation knowledge.In order to compare these two kinds...Association rules and C4.5 rules can overcome the shortage of the traditional land evaluation methods and improve the intelligibility and efficiency of the land evaluation knowledge.In order to compare these two kinds of classification rules in the application,two fuzzy classifiers were established by combining with fuzzy decision algorithm especially based on Second General Soil Survey of Guangdong Province.The results of experiments demonstrated that the fuzzy classifier based on association rules obtain a higher accuracy rate,but with more complex calculation process and more computational overhead;the fuzzy classifier based on C4.5 rules obtain a slightly lower accuracy,but with fast computation and simpler calculation.展开更多
Intrusion detection systems provide additional defense capacity to a networked information system in addition to the security measures provided by the firewalls. This paper proposes an active rule based enhancement to...Intrusion detection systems provide additional defense capacity to a networked information system in addition to the security measures provided by the firewalls. This paper proposes an active rule based enhancement to the C4.5 algorithm for network intrusion detection in order to detect misuse behaviors of internal attackers through effective classification and decision making in computer networks. This enhanced C4.5 algorithm derives a set of classification rules from network audit data and then the generated rules are used to detect network intrusions in a real-time environment. Unlike most existing decision tree based approaches, the spawned rules generated and fired in this work are more effective because the information-theoretic approach minimizes the expected number of tests needed to classify an object and guarantees that a simple (but not necessarily the simplest) tree is found. The main advantage of this proposed algorithm is that the generalization ability of enhanced C4.5 decision trees is better than that of C4.5 decision trees. We have employed data from the third international knowledge discovery and data mining tools competition (KDDcup’99) to train and test the feasibility of this proposed model. By applying the enhanced C4.5 algorithm an average detection rate of 93.28 percent and a false positive rate of 0.7 percent have respectively been obtained in this work.展开更多
Classification is an important machine learning problem, and decision tree construction algorithms are an important class of solutions to this problem. RainForest is a scalable way to implement decision tree construct...Classification is an important machine learning problem, and decision tree construction algorithms are an important class of solutions to this problem. RainForest is a scalable way to implement decision tree construction algorithms. It consists of several algorithms, of which the best one is a hybrid between a traditional recursive implementation and an iterative implementation which uses more memory but involves less write operations. We propose an optimized algorithm inspired by RainForest. By using a more sophisticated switching criterion between the two algorithms, we are able to get a performance gain even when all statistical information fits in memory. Evaluations show that our method can achieve a performance boost of 2.8 times in average than the traditional recursive implementation.展开更多
Data mining is the process of extracting implicit but potentially useful information from incomplete, noisy, and fuzzy data. Data mining offers excellent nonlinear modeling and self-organized learning, and it can play...Data mining is the process of extracting implicit but potentially useful information from incomplete, noisy, and fuzzy data. Data mining offers excellent nonlinear modeling and self-organized learning, and it can play a vital role in the interpretation of well logging data of complex reservoirs. We used data mining to identify the lithologies in a complex reservoir. The reservoir lithologies served as the classification task target and were identified using feature extraction, feature selection, and modeling of data streams. We used independent component analysis to extract information from well curves. We then used the branch-and- bound algorithm to look for the optimal feature subsets and eliminate redundant information. Finally, we used the C5.0 decision-tree algorithm to set up disaggregated models of the well logging curves. The modeling and actual logging data were in good agreement, showing the usefulness of data mining methods in complex reservoirs.展开更多
基金Supported by Science and Technology Plan Project of Guangdong Province (2009B010900026,2009CD058,2009CD078,2009CD079,2009CD080)Special Funds for Support Program of Development of Modern Information Service Industry of Guangdong Province(06120840B0370124 )Fund Project of South China Agricultural University (2007K017)~~
文摘[Objective]The aim was to overcome the shortage of being difficult to build land evaluation model when the impact factors had continuous value in the traditional land evaluation process,as well as to improve the intelligibility of the land evaluation knowledge.[Method] The land evaluation method combining classification rule extracted by C4.5 algorithm with fuzzy decision was proposed in this study.[Result] The result of Second General Soil Survey of Guangdong Province had demonstrated that the method was convenient to extract classification rules,and by using only 100 rules,quantity correct rate 86.67% and area correct rate 84.80% of land evaluation could be obtained.[Conclusions] The use of C4.5 algorithm to obtain the rules,combined with fuzzy decision algorithm to build classifiers had got satisfactory results,which provided a practical algorithm for the land evaluation.
文摘As a distributed computing platform, Hadoop provides an effective way to handle big data. In Hadoop, the completion time of job will be delayed by a straggler. Although the definitive cause of the straggler is hard to detect, speculative execution is usually used for dealing with this problem, by simply backing up those stragglers on alternative nodes. In this paper, we design a new Speculative Execution algorithm based on C4.5 Decision Tree, SECDT, for Hadoop. In SECDT, we speculate completion time of stragglers and also of backup tasks, based on a kind of decision tree method: C4.5 decision tree. After we speculate the completion time, we compare the completion time of stragglers and of the backup tasks, calculating their differential value, and selecting the straggler with the maximum differential value to start the backup task.Experiment result shows that the SECDT can predict execution time more accurately than other speculative execution methods, hence reduce the job completion time.
基金Supported by Science and Technology Plan of Mudanjiang City (G200920064)Teaching Reform Construction of Mudanjiang Normal University (10-xj11080)
文摘Based on the discuss of the basic concept of data mining technology and the decision tree method,combining with the data samples of wind and hailstorm disasters in some counties of Mudanjiang region,the forecasting model of agro-meteorological disaster grade was established by adopting the C4.5 classification algorithm of decision tree,which can forecast the direct economic loss degree to provide rational data mining model and obtain effective analysis results.
基金This research is funded by the National Natural Science Foundation of China(Grant Nos.41807285 and 51679117)Key Project of the State Key Laboratory of Geohazard Prevention and Geoenvironment Protection(SKLGP2019Z002)+3 种基金the National Science Foundation of Jiangxi Province,China(20192BAB216034)the China Postdoctoral Science Foundation(2019M652287 and 2020T130274)the Jiangxi Provincial Postdoctoral Science Foundation(2019KY08)Fundamental Research Funds for National Universities,China University of Geosciences(Wuhan)。
文摘Machine learning algorithms are an important measure with which to perform landslide susceptibility assessments, but most studies use GIS-based classification methods to conduct susceptibility zonation.This study presents a machine learning approach based on the C5.0 decision tree(DT) model and the K-means cluster algorithm to produce a regional landslide susceptibility map. Yanchang County, a typical landslide-prone area located in northwestern China, was taken as the area of interest to introduce the proposed application procedure. A landslide inventory containing 82 landslides was prepared and subsequently randomly partitioned into two subsets: training data(70% landslide pixels) and validation data(30% landslide pixels). Fourteen landslide influencing factors were considered in the input dataset and were used to calculate the landslide occurrence probability based on the C5.0 decision tree model.Susceptibility zonation was implemented according to the cut-off values calculated by the K-means cluster algorithm. The validation results of the model performance analysis showed that the AUC(area under the receiver operating characteristic(ROC) curve) of the proposed model was the highest, reaching 0.88,compared with traditional models(support vector machine(SVM) = 0.85, Bayesian network(BN) = 0.81,frequency ratio(FR) = 0.75, weight of evidence(WOE) = 0.76). The landslide frequency ratio and frequency density of the high susceptibility zones were 6.76/km^(2) and 0.88/km^(2), respectively, which were much higher than those of the low susceptibility zones. The top 20% interval of landslide occurrence probability contained 89% of the historical landslides but only accounted for 10.3% of the total area.Our results indicate that the distribution of high susceptibility zones was more focused without containing more " stable" pixels. Therefore, the obtained susceptibility map is suitable for application to landslide risk management practices.
基金Supported by A grant of the Universidad Nacional Autonoma de Mexico SDI.PTID.05.6
文摘AIM: To assess the usefulness of FibroTest to forecast scores by constructing decision trees in patients with chronic hepatitis C.METHODS: We used the C4.5 classification algorithm to construct decision trees with data from 261 patients with chronic hepatitis C without a liver biopsy. The FibroTest attributes of age, gender, bilirubin, apolipoprotein, haptoglobin, α2 macroglobulin, and γ-glutamyl transpeptidase were used as predictors, and the FibroTest score as the target. For testing, a 10-fold cross validation was used.RESULTS: The overall classification error was 14.9% (accuracy 85.1%). FibroTest's cases with true scores of FO and F4 were classified with very high accuracy (18/20 for FO, 9/9 for FO-1 and 92/96 for F4) and the largest confusion centered on F3. The algorithm produced a set of compound rules out of the ten classification trees and was used to classify the 261 patients. The rules for the classification of patients in FO and F4 were effective in more than 75% of the cases in which they were tested.CONCLUSION: The recognition of clinical subgroups should help to enhance our ability to assess differences in fibrosis scores in clinical studies and improve our understanding of fibrosis progression,
基金Supported by the National Natural Science Foundation of China(21706282,21276137,61273039,61673236)Science Foundation of China University of Petroleum,Beijing(No.2462017YJRC028)the National High-tech 863 Program of China(2013AA 040702)
文摘Refinery scheduling attracts increasing concerns in both academic and industrial communities in recent years.However, due to the complexity of refinery processes, little has been reported for success use in real world refineries. In academic studies, refinery scheduling is usually treated as an integrated, large-scale optimization problem,though such complex optimization problems are extremely difficult to solve. In this paper, we proposed a way to exploit the prior knowledge existing in refineries, and developed a decision making system to guide the scheduling process. For a real world fuel oil oriented refinery, ten adjusting process scales are predetermined. A C4.5 decision tree works based on the finished oil demand plan to classify the corresponding category(i.e. adjusting scale). Then,a specific sub-scheduling problem with respect to the determined adjusting scale is solved. The proposed strategy is demonstrated with a scheduling case originated from a real world refinery.
文摘In this work, a hardware intrusion detection system (IDS) model and its implementation are introduced to perform online real-time traffic monitoring and analysis. The introduced system gathers some advantages of many IDSs: hardware based from implementation point of view, network based from system type point of view, and anomaly detection from detection approach point of view. In addition, it can detect most of network attacks, such as denial of services (DOS), leakage, etc. from detection behavior point of view and can detect both internal and external intruders from intruder type point of view. Gathering these features in one IDS system gives lots of strengths and advantages of the work. The system is implemented by using field programmable gate array (FPGA), giving a more advantages to the system. A C5.0 decision tree classifier is used as inference engine to the system and gives a high detection ratio of 99.93%.
文摘Under the modern education system of China, the annual scholarship evaluation is a vital thing for many of the collegestudents. This paper adopts the classification algorithm of decision tree C4.5 based on the bettering of ID3 algorithm and constructa data set of the scholarship evaluation system through the analysis of the related attributes in scholarship evaluation information.And also having found some factors that plays a significant role in the growing up of the college students through analysis and re-search of moral education, intellectural education and culture&PE.
基金Supported by Science and Technology Plan Project of Guangdong Province (2009B010900026,2009CD058,2009CD078,2009CD079,2009CD080)Special Funds for Support Program of Development of Modern Information Service Industry of Guangdong Province(06120840B0370124)Funded Fund Project of South China Agricultural University (2007K017)~~
文摘Association rules and C4.5 rules can overcome the shortage of the traditional land evaluation methods and improve the intelligibility and efficiency of the land evaluation knowledge.In order to compare these two kinds of classification rules in the application,two fuzzy classifiers were established by combining with fuzzy decision algorithm especially based on Second General Soil Survey of Guangdong Province.The results of experiments demonstrated that the fuzzy classifier based on association rules obtain a higher accuracy rate,but with more complex calculation process and more computational overhead;the fuzzy classifier based on C4.5 rules obtain a slightly lower accuracy,but with fast computation and simpler calculation.
文摘Intrusion detection systems provide additional defense capacity to a networked information system in addition to the security measures provided by the firewalls. This paper proposes an active rule based enhancement to the C4.5 algorithm for network intrusion detection in order to detect misuse behaviors of internal attackers through effective classification and decision making in computer networks. This enhanced C4.5 algorithm derives a set of classification rules from network audit data and then the generated rules are used to detect network intrusions in a real-time environment. Unlike most existing decision tree based approaches, the spawned rules generated and fired in this work are more effective because the information-theoretic approach minimizes the expected number of tests needed to classify an object and guarantees that a simple (but not necessarily the simplest) tree is found. The main advantage of this proposed algorithm is that the generalization ability of enhanced C4.5 decision trees is better than that of C4.5 decision trees. We have employed data from the third international knowledge discovery and data mining tools competition (KDDcup’99) to train and test the feasibility of this proposed model. By applying the enhanced C4.5 algorithm an average detection rate of 93.28 percent and a false positive rate of 0.7 percent have respectively been obtained in this work.
文摘Classification is an important machine learning problem, and decision tree construction algorithms are an important class of solutions to this problem. RainForest is a scalable way to implement decision tree construction algorithms. It consists of several algorithms, of which the best one is a hybrid between a traditional recursive implementation and an iterative implementation which uses more memory but involves less write operations. We propose an optimized algorithm inspired by RainForest. By using a more sophisticated switching criterion between the two algorithms, we are able to get a performance gain even when all statistical information fits in memory. Evaluations show that our method can achieve a performance boost of 2.8 times in average than the traditional recursive implementation.
基金sponsored by the National Science and Technology Major Project(No.2011ZX05023-005-006)
文摘Data mining is the process of extracting implicit but potentially useful information from incomplete, noisy, and fuzzy data. Data mining offers excellent nonlinear modeling and self-organized learning, and it can play a vital role in the interpretation of well logging data of complex reservoirs. We used data mining to identify the lithologies in a complex reservoir. The reservoir lithologies served as the classification task target and were identified using feature extraction, feature selection, and modeling of data streams. We used independent component analysis to extract information from well curves. We then used the branch-and- bound algorithm to look for the optimal feature subsets and eliminate redundant information. Finally, we used the C5.0 decision-tree algorithm to set up disaggregated models of the well logging curves. The modeling and actual logging data were in good agreement, showing the usefulness of data mining methods in complex reservoirs.