As a distributed computing platform, Hadoop provides an effective way to handle big data. In Hadoop, the completion time of job will be delayed by a straggler. Although the definitive cause of the straggler is hard to...As a distributed computing platform, Hadoop provides an effective way to handle big data. In Hadoop, the completion time of job will be delayed by a straggler. Although the definitive cause of the straggler is hard to detect, speculative execution is usually used for dealing with this problem, by simply backing up those stragglers on alternative nodes. In this paper, we design a new Speculative Execution algorithm based on C4.5 Decision Tree, SECDT, for Hadoop. In SECDT, we speculate completion time of stragglers and also of backup tasks, based on a kind of decision tree method: C4.5 decision tree. After we speculate the completion time, we compare the completion time of stragglers and of the backup tasks, calculating their differential value, and selecting the straggler with the maximum differential value to start the backup task.Experiment result shows that the SECDT can predict execution time more accurately than other speculative execution methods, hence reduce the job completion time.展开更多
[Objective]The aim was to overcome the shortage of being difficult to build land evaluation model when the impact factors had continuous value in the traditional land evaluation process,as well as to improve the intel...[Objective]The aim was to overcome the shortage of being difficult to build land evaluation model when the impact factors had continuous value in the traditional land evaluation process,as well as to improve the intelligibility of the land evaluation knowledge.[Method] The land evaluation method combining classification rule extracted by C4.5 algorithm with fuzzy decision was proposed in this study.[Result] The result of Second General Soil Survey of Guangdong Province had demonstrated that the method was convenient to extract classification rules,and by using only 100 rules,quantity correct rate 86.67% and area correct rate 84.80% of land evaluation could be obtained.[Conclusions] The use of C4.5 algorithm to obtain the rules,combined with fuzzy decision algorithm to build classifiers had got satisfactory results,which provided a practical algorithm for the land evaluation.展开更多
Under the modern education system of China, the annual scholarship evaluation is a vital thing for many of the collegestudents. This paper adopts the classification algorithm of decision tree C4.5 based on the betteri...Under the modern education system of China, the annual scholarship evaluation is a vital thing for many of the collegestudents. This paper adopts the classification algorithm of decision tree C4.5 based on the bettering of ID3 algorithm and constructa data set of the scholarship evaluation system through the analysis of the related attributes in scholarship evaluation information.And also having found some factors that plays a significant role in the growing up of the college students through analysis and re-search of moral education, intellectural education and culture&PE.展开更多
Based on the discuss of the basic concept of data mining technology and the decision tree method,combining with the data samples of wind and hailstorm disasters in some counties of Mudanjiang region,the forecasting mo...Based on the discuss of the basic concept of data mining technology and the decision tree method,combining with the data samples of wind and hailstorm disasters in some counties of Mudanjiang region,the forecasting model of agro-meteorological disaster grade was established by adopting the C4.5 classification algorithm of decision tree,which can forecast the direct economic loss degree to provide rational data mining model and obtain effective analysis results.展开更多
AIM: To assess the usefulness of FibroTest to forecast scores by constructing decision trees in patients with chronic hepatitis C.METHODS: We used the C4.5 classification algorithm to construct decision trees with d...AIM: To assess the usefulness of FibroTest to forecast scores by constructing decision trees in patients with chronic hepatitis C.METHODS: We used the C4.5 classification algorithm to construct decision trees with data from 261 patients with chronic hepatitis C without a liver biopsy. The FibroTest attributes of age, gender, bilirubin, apolipoprotein, haptoglobin, α2 macroglobulin, and γ-glutamyl transpeptidase were used as predictors, and the FibroTest score as the target. For testing, a 10-fold cross validation was used.RESULTS: The overall classification error was 14.9% (accuracy 85.1%). FibroTest's cases with true scores of FO and F4 were classified with very high accuracy (18/20 for FO, 9/9 for FO-1 and 92/96 for F4) and the largest confusion centered on F3. The algorithm produced a set of compound rules out of the ten classification trees and was used to classify the 261 patients. The rules for the classification of patients in FO and F4 were effective in more than 75% of the cases in which they were tested.CONCLUSION: The recognition of clinical subgroups should help to enhance our ability to assess differences in fibrosis scores in clinical studies and improve our understanding of fibrosis progression,展开更多
Refinery scheduling attracts increasing concerns in both academic and industrial communities in recent years.However, due to the complexity of refinery processes, little has been reported for success use in real world...Refinery scheduling attracts increasing concerns in both academic and industrial communities in recent years.However, due to the complexity of refinery processes, little has been reported for success use in real world refineries. In academic studies, refinery scheduling is usually treated as an integrated, large-scale optimization problem,though such complex optimization problems are extremely difficult to solve. In this paper, we proposed a way to exploit the prior knowledge existing in refineries, and developed a decision making system to guide the scheduling process. For a real world fuel oil oriented refinery, ten adjusting process scales are predetermined. A C4.5 decision tree works based on the finished oil demand plan to classify the corresponding category(i.e. adjusting scale). Then,a specific sub-scheduling problem with respect to the determined adjusting scale is solved. The proposed strategy is demonstrated with a scheduling case originated from a real world refinery.展开更多
Intrusion detection systems provide additional defense capacity to a networked information system in addition to the security measures provided by the firewalls. This paper proposes an active rule based enhancement to...Intrusion detection systems provide additional defense capacity to a networked information system in addition to the security measures provided by the firewalls. This paper proposes an active rule based enhancement to the C4.5 algorithm for network intrusion detection in order to detect misuse behaviors of internal attackers through effective classification and decision making in computer networks. This enhanced C4.5 algorithm derives a set of classification rules from network audit data and then the generated rules are used to detect network intrusions in a real-time environment. Unlike most existing decision tree based approaches, the spawned rules generated and fired in this work are more effective because the information-theoretic approach minimizes the expected number of tests needed to classify an object and guarantees that a simple (but not necessarily the simplest) tree is found. The main advantage of this proposed algorithm is that the generalization ability of enhanced C4.5 decision trees is better than that of C4.5 decision trees. We have employed data from the third international knowledge discovery and data mining tools competition (KDDcup’99) to train and test the feasibility of this proposed model. By applying the enhanced C4.5 algorithm an average detection rate of 93.28 percent and a false positive rate of 0.7 percent have respectively been obtained in this work.展开更多
Association rules and C4.5 rules can overcome the shortage of the traditional land evaluation methods and improve the intelligibility and efficiency of the land evaluation knowledge.In order to compare these two kinds...Association rules and C4.5 rules can overcome the shortage of the traditional land evaluation methods and improve the intelligibility and efficiency of the land evaluation knowledge.In order to compare these two kinds of classification rules in the application,two fuzzy classifiers were established by combining with fuzzy decision algorithm especially based on Second General Soil Survey of Guangdong Province.The results of experiments demonstrated that the fuzzy classifier based on association rules obtain a higher accuracy rate,but with more complex calculation process and more computational overhead;the fuzzy classifier based on C4.5 rules obtain a slightly lower accuracy,but with fast computation and simpler calculation.展开更多
Classification is an important machine learning problem, and decision tree construction algorithms are an important class of solutions to this problem. RainForest is a scalable way to implement decision tree construct...Classification is an important machine learning problem, and decision tree construction algorithms are an important class of solutions to this problem. RainForest is a scalable way to implement decision tree construction algorithms. It consists of several algorithms, of which the best one is a hybrid between a traditional recursive implementation and an iterative implementation which uses more memory but involves less write operations. We propose an optimized algorithm inspired by RainForest. By using a more sophisticated switching criterion between the two algorithms, we are able to get a performance gain even when all statistical information fits in memory. Evaluations show that our method can achieve a performance boost of 2.8 times in average than the traditional recursive implementation.展开更多
Automatic text summarization involves reducing a text document or a larger corpus of multiple documents to a short set of sentences or paragraphs that convey the main meaning of the text. In this paper, we discuss abo...Automatic text summarization involves reducing a text document or a larger corpus of multiple documents to a short set of sentences or paragraphs that convey the main meaning of the text. In this paper, we discuss about multi-document summarization that differs from the single one in which the issues of compression, speed, redundancy and passage selection are critical in the formation of useful summaries. Since the number and variety of online medical news make them difficult for experts in the medical field to read all of the medical news, an automatic multi-document summarization can be useful for easy study of information on the web. Hence we propose a new approach based on machine learning meta-learner algorithm called AdaBoost that is used for summarization. We treat a document as a set of sentences, and the learning algorithm must learn to classify as positive or negative examples of sentences based on the score of the sentences. For this learning task, we apply AdaBoost meta-learning algorithm where a C4.5 decision tree has been chosen as the base learner. In our experiment, we use 450 pieces of news that are downloaded from different medical websites. Then we compare our results with some existing approaches.展开更多
Data mining is the process of extracting implicit but potentially useful information from incomplete, noisy, and fuzzy data. Data mining offers excellent nonlinear modeling and self-organized learning, and it can play...Data mining is the process of extracting implicit but potentially useful information from incomplete, noisy, and fuzzy data. Data mining offers excellent nonlinear modeling and self-organized learning, and it can play a vital role in the interpretation of well logging data of complex reservoirs. We used data mining to identify the lithologies in a complex reservoir. The reservoir lithologies served as the classification task target and were identified using feature extraction, feature selection, and modeling of data streams. We used independent component analysis to extract information from well curves. We then used the branch-and- bound algorithm to look for the optimal feature subsets and eliminate redundant information. Finally, we used the C5.0 decision-tree algorithm to set up disaggregated models of the well logging curves. The modeling and actual logging data were in good agreement, showing the usefulness of data mining methods in complex reservoirs.展开更多
Objective: According to RFM model theory of customer relationship management, data mining technology was used to group the chronic infectious disease patients to explore the effect of customer segmentation on the mana...Objective: According to RFM model theory of customer relationship management, data mining technology was used to group the chronic infectious disease patients to explore the effect of customer segmentation on the management of patients with different characteristics. Methods: 170,246 outpatient data was extracted from the hospital management information system (HIS) during January 2016 to July 2016, 43,448 data was formed after the data cleaning. K-Means clustering algorithm was used to classify patients with chronic infectious diseases, and then C5.0 decision tree algorithm was used to predict the situation of patients with chronic infectious diseases. Results: Male patients accounted for 58.7%, patients living in Shanghai accounted for 85.6%. The average age of patients is 45.88 years old, the high incidence age is 25 to 65 years old. Patients was gathered into three categories: 1) Clusters 1—Important patients (4786 people, 11.72%, R = 2.89, F = 11.72, M = 84,302.95);2) Clustering 2—Major patients (23,103, 53.2%, R = 5.22, F = 3.45, M = 9146.39);3) Cluster 3—Potential patients (15,559 people, 35.8%, R = 19.77, F = 1.55, M = 1739.09). C5.0 decision tree algorithm was used to predict the treatment situation of patients with chronic infectious diseases, the final treatment time (weeks) is an important predictor, the accuracy rate is 99.94% verified by the confusion model. Conclusion: Medical institutions should strengthen the adherence education for patients with chronic infectious diseases, establish the chronic infectious diseases and customer relationship management database, take the initiative to help them improve treatment adherence. Chinese governments at all levels should speed up the construction of hospital information, establish the chronic infectious disease database, strengthen the blocking of mother-to-child transmission, to effectively curb chronic infectious diseases, reduce disease burden and mortality.展开更多
Classification can be regarded as dividing the data space into decision regions separated by decision boundaries.In this paper we analyze decision tree algorithms and the NBTree algorithm from this perspective.Thus,a ...Classification can be regarded as dividing the data space into decision regions separated by decision boundaries.In this paper we analyze decision tree algorithms and the NBTree algorithm from this perspective.Thus,a decision tree can be regarded as a classifier tree,in which each classifier on a non-root node is trained in decision regions of the classifier on the parent node.Meanwhile,the NBTree algorithm,which generates a classifier tree with the C4.5 algorithm and the naive Bayes classifier as the root and leaf classifiers respectively,can also be regarded as training naive Bayes classifiers in decision regions of the C4.5 algorithm.We propose a second division (SD) algorithm and three soft second division (SD-soft) algorithms to train classifiers in decision regions of the naive Bayes classifier.These four novel algorithms all generate two-level classifier trees with the naive Bayes classifier as root classifiers.The SD and three SD-soft algorithms can make good use of both the information contained in instances near decision boundaries,and those that may be ignored by the naive Bayes classifier.Finally,we conduct experiments on 30 data sets from the UC Irvine (UCI) repository.Experiment results show that the SD algorithm can obtain better generali-zation abilities than the NBTree and the averaged one-dependence estimators (AODE) algorithms when using the C4.5 algorithm and support vector machine (SVM) as leaf classifiers.Further experiments indicate that our three SD-soft algorithms can achieve better generalization abilities than the SD algorithm when argument values are selected appropriately.展开更多
Education is the base of the survival and growth of any state,but due to resource scarcity,students,particularly at the university level,are forced into a difficult situation.Scholarships are the most significant fina...Education is the base of the survival and growth of any state,but due to resource scarcity,students,particularly at the university level,are forced into a difficult situation.Scholarships are the most significant financial aid mechanisms developed to overcome such obstacles and assist the students in continuing with their higher studies.In this study,the convoluted situation of scholarship eligibility criteria,including parental income,responsibilities,and academic achievements,is addressed.In an attempt to maximize the scholarship selection process,numerous machine learning algorithms,including Support Vector Machines,Neural Networks,K-Nearest Neighbors,and the C4.5 algorithm,were applied.The C4.5 algorithm,owing to its efficiency in the prediction of scholarship beneficiaries based on extraneous factors,was capable of predicting a phenomenal 95.62%of predictions using extensive data of a well-esteemed government sector university from Pakistan.This percentage is 4%and 15%better than the remainder of the methods tested,and it depicts the extent of the potential for the technique to enhance the scholarship selection process.The Decision Support Systems(DSS)would not only save the administrative cost but would also create a fair and transparent process in place.In a world where accessibility to education is the key,this research provides data-oriented consolidation to ensure that deserving students are helped and allowed to get the financial assistance that they need to reach higher studies and bridge the gap between the demands of the day and the institutions of intellect.展开更多
文摘As a distributed computing platform, Hadoop provides an effective way to handle big data. In Hadoop, the completion time of job will be delayed by a straggler. Although the definitive cause of the straggler is hard to detect, speculative execution is usually used for dealing with this problem, by simply backing up those stragglers on alternative nodes. In this paper, we design a new Speculative Execution algorithm based on C4.5 Decision Tree, SECDT, for Hadoop. In SECDT, we speculate completion time of stragglers and also of backup tasks, based on a kind of decision tree method: C4.5 decision tree. After we speculate the completion time, we compare the completion time of stragglers and of the backup tasks, calculating their differential value, and selecting the straggler with the maximum differential value to start the backup task.Experiment result shows that the SECDT can predict execution time more accurately than other speculative execution methods, hence reduce the job completion time.
基金Supported by Science and Technology Plan Project of Guangdong Province (2009B010900026,2009CD058,2009CD078,2009CD079,2009CD080)Special Funds for Support Program of Development of Modern Information Service Industry of Guangdong Province(06120840B0370124 )Fund Project of South China Agricultural University (2007K017)~~
文摘[Objective]The aim was to overcome the shortage of being difficult to build land evaluation model when the impact factors had continuous value in the traditional land evaluation process,as well as to improve the intelligibility of the land evaluation knowledge.[Method] The land evaluation method combining classification rule extracted by C4.5 algorithm with fuzzy decision was proposed in this study.[Result] The result of Second General Soil Survey of Guangdong Province had demonstrated that the method was convenient to extract classification rules,and by using only 100 rules,quantity correct rate 86.67% and area correct rate 84.80% of land evaluation could be obtained.[Conclusions] The use of C4.5 algorithm to obtain the rules,combined with fuzzy decision algorithm to build classifiers had got satisfactory results,which provided a practical algorithm for the land evaluation.
文摘Under the modern education system of China, the annual scholarship evaluation is a vital thing for many of the collegestudents. This paper adopts the classification algorithm of decision tree C4.5 based on the bettering of ID3 algorithm and constructa data set of the scholarship evaluation system through the analysis of the related attributes in scholarship evaluation information.And also having found some factors that plays a significant role in the growing up of the college students through analysis and re-search of moral education, intellectural education and culture&PE.
基金Supported by Science and Technology Plan of Mudanjiang City (G200920064)Teaching Reform Construction of Mudanjiang Normal University (10-xj11080)
文摘Based on the discuss of the basic concept of data mining technology and the decision tree method,combining with the data samples of wind and hailstorm disasters in some counties of Mudanjiang region,the forecasting model of agro-meteorological disaster grade was established by adopting the C4.5 classification algorithm of decision tree,which can forecast the direct economic loss degree to provide rational data mining model and obtain effective analysis results.
基金Supported by A grant of the Universidad Nacional Autonoma de Mexico SDI.PTID.05.6
文摘AIM: To assess the usefulness of FibroTest to forecast scores by constructing decision trees in patients with chronic hepatitis C.METHODS: We used the C4.5 classification algorithm to construct decision trees with data from 261 patients with chronic hepatitis C without a liver biopsy. The FibroTest attributes of age, gender, bilirubin, apolipoprotein, haptoglobin, α2 macroglobulin, and γ-glutamyl transpeptidase were used as predictors, and the FibroTest score as the target. For testing, a 10-fold cross validation was used.RESULTS: The overall classification error was 14.9% (accuracy 85.1%). FibroTest's cases with true scores of FO and F4 were classified with very high accuracy (18/20 for FO, 9/9 for FO-1 and 92/96 for F4) and the largest confusion centered on F3. The algorithm produced a set of compound rules out of the ten classification trees and was used to classify the 261 patients. The rules for the classification of patients in FO and F4 were effective in more than 75% of the cases in which they were tested.CONCLUSION: The recognition of clinical subgroups should help to enhance our ability to assess differences in fibrosis scores in clinical studies and improve our understanding of fibrosis progression,
基金Supported by the National Natural Science Foundation of China(21706282,21276137,61273039,61673236)Science Foundation of China University of Petroleum,Beijing(No.2462017YJRC028)the National High-tech 863 Program of China(2013AA 040702)
文摘Refinery scheduling attracts increasing concerns in both academic and industrial communities in recent years.However, due to the complexity of refinery processes, little has been reported for success use in real world refineries. In academic studies, refinery scheduling is usually treated as an integrated, large-scale optimization problem,though such complex optimization problems are extremely difficult to solve. In this paper, we proposed a way to exploit the prior knowledge existing in refineries, and developed a decision making system to guide the scheduling process. For a real world fuel oil oriented refinery, ten adjusting process scales are predetermined. A C4.5 decision tree works based on the finished oil demand plan to classify the corresponding category(i.e. adjusting scale). Then,a specific sub-scheduling problem with respect to the determined adjusting scale is solved. The proposed strategy is demonstrated with a scheduling case originated from a real world refinery.
文摘Intrusion detection systems provide additional defense capacity to a networked information system in addition to the security measures provided by the firewalls. This paper proposes an active rule based enhancement to the C4.5 algorithm for network intrusion detection in order to detect misuse behaviors of internal attackers through effective classification and decision making in computer networks. This enhanced C4.5 algorithm derives a set of classification rules from network audit data and then the generated rules are used to detect network intrusions in a real-time environment. Unlike most existing decision tree based approaches, the spawned rules generated and fired in this work are more effective because the information-theoretic approach minimizes the expected number of tests needed to classify an object and guarantees that a simple (but not necessarily the simplest) tree is found. The main advantage of this proposed algorithm is that the generalization ability of enhanced C4.5 decision trees is better than that of C4.5 decision trees. We have employed data from the third international knowledge discovery and data mining tools competition (KDDcup’99) to train and test the feasibility of this proposed model. By applying the enhanced C4.5 algorithm an average detection rate of 93.28 percent and a false positive rate of 0.7 percent have respectively been obtained in this work.
基金Supported by Science and Technology Plan Project of Guangdong Province (2009B010900026,2009CD058,2009CD078,2009CD079,2009CD080)Special Funds for Support Program of Development of Modern Information Service Industry of Guangdong Province(06120840B0370124)Funded Fund Project of South China Agricultural University (2007K017)~~
文摘Association rules and C4.5 rules can overcome the shortage of the traditional land evaluation methods and improve the intelligibility and efficiency of the land evaluation knowledge.In order to compare these two kinds of classification rules in the application,two fuzzy classifiers were established by combining with fuzzy decision algorithm especially based on Second General Soil Survey of Guangdong Province.The results of experiments demonstrated that the fuzzy classifier based on association rules obtain a higher accuracy rate,but with more complex calculation process and more computational overhead;the fuzzy classifier based on C4.5 rules obtain a slightly lower accuracy,but with fast computation and simpler calculation.
文摘Classification is an important machine learning problem, and decision tree construction algorithms are an important class of solutions to this problem. RainForest is a scalable way to implement decision tree construction algorithms. It consists of several algorithms, of which the best one is a hybrid between a traditional recursive implementation and an iterative implementation which uses more memory but involves less write operations. We propose an optimized algorithm inspired by RainForest. By using a more sophisticated switching criterion between the two algorithms, we are able to get a performance gain even when all statistical information fits in memory. Evaluations show that our method can achieve a performance boost of 2.8 times in average than the traditional recursive implementation.
文摘Automatic text summarization involves reducing a text document or a larger corpus of multiple documents to a short set of sentences or paragraphs that convey the main meaning of the text. In this paper, we discuss about multi-document summarization that differs from the single one in which the issues of compression, speed, redundancy and passage selection are critical in the formation of useful summaries. Since the number and variety of online medical news make them difficult for experts in the medical field to read all of the medical news, an automatic multi-document summarization can be useful for easy study of information on the web. Hence we propose a new approach based on machine learning meta-learner algorithm called AdaBoost that is used for summarization. We treat a document as a set of sentences, and the learning algorithm must learn to classify as positive or negative examples of sentences based on the score of the sentences. For this learning task, we apply AdaBoost meta-learning algorithm where a C4.5 decision tree has been chosen as the base learner. In our experiment, we use 450 pieces of news that are downloaded from different medical websites. Then we compare our results with some existing approaches.
基金sponsored by the National Science and Technology Major Project(No.2011ZX05023-005-006)
文摘Data mining is the process of extracting implicit but potentially useful information from incomplete, noisy, and fuzzy data. Data mining offers excellent nonlinear modeling and self-organized learning, and it can play a vital role in the interpretation of well logging data of complex reservoirs. We used data mining to identify the lithologies in a complex reservoir. The reservoir lithologies served as the classification task target and were identified using feature extraction, feature selection, and modeling of data streams. We used independent component analysis to extract information from well curves. We then used the branch-and- bound algorithm to look for the optimal feature subsets and eliminate redundant information. Finally, we used the C5.0 decision-tree algorithm to set up disaggregated models of the well logging curves. The modeling and actual logging data were in good agreement, showing the usefulness of data mining methods in complex reservoirs.
文摘Objective: According to RFM model theory of customer relationship management, data mining technology was used to group the chronic infectious disease patients to explore the effect of customer segmentation on the management of patients with different characteristics. Methods: 170,246 outpatient data was extracted from the hospital management information system (HIS) during January 2016 to July 2016, 43,448 data was formed after the data cleaning. K-Means clustering algorithm was used to classify patients with chronic infectious diseases, and then C5.0 decision tree algorithm was used to predict the situation of patients with chronic infectious diseases. Results: Male patients accounted for 58.7%, patients living in Shanghai accounted for 85.6%. The average age of patients is 45.88 years old, the high incidence age is 25 to 65 years old. Patients was gathered into three categories: 1) Clusters 1—Important patients (4786 people, 11.72%, R = 2.89, F = 11.72, M = 84,302.95);2) Clustering 2—Major patients (23,103, 53.2%, R = 5.22, F = 3.45, M = 9146.39);3) Cluster 3—Potential patients (15,559 people, 35.8%, R = 19.77, F = 1.55, M = 1739.09). C5.0 decision tree algorithm was used to predict the treatment situation of patients with chronic infectious diseases, the final treatment time (weeks) is an important predictor, the accuracy rate is 99.94% verified by the confusion model. Conclusion: Medical institutions should strengthen the adherence education for patients with chronic infectious diseases, establish the chronic infectious diseases and customer relationship management database, take the initiative to help them improve treatment adherence. Chinese governments at all levels should speed up the construction of hospital information, establish the chronic infectious disease database, strengthen the blocking of mother-to-child transmission, to effectively curb chronic infectious diseases, reduce disease burden and mortality.
基金supported by the National Natural Science Foundation of China (No.60970081)the National Basic Research Program (973) of China (No.2010CB327903)
文摘Classification can be regarded as dividing the data space into decision regions separated by decision boundaries.In this paper we analyze decision tree algorithms and the NBTree algorithm from this perspective.Thus,a decision tree can be regarded as a classifier tree,in which each classifier on a non-root node is trained in decision regions of the classifier on the parent node.Meanwhile,the NBTree algorithm,which generates a classifier tree with the C4.5 algorithm and the naive Bayes classifier as the root and leaf classifiers respectively,can also be regarded as training naive Bayes classifiers in decision regions of the C4.5 algorithm.We propose a second division (SD) algorithm and three soft second division (SD-soft) algorithms to train classifiers in decision regions of the naive Bayes classifier.These four novel algorithms all generate two-level classifier trees with the naive Bayes classifier as root classifiers.The SD and three SD-soft algorithms can make good use of both the information contained in instances near decision boundaries,and those that may be ignored by the naive Bayes classifier.Finally,we conduct experiments on 30 data sets from the UC Irvine (UCI) repository.Experiment results show that the SD algorithm can obtain better generali-zation abilities than the NBTree and the averaged one-dependence estimators (AODE) algorithms when using the C4.5 algorithm and support vector machine (SVM) as leaf classifiers.Further experiments indicate that our three SD-soft algorithms can achieve better generalization abilities than the SD algorithm when argument values are selected appropriately.
文摘Education is the base of the survival and growth of any state,but due to resource scarcity,students,particularly at the university level,are forced into a difficult situation.Scholarships are the most significant financial aid mechanisms developed to overcome such obstacles and assist the students in continuing with their higher studies.In this study,the convoluted situation of scholarship eligibility criteria,including parental income,responsibilities,and academic achievements,is addressed.In an attempt to maximize the scholarship selection process,numerous machine learning algorithms,including Support Vector Machines,Neural Networks,K-Nearest Neighbors,and the C4.5 algorithm,were applied.The C4.5 algorithm,owing to its efficiency in the prediction of scholarship beneficiaries based on extraneous factors,was capable of predicting a phenomenal 95.62%of predictions using extensive data of a well-esteemed government sector university from Pakistan.This percentage is 4%and 15%better than the remainder of the methods tested,and it depicts the extent of the potential for the technique to enhance the scholarship selection process.The Decision Support Systems(DSS)would not only save the administrative cost but would also create a fair and transparent process in place.In a world where accessibility to education is the key,this research provides data-oriented consolidation to ensure that deserving students are helped and allowed to get the financial assistance that they need to reach higher studies and bridge the gap between the demands of the day and the institutions of intellect.