Many business applications rely on their historical data to predict their business future. The marketing products process is one of the core processes for the business. Customer needs give a useful piece of informatio...Many business applications rely on their historical data to predict their business future. The marketing products process is one of the core processes for the business. Customer needs give a useful piece of information that help</span><span style="font-family:Verdana;"><span style="font-family:Verdana;">s</span></span><span style="font-family:Verdana;"> to market the appropriate products at the appropriate time. Moreover, services are considered recently as products. The development of education and health services </span><span style="font-family:Verdana;"><span style="font-family:Verdana;">is</span></span><span style="font-family:Verdana;"> depending on historical data. For the more, reducing online social media networks problems and crimes need a significant source of information. Data analysts need to use an efficient classification algorithm to predict the future of such businesses. However, dealing with a huge quantity of data requires great time to process. Data mining involves many useful techniques that are used to predict statistical data in a variety of business applications. The classification technique is one of the most widely used with a variety of algorithms. In this paper, various classification algorithms are revised in terms of accuracy in different areas of data mining applications. A comprehensive analysis is made after delegated reading of 20 papers in the literature. This paper aims to help data analysts to choose the most suitable classification algorithm for different business applications including business in general, online social media networks, agriculture, health, and education. Results show FFBPN is the most accurate algorithm in the business domain. The Random Forest algorithm is the most accurate in classifying online social networks (OSN) activities. Na<span style="white-space:nowrap;">ï</span>ve Bayes algorithm is the most accurate to classify agriculture datasets. OneR is the most accurate algorithm to classify instances within the health domain. The C4.5 Decision Tree algorithm is the most accurate to classify students’ records to predict degree completion time.展开更多
Two important performance indicators for data mining algorithms are accuracy of classification/ prediction and time taken for training. These indicators are useful for selecting best algorithms for classification/pred...Two important performance indicators for data mining algorithms are accuracy of classification/ prediction and time taken for training. These indicators are useful for selecting best algorithms for classification/prediction tasks in data mining. Empirical studies on these performance indicators in data mining are few. Therefore, this study was designed to determine how data mining classification algorithm perform with increase in input data sizes. Three data mining classification algorithms—Decision Tree, Multi-Layer Perceptron (MLP) Neural Network and Naïve Bayes— were subjected to varying simulated data sizes. The time taken by the algorithms for trainings and accuracies of their classifications were analyzed for the different data sizes. Results show that Naïve Bayes takes least time to train data but with least accuracy as compared to MLP and Decision Tree algorithms.展开更多
In this research article, we analyze the multimedia data mining and classification algorithm based on database optimization techniques. Of high performance application requirements of various kinds are springing up co...In this research article, we analyze the multimedia data mining and classification algorithm based on database optimization techniques. Of high performance application requirements of various kinds are springing up constantly makes parallel computer system structure is valued by more and more common but the corresponding software system development lags far behind the development of the hardware system, it is more obvious in the field of database technology application. Multimedia mining is different from the low level of computer multimedia processing technology and the former focuses on the extracted from huge multimedia collection mode which focused on specific features of understanding or extraction from a single multimedia objects. Our research provides new paradigm for the methodology which will be meaningful and necessary.展开更多
Tuberculosis remains an important problem in public health that threatens the world, including the Philippines. Treatment relapse continues to place a severe problem on patients and TB programs worldwide. A significan...Tuberculosis remains an important problem in public health that threatens the world, including the Philippines. Treatment relapse continues to place a severe problem on patients and TB programs worldwide. A significant reason for the development of decline is poor compliance with medical treatments. The objectives of this research are to generate a predictive data mining model to classify the treatment relapse of TB patients and to identify the features influencing the category of treatment relapse. The TB patient dataset is applied and tested in decision tree J48 algorithm using WEKA. The J48 model identified the three (3) significant independent variables (DSSM Result, Age, and Sex) as predictors of category treatment relapse.展开更多
Big data is usually unstructured, and many applications require theanalysis in real-time. Decision tree (DT) algorithm is widely used to analyzebig data. Selecting the optimal depth of DT is time-consuming process as ...Big data is usually unstructured, and many applications require theanalysis in real-time. Decision tree (DT) algorithm is widely used to analyzebig data. Selecting the optimal depth of DT is time-consuming process as itrequires many iterations. In this paper, we have designed a modified versionof a (DT). The tree aims to achieve optimal depth by self-tuning runningparameters and improving the accuracy. The efficiency of the modified (DT)was verified using two datasets (airport and fire datasets). The airport datasethas 500000 instances and the fire dataset has 600000 instances. A comparisonhas been made between the modified (DT) and standard (DT) with resultsshowing that the modified performs better. This comparison was conductedon multi-node on Apache Spark tool using Amazon web services. Resultingin accuracy with an increase of 6.85% for the first dataset and 8.85% for theairport dataset. In conclusion, the modified DT showed better accuracy inhandling different-sized datasets compared to standard DT algorithm.展开更多
A new classification algorithm for web mining is proposed on the basis of general classification algorithm for data mining in order to implement personalized information services. The building tree method of detecting...A new classification algorithm for web mining is proposed on the basis of general classification algorithm for data mining in order to implement personalized information services. The building tree method of detecting class threshold is used for construction of decision tree according to the concept of user expectation so as to find classification rules in different layers. Compared with the traditional C4.5 algorithm, the disadvantage of excessive adaptation in C4.5 has been improved so that classification results not only have much higher accuracy but also statistic meaning.展开更多
Data mining is the process of extracting implicit but potentially useful information from incomplete, noisy, and fuzzy data. Data mining offers excellent nonlinear modeling and self-organized learning, and it can play...Data mining is the process of extracting implicit but potentially useful information from incomplete, noisy, and fuzzy data. Data mining offers excellent nonlinear modeling and self-organized learning, and it can play a vital role in the interpretation of well logging data of complex reservoirs. We used data mining to identify the lithologies in a complex reservoir. The reservoir lithologies served as the classification task target and were identified using feature extraction, feature selection, and modeling of data streams. We used independent component analysis to extract information from well curves. We then used the branch-and- bound algorithm to look for the optimal feature subsets and eliminate redundant information. Finally, we used the C5.0 decision-tree algorithm to set up disaggregated models of the well logging curves. The modeling and actual logging data were in good agreement, showing the usefulness of data mining methods in complex reservoirs.展开更多
Under the modern education system of China, the annual scholarship evaluation is a vital thing for many of the collegestudents. This paper adopts the classification algorithm of decision tree C4.5 based on the betteri...Under the modern education system of China, the annual scholarship evaluation is a vital thing for many of the collegestudents. This paper adopts the classification algorithm of decision tree C4.5 based on the bettering of ID3 algorithm and constructa data set of the scholarship evaluation system through the analysis of the related attributes in scholarship evaluation information.And also having found some factors that plays a significant role in the growing up of the college students through analysis and re-search of moral education, intellectural education and culture&PE.展开更多
Based on the discuss of the basic concept of data mining technology and the decision tree method,combining with the data samples of wind and hailstorm disasters in some counties of Mudanjiang region,the forecasting mo...Based on the discuss of the basic concept of data mining technology and the decision tree method,combining with the data samples of wind and hailstorm disasters in some counties of Mudanjiang region,the forecasting model of agro-meteorological disaster grade was established by adopting the C4.5 classification algorithm of decision tree,which can forecast the direct economic loss degree to provide rational data mining model and obtain effective analysis results.展开更多
Background:Given the importance of customers as the most valuable assets of organizations,customer retention seems to be an essential,basic requirement for any organization.Banks are no exception to this rule.The comp...Background:Given the importance of customers as the most valuable assets of organizations,customer retention seems to be an essential,basic requirement for any organization.Banks are no exception to this rule.The competitive atmosphere within which electronic banking services are provided by different banks increases the necessity of customer retention.Methods:Being based on existing information technologies which allow one to collect data from organizations’databases,data mining introduces a powerful tool for the extraction of knowledge from huge amounts of data.In this research,the decision tree technique was applied to build a model incorporating this knowledge.Results:The results represent the characteristics of churned customers.Conclusions:Bank managers can identify churners in future using the results of decision tree.They should be provide some strategies for customers whose features are getting more likely to churner’s features.展开更多
This paper focuses on improving decision tree induction algorithms when a kind of tie appears during the rule generation procedure for specific training datasets. The tie occurs when there are equal proportions of the...This paper focuses on improving decision tree induction algorithms when a kind of tie appears during the rule generation procedure for specific training datasets. The tie occurs when there are equal proportions of the target class outcome in the leaf node's records that leads to a situation where majority voting cannot be applied. To solve the above mentioned exception, we propose to base the prediction of the result on the naive Bayes (NB) estimate, k-nearest neighbour (k-NN) and association rule mining (ARM). The other features used for splitting the parent nodes are also taken into consideration.展开更多
Sheet metal is widely used on auto-bodies, plane-bodies and metal furniture, etc. For instance, a typical auto-body commonly consists of hundreds of sheet metal stamping parts. Because of its complexity of structure a...Sheet metal is widely used on auto-bodies, plane-bodies and metal furniture, etc. For instance, a typical auto-body commonly consists of hundreds of sheet metal stamping parts. Because of its complexity of structure and manufacturing process, auto-bodies inevitably have geometrical variation results from a number of different sources, such as the geometrical variation of stamping parts, the transformation of assembly process parameters and even the improper design concept. As more than 30% quality defects of an auto-body are born from the dimensional deviation of Body-In-White originated during the manufacturing process, effective diagnosis and control of dimensional faults are essential to the continuous improvement of the quality of vehicles. Especially during the period of new car launching or model changing when the assembly process was changed and adjusted frequently. For continuously improving the quality of modern cars, rapid dimensional variation causes identification becomes a challenging but essential work. In this paper, main variation causes of auto-body was firstly been cataloged and analyzed, then, a dimensional variation diagnostic reasoning and decision approach was developed through the combination of data mining and knowledge discovery techniques. This approach is driven by variation pattern identification which can be discovered from the dispersive, isolated massive measured data: Correlation Analysis (CA) and Maximal Tree (MT) methods were applied to extract the large variation group from massive multidimensional measured data, while multivariate statistical analysis (MSA) approach was used to discovery the principle variation pattern. A Decision Tree (DT) approach based on the knowledge of product and assembly process was developed to fulfill the "Hypothesis and Validation" characterized variation causes reasoning procedure. An practical application case with sudden and severe dimension variation on rear end panel in up/down direction was analyzed and successfully solved aided by the devloped variation diagnostic method, which have proved that the approach is effective and efficient.展开更多
To make business policy, market analysis, corporate decision, fraud detection, etc., we have to analyze and work with huge amount of data. Generally, such data are taken from different sources. Researchers are using d...To make business policy, market analysis, corporate decision, fraud detection, etc., we have to analyze and work with huge amount of data. Generally, such data are taken from different sources. Researchers are using data mining to perform such tasks. Data mining techniques are used to find hidden information from large data source. Data mining is using for various fields: Artificial intelligence, Bank, health and medical, corruption, legal issues, corporate business, marketing, etc. Special interest is given to associate rules, data mining algorithms, decision tree and distributed approach. Data is becoming larger and spreading geographically. So it is difficult to find better result from only a central data source. For knowledge discovery, we have to work with distributed database. On the other hand, security and privacy considerations are also another factor for de-motivation of working with centralized data. For this reason, distributed database is essential for future processing. In this paper, we have proposed a framework to study data mining in distributed environment. The paper presents a framework to bring out actionable knowledge. We have shown some level by which we can generate actionable knowledge. Possible tools and technique for these levels are discussed.展开更多
Corporations focus on web based education to train their employees ever more than before. Unlike traditional learning environments, web based education applications store large amount of data. This growing availabilit...Corporations focus on web based education to train their employees ever more than before. Unlike traditional learning environments, web based education applications store large amount of data. This growing availability of data stimulated the emergence of a new field called educational data mining. In this study, the classification method is implemented on a data that is obtained from a company which uses web based education to train their employees. The authors' aim is to find out the most critical factors that influence the users' success. For the classification of the data, two decision tree algorithms, Classification and Regression Tree (CART) and Quick, Unbiased and Efficient Statistical Tree (QUEST) are applied. According to the results, assurance of a certificate at the end of the training is found to be the most critical factor that influences the users' success. Position, number of work years and the education level of the user, are also found as important factors.展开更多
There is still no effective means to analyze in depth and utilize domestic mass data about agricultural product quality safety tests in china now. The neural network algorithm, the classification regression tree algor...There is still no effective means to analyze in depth and utilize domestic mass data about agricultural product quality safety tests in china now. The neural network algorithm, the classification regression tree algorithm, the Bayesian network algorithm were selected according to the principle of selecting combination model and were used to build models respectively and then combined, innovatively establishing a combination model which has relatively high precision, strong robustness and better explanatory to predict the results of perishable food transportation meta-morphism monitoring. The relative optimal prediction model of the perishable food transportation metamorphism monitoring system could be got. The relative perfect prediction model can guide the actual sampling work about food quality and safety by prognosticating the occurrence of unqualified food to select the typical and effective samples for test, thus improving the efficiency and effectiveness of sampling work effectively, so as to avoid deteriorated perishable food’s approaching the market to ensure the quality and safety of perishable food transportation. A solid protective wall was built in the protection of general perishable food consumers’ health.展开更多
Objective: According to RFM model theory of customer relationship management, data mining technology was used to group the chronic infectious disease patients to explore the effect of customer segmentation on the mana...Objective: According to RFM model theory of customer relationship management, data mining technology was used to group the chronic infectious disease patients to explore the effect of customer segmentation on the management of patients with different characteristics. Methods: 170,246 outpatient data was extracted from the hospital management information system (HIS) during January 2016 to July 2016, 43,448 data was formed after the data cleaning. K-Means clustering algorithm was used to classify patients with chronic infectious diseases, and then C5.0 decision tree algorithm was used to predict the situation of patients with chronic infectious diseases. Results: Male patients accounted for 58.7%, patients living in Shanghai accounted for 85.6%. The average age of patients is 45.88 years old, the high incidence age is 25 to 65 years old. Patients was gathered into three categories: 1) Clusters 1—Important patients (4786 people, 11.72%, R = 2.89, F = 11.72, M = 84,302.95);2) Clustering 2—Major patients (23,103, 53.2%, R = 5.22, F = 3.45, M = 9146.39);3) Cluster 3—Potential patients (15,559 people, 35.8%, R = 19.77, F = 1.55, M = 1739.09). C5.0 decision tree algorithm was used to predict the treatment situation of patients with chronic infectious diseases, the final treatment time (weeks) is an important predictor, the accuracy rate is 99.94% verified by the confusion model. Conclusion: Medical institutions should strengthen the adherence education for patients with chronic infectious diseases, establish the chronic infectious diseases and customer relationship management database, take the initiative to help them improve treatment adherence. Chinese governments at all levels should speed up the construction of hospital information, establish the chronic infectious disease database, strengthen the blocking of mother-to-child transmission, to effectively curb chronic infectious diseases, reduce disease burden and mortality.展开更多
Information disclosure can reduce information asymmetry between health care providers and patients, thus improving both patient safety and medical quality. The National Bureau of Health Insurance (NBHI) inTaiwancurren...Information disclosure can reduce information asymmetry between health care providers and patients, thus improving both patient safety and medical quality. The National Bureau of Health Insurance (NBHI) inTaiwancurrently publishes health-related information online in order to enhance service efficiency and enable the public to monitor the country’s medical system. A data mining technique, classification and regression tree (CART), is used in this work to investigate online public quality information to compare the characteristics of hospital. The hospital quality indicators and characteristics data are available on the websites of the NBHI (http://www.nhi.gov.tw/AmountInfoWeb/Index.aspx) and the Department of Health (http://www.doh.gov.tw/). The full classification and regression tree presented in this work, grown using the hospitals’ quality medical indicators and characteristic values, classifies all hospitals into seven groups. The rate of stays longer than 30 days, which is the dependent variable in this study, is most influenced by the number of medical staff. This reflects the fact that the fewer medical staffs that are employed, the smaller the hospital is, and patients who are likely to have longer stays tend to go to the medium or large hospitals. Policy makers should work to decrease or eliminate persistent healthcare disparities among different socioeconomic groups and offer more online healthrelated services to reduce information asymmetry between health care providers and patients.展开更多
Trauma is the most common cause of death to young people and many of these deaths are preventable [1]. The prediction of trauma patients outcome was a difficult problem to investigate till present times. In this study...Trauma is the most common cause of death to young people and many of these deaths are preventable [1]. The prediction of trauma patients outcome was a difficult problem to investigate till present times. In this study, prediction models are built and their capabilities to accurately predict the mortality are assessed. The analysis includes a comparison of data mining techniques using classification, clustering and association algorithms. Data were collected by Hellenic Trauma and Emergency Surgery Society from 30 Greek hospitals. Dataset contains records of 8544 patients suffering from severe injuries collected from the year 2005 to 2006. Factors include patients' demographic elements and several other variables registered from the time and place of accident until the hospital treatment and final outcome. Using this analysis the obtained results are compared in terms of sensitivity, specificity, positive predictive value and negative predictive value and the ROC curve depicts these methods performance.展开更多
The increasing volume of data in the area of environmental sciences needs analysis and interpretation. Among the challenges generated by this “data deluge”, the development of efficient strategies for the knowledge ...The increasing volume of data in the area of environmental sciences needs analysis and interpretation. Among the challenges generated by this “data deluge”, the development of efficient strategies for the knowledge discovery is an important issue. Here, statistical and tools from computational intelligence are applied to analyze large data sets from meteorology and climate sciences. Our approach allows a geographical mapping of the statistical property to be easily interpreted by meteorologists. Our data analysis comprises two main steps of knowledge extraction, applied successively in order to reduce the complexity from the original data set. The goal is to identify a much smaller subset of climatic variables that might still be able to describe or even predict the probability of occurrence of an extreme event. The first step applies a class comparison technique: p-value estimation. The second step consists of a decision tree (DT) configured from the data available and the p-value analysis. The DT is used as a predictive model, identifying the most statistically significant climate variables of the precipitation intensity. The methodology is employed to the study the climatic causes of an extreme precipitation events occurred in Alagoas and Pernambuco States (Brazil) at June/2010.展开更多
文摘Many business applications rely on their historical data to predict their business future. The marketing products process is one of the core processes for the business. Customer needs give a useful piece of information that help</span><span style="font-family:Verdana;"><span style="font-family:Verdana;">s</span></span><span style="font-family:Verdana;"> to market the appropriate products at the appropriate time. Moreover, services are considered recently as products. The development of education and health services </span><span style="font-family:Verdana;"><span style="font-family:Verdana;">is</span></span><span style="font-family:Verdana;"> depending on historical data. For the more, reducing online social media networks problems and crimes need a significant source of information. Data analysts need to use an efficient classification algorithm to predict the future of such businesses. However, dealing with a huge quantity of data requires great time to process. Data mining involves many useful techniques that are used to predict statistical data in a variety of business applications. The classification technique is one of the most widely used with a variety of algorithms. In this paper, various classification algorithms are revised in terms of accuracy in different areas of data mining applications. A comprehensive analysis is made after delegated reading of 20 papers in the literature. This paper aims to help data analysts to choose the most suitable classification algorithm for different business applications including business in general, online social media networks, agriculture, health, and education. Results show FFBPN is the most accurate algorithm in the business domain. The Random Forest algorithm is the most accurate in classifying online social networks (OSN) activities. Na<span style="white-space:nowrap;">ï</span>ve Bayes algorithm is the most accurate to classify agriculture datasets. OneR is the most accurate algorithm to classify instances within the health domain. The C4.5 Decision Tree algorithm is the most accurate to classify students’ records to predict degree completion time.
文摘Two important performance indicators for data mining algorithms are accuracy of classification/ prediction and time taken for training. These indicators are useful for selecting best algorithms for classification/prediction tasks in data mining. Empirical studies on these performance indicators in data mining are few. Therefore, this study was designed to determine how data mining classification algorithm perform with increase in input data sizes. Three data mining classification algorithms—Decision Tree, Multi-Layer Perceptron (MLP) Neural Network and Naïve Bayes— were subjected to varying simulated data sizes. The time taken by the algorithms for trainings and accuracies of their classifications were analyzed for the different data sizes. Results show that Naïve Bayes takes least time to train data but with least accuracy as compared to MLP and Decision Tree algorithms.
文摘In this research article, we analyze the multimedia data mining and classification algorithm based on database optimization techniques. Of high performance application requirements of various kinds are springing up constantly makes parallel computer system structure is valued by more and more common but the corresponding software system development lags far behind the development of the hardware system, it is more obvious in the field of database technology application. Multimedia mining is different from the low level of computer multimedia processing technology and the former focuses on the extracted from huge multimedia collection mode which focused on specific features of understanding or extraction from a single multimedia objects. Our research provides new paradigm for the methodology which will be meaningful and necessary.
文摘Tuberculosis remains an important problem in public health that threatens the world, including the Philippines. Treatment relapse continues to place a severe problem on patients and TB programs worldwide. A significant reason for the development of decline is poor compliance with medical treatments. The objectives of this research are to generate a predictive data mining model to classify the treatment relapse of TB patients and to identify the features influencing the category of treatment relapse. The TB patient dataset is applied and tested in decision tree J48 algorithm using WEKA. The J48 model identified the three (3) significant independent variables (DSSM Result, Age, and Sex) as predictors of category treatment relapse.
文摘Big data is usually unstructured, and many applications require theanalysis in real-time. Decision tree (DT) algorithm is widely used to analyzebig data. Selecting the optimal depth of DT is time-consuming process as itrequires many iterations. In this paper, we have designed a modified versionof a (DT). The tree aims to achieve optimal depth by self-tuning runningparameters and improving the accuracy. The efficiency of the modified (DT)was verified using two datasets (airport and fire datasets). The airport datasethas 500000 instances and the fire dataset has 600000 instances. A comparisonhas been made between the modified (DT) and standard (DT) with resultsshowing that the modified performs better. This comparison was conductedon multi-node on Apache Spark tool using Amazon web services. Resultingin accuracy with an increase of 6.85% for the first dataset and 8.85% for theairport dataset. In conclusion, the modified DT showed better accuracy inhandling different-sized datasets compared to standard DT algorithm.
文摘A new classification algorithm for web mining is proposed on the basis of general classification algorithm for data mining in order to implement personalized information services. The building tree method of detecting class threshold is used for construction of decision tree according to the concept of user expectation so as to find classification rules in different layers. Compared with the traditional C4.5 algorithm, the disadvantage of excessive adaptation in C4.5 has been improved so that classification results not only have much higher accuracy but also statistic meaning.
基金sponsored by the National Science and Technology Major Project(No.2011ZX05023-005-006)
文摘Data mining is the process of extracting implicit but potentially useful information from incomplete, noisy, and fuzzy data. Data mining offers excellent nonlinear modeling and self-organized learning, and it can play a vital role in the interpretation of well logging data of complex reservoirs. We used data mining to identify the lithologies in a complex reservoir. The reservoir lithologies served as the classification task target and were identified using feature extraction, feature selection, and modeling of data streams. We used independent component analysis to extract information from well curves. We then used the branch-and- bound algorithm to look for the optimal feature subsets and eliminate redundant information. Finally, we used the C5.0 decision-tree algorithm to set up disaggregated models of the well logging curves. The modeling and actual logging data were in good agreement, showing the usefulness of data mining methods in complex reservoirs.
文摘Under the modern education system of China, the annual scholarship evaluation is a vital thing for many of the collegestudents. This paper adopts the classification algorithm of decision tree C4.5 based on the bettering of ID3 algorithm and constructa data set of the scholarship evaluation system through the analysis of the related attributes in scholarship evaluation information.And also having found some factors that plays a significant role in the growing up of the college students through analysis and re-search of moral education, intellectural education and culture&PE.
基金Supported by Science and Technology Plan of Mudanjiang City (G200920064)Teaching Reform Construction of Mudanjiang Normal University (10-xj11080)
文摘Based on the discuss of the basic concept of data mining technology and the decision tree method,combining with the data samples of wind and hailstorm disasters in some counties of Mudanjiang region,the forecasting model of agro-meteorological disaster grade was established by adopting the C4.5 classification algorithm of decision tree,which can forecast the direct economic loss degree to provide rational data mining model and obtain effective analysis results.
文摘Background:Given the importance of customers as the most valuable assets of organizations,customer retention seems to be an essential,basic requirement for any organization.Banks are no exception to this rule.The competitive atmosphere within which electronic banking services are provided by different banks increases the necessity of customer retention.Methods:Being based on existing information technologies which allow one to collect data from organizations’databases,data mining introduces a powerful tool for the extraction of knowledge from huge amounts of data.In this research,the decision tree technique was applied to build a model incorporating this knowledge.Results:The results represent the characteristics of churned customers.Conclusions:Bank managers can identify churners in future using the results of decision tree.They should be provide some strategies for customers whose features are getting more likely to churner’s features.
文摘This paper focuses on improving decision tree induction algorithms when a kind of tie appears during the rule generation procedure for specific training datasets. The tie occurs when there are equal proportions of the target class outcome in the leaf node's records that leads to a situation where majority voting cannot be applied. To solve the above mentioned exception, we propose to base the prediction of the result on the naive Bayes (NB) estimate, k-nearest neighbour (k-NN) and association rule mining (ARM). The other features used for splitting the parent nodes are also taken into consideration.
文摘Sheet metal is widely used on auto-bodies, plane-bodies and metal furniture, etc. For instance, a typical auto-body commonly consists of hundreds of sheet metal stamping parts. Because of its complexity of structure and manufacturing process, auto-bodies inevitably have geometrical variation results from a number of different sources, such as the geometrical variation of stamping parts, the transformation of assembly process parameters and even the improper design concept. As more than 30% quality defects of an auto-body are born from the dimensional deviation of Body-In-White originated during the manufacturing process, effective diagnosis and control of dimensional faults are essential to the continuous improvement of the quality of vehicles. Especially during the period of new car launching or model changing when the assembly process was changed and adjusted frequently. For continuously improving the quality of modern cars, rapid dimensional variation causes identification becomes a challenging but essential work. In this paper, main variation causes of auto-body was firstly been cataloged and analyzed, then, a dimensional variation diagnostic reasoning and decision approach was developed through the combination of data mining and knowledge discovery techniques. This approach is driven by variation pattern identification which can be discovered from the dispersive, isolated massive measured data: Correlation Analysis (CA) and Maximal Tree (MT) methods were applied to extract the large variation group from massive multidimensional measured data, while multivariate statistical analysis (MSA) approach was used to discovery the principle variation pattern. A Decision Tree (DT) approach based on the knowledge of product and assembly process was developed to fulfill the "Hypothesis and Validation" characterized variation causes reasoning procedure. An practical application case with sudden and severe dimension variation on rear end panel in up/down direction was analyzed and successfully solved aided by the devloped variation diagnostic method, which have proved that the approach is effective and efficient.
文摘To make business policy, market analysis, corporate decision, fraud detection, etc., we have to analyze and work with huge amount of data. Generally, such data are taken from different sources. Researchers are using data mining to perform such tasks. Data mining techniques are used to find hidden information from large data source. Data mining is using for various fields: Artificial intelligence, Bank, health and medical, corruption, legal issues, corporate business, marketing, etc. Special interest is given to associate rules, data mining algorithms, decision tree and distributed approach. Data is becoming larger and spreading geographically. So it is difficult to find better result from only a central data source. For knowledge discovery, we have to work with distributed database. On the other hand, security and privacy considerations are also another factor for de-motivation of working with centralized data. For this reason, distributed database is essential for future processing. In this paper, we have proposed a framework to study data mining in distributed environment. The paper presents a framework to bring out actionable knowledge. We have shown some level by which we can generate actionable knowledge. Possible tools and technique for these levels are discussed.
文摘Corporations focus on web based education to train their employees ever more than before. Unlike traditional learning environments, web based education applications store large amount of data. This growing availability of data stimulated the emergence of a new field called educational data mining. In this study, the classification method is implemented on a data that is obtained from a company which uses web based education to train their employees. The authors' aim is to find out the most critical factors that influence the users' success. For the classification of the data, two decision tree algorithms, Classification and Regression Tree (CART) and Quick, Unbiased and Efficient Statistical Tree (QUEST) are applied. According to the results, assurance of a certificate at the end of the training is found to be the most critical factor that influences the users' success. Position, number of work years and the education level of the user, are also found as important factors.
文摘There is still no effective means to analyze in depth and utilize domestic mass data about agricultural product quality safety tests in china now. The neural network algorithm, the classification regression tree algorithm, the Bayesian network algorithm were selected according to the principle of selecting combination model and were used to build models respectively and then combined, innovatively establishing a combination model which has relatively high precision, strong robustness and better explanatory to predict the results of perishable food transportation meta-morphism monitoring. The relative optimal prediction model of the perishable food transportation metamorphism monitoring system could be got. The relative perfect prediction model can guide the actual sampling work about food quality and safety by prognosticating the occurrence of unqualified food to select the typical and effective samples for test, thus improving the efficiency and effectiveness of sampling work effectively, so as to avoid deteriorated perishable food’s approaching the market to ensure the quality and safety of perishable food transportation. A solid protective wall was built in the protection of general perishable food consumers’ health.
文摘Objective: According to RFM model theory of customer relationship management, data mining technology was used to group the chronic infectious disease patients to explore the effect of customer segmentation on the management of patients with different characteristics. Methods: 170,246 outpatient data was extracted from the hospital management information system (HIS) during January 2016 to July 2016, 43,448 data was formed after the data cleaning. K-Means clustering algorithm was used to classify patients with chronic infectious diseases, and then C5.0 decision tree algorithm was used to predict the situation of patients with chronic infectious diseases. Results: Male patients accounted for 58.7%, patients living in Shanghai accounted for 85.6%. The average age of patients is 45.88 years old, the high incidence age is 25 to 65 years old. Patients was gathered into three categories: 1) Clusters 1—Important patients (4786 people, 11.72%, R = 2.89, F = 11.72, M = 84,302.95);2) Clustering 2—Major patients (23,103, 53.2%, R = 5.22, F = 3.45, M = 9146.39);3) Cluster 3—Potential patients (15,559 people, 35.8%, R = 19.77, F = 1.55, M = 1739.09). C5.0 decision tree algorithm was used to predict the treatment situation of patients with chronic infectious diseases, the final treatment time (weeks) is an important predictor, the accuracy rate is 99.94% verified by the confusion model. Conclusion: Medical institutions should strengthen the adherence education for patients with chronic infectious diseases, establish the chronic infectious diseases and customer relationship management database, take the initiative to help them improve treatment adherence. Chinese governments at all levels should speed up the construction of hospital information, establish the chronic infectious disease database, strengthen the blocking of mother-to-child transmission, to effectively curb chronic infectious diseases, reduce disease burden and mortality.
文摘Information disclosure can reduce information asymmetry between health care providers and patients, thus improving both patient safety and medical quality. The National Bureau of Health Insurance (NBHI) inTaiwancurrently publishes health-related information online in order to enhance service efficiency and enable the public to monitor the country’s medical system. A data mining technique, classification and regression tree (CART), is used in this work to investigate online public quality information to compare the characteristics of hospital. The hospital quality indicators and characteristics data are available on the websites of the NBHI (http://www.nhi.gov.tw/AmountInfoWeb/Index.aspx) and the Department of Health (http://www.doh.gov.tw/). The full classification and regression tree presented in this work, grown using the hospitals’ quality medical indicators and characteristic values, classifies all hospitals into seven groups. The rate of stays longer than 30 days, which is the dependent variable in this study, is most influenced by the number of medical staff. This reflects the fact that the fewer medical staffs that are employed, the smaller the hospital is, and patients who are likely to have longer stays tend to go to the medium or large hospitals. Policy makers should work to decrease or eliminate persistent healthcare disparities among different socioeconomic groups and offer more online healthrelated services to reduce information asymmetry between health care providers and patients.
文摘Trauma is the most common cause of death to young people and many of these deaths are preventable [1]. The prediction of trauma patients outcome was a difficult problem to investigate till present times. In this study, prediction models are built and their capabilities to accurately predict the mortality are assessed. The analysis includes a comparison of data mining techniques using classification, clustering and association algorithms. Data were collected by Hellenic Trauma and Emergency Surgery Society from 30 Greek hospitals. Dataset contains records of 8544 patients suffering from severe injuries collected from the year 2005 to 2006. Factors include patients' demographic elements and several other variables registered from the time and place of accident until the hospital treatment and final outcome. Using this analysis the obtained results are compared in terms of sensitivity, specificity, positive predictive value and negative predictive value and the ROC curve depicts these methods performance.
文摘The increasing volume of data in the area of environmental sciences needs analysis and interpretation. Among the challenges generated by this “data deluge”, the development of efficient strategies for the knowledge discovery is an important issue. Here, statistical and tools from computational intelligence are applied to analyze large data sets from meteorology and climate sciences. Our approach allows a geographical mapping of the statistical property to be easily interpreted by meteorologists. Our data analysis comprises two main steps of knowledge extraction, applied successively in order to reduce the complexity from the original data set. The goal is to identify a much smaller subset of climatic variables that might still be able to describe or even predict the probability of occurrence of an extreme event. The first step applies a class comparison technique: p-value estimation. The second step consists of a decision tree (DT) configured from the data available and the p-value analysis. The DT is used as a predictive model, identifying the most statistically significant climate variables of the precipitation intensity. The methodology is employed to the study the climatic causes of an extreme precipitation events occurred in Alagoas and Pernambuco States (Brazil) at June/2010.