[Objectives]To explore the trend of brands towards the design of waist protection products through data mining,and to provide reference for the design concept of the contour of waist protection pillow.[Methods]The str...[Objectives]To explore the trend of brands towards the design of waist protection products through data mining,and to provide reference for the design concept of the contour of waist protection pillow.[Methods]The structural design information of all waist protection equipment was collected from the national Internet platform,and the data were classified and a database was established.IBM SPSS 26.0 and MATLAB 2018a were used to analyze the data and tabulate them in Tableau 2022.4.After the association rules were clarified,the data were imported into Cinema 4D R21 to create the concept contour of waist protection pillow.[Results]The average and standard deviation of the single airbag design were the highest in all groups,with an average of 0.511 and a standard deviation of 0.502.The average and standard deviation of the upper and lower dual airbags were the lowest in all groups,with an average of 0.015 and a standard deviation of 0.120;the correlation coefficient between single airbag and 120°arc stretching was 0.325,which was positively correlated with each other(P<0.01);the correlation coefficient between multiple airbags and 360°encircling fitting was 0.501,which was positively correlated with each other and had the highest correlation degree(P<0.01).[Conclusions]The single airbag design is well recognized by companies,and has received the highest attention among all brand products.While focusing on single airbag design,most brands will consider the need to add 120°arc stretching elements in product design.At the time of focusing on multiple airbag design,some brands believe that 360°encircling fitting elements need to be added to the product,and the correlation between the two is the highest among all groups.展开更多
Along with the rapid development of internet, CRM has become one of the most important facts leading the enterprises to be competent. At the same time, the analytical CRM based on Date Warehouse is the kernel of CRM s...Along with the rapid development of internet, CRM has become one of the most important facts leading the enterprises to be competent. At the same time, the analytical CRM based on Date Warehouse is the kernel of CRM system. This paper mainly explains the idea of CRM and the DW model of analytical CRM system.展开更多
According to the chaotic and non-linear characters of power load data,the time series matrix is established with the theory of phase-space reconstruction,and then Lyapunov exponents with chaotic time series are comput...According to the chaotic and non-linear characters of power load data,the time series matrix is established with the theory of phase-space reconstruction,and then Lyapunov exponents with chaotic time series are computed to determine the time delay and the embedding dimension.Due to different features of the data,data mining algorithm is conducted to classify the data into different groups.Redundant information is eliminated by the advantage of data mining technology,and the historical loads that have highly similar features with the forecasting day are searched by the system.As a result,the training data can be decreased and the computing speed can also be improved when constructing support vector machine(SVM) model.Then,SVM algorithm is used to predict power load with parameters that get in pretreatment.In order to prove the effectiveness of the new model,the calculation with data mining SVM algorithm is compared with that of single SVM and back propagation network.It can be seen that the new DSVM algorithm effectively improves the forecast accuracy by 0.75%,1.10% and 1.73% compared with SVM for two random dimensions of 11-dimension,14-dimension and BP network,respectively.This indicates that the DSVM gains perfect improvement effect in the short-term power load forecasting.展开更多
Objective speech quality is difficult to be measured without the input reference speech.Mapping methods using data mining are investigated and designed to improve the output-based speech quality assessment algorithm.T...Objective speech quality is difficult to be measured without the input reference speech.Mapping methods using data mining are investigated and designed to improve the output-based speech quality assessment algorithm.The degraded speech is firstly separated into three classes(unvoiced,voiced and silence),and then the consistency measurement between the degraded speech signal and the pre-trained reference model for each class is calculated and mapped to an objective speech quality score using data mining.Fuzzy Gaussian mixture model(GMM)is used to generate the artificial reference model trained on perceptual linear predictive(PLP)features.The mean opinion score(MOS)mapping methods including multivariate non-linear regression(MNLR),fuzzy neural network(FNN)and support vector regression(SVR)are designed and compared with the standard ITU-T P.563 method.Experimental results show that the assessment methods with data mining perform better than ITU-T P.563.Moreover,FNN and SVR are more efficient than MNLR,and FNN performs best with 14.50% increase in the correlation coefficient and 32.76% decrease in the root-mean-square MOS error.展开更多
In order to compete in the global manufacturing mar ke t, agility is the only possible solution to response to the fragmented market se gments and frequently changed customer requirements. However, manufacturing agil ...In order to compete in the global manufacturing mar ke t, agility is the only possible solution to response to the fragmented market se gments and frequently changed customer requirements. However, manufacturing agil ity can only be attained through the deployment of knowledge. To embed knowledge into a CAD system to form a knowledge intensive CAD (KIC) system is one of way to enhance the design compatibility of a manufacturing company. The most difficu lt phase to develop a KIC system is to capitalize a huge amount of legacy data t o form a knowledge database. In the past, such capitalization process could only be done solely manually or semi-automatic. In this paper, a five step model fo r automatic design knowledge capitalization through the use of data mining is pr oposed whilst details of how to select, verify and performance benchmarking an a ppropriate data mining algorithm for a specific design task will also be discuss ed. A case study concerning the design of a plastic toaster casing was used as an illustration for the proposed methodology and it was found that the avera ge absolute error of the predictions for the most appropriate algorithm is withi n 17%.展开更多
Objective: To analyze the experience of chief physician Xiong Lu in treating metaphase and advanced lung cancer through using TCM inheritance support system (V2.5). Methods: Collecting the prescriptions used for m...Objective: To analyze the experience of chief physician Xiong Lu in treating metaphase and advanced lung cancer through using TCM inheritance support system (V2.5). Methods: Collecting the prescriptions used for metaphase and advanced lung cancer from November 1, 2014 to February 1, 2015, then the data were entered into the TCM inheritance support system. Based on principle analysis, revised mutual information, complex system entropy cluster and unsupervised hierarchical clustering composing principles were analyzed. Results: Based on the analysis of 228 cases of prescriptions, the frequency of each Chinese medicinal herb and association rules among herbs included in the database were computed. 15 core combinations and 2 new prescriptions were explored from the database. Conclusion: In treating metaphase and advanced lung cancer, chief physician Xiong Lu pay attention to Fuzheng Peiben (Therapy for support Zheng-qi to propup root), according to the different situation cooperate with Tong Luo (dredging collaterals), San Jie (Dissipating a mass), Huo Xue (Activating blood), Gong Du (Counteracting toxic substance) and so on. Xiong Lu is also good at using toxic drugs and incompatible medicaments.展开更多
On the bas is of the reality of material supply management of the coal enterprise, this paper expounds plans of material management systems based on specific IT, and indicates the deficiencies, the problems of them an...On the bas is of the reality of material supply management of the coal enterprise, this paper expounds plans of material management systems based on specific IT, and indicates the deficiencies, the problems of them and the necessity of improving them. The structure, models and data organizing schema of the material management decision support system are investigated based on a new data management technology (data warehousing technology).展开更多
The paper introduced the data mining and issues related to it.Data mining is a technique by which we can extract useful knowledge from urge set of data.Data mining tasks used to perform various operations and used to ...The paper introduced the data mining and issues related to it.Data mining is a technique by which we can extract useful knowledge from urge set of data.Data mining tasks used to perform various operations and used to solve various problems related to data mining.Data warehouse is the collection of different method and techniques used to extract useful information from raw data.Genetic algorithm is based on Darwin’s theory in which low standard chromosomes are removed from the population due to their inability to survive the process of selection.The high standard chromosomes survive and are mixed by recombination to form more appropriate individuals.In this urge amount of data is used to predict future result by following several steps.展开更多
The aim of this study was to discriminate organic from conventional orange juice based on chemical elements and data mining applications.A comprehensive sampling of organic and conventional oranges was carried out in ...The aim of this study was to discriminate organic from conventional orange juice based on chemical elements and data mining applications.A comprehensive sampling of organic and conventional oranges was carried out in Borborema,state of Sao Paulo,Brazil.The fruits of the variety Valencia(Citrus sinensis(L.)Osbeck)budded on Rangpur lime(Citrus limonia Osbeck)were analyzed.Eleven chemical elements were determined in 57 orange samples grown in organic and conventional systems.In order to classify these samples,data mining techniques(Support Vector Machine(SVM)and Multilayer Perceptron(MLP))were combined with feature selection(F-score and chi-squared).SVM with chi-squared had a better performance compared with the other techniques because it reached 93.00% accuracy using only seven chemical components(Cu,Cs,Zn,Al,Mn,Rb and Sr),and correctly classified 96.73% of the samples grown in an organic system.展开更多
Credit scoring has become a critical and challenging management science issue as the credit industry has been facing stiffer competition in recent years. Many classification methods have been suggested to tackle this ...Credit scoring has become a critical and challenging management science issue as the credit industry has been facing stiffer competition in recent years. Many classification methods have been suggested to tackle this problem in the literature. In this paper, we investigate the performance of various credit scoring models and the corresponding credit risk cost for three real-life credit scoring data sets. Besides the well-known classification algorithms (e.g. linear discriminant analysis, logistic regression, neural networks and k-nearest neighbor), we also investigate the suitability and performance of some recently proposed, advanced data mining techniques such as support vector machines (SVMs), classification and regression tree (CART), and multivariate adaptive regression splines (MARS). The performance is assessed by using the classification accuracy and cost of credit scoring errors. The experiment results show that SVM, MARS, logistic regression and neural networks yield a very good performance. However, CART and MARS's explanatory capability outperforms the other methods.展开更多
In this paper, we designed a customer-centered data warehouse system with five subjects: listing, bidding, transaction, accounts, and customer contact based on the business process of online auction companies. For ea...In this paper, we designed a customer-centered data warehouse system with five subjects: listing, bidding, transaction, accounts, and customer contact based on the business process of online auction companies. For each subject, we analyzed its fact indexes and dimensions. Then take transaction subject as example, analyzed the data warehouse model in detail, and got the multi-dimensional analysis structure of transaction subject. At last, using data mining to do customer segmentation, we divided customers into four types: impulse customer, prudent customer, potential customer, and ordinary customer. By the result of multi-dimensional customer data analysis, online auction companies can do more target marketing and increase customer loyalty.展开更多
Workers’exposure to excessive noise is a big universal work-related challenges.One of the major consequences of exposure to noise is permanent or transient hearing loss.The current study sought to utilize audiometric...Workers’exposure to excessive noise is a big universal work-related challenges.One of the major consequences of exposure to noise is permanent or transient hearing loss.The current study sought to utilize audiometric data to weigh and prioritize the factors affecting workers’hearing loss based using the Support Vector Machine(SVM)algorithm.This cross sectional-descriptive study was conducted in 2017 in a mining industry in southeast Iran.The participating workers(n=150)were divided into three groups of 50 based on the sound pressure level to which they were exposed(two experimental groups and one control group).Audiometric tests were carried out for all members of each group.The study generally entailed the following steps:(1)selecting predicting variables to weigh and prioritize factors affecting hearing loss;(2)conducting audiometric tests and assessing permanent hearing loss in each ear and then evaluating total hearing loss;(3)categorizing different types of hearing loss;(4)weighing and prioritizing factors that affect hearing loss based on the SVM algorithm;and(5)assessing the error rate and accuracy of the models.The collected data were fed into SPSS 18,followed by conducting linear regression and paired samples t-test.It was revealed that,in the first model(SPL<70 dBA),the frequency of 8 KHz had the greatest impact(with a weight of 33%),while noise had the smallest influence(with a weight of 5%).The accuracy of this model was 100%.In the second model(70<SPL<80 dBA),the frequency of 4 KHz had the most profound effect(with a weight of 21%),whereas the frequency of 250 Hz had the lowest impact(with a weight of 6%).The accuracy of this model was 100%too.In the third model(SPL>85 dBA),the frequency of 4 KHz had the highest impact(with a weight of 22%),while the frequency of 250 Hz had the smallest influence(with a weight of 3%).The accuracy of this model was 100%too.In the fourth model,the frequency of 4 KHz had the greatest effect(with a weight of 24%),while the frequency of 500 Hz had the smallest effect(with a weight of 4%).The accuracy of this model was found to be 94%.According to the modeling conducted using the SVM algorithm,the frequency of 4 KHz has the most profound effect on predicting changes in hearing loss.Given the high accuracy of the obtained model,this algorithm is an appropriate and powerful tool to predict and model hearing loss.展开更多
Background knowledge is important for data mining, especially in complicated situation. Ontological engineering is the successor of knowledge engineering. The sharable knowledge bases built on ontology can be used to ...Background knowledge is important for data mining, especially in complicated situation. Ontological engineering is the successor of knowledge engineering. The sharable knowledge bases built on ontology can be used to provide background knowledge to direct the process of data mining. This paper gives a common introduction to the method and presents a practical analysis example using SVM (support vector machine) as the classifier. Gene Ontology and the accompanying annotations compose a big knowledge base, on which many researches have been carried out. Microarray dataset is the output of DNA chip. With the help of Gene Ontology we present a more elaborate analysis on microarray data than former researchers. The method can also be used in other fields with similar scenario.展开更多
Association rules mining is a major data mining field that leads to discovery of associations and correlations among items in today’s big data environment. The conventional association rule mining focuses mainly on p...Association rules mining is a major data mining field that leads to discovery of associations and correlations among items in today’s big data environment. The conventional association rule mining focuses mainly on positive itemsets generated from frequently occurring itemsets (PFIS). However, there has been a significant study focused on infrequent itemsets with utilization of negative association rules to mine interesting frequent itemsets (NFIS) from transactions. In this work, we propose an efficient backward calculating negative frequent itemset algorithm namely EBC-NFIS for computing backward supports that can extract both positive and negative frequent itemsets synchronously from dataset. EBC-NFIS algorithm is based on popular e-NFIS algorithm that computes supports of negative itemsets from the supports of positive itemsets. The proposed algorithm makes use of previously computed supports from memory to minimize the computation time. In addition, association rules, i.e. positive and negative association rules (PNARs) are generated from discovered frequent itemsets using EBC-NFIS algorithm. The efficiency of the proposed algorithm is verified by several experiments and comparing results with e-NFIS algorithm. The experimental results confirm that the proposed algorithm successfully discovers NFIS and PNARs and runs significantly faster than conventional e-NFIS algorithm.展开更多
The technological evolution emerges a unified (Industrial) Internet of Things network, where loosely coupled smart manufacturing devices build smart manufacturing systems and enable comprehensive collaboration possibi...The technological evolution emerges a unified (Industrial) Internet of Things network, where loosely coupled smart manufacturing devices build smart manufacturing systems and enable comprehensive collaboration possibilities that increase the dynamic and volatility of their ecosystems. On the one hand, this evolution generates a huge field for exploitation, but on the other hand also increases complexity including new challenges and requirements demanding for new approaches in several issues. One challenge is the analysis of such systems that generate huge amounts of (continuously generated) data, potentially containing valuable information useful for several use cases, such as knowledge generation, key performance indicator (KPI) optimization, diagnosis, predication, feedback to design or decision support. This work presents a review of Big Data analysis in smart manufacturing systems. It includes the status quo in research, innovation and development, next challenges, and a comprehensive list of potential use cases and exploitation possibilities.展开更多
The explosion of online information with the recent advent of digital technology in information processing,information storing,information sharing,natural language processing,and text mining techniques has enabled sto...The explosion of online information with the recent advent of digital technology in information processing,information storing,information sharing,natural language processing,and text mining techniques has enabled stock investors to uncover market movement and volatility from heterogeneous content.For example,a typical stock market investor reads the news,explores market sentiment,and analyzes technical details in order to make a sound decision prior to purchasing or selling a particular company’s stock.However,capturing a dynamic stock market trend is challenging owing to high fluctuation and the non-stationary nature of the stock market.Although existing studies have attempted to enhance stock prediction,few have provided a complete decision-support system for investors to retrieve real-time data from multiple sources and extract insightful information for sound decision-making.To address the above challenge,we propose a unified solution for data collection,analysis,and visualization in real-time stock market prediction to retrieve and process relevant financial data from news articles,social media,and company technical information.We aim to provide not only useful information for stock investors but also meaningful visualization that enables investors to effectively interpret storyline events affecting stock prices.Specifically,we utilize an ensemble stacking of diversified machine-learning-based estimators and innovative contextual feature engineering to predict the next day’s stock prices.Experiment results show that our proposed stock forecasting method outperforms a traditional baseline with an average mean absolute percentage error of 0.93.Our findings confirm that leveraging an ensemble scheme of machine learning methods with contextual information improves stock prediction performance.Finally,our study could be further extended to a wide variety of innovative financial applications that seek to incorporate external insight from contextual information such as large-scale online news articles and social media data.展开更多
Detecting naturally arising structures in data is central to knowledge extraction from data. In most applications, the main challenge is in the choice of the appropriate model for exploring the data features. The choi...Detecting naturally arising structures in data is central to knowledge extraction from data. In most applications, the main challenge is in the choice of the appropriate model for exploring the data features. The choice is generally poorly understood and any tentative choice may be too restrictive. Growing volumes of data, disparate data sources and modelling techniques entail the need for model optimization via adaptability rather than comparability. We propose a novel two-stage algorithm to modelling continuous data consisting of an unsupervised stage whereby the algorithm searches through the data for optimal parameter values and a supervised stage that adapts the parameters for predictive modelling. The method is implemented on the sunspots data with inherently Gaussian distributional properties and assumed bi-modality. Optimal values separating high from lows cycles are obtained via multiple simulations. Early patterns for each recorded cycle reveal that the first 3 years provide a sufficient basis for predicting the peak. Multiple Support Vector Machine runs using repeatedly improved data parameters show that the approach yields greater accuracy and reliability than conventional approaches and provides a good basis for model selection. Model reliability is established via multiple simulations of this type.展开更多
基金Supported by Municipal Public Welfare Science and Technology Project of Zhoushan Science and Technology Bureau,Zhejiang Province(2021C31064).
文摘[Objectives]To explore the trend of brands towards the design of waist protection products through data mining,and to provide reference for the design concept of the contour of waist protection pillow.[Methods]The structural design information of all waist protection equipment was collected from the national Internet platform,and the data were classified and a database was established.IBM SPSS 26.0 and MATLAB 2018a were used to analyze the data and tabulate them in Tableau 2022.4.After the association rules were clarified,the data were imported into Cinema 4D R21 to create the concept contour of waist protection pillow.[Results]The average and standard deviation of the single airbag design were the highest in all groups,with an average of 0.511 and a standard deviation of 0.502.The average and standard deviation of the upper and lower dual airbags were the lowest in all groups,with an average of 0.015 and a standard deviation of 0.120;the correlation coefficient between single airbag and 120°arc stretching was 0.325,which was positively correlated with each other(P<0.01);the correlation coefficient between multiple airbags and 360°encircling fitting was 0.501,which was positively correlated with each other and had the highest correlation degree(P<0.01).[Conclusions]The single airbag design is well recognized by companies,and has received the highest attention among all brand products.While focusing on single airbag design,most brands will consider the need to add 120°arc stretching elements in product design.At the time of focusing on multiple airbag design,some brands believe that 360°encircling fitting elements need to be added to the product,and the correlation between the two is the highest among all groups.
文摘Along with the rapid development of internet, CRM has become one of the most important facts leading the enterprises to be competent. At the same time, the analytical CRM based on Date Warehouse is the kernel of CRM system. This paper mainly explains the idea of CRM and the DW model of analytical CRM system.
基金Project(70671039) supported by the National Natural Science Foundation of China
文摘According to the chaotic and non-linear characters of power load data,the time series matrix is established with the theory of phase-space reconstruction,and then Lyapunov exponents with chaotic time series are computed to determine the time delay and the embedding dimension.Due to different features of the data,data mining algorithm is conducted to classify the data into different groups.Redundant information is eliminated by the advantage of data mining technology,and the historical loads that have highly similar features with the forecasting day are searched by the system.As a result,the training data can be decreased and the computing speed can also be improved when constructing support vector machine(SVM) model.Then,SVM algorithm is used to predict power load with parameters that get in pretreatment.In order to prove the effectiveness of the new model,the calculation with data mining SVM algorithm is compared with that of single SVM and back propagation network.It can be seen that the new DSVM algorithm effectively improves the forecast accuracy by 0.75%,1.10% and 1.73% compared with SVM for two random dimensions of 11-dimension,14-dimension and BP network,respectively.This indicates that the DSVM gains perfect improvement effect in the short-term power load forecasting.
基金Projects(61001188,1161140319)supported by the National Natural Science Foundation of ChinaProject(2012ZX03001034)supported by the National Science and Technology Major ProjectProject(YETP1202)supported by Beijing Higher Education Young Elite Teacher Project,China
文摘Objective speech quality is difficult to be measured without the input reference speech.Mapping methods using data mining are investigated and designed to improve the output-based speech quality assessment algorithm.The degraded speech is firstly separated into three classes(unvoiced,voiced and silence),and then the consistency measurement between the degraded speech signal and the pre-trained reference model for each class is calculated and mapped to an objective speech quality score using data mining.Fuzzy Gaussian mixture model(GMM)is used to generate the artificial reference model trained on perceptual linear predictive(PLP)features.The mean opinion score(MOS)mapping methods including multivariate non-linear regression(MNLR),fuzzy neural network(FNN)and support vector regression(SVR)are designed and compared with the standard ITU-T P.563 method.Experimental results show that the assessment methods with data mining perform better than ITU-T P.563.Moreover,FNN and SVR are more efficient than MNLR,and FNN performs best with 14.50% increase in the correlation coefficient and 32.76% decrease in the root-mean-square MOS error.
文摘In order to compete in the global manufacturing mar ke t, agility is the only possible solution to response to the fragmented market se gments and frequently changed customer requirements. However, manufacturing agil ity can only be attained through the deployment of knowledge. To embed knowledge into a CAD system to form a knowledge intensive CAD (KIC) system is one of way to enhance the design compatibility of a manufacturing company. The most difficu lt phase to develop a KIC system is to capitalize a huge amount of legacy data t o form a knowledge database. In the past, such capitalization process could only be done solely manually or semi-automatic. In this paper, a five step model fo r automatic design knowledge capitalization through the use of data mining is pr oposed whilst details of how to select, verify and performance benchmarking an a ppropriate data mining algorithm for a specific design task will also be discuss ed. A case study concerning the design of a plastic toaster casing was used as an illustration for the proposed methodology and it was found that the avera ge absolute error of the predictions for the most appropriate algorithm is withi n 17%.
文摘Objective: To analyze the experience of chief physician Xiong Lu in treating metaphase and advanced lung cancer through using TCM inheritance support system (V2.5). Methods: Collecting the prescriptions used for metaphase and advanced lung cancer from November 1, 2014 to February 1, 2015, then the data were entered into the TCM inheritance support system. Based on principle analysis, revised mutual information, complex system entropy cluster and unsupervised hierarchical clustering composing principles were analyzed. Results: Based on the analysis of 228 cases of prescriptions, the frequency of each Chinese medicinal herb and association rules among herbs included in the database were computed. 15 core combinations and 2 new prescriptions were explored from the database. Conclusion: In treating metaphase and advanced lung cancer, chief physician Xiong Lu pay attention to Fuzheng Peiben (Therapy for support Zheng-qi to propup root), according to the different situation cooperate with Tong Luo (dredging collaterals), San Jie (Dissipating a mass), Huo Xue (Activating blood), Gong Du (Counteracting toxic substance) and so on. Xiong Lu is also good at using toxic drugs and incompatible medicaments.
文摘On the bas is of the reality of material supply management of the coal enterprise, this paper expounds plans of material management systems based on specific IT, and indicates the deficiencies, the problems of them and the necessity of improving them. The structure, models and data organizing schema of the material management decision support system are investigated based on a new data management technology (data warehousing technology).
文摘The paper introduced the data mining and issues related to it.Data mining is a technique by which we can extract useful knowledge from urge set of data.Data mining tasks used to perform various operations and used to solve various problems related to data mining.Data warehouse is the collection of different method and techniques used to extract useful information from raw data.Genetic algorithm is based on Darwin’s theory in which low standard chromosomes are removed from the population due to their inability to survive the process of selection.The high standard chromosomes survive and are mixed by recombination to form more appropriate individuals.In this urge amount of data is used to predict future result by following several steps.
文摘The aim of this study was to discriminate organic from conventional orange juice based on chemical elements and data mining applications.A comprehensive sampling of organic and conventional oranges was carried out in Borborema,state of Sao Paulo,Brazil.The fruits of the variety Valencia(Citrus sinensis(L.)Osbeck)budded on Rangpur lime(Citrus limonia Osbeck)were analyzed.Eleven chemical elements were determined in 57 orange samples grown in organic and conventional systems.In order to classify these samples,data mining techniques(Support Vector Machine(SVM)and Multilayer Perceptron(MLP))were combined with feature selection(F-score and chi-squared).SVM with chi-squared had a better performance compared with the other techniques because it reached 93.00% accuracy using only seven chemical components(Cu,Cs,Zn,Al,Mn,Rb and Sr),and correctly classified 96.73% of the samples grown in an organic system.
基金This work was supported in part by National Science Foundation of China under Grant No. 70171015
文摘Credit scoring has become a critical and challenging management science issue as the credit industry has been facing stiffer competition in recent years. Many classification methods have been suggested to tackle this problem in the literature. In this paper, we investigate the performance of various credit scoring models and the corresponding credit risk cost for three real-life credit scoring data sets. Besides the well-known classification algorithms (e.g. linear discriminant analysis, logistic regression, neural networks and k-nearest neighbor), we also investigate the suitability and performance of some recently proposed, advanced data mining techniques such as support vector machines (SVMs), classification and regression tree (CART), and multivariate adaptive regression splines (MARS). The performance is assessed by using the classification accuracy and cost of credit scoring errors. The experiment results show that SVM, MARS, logistic regression and neural networks yield a very good performance. However, CART and MARS's explanatory capability outperforms the other methods.
基金Supported by the National Natural Science Foundation of China (70471037)211 Project Foundation of Shanghai University (8011040506)
文摘In this paper, we designed a customer-centered data warehouse system with five subjects: listing, bidding, transaction, accounts, and customer contact based on the business process of online auction companies. For each subject, we analyzed its fact indexes and dimensions. Then take transaction subject as example, analyzed the data warehouse model in detail, and got the multi-dimensional analysis structure of transaction subject. At last, using data mining to do customer segmentation, we divided customers into four types: impulse customer, prudent customer, potential customer, and ordinary customer. By the result of multi-dimensional customer data analysis, online auction companies can do more target marketing and increase customer loyalty.
基金This study stemmed from a research project(code number:96000838)which was sponsored by the Institute for Futures Studies in Health at Kerman University of Medical Sciences.
文摘Workers’exposure to excessive noise is a big universal work-related challenges.One of the major consequences of exposure to noise is permanent or transient hearing loss.The current study sought to utilize audiometric data to weigh and prioritize the factors affecting workers’hearing loss based using the Support Vector Machine(SVM)algorithm.This cross sectional-descriptive study was conducted in 2017 in a mining industry in southeast Iran.The participating workers(n=150)were divided into three groups of 50 based on the sound pressure level to which they were exposed(two experimental groups and one control group).Audiometric tests were carried out for all members of each group.The study generally entailed the following steps:(1)selecting predicting variables to weigh and prioritize factors affecting hearing loss;(2)conducting audiometric tests and assessing permanent hearing loss in each ear and then evaluating total hearing loss;(3)categorizing different types of hearing loss;(4)weighing and prioritizing factors that affect hearing loss based on the SVM algorithm;and(5)assessing the error rate and accuracy of the models.The collected data were fed into SPSS 18,followed by conducting linear regression and paired samples t-test.It was revealed that,in the first model(SPL<70 dBA),the frequency of 8 KHz had the greatest impact(with a weight of 33%),while noise had the smallest influence(with a weight of 5%).The accuracy of this model was 100%.In the second model(70<SPL<80 dBA),the frequency of 4 KHz had the most profound effect(with a weight of 21%),whereas the frequency of 250 Hz had the lowest impact(with a weight of 6%).The accuracy of this model was 100%too.In the third model(SPL>85 dBA),the frequency of 4 KHz had the highest impact(with a weight of 22%),while the frequency of 250 Hz had the smallest influence(with a weight of 3%).The accuracy of this model was 100%too.In the fourth model,the frequency of 4 KHz had the greatest effect(with a weight of 24%),while the frequency of 500 Hz had the smallest effect(with a weight of 4%).The accuracy of this model was found to be 94%.According to the modeling conducted using the SVM algorithm,the frequency of 4 KHz has the most profound effect on predicting changes in hearing loss.Given the high accuracy of the obtained model,this algorithm is an appropriate and powerful tool to predict and model hearing loss.
基金Project (No. 20040248001) supported by the Ph.D. Programs Foun-dation of Ministry of Education of China
文摘Background knowledge is important for data mining, especially in complicated situation. Ontological engineering is the successor of knowledge engineering. The sharable knowledge bases built on ontology can be used to provide background knowledge to direct the process of data mining. This paper gives a common introduction to the method and presents a practical analysis example using SVM (support vector machine) as the classifier. Gene Ontology and the accompanying annotations compose a big knowledge base, on which many researches have been carried out. Microarray dataset is the output of DNA chip. With the help of Gene Ontology we present a more elaborate analysis on microarray data than former researchers. The method can also be used in other fields with similar scenario.
文摘Association rules mining is a major data mining field that leads to discovery of associations and correlations among items in today’s big data environment. The conventional association rule mining focuses mainly on positive itemsets generated from frequently occurring itemsets (PFIS). However, there has been a significant study focused on infrequent itemsets with utilization of negative association rules to mine interesting frequent itemsets (NFIS) from transactions. In this work, we propose an efficient backward calculating negative frequent itemset algorithm namely EBC-NFIS for computing backward supports that can extract both positive and negative frequent itemsets synchronously from dataset. EBC-NFIS algorithm is based on popular e-NFIS algorithm that computes supports of negative itemsets from the supports of positive itemsets. The proposed algorithm makes use of previously computed supports from memory to minimize the computation time. In addition, association rules, i.e. positive and negative association rules (PNARs) are generated from discovered frequent itemsets using EBC-NFIS algorithm. The efficiency of the proposed algorithm is verified by several experiments and comparing results with e-NFIS algorithm. The experimental results confirm that the proposed algorithm successfully discovers NFIS and PNARs and runs significantly faster than conventional e-NFIS algorithm.
文摘The technological evolution emerges a unified (Industrial) Internet of Things network, where loosely coupled smart manufacturing devices build smart manufacturing systems and enable comprehensive collaboration possibilities that increase the dynamic and volatility of their ecosystems. On the one hand, this evolution generates a huge field for exploitation, but on the other hand also increases complexity including new challenges and requirements demanding for new approaches in several issues. One challenge is the analysis of such systems that generate huge amounts of (continuously generated) data, potentially containing valuable information useful for several use cases, such as knowledge generation, key performance indicator (KPI) optimization, diagnosis, predication, feedback to design or decision support. This work presents a review of Big Data analysis in smart manufacturing systems. It includes the status quo in research, innovation and development, next challenges, and a comprehensive list of potential use cases and exploitation possibilities.
基金supported by Mahidol University(Grant No.MU-MiniRC02/2564)We also appreciate the partial computing resources from Grant No.RSA6280105funded by Thailand Science Research and Innovation(TSRI),(formerly known as the Thailand Research Fund(TRF)),and the National Research Council of Thailand(NRCT).
文摘The explosion of online information with the recent advent of digital technology in information processing,information storing,information sharing,natural language processing,and text mining techniques has enabled stock investors to uncover market movement and volatility from heterogeneous content.For example,a typical stock market investor reads the news,explores market sentiment,and analyzes technical details in order to make a sound decision prior to purchasing or selling a particular company’s stock.However,capturing a dynamic stock market trend is challenging owing to high fluctuation and the non-stationary nature of the stock market.Although existing studies have attempted to enhance stock prediction,few have provided a complete decision-support system for investors to retrieve real-time data from multiple sources and extract insightful information for sound decision-making.To address the above challenge,we propose a unified solution for data collection,analysis,and visualization in real-time stock market prediction to retrieve and process relevant financial data from news articles,social media,and company technical information.We aim to provide not only useful information for stock investors but also meaningful visualization that enables investors to effectively interpret storyline events affecting stock prices.Specifically,we utilize an ensemble stacking of diversified machine-learning-based estimators and innovative contextual feature engineering to predict the next day’s stock prices.Experiment results show that our proposed stock forecasting method outperforms a traditional baseline with an average mean absolute percentage error of 0.93.Our findings confirm that leveraging an ensemble scheme of machine learning methods with contextual information improves stock prediction performance.Finally,our study could be further extended to a wide variety of innovative financial applications that seek to incorporate external insight from contextual information such as large-scale online news articles and social media data.
文摘Detecting naturally arising structures in data is central to knowledge extraction from data. In most applications, the main challenge is in the choice of the appropriate model for exploring the data features. The choice is generally poorly understood and any tentative choice may be too restrictive. Growing volumes of data, disparate data sources and modelling techniques entail the need for model optimization via adaptability rather than comparability. We propose a novel two-stage algorithm to modelling continuous data consisting of an unsupervised stage whereby the algorithm searches through the data for optimal parameter values and a supervised stage that adapts the parameters for predictive modelling. The method is implemented on the sunspots data with inherently Gaussian distributional properties and assumed bi-modality. Optimal values separating high from lows cycles are obtained via multiple simulations. Early patterns for each recorded cycle reveal that the first 3 years provide a sufficient basis for predicting the peak. Multiple Support Vector Machine runs using repeatedly improved data parameters show that the approach yields greater accuracy and reliability than conventional approaches and provides a good basis for model selection. Model reliability is established via multiple simulations of this type.