With the fast development of business logic and information technology, today's best solutions are tomorrow's legacy systems. In China, the situation in the education domain follows the same path. Currently, there e...With the fast development of business logic and information technology, today's best solutions are tomorrow's legacy systems. In China, the situation in the education domain follows the same path. Currently, there exists a number of e-learning legacy assets with accumulated practical business experience, such as program resource, usage behaviour data resource, and so on. In order to use these legacy assets adequately and efficiently, we should not only utilize the explicit assets but also discover the hidden assets. The usage behaviour data resource is the set of practical operation sequences requested by all users. The hidden patterns in this data resource will provide users' practical experiences, which can benefit the service composition in service-oriented architecture (SOA) migration. Namely, these discovered patterns will be the candidate composite services (coarse-grained) in SOA systems. Although data mining techniques have been used for software engineering tasks, little is known about how they can be used for service composition of migrating an e-learning legacy system (MELS) to SOA. In this paper, we propose a service composition approach based on sequence mining techniques for MELS. Composite services found by this approach will be the complementation of business logic analysis results of MELS. The core of this approach is to develop an appropriate sequence mining algorithm for mining related data collected from an e-learning legacy system. According to the features of execution trace data on usage behaviour from this e-learning legacy system and needs of further pattern analysis, we propose a sequential mining algorithm to mine this kind of data of tile legacy system. For validation, this approach has been applied to the corresponding real data, which was collected from the e-learning legacy system; meanwhile, some investigation questionnaires were set up to collect satisfaction data. The investigation result is 90% the same with the result obtained through our approach.展开更多
After the digital revolution,large quantities of data have been generated with time through various networks.The networks have made the process of data analysis very difficult by detecting attacks using suitable techn...After the digital revolution,large quantities of data have been generated with time through various networks.The networks have made the process of data analysis very difficult by detecting attacks using suitable techniques.While Intrusion Detection Systems(IDSs)secure resources against threats,they still face challenges in improving detection accuracy,reducing false alarm rates,and detecting the unknown ones.This paper presents a framework to integrate data mining classification algorithms and association rules to implement network intrusion detection.Several experiments have been performed and evaluated to assess various machine learning classifiers based on the KDD99 intrusion dataset.Our study focuses on several data mining algorithms such as;naïve Bayes,decision trees,support vector machines,decision tables,k-nearest neighbor algorithms,and artificial neural networks.Moreover,this paper is concerned with the association process in creating attack rules to identify those in the network audit data,by utilizing a KDD99 dataset anomaly detection.The focus is on false negative and false positive performance metrics to enhance the detection rate of the intrusion detection system.The implemented experiments compare the results of each algorithm and demonstrate that the decision tree is the most powerful algorithm as it has the highest accuracy(0.992)and the lowest false positive rate(0.009).展开更多
Using S-rough sets, this paper gives the concepts off-heredity knowledge and its heredity coefficient, and f-variation coefficient of knowledge; presents the theorem of f-attribute dependence of variation coefficient ...Using S-rough sets, this paper gives the concepts off-heredity knowledge and its heredity coefficient, and f-variation coefficient of knowledge; presents the theorem of f-attribute dependence of variation coefficient and the relation theorem of heredity-variation. The attribute dependence of f-variation coefficient and the relation of heredity-variation are important characteristics of S-rough sets. From such discussion, this paper puts forward the heredity mining off-knowledge and the algorithm of heredity mining, also gives its relative application.展开更多
In a developing country like Ghana, the study of land use and land cover change(LULCC) based on satellite imageries still remains a challenge due to cost, resolution and availability with less skilled man power. Exist...In a developing country like Ghana, the study of land use and land cover change(LULCC) based on satellite imageries still remains a challenge due to cost, resolution and availability with less skilled man power. Existing researches are skewed towards the southerly part of Ghana thereby leaving the Northern sectors uncovered. The maximum likelihood classification(MLC) algorithm was employed for the LULCC between 2000 and 2014 in Nadowli: an area characterized by an upsurge in mining in the Northern belt of Ghana. A spatial-social approach was utilized combining both satellite imagery and socio economic data. Land use transition matrix, land use integrated index/degree indices was used to depict the characters of the change. A semi structured interview, pair wise ranking and key informant interviews were used to correlate the socio economic impact of the different LULC. Overall changes in the landscape showed an increase in bare ground by 19.22%, open savannah by 16.8% whereas closed savanna decreased by 50%. Land use change matrix showed increasing trends of bare ground at the expense of vegetation. The integrated land use index highlighted the bare ground and built up areas rising with a decreasing closed vegetation woodlot. Large farm size are shrinking whiles majority of the people view mining as the main socio economic activity affecting the environment and the reduction in vegetation. This study therefore provides a strategic guide and a baseline data for land use policy actors in the Northern belt of Ghana. This will aid in developing models for future land use change implications in surrounding areas where mining is on the rise.展开更多
Classification systems such as Slope Mass Rating(SMR) are currently being used to undertake slope stability analysis. In SMR classification system, data is allocated to certain classes based on linguistic and experien...Classification systems such as Slope Mass Rating(SMR) are currently being used to undertake slope stability analysis. In SMR classification system, data is allocated to certain classes based on linguistic and experience-based criteria. In order to eliminate linguistic criteria resulted from experience-based judgments and account for uncertainties in determining class boundaries developed by SMR system,the system classification results were corrected using two clustering algorithms, namely K-means and fuzzy c-means(FCM), for the ratings obtained via continuous and discrete functions. By applying clustering algorithms in SMR classification system, no in-advance experience-based judgment was made on the number of extracted classes in this system, and it was only after all steps of the clustering algorithms were accomplished that new classification scheme was proposed for SMR system under different failure modes based on the ratings obtained via continuous and discrete functions. The results of this study showed that, engineers can achieve more reliable and objective evaluations over slope stability by using SMR system based on the ratings calculated via continuous and discrete functions.展开更多
File semantic has proven effective in optimizing large scale distributed file system.As a consequence of the elaborate and rich I/O interfaces between upper layer applications and file systems,file system can provide ...File semantic has proven effective in optimizing large scale distributed file system.As a consequence of the elaborate and rich I/O interfaces between upper layer applications and file systems,file system can provide useful and insightful information about semantic.Hence,file semantic mining has become an increasingly important practice in both engineering and research community.Unfortunately,it is a challenge to exploit file semantic knowledge because a variety of factors coulda ffect this information exploration process.Even worse,the challenges are exacerbated due to the intricate interdependency between these factors,and make it difficult to fully exploit the potentially important correlation among various semantic knowledges.This article proposes a file access correlation miming and evaluation reference(FARMER) model,where file is treated as a multivariate vector space,and each item within the vector corresponds a separate factor of the given file.The selection of factor depends on the application,examples of factors are file path,creator and executing program.If one particular factor occurs in both files,its value is non-zero.It is clear that the extent of inter-file relationships can be measured based on the likeness of their factor values in the semantic vectors.Benefit from this model,FARMER represents files as structured vectors of identifiers,and basic vector operations can be leveraged to quantify file correlation between two file vectors.FARMER model leverages linear regression model to estimate the strength of the relationship between file correlation and a set of influencing factors so that the "bad knowledge" can be filtered out.To demonstrate the ability of new FARMER model,FARMER is incorporated into a real large-scale object-based storage system as a case study to dynamically infer file correlations.In addition FARMER-enabled optimize service for metadata prefetching algorithm and object data layout algorithm is implemented.Experimental results show that is FARMER-enabled prefetching algorithm is shown to reduce the metadata operations latency by approximately 30%-40% when compared to a state-of-the-art metadata prefetching algorithm and a commonly used replacement policy.展开更多
Clustering is one of the most widely used data mining techniques that can be used to create homogeneous clusters.K-means is one of the popular clustering algorithms that,despite its inherent simplicity,has also some m...Clustering is one of the most widely used data mining techniques that can be used to create homogeneous clusters.K-means is one of the popular clustering algorithms that,despite its inherent simplicity,has also some major problems.One way to resolve these problems and improve the k-means algorithm is the use of evolutionary algorithms in clustering.In this study,the Imperialist Competitive Algorithm(ICA) is developed and then used in the clustering process.Clustering of IRIS,Wine and CMC datasets using developed ICA and comparing them with the results of clustering by the original ICA,GA and PSO algorithms,demonstrate the improvement of Imperialist competitive algorithm.展开更多
Purpose–Among the growing number of data mining(DM)techniques,outlier detection has gained importance in many applications and also attracted much attention in recent times.In the past,outlier detection researched pa...Purpose–Among the growing number of data mining(DM)techniques,outlier detection has gained importance in many applications and also attracted much attention in recent times.In the past,outlier detection researched papers appeared in a safety care that can view as searching for the needles in the haystack.However,outliers are not always erroneous.Therefore,the purpose of this paper is to investigate the role of outliers in healthcare services in general and patient safety care,in particular.Design/methodology/approach–It is a combined DM(clustering and the nearest neighbor)technique for outliers’detection,which provides a clear understanding and meaningful insights to visualize the data behaviors for healthcare safety.The outcomes or the knowledge implicit is vitally essential to a proper clinicaldecision-making process.The method isimportant to thesemantic,andthe novel tactic of patients’events and situations prove that play a significant role in the process of patient care safety and medications.Findings–The outcomes of the paper is discussing a novel and integrated methodology,which can be inferring for different biological data analysis.It is discussed as integrated DM techniques to optimize its performancein the field of health and medicalscience.It is an integrated method of outliers detection that can be extending for searching valuable information and knowledge implicit based on selected patient factors.Based on these facts,outliers are detected as clusters and point events,and novel ideas proposed to empower clinical services in consideration of customers’satisfactions.It is also essential to be a baseline for further healthcare strategic development and research works.Research limitations/implications–This paper mainly focussed on outliers detections.Outlier isolation that are essential to investigate the reason how it happened and communications how to mitigate it did not touch.Therefore,the research can be extended more about the hierarchy of patient problems.Originality/value–DM is a dynamic and successful gateway for discovering useful knowledge for enhancing healthcare performances and patient safety.Clinical data based outlier detection is a basic task to achieve healthcare strategy.Therefore,in this paper,the authors focussed on combined DM techniques for a deep analysis of clinical data,which provide an optimal level of clinical decision-making processes.Proper clinical decisions can obtain in terms of attributes selections that important to know the influential factors or parameters of healthcare services.Therefore,using integrated clustering and nearest neighbors techniques give more acceptable searched such complex data outliers,which could be fundamental to further analysis of healthcare and patient safety situational analysis.展开更多
To overcome the failure in eliminating suspicious patterns or association rules existing in traditional association rules mining, we propose a novel method to mine item-item and between-set correlated association rule...To overcome the failure in eliminating suspicious patterns or association rules existing in traditional association rules mining, we propose a novel method to mine item-item and between-set correlated association rules. First, we present three measurements: the association, correlation, and item-set correlation measurements. In the association measurement, the all-confidence measure is used to filter suspicious cross-support patterns, while the all-item-confidence measure is applied in the correlation measurement to eliminate spurious association rules that contain negatively correlated items. Then, we define the item-set correlation measurement and show its corresponding properties. By using this measurement, spurious association rules in which the antecedent and consequent item-sets are negatively correlated can be eliminated. Finally, we propose item-item and between-set correlated association rules and two mining algorithms, I&ISCoMine_AP and I&ISCoMine_CT. Experimental results with synthetic and real retail datasets show that the proposed method is effective and valid.展开更多
Since years, online social networks have evolved from profile and communication websites to online portals where people interact with each other, share and consume multimedia-enriched data and play different types of ...Since years, online social networks have evolved from profile and communication websites to online portals where people interact with each other, share and consume multimedia-enriched data and play different types of games. Due to the immense popularity of these online games and their huge revenue potential, the number of these games increases every day, resulting in a current offering of thousands of online social games. In this paper, the applicability of neighborhood-based collaborative filtering (CF) algorithms for the recommendation of online social games is evaluated. This evaluation is based on a large dataset of an online social gaming platform containing game ratings (explicit data) and online gaming behavior (implicit data) of millions of active users. Several similarity metrics were implemented and evaluated on the explicit data, implicit data and a combination thereof. It is shown that the neighborhood-based CF algorithms greatly outperform the content-based algorithm, currently often used on online social gaming websites. The reslflts also show that a combined approach, fie, taking into account both implicit and explicit data at the same time, yields overall good results on all evaluation metrics for all scenarios, while only slightly performing worse compared to the strengths of the explicit or implicit only approaches. The best performing algorithms have been implemented in a live setup of the online game platform.展开更多
Mining from ambiguous data is very important in data mining. This paper discusses one of the tasks for mining from ambiguous data known as multi-instance problem. In multi-instance problem, each pattern is a labeled b...Mining from ambiguous data is very important in data mining. This paper discusses one of the tasks for mining from ambiguous data known as multi-instance problem. In multi-instance problem, each pattern is a labeled bag that consists of a number of unlabeled instances. A bag is negative if all instances in it are negative. A bag is positive if it has at least one positive instance. Because the instances in the positive bag are not labeled, each positive bag is an ambiguous. The mining aim is to classify unseen bags. The main idea of existing multi-instance algorithms is to find true positive instances in positive bags and convert the multi-instance problem to the supervised problem, and get the labels of test bags according to predict the labels of unknown instances. In this paper, we aim at mining the multi-instance data from another point of view, i.e., excluding the false positive instances in positive bags and predicting the label of an entire unknown bag. We propose an algorithm called Multi-Instance Covering kNN (MICkNN) for mining from multi-instance data. Briefly, constructive covering algorithm is utilized to restructure the structure of the original multi-instance data at first. Then, the kNN algorithm is applied to discriminate the false positive instances. In the test stage, we label the tested bag directly according to the similarity between the unseen bag and sphere neighbors obtained from last two steps. Experimental results demonstrate the proposed algorithm is competitive with most of the state-of-the-art multi-instance methods both in classification accuracy and running time.展开更多
The massive flow of scholarly publications from traditional paper journals to online outlets has benefited biologists because of its ease to access. However, due to the sheer volume of available biological literature,...The massive flow of scholarly publications from traditional paper journals to online outlets has benefited biologists because of its ease to access. However, due to the sheer volume of available biological literature, researchers are finding it increasingly difficult to locate needed information. As a result, recent biology contests, notably JNLPBA and BioCreAtIvE, have focused on evaluating various methods in which the literature may be navigated. Among these methods, text-mining technology has shown the most promise. With recent advances in text-mining technology and the fact that publishers are now making the full texts of articles available in XML format, TMSs can be adapted to accelerate literature curation, maintain the integrity of information, and ensure proper linkage of data to other resources. Even so, several new challenges have emerged in relation to full text analysis, life-science terminology, complex relation extraction, and information fusion. These challenges must be overcome in order for text-mining to be more effective. In this paper, we identify the challenges, discuss how they might be overcome, and consider the resources that may be helpful in achieving that goal.展开更多
Personalization is the adaptation of the services to fit the user’s interests,characteristics and needs.The key to effective personalization is user profiling.Apart from traditional collaborative and content-based ap...Personalization is the adaptation of the services to fit the user’s interests,characteristics and needs.The key to effective personalization is user profiling.Apart from traditional collaborative and content-based approaches,a number of classification and clustering algorithms have been used to classify user related information to create user profiles.However,they are not able to achieve accurate user profiles.In this paper,we present a new clustering algorithm,namely Multi-Dimensional Clustering(MDC),to determine user profiling.The MDC is a version of the Instance-Based Learner(IBL)algorithm that assigns weights to feature values and considers these weights for the clustering.Three feature weight methods are proposed for the MDC and,all three,have been tested and evaluated.Simulations were conducted with using two sets of user profile datasets,which are the training(includes 10,000 instances)and test(includes 1000 instances)datasets.These datasets reflect each user’s personal information,preferences and interests.Additional simulations and comparisons with existing weighted and non-weighted instance-based algorithms were carried out in order to demonstrate the performance of proposed algorithm.Experimental results using the user profile datasets demonstrate that the proposed algorithm has better clustering accuracy performance compared to other algorithms.This work is based on the doctoral thesis of the corresponding author.展开更多
基金supported by E-learning Platform, National Torch Project (No. z20040010)
文摘With the fast development of business logic and information technology, today's best solutions are tomorrow's legacy systems. In China, the situation in the education domain follows the same path. Currently, there exists a number of e-learning legacy assets with accumulated practical business experience, such as program resource, usage behaviour data resource, and so on. In order to use these legacy assets adequately and efficiently, we should not only utilize the explicit assets but also discover the hidden assets. The usage behaviour data resource is the set of practical operation sequences requested by all users. The hidden patterns in this data resource will provide users' practical experiences, which can benefit the service composition in service-oriented architecture (SOA) migration. Namely, these discovered patterns will be the candidate composite services (coarse-grained) in SOA systems. Although data mining techniques have been used for software engineering tasks, little is known about how they can be used for service composition of migrating an e-learning legacy system (MELS) to SOA. In this paper, we propose a service composition approach based on sequence mining techniques for MELS. Composite services found by this approach will be the complementation of business logic analysis results of MELS. The core of this approach is to develop an appropriate sequence mining algorithm for mining related data collected from an e-learning legacy system. According to the features of execution trace data on usage behaviour from this e-learning legacy system and needs of further pattern analysis, we propose a sequential mining algorithm to mine this kind of data of tile legacy system. For validation, this approach has been applied to the corresponding real data, which was collected from the e-learning legacy system; meanwhile, some investigation questionnaires were set up to collect satisfaction data. The investigation result is 90% the same with the result obtained through our approach.
文摘After the digital revolution,large quantities of data have been generated with time through various networks.The networks have made the process of data analysis very difficult by detecting attacks using suitable techniques.While Intrusion Detection Systems(IDSs)secure resources against threats,they still face challenges in improving detection accuracy,reducing false alarm rates,and detecting the unknown ones.This paper presents a framework to integrate data mining classification algorithms and association rules to implement network intrusion detection.Several experiments have been performed and evaluated to assess various machine learning classifiers based on the KDD99 intrusion dataset.Our study focuses on several data mining algorithms such as;naïve Bayes,decision trees,support vector machines,decision tables,k-nearest neighbor algorithms,and artificial neural networks.Moreover,this paper is concerned with the association process in creating attack rules to identify those in the network audit data,by utilizing a KDD99 dataset anomaly detection.The focus is on false negative and false positive performance metrics to enhance the detection rate of the intrusion detection system.The implemented experiments compare the results of each algorithm and demonstrate that the decision tree is the most powerful algorithm as it has the highest accuracy(0.992)and the lowest false positive rate(0.009).
基金This project was supported by the National Natural Science Foundation of China (60364001), the Shandong ProvincialNatural Science Foundation of China (Y2004A04) and Fujian Provincial Education Foundation of China(JA04268).
文摘Using S-rough sets, this paper gives the concepts off-heredity knowledge and its heredity coefficient, and f-variation coefficient of knowledge; presents the theorem of f-attribute dependence of variation coefficient and the relation theorem of heredity-variation. The attribute dependence of f-variation coefficient and the relation of heredity-variation are important characteristics of S-rough sets. From such discussion, this paper puts forward the heredity mining off-knowledge and the algorithm of heredity mining, also gives its relative application.
基金self-supported as part of the Ph D Program on CSC scholarship in the China University of Geosciences (Wuhan)
文摘In a developing country like Ghana, the study of land use and land cover change(LULCC) based on satellite imageries still remains a challenge due to cost, resolution and availability with less skilled man power. Existing researches are skewed towards the southerly part of Ghana thereby leaving the Northern sectors uncovered. The maximum likelihood classification(MLC) algorithm was employed for the LULCC between 2000 and 2014 in Nadowli: an area characterized by an upsurge in mining in the Northern belt of Ghana. A spatial-social approach was utilized combining both satellite imagery and socio economic data. Land use transition matrix, land use integrated index/degree indices was used to depict the characters of the change. A semi structured interview, pair wise ranking and key informant interviews were used to correlate the socio economic impact of the different LULC. Overall changes in the landscape showed an increase in bare ground by 19.22%, open savannah by 16.8% whereas closed savanna decreased by 50%. Land use change matrix showed increasing trends of bare ground at the expense of vegetation. The integrated land use index highlighted the bare ground and built up areas rising with a decreasing closed vegetation woodlot. Large farm size are shrinking whiles majority of the people view mining as the main socio economic activity affecting the environment and the reduction in vegetation. This study therefore provides a strategic guide and a baseline data for land use policy actors in the Northern belt of Ghana. This will aid in developing models for future land use change implications in surrounding areas where mining is on the rise.
文摘Classification systems such as Slope Mass Rating(SMR) are currently being used to undertake slope stability analysis. In SMR classification system, data is allocated to certain classes based on linguistic and experience-based criteria. In order to eliminate linguistic criteria resulted from experience-based judgments and account for uncertainties in determining class boundaries developed by SMR system,the system classification results were corrected using two clustering algorithms, namely K-means and fuzzy c-means(FCM), for the ratings obtained via continuous and discrete functions. By applying clustering algorithms in SMR classification system, no in-advance experience-based judgment was made on the number of extracted classes in this system, and it was only after all steps of the clustering algorithms were accomplished that new classification scheme was proposed for SMR system under different failure modes based on the ratings obtained via continuous and discrete functions. The results of this study showed that, engineers can achieve more reliable and objective evaluations over slope stability by using SMR system based on the ratings calculated via continuous and discrete functions.
基金Project supported by the National Basic Research Program of China (Grant Nos. 2004CB318201,2011CB302300)the US National Science Foundation (Grant No. CCF-0621526)+1 种基金the National Natural Science Foundation of China (Grant No. 60703046)HUST-SRF (Grant No.2007Q021B)
文摘File semantic has proven effective in optimizing large scale distributed file system.As a consequence of the elaborate and rich I/O interfaces between upper layer applications and file systems,file system can provide useful and insightful information about semantic.Hence,file semantic mining has become an increasingly important practice in both engineering and research community.Unfortunately,it is a challenge to exploit file semantic knowledge because a variety of factors coulda ffect this information exploration process.Even worse,the challenges are exacerbated due to the intricate interdependency between these factors,and make it difficult to fully exploit the potentially important correlation among various semantic knowledges.This article proposes a file access correlation miming and evaluation reference(FARMER) model,where file is treated as a multivariate vector space,and each item within the vector corresponds a separate factor of the given file.The selection of factor depends on the application,examples of factors are file path,creator and executing program.If one particular factor occurs in both files,its value is non-zero.It is clear that the extent of inter-file relationships can be measured based on the likeness of their factor values in the semantic vectors.Benefit from this model,FARMER represents files as structured vectors of identifiers,and basic vector operations can be leveraged to quantify file correlation between two file vectors.FARMER model leverages linear regression model to estimate the strength of the relationship between file correlation and a set of influencing factors so that the "bad knowledge" can be filtered out.To demonstrate the ability of new FARMER model,FARMER is incorporated into a real large-scale object-based storage system as a case study to dynamically infer file correlations.In addition FARMER-enabled optimize service for metadata prefetching algorithm and object data layout algorithm is implemented.Experimental results show that is FARMER-enabled prefetching algorithm is shown to reduce the metadata operations latency by approximately 30%-40% when compared to a state-of-the-art metadata prefetching algorithm and a commonly used replacement policy.
文摘Clustering is one of the most widely used data mining techniques that can be used to create homogeneous clusters.K-means is one of the popular clustering algorithms that,despite its inherent simplicity,has also some major problems.One way to resolve these problems and improve the k-means algorithm is the use of evolutionary algorithms in clustering.In this study,the Imperialist Competitive Algorithm(ICA) is developed and then used in the clustering process.Clustering of IRIS,Wine and CMC datasets using developed ICA and comparing them with the results of clustering by the original ICA,GA and PSO algorithms,demonstrate the improvement of Imperialist competitive algorithm.
基金The work supported by the National Natural Science Foundation of China under Grant No.61374135.
文摘Purpose–Among the growing number of data mining(DM)techniques,outlier detection has gained importance in many applications and also attracted much attention in recent times.In the past,outlier detection researched papers appeared in a safety care that can view as searching for the needles in the haystack.However,outliers are not always erroneous.Therefore,the purpose of this paper is to investigate the role of outliers in healthcare services in general and patient safety care,in particular.Design/methodology/approach–It is a combined DM(clustering and the nearest neighbor)technique for outliers’detection,which provides a clear understanding and meaningful insights to visualize the data behaviors for healthcare safety.The outcomes or the knowledge implicit is vitally essential to a proper clinicaldecision-making process.The method isimportant to thesemantic,andthe novel tactic of patients’events and situations prove that play a significant role in the process of patient care safety and medications.Findings–The outcomes of the paper is discussing a novel and integrated methodology,which can be inferring for different biological data analysis.It is discussed as integrated DM techniques to optimize its performancein the field of health and medicalscience.It is an integrated method of outliers detection that can be extending for searching valuable information and knowledge implicit based on selected patient factors.Based on these facts,outliers are detected as clusters and point events,and novel ideas proposed to empower clinical services in consideration of customers’satisfactions.It is also essential to be a baseline for further healthcare strategic development and research works.Research limitations/implications–This paper mainly focussed on outliers detections.Outlier isolation that are essential to investigate the reason how it happened and communications how to mitigate it did not touch.Therefore,the research can be extended more about the hierarchy of patient problems.Originality/value–DM is a dynamic and successful gateway for discovering useful knowledge for enhancing healthcare performances and patient safety.Clinical data based outlier detection is a basic task to achieve healthcare strategy.Therefore,in this paper,the authors focussed on combined DM techniques for a deep analysis of clinical data,which provide an optimal level of clinical decision-making processes.Proper clinical decisions can obtain in terms of attributes selections that important to know the influential factors or parameters of healthcare services.Therefore,using integrated clustering and nearest neighbors techniques give more acceptable searched such complex data outliers,which could be fundamental to further analysis of healthcare and patient safety situational analysis.
基金Project supported by the National Natural Science Foundation of China (Nos. 10876036 and 70871111)the Ningbo Natural Science Foundation, China (No. 2010A610113)
文摘To overcome the failure in eliminating suspicious patterns or association rules existing in traditional association rules mining, we propose a novel method to mine item-item and between-set correlated association rules. First, we present three measurements: the association, correlation, and item-set correlation measurements. In the association measurement, the all-confidence measure is used to filter suspicious cross-support patterns, while the all-item-confidence measure is applied in the correlation measurement to eliminate spurious association rules that contain negatively correlated items. Then, we define the item-set correlation measurement and show its corresponding properties. By using this measurement, spurious association rules in which the antecedent and consequent item-sets are negatively correlated can be eliminated. Finally, we propose item-item and between-set correlated association rules and two mining algorithms, I&ISCoMine_AP and I&ISCoMine_CT. Experimental results with synthetic and real retail datasets show that the proposed method is effective and valid.
文摘Since years, online social networks have evolved from profile and communication websites to online portals where people interact with each other, share and consume multimedia-enriched data and play different types of games. Due to the immense popularity of these online games and their huge revenue potential, the number of these games increases every day, resulting in a current offering of thousands of online social games. In this paper, the applicability of neighborhood-based collaborative filtering (CF) algorithms for the recommendation of online social games is evaluated. This evaluation is based on a large dataset of an online social gaming platform containing game ratings (explicit data) and online gaming behavior (implicit data) of millions of active users. Several similarity metrics were implemented and evaluated on the explicit data, implicit data and a combination thereof. It is shown that the neighborhood-based CF algorithms greatly outperform the content-based algorithm, currently often used on online social gaming websites. The reslflts also show that a combined approach, fie, taking into account both implicit and explicit data at the same time, yields overall good results on all evaluation metrics for all scenarios, while only slightly performing worse compared to the strengths of the explicit or implicit only approaches. The best performing algorithms have been implemented in a live setup of the online game platform.
基金the National Natural Science Foundation of China (Nos. 61073117 and 61175046)the Provincial Natural Science Research Program of Higher Education Institutions of Anhui Province (No. KJ2013A016)+1 种基金the Academic Innovative Research Projects of Anhui University Graduate Students (No. 10117700183)the 211 Project of Anhui University
文摘Mining from ambiguous data is very important in data mining. This paper discusses one of the tasks for mining from ambiguous data known as multi-instance problem. In multi-instance problem, each pattern is a labeled bag that consists of a number of unlabeled instances. A bag is negative if all instances in it are negative. A bag is positive if it has at least one positive instance. Because the instances in the positive bag are not labeled, each positive bag is an ambiguous. The mining aim is to classify unseen bags. The main idea of existing multi-instance algorithms is to find true positive instances in positive bags and convert the multi-instance problem to the supervised problem, and get the labels of test bags according to predict the labels of unknown instances. In this paper, we aim at mining the multi-instance data from another point of view, i.e., excluding the false positive instances in positive bags and predicting the label of an entire unknown bag. We propose an algorithm called Multi-Instance Covering kNN (MICkNN) for mining from multi-instance data. Briefly, constructive covering algorithm is utilized to restructure the structure of the original multi-instance data at first. Then, the kNN algorithm is applied to discriminate the false positive instances. In the test stage, we label the tested bag directly according to the similarity between the unseen bag and sphere neighbors obtained from last two steps. Experimental results demonstrate the proposed algorithm is competitive with most of the state-of-the-art multi-instance methods both in classification accuracy and running time.
基金supported by the "National Science Council" under Grant Nos. NSC 97-2218-E-155-001 and NSC96-2752-E-001-001-PAEthe Research Center for Humanities and Social Sciencesthe Thematic Program of "Academia Sinica" under Grant No.AS95ASIA02
文摘The massive flow of scholarly publications from traditional paper journals to online outlets has benefited biologists because of its ease to access. However, due to the sheer volume of available biological literature, researchers are finding it increasingly difficult to locate needed information. As a result, recent biology contests, notably JNLPBA and BioCreAtIvE, have focused on evaluating various methods in which the literature may be navigated. Among these methods, text-mining technology has shown the most promise. With recent advances in text-mining technology and the fact that publishers are now making the full texts of articles available in XML format, TMSs can be adapted to accelerate literature curation, maintain the integrity of information, and ensure proper linkage of data to other resources. Even so, several new challenges have emerged in relation to full text analysis, life-science terminology, complex relation extraction, and information fusion. These challenges must be overcome in order for text-mining to be more effective. In this paper, we identify the challenges, discuss how they might be overcome, and consider the resources that may be helpful in achieving that goal.
文摘Personalization is the adaptation of the services to fit the user’s interests,characteristics and needs.The key to effective personalization is user profiling.Apart from traditional collaborative and content-based approaches,a number of classification and clustering algorithms have been used to classify user related information to create user profiles.However,they are not able to achieve accurate user profiles.In this paper,we present a new clustering algorithm,namely Multi-Dimensional Clustering(MDC),to determine user profiling.The MDC is a version of the Instance-Based Learner(IBL)algorithm that assigns weights to feature values and considers these weights for the clustering.Three feature weight methods are proposed for the MDC and,all three,have been tested and evaluated.Simulations were conducted with using two sets of user profile datasets,which are the training(includes 10,000 instances)and test(includes 1000 instances)datasets.These datasets reflect each user’s personal information,preferences and interests.Additional simulations and comparisons with existing weighted and non-weighted instance-based algorithms were carried out in order to demonstrate the performance of proposed algorithm.Experimental results using the user profile datasets demonstrate that the proposed algorithm has better clustering accuracy performance compared to other algorithms.This work is based on the doctoral thesis of the corresponding author.