The dimensionality of data is increasing very rapidly,which creates challenges for most of the current mining and learning algorithms,such as large memory requirements and high computational costs.The literature inclu...The dimensionality of data is increasing very rapidly,which creates challenges for most of the current mining and learning algorithms,such as large memory requirements and high computational costs.The literature includes much research on feature selection for supervised learning.However,feature selection for unsupervised learning has only recently been studied.Finding the subset of features in unsupervised learning that enhances the performance is challenging since the clusters are indeterminate.This work proposes a hybrid technique for unsupervised feature selection called GAk-MEANS,which combines the genetic algorithm(GA)approach with the classical k-Means algorithm.In the proposed algorithm,a new fitness func-tion is designed in addition to new smart crossover and mutation operators.The effectiveness of this algorithm is demonstrated on various datasets.Fur-thermore,the performance of GAk-MEANS has been compared with other genetic algorithms,such as the genetic algorithm using the Sammon Error Function and the genetic algorithm using the Sum of Squared Error Function.Additionally,the performance of GAk-MEANS is compared with the state-of-the-art statistical unsupervised feature selection techniques.Experimental results show that GAk-MEANS consistently selects subsets of features that result in better classification accuracy compared to others.In particular,GAk-MEANS is able to significantly reduce the size of the subset of selected features by an average of 86.35%(72%–96.14%),which leads to an increase of the accuracy by an average of 3.78%(1.05%–6.32%)compared to using all features.When compared with the genetic algorithm using the Sammon Error Function,GAk-MEANS is able to reduce the size of the subset of selected features by 41.29%on average,improve the accuracy by 5.37%,and reduce the time by 70.71%.When compared with the genetic algorithm using the Sum of Squared Error Function,GAk-MEANS on average is able to reduce the size of the subset of selected features by 15.91%,and improve the accuracy by 9.81%,but the time is increased by a factor of 3.When compared with the machine-learning based methods,we observed that GAk-MEANS is able to increase the accuracy by 13.67%on average with an 88.76%average increase in time.展开更多
As a generative model,Latent Dirichlet Allocation Model,which lacks optimization of topics' discrimination capability focuses on how to generate data,This paper aims to improve the discrimination capability throug...As a generative model,Latent Dirichlet Allocation Model,which lacks optimization of topics' discrimination capability focuses on how to generate data,This paper aims to improve the discrimination capability through unsupervised feature selection.Theoretical analysis shows that the discrimination capability of a topic is limited by the discrimination capability of its representative words.The discrimination capability of a word is approximated by the Information Gain of the word for topics,which is used to distinguish between "general word" and "special word" in LDA topics.Therefore,we add a constraint to the LDA objective function to let the "general words" only happen in "general topics" other than "special topics".Then a heuristic algorithm is presented to get the solution.Experiments show that this method can not only improve the information gain of topics,but also make the topics easier to understand by human.展开更多
Unsupervised feature selection has become an important and challenging problem faced with vast amounts of unlabeled and high-dimension data in machine learning. We propose a novel unsupervised feature selection method...Unsupervised feature selection has become an important and challenging problem faced with vast amounts of unlabeled and high-dimension data in machine learning. We propose a novel unsupervised feature selection method using Structured Self-Representation( SSR) by simultaneously taking into account the selfrepresentation property and local geometrical structure of features. Concretely,according to the inherent selfrepresentation property of features,the most representative features can be selected. Mean while,to obtain more accurate results,we explore local geometrical structure to constrain the representation coefficients to be close to each other if the features are close to each other. Furthermore,an efficient algorithm is presented for optimizing the objective function. Finally,experiments on the synthetic dataset and six benchmark real-world datasets,including biomedical data,letter recognition digit data and face image data,demonstrate the encouraging performance of the proposed algorithm compared with state-of-the-art algorithms.展开更多
Feature selection (FS) is a process to select features which are more informative. It is one of the important steps in knowledge discovery. The problem is that not all features are important. Some of the features ma...Feature selection (FS) is a process to select features which are more informative. It is one of the important steps in knowledge discovery. The problem is that not all features are important. Some of the features may be redundant, and others may be irrelevant and noisy. The conventional supervised FS methods evaluate various feature subsets using an evaluation function or metric to select only those features which are related to the decision classes of the data under consideration. However, for many data mining applications, decision class labels are often unknown or incomplete, thus indicating the significance of unsupervised feature selection. However, in unsupervised learning, decision class labels are not provided. In this paper, we propose a new unsupervised quick reduct (QR) algorithm using rough set theory. The quality of the reduced data is measured by the classification performance and it is evaluated using WEKA classifier tool. The method is compared with existing supervised methods and the result demonstrates the efficiency of the proposed algorithm.展开更多
文摘The dimensionality of data is increasing very rapidly,which creates challenges for most of the current mining and learning algorithms,such as large memory requirements and high computational costs.The literature includes much research on feature selection for supervised learning.However,feature selection for unsupervised learning has only recently been studied.Finding the subset of features in unsupervised learning that enhances the performance is challenging since the clusters are indeterminate.This work proposes a hybrid technique for unsupervised feature selection called GAk-MEANS,which combines the genetic algorithm(GA)approach with the classical k-Means algorithm.In the proposed algorithm,a new fitness func-tion is designed in addition to new smart crossover and mutation operators.The effectiveness of this algorithm is demonstrated on various datasets.Fur-thermore,the performance of GAk-MEANS has been compared with other genetic algorithms,such as the genetic algorithm using the Sammon Error Function and the genetic algorithm using the Sum of Squared Error Function.Additionally,the performance of GAk-MEANS is compared with the state-of-the-art statistical unsupervised feature selection techniques.Experimental results show that GAk-MEANS consistently selects subsets of features that result in better classification accuracy compared to others.In particular,GAk-MEANS is able to significantly reduce the size of the subset of selected features by an average of 86.35%(72%–96.14%),which leads to an increase of the accuracy by an average of 3.78%(1.05%–6.32%)compared to using all features.When compared with the genetic algorithm using the Sammon Error Function,GAk-MEANS is able to reduce the size of the subset of selected features by 41.29%on average,improve the accuracy by 5.37%,and reduce the time by 70.71%.When compared with the genetic algorithm using the Sum of Squared Error Function,GAk-MEANS on average is able to reduce the size of the subset of selected features by 15.91%,and improve the accuracy by 9.81%,but the time is increased by a factor of 3.When compared with the machine-learning based methods,we observed that GAk-MEANS is able to increase the accuracy by 13.67%on average with an 88.76%average increase in time.
基金supported by National Nature Science Foundation of China under Grant No.60905017,61072061National High Technical Research and Development Program of China(863 Program)under Grant No.2009AA01A346+1 种基金111 Project of China under Grant No.B08004the Special Project for Innovative Young Researchers of Beijing University of Posts and Telecommunications
文摘As a generative model,Latent Dirichlet Allocation Model,which lacks optimization of topics' discrimination capability focuses on how to generate data,This paper aims to improve the discrimination capability through unsupervised feature selection.Theoretical analysis shows that the discrimination capability of a topic is limited by the discrimination capability of its representative words.The discrimination capability of a word is approximated by the Information Gain of the word for topics,which is used to distinguish between "general word" and "special word" in LDA topics.Therefore,we add a constraint to the LDA objective function to let the "general words" only happen in "general topics" other than "special topics".Then a heuristic algorithm is presented to get the solution.Experiments show that this method can not only improve the information gain of topics,but also make the topics easier to understand by human.
基金Sponsored by the Major Program of National Natural Science Foundation of China(Grant No.13&ZD162)the Applied Basic Research Programs of China National Textile and Apparel Council(Grant No.J201509)
文摘Unsupervised feature selection has become an important and challenging problem faced with vast amounts of unlabeled and high-dimension data in machine learning. We propose a novel unsupervised feature selection method using Structured Self-Representation( SSR) by simultaneously taking into account the selfrepresentation property and local geometrical structure of features. Concretely,according to the inherent selfrepresentation property of features,the most representative features can be selected. Mean while,to obtain more accurate results,we explore local geometrical structure to constrain the representation coefficients to be close to each other if the features are close to each other. Furthermore,an efficient algorithm is presented for optimizing the objective function. Finally,experiments on the synthetic dataset and six benchmark real-world datasets,including biomedical data,letter recognition digit data and face image data,demonstrate the encouraging performance of the proposed algorithm compared with state-of-the-art algorithms.
基金supported by the UGC, SERO, Hyderabad under FDP during XI plan periodthe UGC, New Delhi for financial assistance under major research project Grant No. F-34-105/2008
文摘Feature selection (FS) is a process to select features which are more informative. It is one of the important steps in knowledge discovery. The problem is that not all features are important. Some of the features may be redundant, and others may be irrelevant and noisy. The conventional supervised FS methods evaluate various feature subsets using an evaluation function or metric to select only those features which are related to the decision classes of the data under consideration. However, for many data mining applications, decision class labels are often unknown or incomplete, thus indicating the significance of unsupervised feature selection. However, in unsupervised learning, decision class labels are not provided. In this paper, we propose a new unsupervised quick reduct (QR) algorithm using rough set theory. The quality of the reduced data is measured by the classification performance and it is evaluated using WEKA classifier tool. The method is compared with existing supervised methods and the result demonstrates the efficiency of the proposed algorithm.