摘要
针对大型数据中大量冗余特征的存在可能降低数据分类性能的问题,提出了一种基于互信息(MI)与模糊C均值(FCM)聚类集成的特征自动优选方法 FCC-MI。首先分析了互信息特征及其相关度函数,根据相关度对特征进行排序;然后按照最大相关度对应的特征对数据进行分组,采用FCM聚类方法自动确定最优特征数目;最后基于相关度对特征进行了优选。在UCI机器学习数据库的7个数据集上进行实验,并与相关文献中提出的基于类内方差与相关度结合的特征选择方法(WCMFS)、基于近似Markov blanket和动态互信息的特征选择算法(B-AMBDMI)及基于互信息和遗传算法的两阶段特征选择方法(T-MI-GA)进行对比。理论分析和实验结果表明,FCC-MI不但提高了数据分类的效率,而且在有效保证分类精度的同时能自动确定最优特征子集,减少了数据集的特征数目,适用于海量、数据特征相关性大的特征约简及数据分析。
Plenty of redundant features may reduce the performance of data classification in massive dataset, so a new method of automatic feature selection based on the integration of Mutual Information and Fuzzy C-Means( FCM) clustering,named FCC-MI, was proposed to resolve this problem. Firstly, MI and its correlation function were analyzed, then the features were sorted according to the correlation value. Secondly, the data was grouped according to the feature with the maximum correlation, and the number of the optimal features were determined automatically by FCM clustering method. At last, the optimization selection of the features was performed using correlation value. Experiments on seven datasets of UCI machine learning database were conducted to compare FCC-MI with three methods come from the literatures, including WCMFS( Within class variance and Correlation Measure Feature Selection), B-AMBDMI( Based on Approximating Markov Blank and Dynamic Mutual Information), and T-MI-GA( Two-stage feature selection algorithm based on MI and GA). The theoretical analysis and experimental results show that the proposed method not only improves the efficiency of data classification, but also ensures the classification accuracy and automatically determine the optimal feature subset, which reduces the number of the features of the dataset, thus it is suitable for feature reduction and analysis of mass data with large correlation features.
出处
《计算机应用》
CSCD
北大核心
2014年第9期2608-2611,2649,共5页
journal of Computer Applications
关键词
互信息
特征优选
模糊C均值聚类
数据分组
Mutual Information(MI)
feature selection
Fuzzy C-Means(FCM) clustering
data grouping