期刊文献+

基于互信息的无监督特征选择 被引量:69

An Unsupervised Feature Selection Approach Based on Mutual Information
下载PDF
导出
摘要 在数据分析中,特征选择可以用来降低特征的冗余,提高分析结果的可理解性和发现高维数据中隐藏的结构.提出了一种基于互信息的无监督的特征选择方法(UFS-MI),在UFS-MI中,使用了一种综合考虑了相关度和冗余度的特征选择标准UmRMR(无监督最小冗余最大相关)来评价特征的重要性.相关度和冗余度分别使用互信息来度量特征与潜在类别变量之间的依赖和特征与特征之间的依赖.UFS-MI同时适用于数值型和非数值型特征.在理论上证明了UFS-MI的有效性,实验结果也表明UFS-MI可以达到与传统的特征选择方法相当甚至更好的性能. In data analysis, feature selection can be used to reduce the redundancy of features, improve the comprehensibility of models, and identify the hidden structures in high-dimensional data. In this paper, we propose a novel unsupervised feature selection approach based on mutual information called UFS-MI. In UFS-MI, we use a feature selection criterion, UrnRMR, to evaluate the importance of each feature, which takes into account both relevance and redundancy. The relevance and redundancy respectively use mutual information to measure the dependence of features on the latent class and the dependence between features. In the new algorithm, features are selected or ranked in a stepwise way, one at a time, by estimating the capability of each specified candidate feature to decrease the uncertainty of other features (i. e. the capability of retaining the information contained in other features). The effectiveness of UFS-MI is confirmed by the theoretical proof which shows it can select features highly correlated with the latent class. An empirical comparison between UFS-MI and several traditional feature selection methods are also conducted on some popular data sets and the results show that UFS-MI can attain better or comparable performance and it is applicable to both numerical and non-numerical features.
出处 《计算机研究与发展》 EI CSCD 北大核心 2012年第2期372-382,共11页 Journal of Computer Research and Development
基金 国家自然科学基金项目(61073029 90818027 60633010) 国家"八六三"高技术研究发展计划基金项目(2009AA01Z147) 国家"九七三"重点基础研究发展计划基金项目(2009CB320703)
关键词 特征选择 无监督特征选择 互信息 最小冗余-最大相关 无监督最小冗余-最大相关 feature selection unsupervised feature selection mutual information minimum redundancy and maximum relevance unsupervised minimum redundancy and maximum relevance
  • 相关文献

参考文献24

  • 1Langley P. Selection of relevant features in machine learning [C] //Proc of the AAAI Fall Symposium on Relevance. Menlo Park, CA: AAAI, 1994:1-5.
  • 2Dash M, Liu H. Feature selection for classification [J]. International Journal of Intelligent Data Analysis, 1997, 1 (3): 131-156.
  • 3Pudil P, Novovicova J. Novel methods for subset selection with respect to problem knowledge[J]. IEEE Intelligent Systems, 1998, 13(2): 66-74.
  • 4Robnik-Sikonja M, Kononenko I. Theoretical and empirical analysis of ReliefF and RReliefF [J]. Machine Learning, 2003, 53(1): 23-69.
  • 5Hall M. Correlation-based feature selection for discrete and numeric class machine learning [C]//Proc of the 7th Int Conf on Machine Learning. San Francisco: Morgan Kaufmann, 2000:359-366.
  • 6Mitra P, Murthy C A, Pal S K. Unsupervised feature selection using feature similarity [J]. IEEE Trans on Pattern Analysis and Machine Intelligence, 2002, 24(3) : 301-312.
  • 7Wei H L, Billings S A. Feature subset selection and ranking for data dimensionality reduction [J]. IEEE Trans on Pattern Analysis and Machine Intelligence, 2007, 29(1): 162-166.
  • 8Yu L, Liu H. Efficient feature selection via analysis of relevance and redundancy [J]. Journal of Machine Learning Research, 2004, 5(10): 1205-1224.
  • 9Battiti R. Using mutual information for selecting features in supervised neural net learning [J]. IEEE Trans on Neutral Networks, 1994, 5(4): 537-550.
  • 10Dash M, Choi K, Scheuermann P, et al. Feature selection for clustering A filter solution [C] //Proc of the 2nd IEEE Int Conf on Data Mining. Piscataway, NJ: IEEE, 2002: 115-122.

二级参考文献7

  • 1C. C. Aggrawal, P. S. Yu. Finding generalized projected clustersin high dimensional spaces. The SIGMOD'00, Dallas, 2000.
  • 2M. Dash, H. Liu. Feature selection for clustering. The PAKDD-00, Kyoto, 2000.
  • 3F. Sebastiani. Machine learning in automated text categorization.ACM Computin Surveys, 2002, 34(1): 1--47.
  • 4Y. Yang, J. O. Pedersen. A comparative study on featureselection in text categorization. The ICML97, Nashville, 1997.
  • 5M. Rogati, Y. Yang. High performance feature selection for text categorization. The CIKM-02, Mclean, 2002.
  • 6L. Tao, L. Shengping, C. Zheng, et al.An evaluation on feature selection for text clustering. The ICML03, Washington,2003.
  • 7陆玉昌,鲁明羽,李凡,周立柱.向量空间法中单词权重函数的分析和构造[J].计算机研究与发展,2002,39(10):1205-1210. 被引量:126

共引文献36

同被引文献519

引证文献69

二级引证文献440

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部