摘要
特征选择是模式识别、机器学习、数据挖掘等领域的重要问题之一,近年来已成为研究热点,并涌现出大量的用于选择特征的算法.现有的特征选择算法大多仅面向某一特定领域,其适用范围有限.采用基于Hilbert-Schmidt相关性标准的核方法衡量特征子集与目标对象间的相关程度,提出了一个适用性更广的特征选择方法FSM-HSIC,能较好地统一有监督、半监督和无监督3种模型下的特征选择过程,而且可从核方法的角度对整个过程进行抽象地描述,并深入理解现有的一些算法.同时以该方法为基础针对交互特征选择问题设计了新颖的FSI算法.理论分析和大量真实与仿真实验结果表明,与若干特征选择算法相比较,提出的算法具有良好的效率和稳定性,FSM-HSIC方法对新算法的产生具有重要的指导意义.
Feature selection is one of the most important problems in pattern recognition, machine learning and data mining areas, as a basic pre-processing step of compressing data. Most of the current algorithms were proposed separately for some special domain, which limited their extension. Especially, different applications are often under different supervised models, such as supervised, semi-supervised and unsupervised model. A concrete feature selection algorithm is always designed for a given environment. When the setting is changed, the original algorithm, which was running fluently and efficiently, turns to be inefficient, or even useless. Hence a new algorithm should be explored in this condition.This paper presents a common feature selection method based on Hilbert-Schmidt Independence Criterion, evaluating the correlation between feature subset and target concept. Intrinsic properties of feature selection are exploited in this method, under multiple supervised models, like supervised, semi-supervised and unsupervised. And a uniform format is applied. Furthermore, some existing algorithms can be explained from the viewpoint of kernel-based methods, which brings a deeper understanding. And a novel algorithm is derived from this method. It can solve a challenging problem, known as interactive feature selection. The experimental results not only demonstrate the efficiency and stability of the algorithm, but also infer that the method can give a considerable guidance for the production of novel feature selection algorithms.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2010年第9期1548-1557,共10页
Journal of Computer Research and Development
基金
国家"八六三"高技术研究发展计划基金项目(2006AA01Z451
2007AA01Z474
2007AA010502)
国家自然科学基金项目(60873204)
关键词
数据挖掘
模式识别
特征选择
核函数方法
交互特征
稳定性
data mining
pattern recognition
feature selection
kernel-based method
interactive feature
stability