摘要
特征选择是数据挖掘分类中的一个重要问题.该文推导出一种新的衡量特征与类别相关度的测度SCD即描述特征取值序列类分布的CV系数,利用该测度给出一种线性的可伸缩特征选择算法StaFSOS,并证明了在类别数为2时,SCD测度满足分支界限法的单调性;给出了StaFSOS的一个完备形式———BBStaFS.在12个标准数据集中,StaFSOS算法得出的结果和目标集几乎一致,而StaFSOS的效率高于其它算法;而在另1个中,BBStaFS算法得出了准确结果.在用1000个样本20个特征的真实数据进行的测试中,StaFSOS运行时间是目前较快的GRSR的1/2,得出的特征集准确有效.
Feature selection is an important issue in classification mining. This paper gives a dependence measure named SCD from statistical theory; this measure describes the CV ratio of class distributions of each feature value. According to SCD measure, an I/O linear feature selection algorithm (i.e. StaFSOS) is constructed. The SCD measure is proven to satisfy the monotonicity of Branch & Bound algorithm when there are only two classes, therefore StaFSOS and B&B are combined into BBStaFS feature selection algorithm. The result features selected by StaFSOS are consistent with the target features in 12 open benchmarks, but more efficiently than other algorithms, while BBStaFS selects the target features in another benchmark. When StaFSOS selects the target features by using a realworld data of 1000 samples and 20 features, GRSR is the most recent efficient algorithm, however, the runtime of StaFSOS is just half of GRSR.
出处
《计算机学报》
EI
CSCD
北大核心
2005年第7期1223-1229,共7页
Chinese Journal of Computers
基金
国家"八六三"高技术研究发展计划项目基金(2004AA114030)资助.~~
关键词
数据挖掘
分类
特征选择
data mining
classification
feature selection