摘要
为在中文网页分类时降低特征向量的维度、提高分类的精度,采用一种基于关联的特征选择(Correlation-based Feature Selection,CFS)与遗传算法(Genetic Algorithm,GA)相结合的方法进行特征选择.在该算法中,特征子集被当作GA中的一个染色体进行二进制编码;利用CFS启发值作为GA的适应度函数对个体进行评价;CFS值越大的个体遗传到下一代的概率越大.结合GA的全局搜索特性,该算法可保证所得特征子集是全局最优的.利用weka平台,对搜狗实验室提供的中文网页数据集进行实验.结果表明,该算法能有效降低特征空间的维度、提高分类精度。
To reduce the dimension of the feature space and improve the precision of Chinese Web page classification,a method based on Correlation-based Feature Selection(CFS) and Genetic Algorithm(GA) is used in the process of feature selection.In the CFS-GA algorithm,a feature subset is regarded as a chromosome which is then performed in binary encode,and CFS is used as GA's fitness function to evaluate the chromosome.The greater the CFS value is,the greater the probability that individuals inherit to the next generation will be.Combining with GA's global search character,the algorithm can ensure that the feature subset is global optimum.Experiment is done on weka platform with the Chinese Web page dataset provided by the Sougou lab.The result shows that this algorithm can reduce the dimension of the feature space effectively and improve the precision of the classification.
出处
《上海海事大学学报》
北大核心
2012年第1期77-81,共5页
Journal of Shanghai Maritime University
基金
国家自然科学基金(61175044)
关键词
中文网页分类
特征选择
基于关联的特征选择算法
遗传算法
Chinese Web page classification
feature selection
correlation-based feature selection
genetic algorithm