摘要
文本特征维数通常高达几万且特征之间存在大量冗余和不相关信息,从而导致传统的分类方法效率低、分类准确率低。为了提高文本分类的快速性和准确性,提出了一种遗传算法(GA)和支持向量机(SVM)相结合的文本分类方法。把文本特征组合看作遗传算法中一个染色体,并进行二进制编码,将支持向量机分类准确率作为遗传算法的适应度函数,对每一个个体适应度的评价,通过选择、交叉和变异的遗传操作,得到文本最优特征,最后通过支持向量机利用最优特征进行分类。对复旦大学中文文本分类库进行仿真实验,实验结果表明,相对于传统的文本分类方法,能够快速地得到最优分类特征子集,大大提高文本分类的准确率,在文本挖掘中具有较好的应用前景。
In text categorization ,one problem is usually confronted with feature spaces containing 10,000 dimen- sions and more, even exceeding the number of available training samples, the precision is always difficult to be im- proved. In order to enhance operating speed and reduce memory space occupied, a feature selection method based on genetic algorithm and support vector machine is presented. In this algorithm, a feature combination is regarded as a chromosome which is then performed with binary code, and support vector machine precision set density is considered as the fitness function to evaluate the fitness of individual feature. By the operations of selection, crossover and mutation,the optimal feature set can rapidly be obtained.. The improved genetic algorithm is applied to the example of categorization data for feature optimization simulation. It is proved that this method can obtain the subset of the features which contribute to pattern classification. With the result that fault diagnosis accuracy and computational efficiency have been improved, It is a good prospect in text mining.
出处
《计算机仿真》
CSCD
北大核心
2011年第1期222-225,共4页
Computer Simulation
关键词
文本分类
遗传算法
支持向量机
特征选择
Text categorization
Genetic algorithm (GA)
Support vector machine ( SVM )
Feature selection