摘要
传统的文本聚类特征选择方法不能发现最优特征集,而遗传算法能获得全局最优解且具有高的寻优效率,因此提出利用遗传算法进行文本聚类的特征选择.把一种特征组合看作一个染色体,对其进行二进制编码,引入文本集密度作为适应度函数进行特征个体适应度的评价.通过选择、交叉和变异的遗传操作,能较为快速地求出最优特征集.对公开的文本分类语料所进行的实验表明,基于遗传算法的特征选择使文本聚类结果的精度较之特征选择前提高了5.9%,而聚类时间减少了15s.
As the traditional feature selection methods for text clustering cannot find the best feature set, the genetic algorithm is applied to the feature selection because it can get the global optimal solution and is of high searching efficiency. In this algorithm, a feature combination is regarded as a chromosome which is then performed with binary code, and the text set density is considered as the fitness function to evaluate the fitness of individual feature. By the operations of selection, crossover and mutation, the optimal feature set can rapidly be rapidly obtained. Experimental results on the open corpus show that the feature selection based on the genetic algorithm improves the text clustering precision by 5.9% and decreases the clustering time by 15s.
出处
《华南理工大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2004年第z1期133-136,共4页
Journal of South China University of Technology(Natural Science Edition)
关键词
遗传算法
文本聚类
特征选择
中文信息处理
genetic algorithm
text clustering
feature selection
Chinese information processing