摘要
【目的】综合分析特征提取方法并对传统特征提取流程和方法进行改进。【方法】利用特征池进行特征词预选,引入遗传算法对候选特征词分组编码并提取最佳特征向量。【结果】改进的文本特征提取方法在使用KNN计算适应度值时效果最佳,而且在特征维数较少时效果更为明显。同时在针对不同特征维数和语料库时,分类准确率更加稳定。【局限】实验语料库质量有待提高;构造特征池时只使用CHI和IG两种特征提取方法;使用分组编码时没考虑词与词之间的语义关系;种群数量和迭代次数受限于计算的复杂性。【结论】加入特征池进行特征预提取能够提高文本分类准确率的稳定性,而加入遗传算法到文本特征提取中可以提高特征提取的效果,遗传算法利用分组编码规则可以减少特征的过拟合现象并提高算法运行速度。
[Objective] To comprehensively analyze many feature extraction methods and improve traditional feature extraction process. [Methods] Firstly, the paper uses feature pool to pre-extract features, then extract best feature set by genetic algorithm and group coding. [Results] When the fitness function uses KNN classification algorithm, the method using in this paper shows the best performance. Besides, the effect is more obvious with less feature dimensions. Simultaneously, the proposed method has better stability in text classification for different feature dimensions and corpuses. [Limitations] The corpus is not abundant enough. Only IG and CHI are used to extract features for feature pool construction. It ignores semantic relationships among words for group coding. The population size and the number of iteration in genetic algorithm are restricted by experimental conditions. [Conclusions] The stability of text classification is improved by adding a feature pool to pre-extract features. The result of text classification is more accurate by adding genetic algorithm in the text feature extraction. To use proposed method reduces overfitting of features and improves efficiency by utilizing group coding in the genetic algorithm.
出处
《现代图书情报技术》
CSSCI
北大核心
2014年第4期48-57,共10页
New Technology of Library and Information Service
基金
国家自然科学基金项目"面向文本分类的多学科协同建模理论与实验研究"(项目编号:71373291)
国家高技术研究发展计划(863计划)资助项目"农产品全供应链多源信息感知技术与产品开发"(项目编号:2012AA101701)的研究成果之一
关键词
文本分类
特征提取
遗传算法
特征池
Text categorization Feature extraction Genetic algorithms Feature pool