摘要
特征选择是数据挖掘和机器学习等领域内重要的预处理步骤,近年来得到了广泛的关注。文本数据的高维性往往会影响分类等数据挖掘任务的效率,因此特征选择常被作为文本分类过程中的重要组成部分,以达到降维的目的。随着分类技术的快速发展,类别的日益细化,文本的多类分类问题为特征选择方法提出了更多的挑战。本文面向文本多类分类的应用背景,阐述了目前特征选择方法所面临的主要挑战,给出了多分类特征选择方法的主要种类。本文沿着相关研究的发展路线,由易至难,由浅入深,通过对目前多分类特征选择算法的应用情况进行总结,并进行综述评论,最后对全文进行了概括,提出了未来可能的研究方向。
As an important pre-processing step in data mining and machine learning, feature selection has been gradually developed. The high-dimensional characteristics of text data always declines the performance of categorization. Hence, feature selection can be employed as a dimension-reduction measure. With the fast evolution of classification methods and refinement of categories, muhi-class text categorization gives rise to more challenges for feature selection. In this paper, we present a survey of the main problems and the state of-art feature selection methods, following the development track. Finally, we conclude the whole paper and give some future directions of research.
出处
《计算机工程与科学》
CSCD
北大核心
2010年第8期90-93,148,共5页
Computer Engineering & Science
基金
国家863计划资助项目(2006AA01Z451
2007AA01Z474
2007AA010502)
国家自然科学基金资助项目(60873204)
NCET060928
关键词
特征选择
文本分类
数据挖掘
层次结构
feature selection
text categorization
data mining
hierarchical structure