摘要
随着网络的发展,大量的文档数据涌现在网上,用于处理海量数据的自动文本分类技术变得越来越重要,自动文本分类已成为处理和组织大量文档数据的关键技术.对于采用矢量空间模型(VSM)的大多数分类器来说,文本预处理成为分类的瓶颈,高维的特征空间对于大多数分类器来说是难以忍受的,因此采用适当的文本特征选择算法降低原始文本特征空间的维数成为文本分类的首要任务.目前也有很多的文本特征选择算法,介绍了另一种新的基于基尼指数的文本特征选择算法,使用基尼指数原理进行了文本特征选择的研究,构造了基于基尼指数的适合于文本特征选择的特征选择评估函数.实验表明,基于基尼指数的文本特征选择能进一步提高分类性能,而且计算复杂度小.
With the rapid development of World Wide Web, large numbers of documents are available on the Internet. Automatic text categorization becomes more and more important for dealing with massive data. Text categorization has become a key technology in organizing and processing large amount of text data. For most classifiers using vector space model (VSM), text preprocessing has become the bottleneck of categorization. High dimensionality of the feature space is impossible for many classifiers. So adopting appropriate text feature selection algorithms to reduce the dimensionality of the feature space is becoming the key role. At present, there are many text feature selection algorithms. In this paper, all these text feature selection methods are not discussed in detail, but another new text feature selection method--Gini index is presented, lmproved Gini-index is used for text feature selection, constructing the measure function based on Gini-index. The experiment results show that the text feature selection based on Gini index can improve the categorization performance further, and that its complexity of computing is small.
出处
《计算机研究与发展》
EI
CSCD
北大核心
2006年第10期1688-1694,共7页
Journal of Computer Research and Development
基金
国家自然科学基金项目(60503017)
北京交通大学人才基金项目(JSJ04002)~~
关键词
文本分类
文本特征选择
基尼指数
文本预处理
text categorization
text feature selection
Gini index
text preprocessing