期刊文献+

文本分类中基于基尼指数的特征选择算法研究 被引量:38

Research on the Algorithm of Feature Selection Based on Gini Index for Text Categorization
下载PDF
导出
摘要 随着网络的发展,大量的文档数据涌现在网上,用于处理海量数据的自动文本分类技术变得越来越重要,自动文本分类已成为处理和组织大量文档数据的关键技术.对于采用矢量空间模型(VSM)的大多数分类器来说,文本预处理成为分类的瓶颈,高维的特征空间对于大多数分类器来说是难以忍受的,因此采用适当的文本特征选择算法降低原始文本特征空间的维数成为文本分类的首要任务.目前也有很多的文本特征选择算法,介绍了另一种新的基于基尼指数的文本特征选择算法,使用基尼指数原理进行了文本特征选择的研究,构造了基于基尼指数的适合于文本特征选择的特征选择评估函数.实验表明,基于基尼指数的文本特征选择能进一步提高分类性能,而且计算复杂度小. With the rapid development of World Wide Web, large numbers of documents are available on the Internet. Automatic text categorization becomes more and more important for dealing with massive data. Text categorization has become a key technology in organizing and processing large amount of text data. For most classifiers using vector space model (VSM), text preprocessing has become the bottleneck of categorization. High dimensionality of the feature space is impossible for many classifiers. So adopting appropriate text feature selection algorithms to reduce the dimensionality of the feature space is becoming the key role. At present, there are many text feature selection algorithms. In this paper, all these text feature selection methods are not discussed in detail, but another new text feature selection method--Gini index is presented, lmproved Gini-index is used for text feature selection, constructing the measure function based on Gini-index. The experiment results show that the text feature selection based on Gini index can improve the categorization performance further, and that its complexity of computing is small.
出处 《计算机研究与发展》 EI CSCD 北大核心 2006年第10期1688-1694,共7页 Journal of Computer Research and Development
基金 国家自然科学基金项目(60503017) 北京交通大学人才基金项目(JSJ04002)~~
关键词 文本分类 文本特征选择 基尼指数 文本预处理 text categorization text feature selection Gini index text preprocessing
  • 相关文献

参考文献25

  • 1T M Cover,P E Hart.Nearest neighbor pattern classification[J].IEEE Trans on Information Theory,1967,IT-13(1):21-27
  • 2Y Yang.An evaluation of statistical approaches to text categorization[J].Information Retrieval,1999,1(1/2):67 -88
  • 3Y Yang,X Lin.A re-examination of text categorization methods[C].The 22nd Annual Int'l ACM SIGIR Conf on Research and Development in the Information Retrieval,Berkeley,California,USA,1999
  • 4B Masand,G Lino,D Waltz.Classifying news stories using memory based reasoning[C].The 15th Annual Int'l ACM SIGIR Conf on Research and Development in Information Retrieval,Copenhagen,Denmark,1992
  • 5D D Lewis.Naive (Bayes) at forty:The independence assumption in information retrieval[C].The 10th European Conf on Machine Learning,Heidelberg,Germany,1998
  • 6A Mccallum,K Nigam.A comparison of event models for naive bayes text classification[C].AAAI-98 Workshop on Learning for Text Categorization,Madison,Wisconsin,1998
  • 7D D Lewis,M Ringuette.Comparison of two learning algorithms for text categorization[C].The 3rd Annual Symp on Document Analysis and Information Retrieval,Las Vegas,1994
  • 8C Apte,F Damerau,S Weiss.Text mining with decision rules and decision trees[C].The Conf on Automated Learning and Discovery,Workshop 6:Learning from Text and the Web,Pittsburgh,PA,1998
  • 9T Joachims.Text categorization with support vector machines:Learning with many relevant features[C].The 10th European Conf on Machine Learning,Heidelberg,Germany,1998
  • 10Y Yang,C G Chute.An example-based mapping method for text categorization and retrieval[J].ACM Trans on Information System,1994,12(3):252-277

二级参考文献17

  • 1D. D. Lewis. Naive (Bayes) at forty: The independence assumption in information retrieval. In: Proc. of the 10th European Conf. on Machine Learning. New York: Springer,1998, 4-15.
  • 2Y. Yang, X. Lin. A re-examination of text categorization methods. In: The 22nd Annual Int'l ACM SIGIR Conf. onResearch and Development in the Information Retrieval. NewYork: ACM Press, 1999.
  • 3Y. Yang, C. G. Chute. An example based mapping method for text categorization and retrieval. ACM Trans. on Information Systems, 1994, 12(3): 252 -277.
  • 4E. Wiener. A neural network approach to topic spotting. The 4th Annual Syrup. on Document Analysis and Information Retrieval,Las Vegas, NV, 1995.
  • 5R. E. Schapire, Y. Singer. Improved boosting algorithms using confidence-rated predications. In: Proc. of the 11th Annual Conf.on Computational Learning Theory. New York: ACM Press,1998. 80--91.
  • 6T. Joachims. Text categorization with support vector machines:Learning with many relevant features. In: Proc. of the 10th European Conf. on Machine Learning. New York: Springer,1998. 137-142.
  • 7Y. Yang. An evaluation of statistical approaches to text categorization. Information Retrieval, 1999, 1 ( 1 ) : 76-- 88.
  • 8R. Adwait. Maximum entropy models for natural language ambiguity resolution: [ Ph. D. dissertation ] . Pennsylvania:University of Pennsylvania, 1998.
  • 9R. Adwait. A maximum entropy model for part-of-speech tagging. The Empirical Methods in Natural Language Processing Conference, Philadelphia, USA, 1996.
  • 10Adam L. Berger, Stephen A. Della Pietra, Vincent J. Della Pietra. A maximum entropy approach to natural language processing. Computational Linguistics, 1996, 22( 1 ) : 38-- 73.

共引文献217

同被引文献427

引证文献38

二级引证文献325

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部