期刊文献+

文本分类中的类别信息特征选择方法 被引量:5

Class information feature selection method for text classification
下载PDF
导出
摘要 随着网上电子文档的急剧增长,文本分类技术在信息检索中的应用变得日益重要.特征维数增加会使样本统计特性的评估变得更加困难,从而降低分类器的泛化能力,出现“过学习”的现象.因此,文档特征的选择和提取是文本分类的必要前提.提出一种基于类别信息的特征选择方法,该方法在尽量保留文档信息的同时,考虑了文档的类别信息.实验表明,这种方法的分类性能比较好,特别是在微平均指标上,与OCFS以及卡方统计量相比有较大幅度的提高. With the explosion of web documents, text classification becomes more important in Information Retrieval applications. It is very difficult to evaluate the statistical characteristics of samples because of the high dimensions. It will lead to "over study" and reduce classifiers' performance. So that feature selection and extraction before analysis are necessary. A class information feature selection method is proposed, in which the class information of the training document is taken into account while keeping as much decument information as possible. The experiments show that this method can get good performance, and it is consistently better than OCFS and CHI on macro average F1.
出处 《山东大学学报(理学版)》 CAS CSCD 北大核心 2006年第3期10-13,59,共5页 Journal of Shandong University(Natural Science)
基金 教育部重点科技资助项目(03070) 江西省自然科学基金资助项目(0311041)
关键词 特征选择 文本分类 类间分布 类内分布 feature selection text classification distribution between class distribution within class
  • 相关文献

参考文献11

  • 1Sebastiani F. Machine learning in automated text categorization[J]. ACM Computing Survey, 2002, 34(1) : 1 - 47.
  • 2Greengrass E. Information retrieval: A survey[Z]. Marryland:DOD Technical Report, 2000.
  • 3Jolliffe I T. Principal component analysis[Z]. New York: Spriger Verlag, 1986.
  • 4Martinez A M, Kak A C. PCA versus LDA[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2001,23(2) :228 - 233.
  • 5Deerwester S, Dumais S T, Furnas G W, et al. Indexing by latent semantic analysis[J]. Journal of the American Society of Information Science, 1990, 41(6) :391 - 407.
  • 6Jun Yan, Ning liu, Benyu Zhang. OCFS: Optimal orthogonal centroid feature selection for text categorization[Z]. Brazil: SIGIR, 2005.
  • 7Howland P, Park H. Generalizing discriminant analysis using the generalized singular value decomposition[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 26(8) :995- 1006.
  • 8M Jeon, Park H, Rosen J B. Dimension reduction based on centroids and least squares for efficient processing of text data[Z]. Minnesota:CSE, 2001.
  • 9James E Gentle, J Chambers, W Eddy, et al. Numerical linear algebra for applications in statistics[Z]. Berlin: Springer-Verlag, 1998.
  • 10陈涛,谢阳群.文本分类中的特征降维方法综述[J].情报学报,2005,24(6):690-695. 被引量:79

二级参考文献26

  • 1Apte C, Damerau F J, and Weiss S M. Automated learning of decision rules for text categorization. ACM Transactions on Information Systems, 1994, 12:233- 251.
  • 2Yang Yiming, and Pedersen J O. A comparative study on feature selection in text categorization. In- Proceedings of the 14^th International Conference on Machine Learning (ICML-97), 1997. 412 - 420.
  • 3Hwee Tou Ng, Wei Boon Goh, and Kok Leong Low. Feature selection, perceptron learning, and a usability case study for text categorization. In: Proceedings of the 20^th ACM International Conference on Research and Development in Information Retrieval (SIGIR-97), 1997. 67 - 73.
  • 4Schutze H, Hull D A, and Pedersen J O. A comparison of classifiers and document representations for the routing problem. In: Proceedings of the 18^th ACM International Conference on Research and Development in Information Retrieval (SIGIR-95). 1995. 229 - 237.
  • 5Li Y H, and Jain A K. Classification of text document. The Computer Journal, 1998, 41(8) :537 - 546.
  • 6Deerwester S, Dumais S, Furnas D, et al. Indexing by latent semantic analysis. Journal of the American Society for Information Science, 1990, 41 (6) : 391 - 407.
  • 7Thomas Hofmann. Probabilistic latent semantic indexing. In:Proceedings of the 22^nd ACM International Conference on Research and Development in Information Retrieval (SIGIR-99), 1999. 50-57.
  • 8Thomas K Landauer, Peter W Foltz, and Darrell Laham. An introduction to latent semantic analysis. Discourse Processes,1998, 25:259 - 284.
  • 9Douglas L Baker, and Andrew Kachites McCallum.Distributional clustering of words for text classification. In:Proceedings of the 21^st ACM International Conference on Research and Development in Information Retrieval (SIGIR-98), 1998. 96 - 103.
  • 10Lee D, Seung H. Learning tile Parts of Objects By Nonnegative Matrix Factorization. Nature, 1999,401:788- 791.

共引文献200

同被引文献43

引证文献5

二级引证文献36

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部