期刊文献+

一种实用高效的文本分类算法 被引量:20

A Simple and Efficient Algorithm to Classify a Large Scale of Texts
下载PDF
导出
摘要 在模式识别研究领域已有的分类算法中,大多数都是基于向量空间模型的算法,其中使用范围最广的是kNN算法.但是,其中的大多数算法都因为计算复杂度太高而不适用于大规模的场合.而且,当训练样本集增大时都需要重新生成分类器,可扩展性差.为此,提出了互依赖和等效半径的概念,并将两者相结合,提出新的分类算法--基于互依赖和等效半径、易更新的分类算法SECTILE.SECTILE计算复杂度较低,而且扩展性能较好,适用于大规模场合.将SECTILE算法应用于中文文本分类,并与kNN算法和类中心向量法进行比较,结果表明,在提高分类精度的同时,SECTILE还可以大幅度提高分类速度,有利于对大规模信息样本进行实时在线的自动分类. Most of classifying methods are based on VSM (vector space model) in the research on classification at present, of which the widely-used method is kNN (k-nearest neighbors) . But most of them are highly complicated on computation, and cannot be used on the occasion of classifying a large number of specimen. Moreover, to them, the classifier must be rebuilt when to increment the corpora of the training specimen. So they have tough scalability. Two new concepts, MD (mutual dependence) and ER (equivalent radius), are put forward in this paper. Furthermore, a new classifying method, SECTILE, is offered. SECTILE can be used to classify a large number of specimen and has good scalability. Later, SECTILE is applied to classify Chinese documents and compared to kNN and CCC method. As a result, SECTILE outperforms kNN and CCC method, and can be used online to classify a large number of specimen while the precision and recall of classification are kept.
出处 《计算机研究与发展》 EI CSCD 北大核心 2005年第1期85-93,共9页 Journal of Computer Research and Development
基金 国家自然科学基金项目(60173027)
关键词 分类 等效半径 向量空间 互依赖 SECTILE classification MD ER VSM SECTILE
  • 相关文献

参考文献10

  • 1边肇祺 张学工.模式识别[M].北京:清华大学出版社,2001..
  • 2周水庚.[D].上海:复旦大学,2000.
  • 3王建会 胡运发.基于等效半径的文本分类算法.技术报告:021011346[R].复旦大学,2002..
  • 4C. J. C. Burges. A tutorial on support vector machines for pattern recognition. Data Mining and Knowledge Discovery,1998, 2(2): 955--974.
  • 5R. Schapire, Y. Singer. BoosTexter: A boosting-based system for text categorization. Machine Learning, 2000, 39(2/3) : 135-- 168.
  • 6Y. Dasarathy B. V. Minimal consistent set (MCS) identification for optimal nearest neighbor decision system terms design. IEEE Trans. on System Man Cybern, 1994, 24(3): 511-517.
  • 7W. Lam, C. Y. Ho. Using a generalized instance set for automatic text categorization. The 21st Ann. Int'l ACM SIGIR Conf. on Research and Development in Information Retrieval(SIGIR'98), Melbourne, Australia, 1998.
  • 8Fuchun Peng, Dale Schuurmans. Self-supervised Chinese word segmentation. The 4th International Symposiun on Intelligent Data Analysis(IDA 2001), Cascais, Portugal, 2001.
  • 9R. W. Sproat, et al.. A stochastic finite-state wordsegmentation algorithm for Chinese. Computational Linguistics,1996, 22(3): 377--404.
  • 10Thomas Emerson. Segmenting Chinese in unicode. The 16th Int'l Unieode Conf., Amsterdam, Holland, 2000.

共引文献28

同被引文献195

引证文献20

二级引证文献458

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部