期刊文献+

文本表示方式对线性支持向量机分类性能的影响 被引量:4

IMPACT OF TEXT REPRESENTATIONS ON PERFORMANCE OF LINEAR SUPPORT VECTOR MACHINES
原文传递
导出
摘要 二十世纪九十年代以来自动文本分类技术受到人们的广泛重视,取得了大量的研究成果,但这些研究大多集中在机器学习算法本身的创新或改进上面,涉及文本表示的理论或实验研究都相对较少,通过在语料库Reuters-21578上进行大量对比实验,本文着重考查了影响文本表示的五个主要因素:"功能词"、"词根"、"取值方式"、"权方式"和"规范化",对线性支持向量机分类性能的影响以及这些因素之间的交互作用,找到了能显著提高文本分类效果的最佳文本表示方式. Automatic text categorization techniques have attracted broad attentions in recent years. Research work has gained much progress in this field. But most previous studies focus on the innovation or improvement of various machine learning algorithms using in text categorization. Theoretic and experimental studies concerning text representations are relatively few. By extensive comparative experiments on the benchmark corpus Reuters-21578, the impact of five text representation factors: stopwords, word stemming, indexing, scaling, and normalization on the performance of linear support vector machines has been studied in detail and the best text representation approach which outperforms the prevailing ones is obtained in this paper.
出处 《模式识别与人工智能》 EI CSCD 北大核心 2004年第2期161-166,共6页 Pattern Recognition and Artificial Intelligence
关键词 文本分类 文本表示 支持向量机 实验设计 线性分类 Text Categorization Text Representation Support Vector Machines Design of Experiments Linear Oassification
  • 相关文献

参考文献9

  • 1Sebastiani F. Machine Learning in Automated Text Categorization.ACM Computing Surveys, 2002, 34(1):1-47.
  • 2Salton G, McGill M J. An Introduction to Modem Information Retrieval. New York: McGraw-Hill, 1983.
  • 3Baker L D, McCallum A K. Distributional Clustering of Words for Text Categorisation. In: Proc of the 21st ACM International Conference on Research and Development in Infommtion Retrieval.Melbourne, Australia, 1998, 96- 103.
  • 4Dumais S, Platt J, Heckemaan D, Sahanfi M. Inductive Learning Algorithrrks and Representations for Text Categorization. In: Proc of the 7th ACM International Conference on Information and Knowledge Management. Washington, USA, 1998, 148- 155.
  • 5Yang Y, Liu X. An Re-Evaluation of Text Categorization Methods. In: Proc of the 22nd ACM International Conference on Research and Development in Information Retrieval. Berkeley, USA,1999, 42 - 49.
  • 6Ma J, Zhao Y, Ahalt S. OSU SVM Classifier Matlab Toolbox(ver3.00). http://www, eleceng, ohio- state, edu/-maj/osu.
  • 7Lewis D. Reuters-21578, Distribution 1. 0. http://www, research. art. corn/- lewis/reut ers21578, html.
  • 8Porter M F. An Algorithm for Suffix Striping. Program, 1980, 14(3): 130- 137.
  • 9Yang Y, Pedersen J O. A Comparative Study on Feature Selection in Text Categorization. In: Proc of the 14th International Conference on Machine Learning. Nashville, USA, 1997, 412 - 420.

同被引文献28

  • 1王红军,徐林,张继宏.一种模糊综合决策汉语谓词识别算法[J].四川大学学报(自然科学版),2004,41(z1):630-634. 被引量:1
  • 2宋枫溪,高林.文本分类器性能评估指标[J].计算机工程,2004,30(13):107-109. 被引量:33
  • 3罗振声,郑碧霞.汉语句型自动分析和分布统计算法与策略的研究[J].中文信息学报,1994,8(2):1-19. 被引量:21
  • 4陈文亮,朱靖波,朱慕华,姚天顺.基于领域词典的文本特征表示[J].计算机研究与发展,2005,42(12):2155-2160. 被引量:22
  • 5SEBASTIANI F. Machine learning in automated text categorization [J]. ACM Computing Surveys, 2002, 34(1): 1-47.
  • 6SALTON G, MCGILL M J. An introduction to modem information retrieval[ M]. [ S. l. ] : McGraw-Hill, 1983.
  • 7SHANKAR S, KARYPIS G. A feature weight adjustment algorithm for document categorization [ C]//6th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. New York: ACM, 2000.
  • 8DEBOLE F, SEBASTIANI F. Supervised term weighting for automated text categorization [ C]// SAC 03: 18th ACM Symposium on Applied Computing. New York: ACM, 2003:784-788.
  • 9YANG YI-MING. An evaluation of statistical approaches to text categorization[ J]. Information Retrieval, 1999, 1 (1) : 69 - 90.
  • 10Sebastiani F. Machine Learning in Automated Text Categorization.ACM Computing Surveys, 2002,34(1):1-47.

引证文献4

二级引证文献15

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部