摘要
二十世纪九十年代以来自动文本分类技术受到人们的广泛重视,取得了大量的研究成果,但这些研究大多集中在机器学习算法本身的创新或改进上面,涉及文本表示的理论或实验研究都相对较少,通过在语料库Reuters-21578上进行大量对比实验,本文着重考查了影响文本表示的五个主要因素:"功能词"、"词根"、"取值方式"、"权方式"和"规范化",对线性支持向量机分类性能的影响以及这些因素之间的交互作用,找到了能显著提高文本分类效果的最佳文本表示方式.
Automatic text categorization techniques have attracted broad attentions in recent years. Research work has
gained much progress in this field. But most previous studies focus on the innovation or improvement of various
machine learning algorithms using in text categorization. Theoretic and experimental studies concerning text
representations are relatively few. By extensive comparative experiments on the benchmark corpus Reuters-21578,
the impact of five text representation factors: stopwords, word stemming, indexing, scaling, and normalization
on the performance of linear support vector machines has been studied in detail and the best text representation
approach which outperforms the prevailing ones is obtained in this paper.
出处
《模式识别与人工智能》
EI
CSCD
北大核心
2004年第2期161-166,共6页
Pattern Recognition and Artificial Intelligence
关键词
文本分类
文本表示
支持向量机
实验设计
线性分类
Text Categorization
Text Representation
Support Vector Machines
Design of Experiments
Linear Oassification