摘要
本文提出了一种根据汉字统计特性和基于实例映射的中文文本自动分类模型。该模型采用汉字字频向量作为文本的表示方法。它的显著特点是引入线性最小二乘方拟合(LinearLeastSquareFit,LLSF)技术建立文本分类器模型,通过对训练集语料的手工分类标引以及对文本和类别间的相关性判定的学习,实现了基于全局最小错误率的汉字———类别两个向量空间的映射函数,并用该函数对测试文本进行分类。
This paper proposes an example based mapping method,which uses Chinese properties of CCs for Chinese text categorization.The most distinguishable characteristics of this method is introducing the LLSF(Linear Least Square Fit) technique to build the categorization model.By learning the relevance information from manually categorized training corpus,this model ultimately generates a mapping function from CC space to category space based on global least mapping error and uses this mapping function to predict the categories of arbitrary texts.
出处
《情报学报》
CSSCI
北大核心
1999年第1期27-32,共6页
Journal of the China Society for Scientific and Technical Information
关键词
中文文本
自动分类
字频向量
映射函数
automatic Chinese text categorization,Chinese character frequency vector,example based mapping method.