摘要
为了提高中文文本分类的效率与精度,设计一种新型的分类器。该分类器采用基于语料库的正向扫描统计分词。在词频统计阶段,采取训练阶段的按类别统计和测试阶段的按文章不同区域统计的方法;为了更好地选择特征词,提出了频度、集中度、相关度三个强信息特征标准;在特征权重计算上,提出了将词频和综合特征选择函数相结合的权重计算方法;最后,结合朴素贝叶斯原理进行分类。实验证明该分类器简单有效。
For improving the efficiency and accuracy of Chinese text classification,in this paper we design a new Chinese text classifier,which adopts corpus-based forward scanning for word segmentation counting. In word frequency statistics stage,it uses the method of counting by category in training stage and the method of counting by different regions of the text in testing stage. In order to better select the feature words,we propose three strong information feature standards: the frequency,the concentration and the correlation. On feature weight calculation issue,we propose a feature weight calculation method which combines the word frequency with comprehensive feature selection function.At last,in combination with naive Bayes theory to carry out the classification. It is proved that this classier is simple and effective by the test.
出处
《计算机应用与软件》
CSCD
北大核心
2014年第8期330-333,共4页
Computer Applications and Software
关键词
中文文本分类
特征选择
特征权重
分类算法
Chinese text categorisation Feature selection Feature weighting Classification algorithm