期刊文献+

一种基于类别强信息特征和贝叶斯算法的中文文本分类器 被引量:5

A CHINESE TEXT CLASSIER BASED ON STRONG INFORMATION FEATURE OF CATEGORY AND BAYESIAN ALGORITHM
下载PDF
导出
摘要 为了提高中文文本分类的效率与精度,设计一种新型的分类器。该分类器采用基于语料库的正向扫描统计分词。在词频统计阶段,采取训练阶段的按类别统计和测试阶段的按文章不同区域统计的方法;为了更好地选择特征词,提出了频度、集中度、相关度三个强信息特征标准;在特征权重计算上,提出了将词频和综合特征选择函数相结合的权重计算方法;最后,结合朴素贝叶斯原理进行分类。实验证明该分类器简单有效。 For improving the efficiency and accuracy of Chinese text classification,in this paper we design a new Chinese text classifier,which adopts corpus-based forward scanning for word segmentation counting. In word frequency statistics stage,it uses the method of counting by category in training stage and the method of counting by different regions of the text in testing stage. In order to better select the feature words,we propose three strong information feature standards: the frequency,the concentration and the correlation. On feature weight calculation issue,we propose a feature weight calculation method which combines the word frequency with comprehensive feature selection function.At last,in combination with naive Bayes theory to carry out the classification. It is proved that this classier is simple and effective by the test.
出处 《计算机应用与软件》 CSCD 北大核心 2014年第8期330-333,共4页 Computer Applications and Software
关键词 中文文本分类 特征选择 特征权重 分类算法 Chinese text categorisation Feature selection Feature weighting Classification algorithm
  • 相关文献

参考文献8

  • 1Chao L,Fan G,Christos F.BBM:bayesian browsing model from petabyte scale data[C]//15th ACM SIGKDD international conference on Knowledge discover and data mining,2009:537-546.
  • 2Cohen J D.High lights:Language and Domain-independent Automatic Indexing Terms for Abstracting[J].Journal of the American Society for Information Science,1995,46(3):162-174.
  • 3洪伟,韩筱璞,周涛,汪秉宏.Heavy-Tailed Statistics in Short-Message Communication[J].Chinese Physics Letters,2009,26(2):297-299. 被引量:31
  • 4Zhang Y C,Wang D,Wang G,et al.Learning click models via probit bayesian inference[C]//19th ACM international conference on Information and knowledge management,2010:439-448.
  • 5徐文海,温有奎.一种基于TFIDF方法的中文关键词抽取算法[J].情报理论与实践,2008,31(2):298-302. 被引量:65
  • 6袁磊.基于概率模型的文本聚类[D].吉林:吉林大学,2004.
  • 7薛得军.中文文本自动分类中的关键问题研究[D].北京:清华大学,2004.
  • 8刘东绪.在自然汉语中进行分词和词性标注[D].成都:电子科技大学,2003.

二级参考文献28

共引文献94

同被引文献23

引证文献5

二级引证文献9

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部