期刊文献+

训练集类别分布对文本分类的影响 被引量:26

Effects of category distribution in a training set on text categorization
原文传递
导出
摘要 为了减小训练集中各类别资源分布不均衡对分类性能造成的影响,该文对原始训练集使用类别均衡法,即对原始训练集以类为单位进行重新组合,使得重组后的训练集类别分布尽可能均衡,从而可以在均衡的类别上进行训练和分类,以降低在训练过程中对小类别的不公平待遇。在复旦大学语料库上使用类别均衡法,分别用N a ve B ayes和R occh io方法分类,前者的宏平均F1从48.62%提高到了80.99%,后者的宏平均F1从64.58%提高到80.26%,微平均F1从73.99%提高到80.47%。实验结果显示,类别均衡法显著提高了分类性能。 A category homogenizing method was developed to lower the effects of uneven distribution of different resources in a training set on text categorization. Categories in the original training set are reassembled to form a new training set in which the category distribution is more uniform, and therefore, training and classification are implemented to change unfair treatment for small categories in the training process. The method was applied to the Fudan University classification corpus with the macro-average...
出处 《清华大学学报(自然科学版)》 EI CAS CSCD 北大核心 2005年第S1期1802-1805,共4页 Journal of Tsinghua University(Science and Technology)
基金 高等学校优秀青年教师教学科研奖励计划资助项目
关键词 文本分类 训练集 类别均衡法 text categorization training set category homogenizing
  • 相关文献

参考文献2

  • 1Hull D A.Improving text retrieval for the routing problem using latent semantic indexing[].Proceedings of the th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval.1994
  • 2Sebastiani F.Machine learning in automated text categorization[].ACM Computing Surveys.2002

同被引文献263

引证文献26

二级引证文献174

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部