期刊文献+

中文文本分类中的文本表示因素比较 被引量:5

Comparative study on text representation schemes in Chinese text classification
下载PDF
导出
摘要 研究了中文文本分类中的文本表示方法,提出了对中文文本表示因素的分析框架,并通过对3个数据集实验结果的分析,确定了各种文本表示因素对分类效果的影响.直接使用汉字进行划分也可以获得较好的分类效果;简单的不使用很大词库的分词和使用大词库的分词,以及复杂的分词对分类效果影响不大;仅使用01表示特征是否出现也可以获得比较好的分类效果;采用综合了合理的向量取值(如使用合适的归一化算法)可以较大幅度地提高分类准确率等.这些结论为后续的应用提供了指导原则. We investigated the representation methods for text classification, proposed the framework of analyzing Chinese text representation algorithms, analyzed the influence of text representation, and obtained the influence of variable text representation factors on classification effect. Using Chinese characters can directly obtain better effect than expected ; there is little difference on classification effect among splitting articles with smaller or huger dictionary or even by complicated splitting algorithm; and classification with only 01 to represent whether a feature is presented in a text or not can lead to not bad effect. We also found it can greatly improve classification effect to use reasonable vector value such as suitable formalization algorithm. These conclusions have provided instructions to contifurther applications.
出处 《中国科学院研究生院学报》 CAS CSCD 北大核心 2009年第3期400-407,共8页 Journal of the Graduate School of the Chinese Academy of Sciences
基金 国家863研究计划(2006AA01Z454)项目资助
关键词 中文文本分类 文本表示 向量化 Chinese text classification, text presentation, vectorization
  • 相关文献

参考文献17

  • 1Sebastiani F. Machine learning in automated text categorization. ACM Computing Surveys,2002,34( 1 ) : 1 -47
  • 2Salton G, Wong A, Yang C. A vector space model for automatic indexing. Communication of the A CM, 1975,18( 11 ): 613 -620
  • 3Yang Y. A comparative study on feature selection in text categorization. In: Proceedings of the Fourteenth International Conference on Machine Learning ( ICML' 97). San Francisco: Morgan Kaufmann Publishers Inc, 1997. 412-420
  • 4苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:386
  • 5冯是聪,单松巍,龚笔宏,张志刚,李晓明.“天网”目录导航服务研究[J].计算机研究与发展,2004,41(4):653-659. 被引量:8
  • 6Yang YM, Liu X. A re-examination of text categorization methods. In: Proceedings of the 22nd Annual International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM Press, 1999. 42 - 49
  • 7罗可,林睦纲,郗东妹.数据挖掘中分类算法综述[J].计算机工程,2005,31(1):3-5. 被引量:63
  • 8Li JY, Sun MS, Zhang X. A comparison and semi-quantitative analysis of words and character-bigrams as features in Chinese text categorization. In: Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the ACL. Morrlstown: Association for Computational Linguistics, 2006. 545 - 552
  • 9Song FX, Liu SH, Yang JY. A comparative study on text representation schemes in text categorization. Pattern Analysis & Applications, 2005, 8 (1) :199- 209
  • 10LangJ, Lin F, Wang J. A comparative study on representing units in Chinese text clustering, Knowledge Science. In: Engineering and Management ( KSEM2006). Heidelberg: Springer Berlin, 2006. 466 - 476

二级参考文献18

  • 1王建会,王洪伟,申展,胡运发.一种实用高效的文本分类算法[J].计算机研究与发展,2005,42(1):85-93. 被引量:20
  • 2李荣陆,王建会,陈晓云,陶晓鹏,胡运发.使用最大熵模型进行中文文本分类[J].计算机研究与发展,2005,42(1):94-101. 被引量:95
  • 3HanJiawei MichelineKambe.数据挖掘概念与技术[M].北京:机械工业出版社,2001..
  • 4Mehta M, Agrawal R, Rissanen J. SLIQ: A Fast Scalable Classifier for Data Mining[A]. Lecture Notes in Computer Sci. Proc. of the 5th Int.Conf. on Extending Database Tech. [C], 1996:18-33
  • 5Shafer J C, Agrawal R, Mehta M. SPRINT: A Scalable Parallel Classifier for Data Mining[A]. Mumbai(Bombay), India: Proc. of the 22nd Int. Conf. on Very Large Databases[C], 1996
  • 6Friedman N, Geiger D, Goldszmidt M. Bayesian Network Classifier[J].Machine Learning, 1997, 29( 1 ):131 - 163
  • 7Liu B, Hsu W, Ma Y. Integrating Classification and Association Rule Mining[A]. Agrawal R. Proc. of the 4th Int. Conf. on Knowledge Discovery and DataMining[C], NY, USA: AAAI Press, 1998:80-86
  • 8WebInfomallWebsitshttp://net.cs.pku.edu.cn/-webg/infomall/index.html . 2002
  • 9TianwangsearchengineWebsits http://e.pku.edu.cn . 1997
  • 10http://cn.yahoo.com . 2003

共引文献457

同被引文献40

引证文献5

二级引证文献26

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部