期刊文献+

文本分类中基于词条聚合的特征抽取 被引量:4

Feature extraction of text classification based on word clustering
下载PDF
导出
摘要 特征抽取是文本分类的重要研究领域,针对原始特征空间的高维性与稀疏性给分类算法带来"维数灾难"问题,探讨了基于词条聚合的特征抽取方法,设计了一种利用词条聚合进行特征抽取的文本分类的方案.该方案利用改进的树型动态自组织映射(TGSOM)进行词条聚合,并根据聚合特征的特点,考虑所包含的词条的文档频率的不同和区分文档类别属性的能力的不同,提出了一种新权重计算方法,最后利用SPR INT决策树算法进行分类,实验表明该方法比普通方法分类精度提高4.32%. Feature extraction is essential for text classification. In this paper we discussed the basic ideas behind word-clustering-based feature extraction. Then a text classification method for feature extraction by the means of words clustering was presented. It employed an improved tree-structured growing self-organization map (TGSOM) to carry out word clustering. Also a new formula for calculating weights was developed by taking account of the distinction between clustered word features and plain word features. Finally, the SPRINT decision tree was applied to complete the text classification. Experiments showed that the precision of text classification using the proposed method is improved by 4.32%.
出处 《哈尔滨工程大学学报》 EI CAS CSCD 北大核心 2008年第11期1205-1209,共5页 Journal of Harbin Engineering University
关键词 特征抽取 词条聚合 TGSOM 权重计算 feature extraction word clustering TGSOM weight calculation
  • 相关文献

参考文献12

二级参考文献31

  • 1黄萱青 吴立德.独立于语种的文本分类方法[M].,2000.37-43.
  • 2鲁松 白硕 等.文本中词语权重计算方法的改进[M].,2000.31-36.
  • 3卜东波.聚类/分类理论研究及其在大模型文本挖掘的应用:博士论文[M].,2000..
  • 4[1]M.S. Chen, J. Han, P. S. Yu, Data niining, An overview fiom a database perspective, IEEE Trans. on Knowledge & Data Engineering, 1996, 8(6), 866-883 .
  • 5[2]T. Kohonen, Self-Organization and Associate Memory, Berlin, Springer-Verlag, 1984, Chapter 5.
  • 6[3]D. Alahakoon, S. K. Halgamuge, Dynamic self-organizing maps with controlled growth for knowledge discovery, IEEE Trans. on Neural Networks, 2000, NN-11(3), 601-614.
  • 7[4]D. Choi, S. Park, Self-creating and organizing neural networks, IEEE Trans. on Neural Networks,1994, NN-5(4), 561-575.
  • 8黄萱菁,2000 International Conference on Multilingual Information Processing,2000年,37页
  • 9鲁松,2000 International Conference on Multilingual Information Processing,2000年,31页
  • 10卜东波,博士学位论文,2000年

共引文献389

同被引文献48

  • 1赵林,胡恬,黄萱菁,吴立德.基于知网的概念特征抽取方法[J].通信学报,2004,25(7):46-54. 被引量:17
  • 2王煜,张明,马力.基于词条聚合和决策树的文本分类方法[J].河北大学学报(自然科学版),2005,25(3):338-342. 被引量:4
  • 3熊亮.基于概念树的文本自动分类系统的研究与实现[J].计算机工程与应用,2005,41(30):6-9. 被引量:2
  • 4http://www.shenmeshi.com/Education/Education_20090827155153.html.
  • 5张华平.ICTCLAS[CP].http://www.ictcias.org/.
  • 6CHANG C C, LIN C J. LIBSVM : a library for support vector ma- chines [ CP]. 2001. Software available at http://www, csientu, edu. tw/-cjlin/libsvm.
  • 7刘群 李素建.基于《知网》的词汇语义相似度计算.中文计算语言学,2002,7(2):59-76.
  • 8HE Cong, HAN Tong Loh. Grouping of TRIZ Inventive Principles to Facilitate Automatic Patent Classification [ J ]. Expert Systems with Applications, 2008, 34( 1 ) :788 -795.
  • 9WEBB Alan. TRIZ: An Inventive Approach to Invention [ J ]. Manufacturing Engineer, 2002, 81 (4) :171 - 177.
  • 10蔡小艳,寇应展.汉语词法分析系统ICTCLAS在Nutch中的应用与实现[J].军械程学院学报,2008,20(5):63-66.

引证文献4

二级引证文献15

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部