期刊文献+

基于特征有序对量化表示的文本分类方法 被引量:4

Text categorization algorithm based on feature order pair quantization
原文传递
导出
摘要 文本分类技术应尽可能包含语言中各种各样的约束信息,但目前常用的文本表示方法却忽视组成文本的语言特征顺序。该文采用基于聚类的方法实现语言特征有序对的快速量化表示,并由此导出新的基于特征有序对的文本表示方法以揭示文本中所呈现出的语言特征顺序信息。运用向量空间质心法,分别依据词对和词类对表示文本并在3个数据集上进行实验。结果表明性能优于基于单纯词或单纯词类的文本表示方法,宏平均F1值绝对提高分别为3%~4%和5%~7%(相对改善分别是4%~5%和8%~10%)。由此说明特征顺序信息对提升文本分类性能具有重要作用。 Text categorization algorithms should contain the various constraints presented in the language, but most neglect the order information of language feature in the text, This paper presents a document representation scheme based on feature pair quantization using clustering to identify feature order information in the text, that is then combined with the vector space centroid algorithm. Tests were done for representing documents based on word pairs and word sense pairs respectively in three different data sets. The results show that the current method outperform traditional representations based on words or word sense, The average improvement of Micro-F1 for word pairs is 3%-4% and for word sense pair is 5%- 7%. Therefore, feature order information plays an important role for improving text categorization performance.
出处 《清华大学学报(自然科学版)》 EI CAS CSCD 北大核心 2006年第4期527-529,533,共4页 Journal of Tsinghua University(Science and Technology)
基金 国家"八六三"高技术项目(2001AA114071)
关键词 文本分类 特征选择 特征抽象 特征变换 奇异值分解 text categorization feature selection feature abstractions feature transformation singular value decomposition
  • 相关文献

参考文献6

  • 1Salton G,Buckley C.Term weighting approaches in automatic text retrieval[J].Information Processing and Management,1988,5:513-523.
  • 2YANG Yiming,Pederson J O.A Comparative study on feature selection in text categorization[A].Proceedings of the 14th international conference on machine learning[C].Nashville:Morgan Kaufman,1997:412-420.
  • 3Kehagias A,Petridis V,Kaburlasos V G,et al.A comparison of word and sense-based text categorization using several classification algorithms[J].Journal of intelligent information systems,2003,3:227-247.
  • 4Deerwester S,Dumais S T,Furnas G W,et al.Indexing by latent semantic analysis[J].Journal of the American Society of Information Science,1990,41:391-407.
  • 5Berry M W.Large-scale sparse singular value computations[J].The International Journal of Supercomputer Applications,1992,6:13-49.
  • 6Ney H,Essen U,Kneser R.On structuring probabilistic dependences in stochastic language modeling[J].Computer Speech and Language,1994,8:1-38.

同被引文献22

  • 1刘群,张华平,俞鸿魁,程学旗.基于层叠隐马模型的汉语词法分析[J].计算机研究与发展,2004,41(8):1421-1429. 被引量:197
  • 2谌志群,张国煊.文本挖掘研究进展[J].模式识别与人工智能,2005,18(1):65-74. 被引量:49
  • 3Joachims T. Text Categorization with Support Vector Machines: Learning with Many Relevant Features[C]//In Europearl Conference on Machine Learning (ECML). Chemnitz, Germany: [ s. n. ], 1998:137 - 142.
  • 4Gartner T, Flach P A. WBCsvm Weighted Bayesian Classification based on support vector machine[ C]//18th Int. Conf. on Machine Learning. WiUianstown, USA: [ s. n. ], 2001 : 154 - 161.
  • 5Sindhawani V, Pushpak B, Subrata R. Information Theoretic Feature Crediting in Multiclass Support Vector Machine[C]// 1st SIAM Int. Conf. on Data Mining. Chicago, IL, USA: [ s. n. ] ,2001:1 - 18.
  • 6Lewis D D, Yang Y, Rose T, et al. RCV1 : A New Benchmark Collection for Text Categorization Research[ J ]. Journal of Machine Learning Research,2004(5) :361 - 397.
  • 7Angheluta R,De Busser R,Moens M-F.The use of topic segmentation for automatic summarization In:Hahn U,Harman D,eds.Proceedings of the Workshop on Automatic Summarization.Philadelphia,Pennsylvania,USA,2002.66~70
  • 8Ko Y,Park J,Seo J.Improving text categorization using the importance of sentences.Information Processing and Management,2004,40:65~79
  • 9Sebastiani F.Machine Learning in Automated Text Categorization.ACM Computing Surveys,2002,34(1):1~47
  • 10Wasikowski M, Xue-wen Chen. Combating the Small SampleClass Imbalance Problem Using Feature Selection[ J]. Knowl-edge and Date Engineering, 2010,22(10) :1388-1400.

引证文献4

二级引证文献23

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部