期刊文献+

基于LDA高频词扩展的中文短文本分类 被引量:38

A New Method of Key words Extraction for Chinese Short-text Classification
原文传递
导出
摘要 针对短文本特征稀疏、噪声大等特点,提出一种基于LDA高频词扩展的方法,通过抽取每个类别的高频词作为向量空间模型的特征空间,用TF-IDF方法将短文本表示成向量,再利用LDA得到每个文本的隐主题特征,将概率大于某一阈值的隐主题对应的高频词扩展到文本中,以降低短文本的噪声和稀疏性影响。实验证明,这种方法的分类性能高于常规分类方法。 Short texts are different from traditional documents in their shortness and sparseness. Feature extension can ease the problem of high sparse in the vector space model, but feature extension inevitably introduces noise. To resolve the problem, this paper proposes a high - frequency words expansion method based on LDA. By extracting high - frequency words from each category as the feature space, using LDA to derive latent topics from the corpus, it extends the topic words into the short -text. Extensive experiments conducted on Chinese short messages and news titles show that the new method proposed for Chinese short - text classification can obtain a higher classification performance comparing with the conventional classification methods.
出处 《现代图书情报技术》 CSSCI 北大核心 2013年第6期42-48,共7页 New Technology of Library and Information Service
基金 国家863计划基金项目"农产品全供应链多源信息感知技术与产品开发"(项目编号:2012AA101701-03)的研究成果之一
关键词 短文本分类 高频词 LDA 特征扩展 Short- text classification High frequency words LDA Feature expansion
  • 相关文献

参考文献18

  • 1Hotho A, Staab S, Stumme G. Ontologies Improve Text Document Clustering[ C ]. In : Proceedings of the 3rd IEEE International Con- ference on Data Mining ( ICDM' 03 ). Washington, D C : IEEE Computer Society, 2003:541 -544.
  • 2Pinto D, Rosso P, Benajiba Y, et al. Word Sense Induction in the Arabic Language: A Self- Term Expansion Based Approach [ C ]. In: Proceedings of the 7 th Conference on Language Engineering of the Egyptian Society of Language Engineering ( ESOLE 2007 ). 2007 : 235 - 245.
  • 3Banerjee S, Ramanathan K, Gupta A. Clustering Short Texts Using Wikipedia[ C]. In: Proceedings of the 30th Annual International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR'07). New York: ACM, 2007:787-788.
  • 4Pinto D, Jimnez - Salazar H, Rosso P. Clustering Abstracts of Scientific Texts Using the Transition Point Technique [ C ]. In: Proceedings of the 7 th International Conference on Computational Linguistics and Intelligent Text Processing ( CICLing' 06 ). Heidel- berg, Berlin : Springer - Verlag, 2006 : 536 - 546.
  • 5Fan X, Hu H. A New Model for Chinese Short - text Classification Considering Feature Extension [ C ]. In : Proceedings of the Interna- tional Conference on Artificial Intelligence and Computational Intel- ligence (A1CI' 10). Washington, D C: IEEE Computer Society, 2010,2:7 -11.
  • 6Sahami M, Heilman T D. A Web - based Kernel Function for Measuring the Similarity of Short Text Snippets [ C ]. In : Proceed- ings of the 15th International Conference on World Wide Web ( WWW' 06). New York : ACM, 2006 : 377 - 386.
  • 7Hu X, Sun N, Zhang C, et al. Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge [ C]. In: Proceedings of the 18th ACM Conference on Information and Knowledge Management ( CIKM' 09 ). New York : ACM, 2009 : 919 -928.
  • 8Phan X H, Nguyen L M, Horiguchi S. Learning to Classify Short and Sparse Text & Web with Hidden Topics from Large - scale Da- ta Collections [ C] In: Proceedings of the 17th International Con- ference on World Wide Web (WWW'08). New York: ACM, 2008: 91 - 100.
  • 9Quan x, Liu G, Lu Z, et al. Short Text Similarity Based on Proba- bilistic Topics [ J ]. Knowledge and Information Systems, 2010,25 (3) : 473 -491.
  • 10Deerwester S, Dumais S, Furnas G W, et al. Indexing by Latent Semantic Analysis[J]. Journal of the American Society for Informa- tion Science, 1990, 41 (6) : 391 -407.

同被引文献386

引证文献38

二级引证文献284

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部