期刊文献+

一种改进的文本特征提取算法 被引量:6

Study on the extraction of characteristics of chinese text based on the LDA model
下载PDF
导出
摘要 针对特征提取忽略特征项语义问题,提出一种基于潜在狄利克雷分配模型(LDA)改进的特征提取算法。该算法基于文档的潜在主题分布,将文档转换为隐含主题与主题下的单词分布按特定比例组成的集合,通过一定的概率选中某个主题,并从该主题下以一定的概率选中某个词语来生成一篇文档。同时,针对LDA算法"平等"对待所有特征项的情况,对LDA模型进行高斯加权。实验结果表明,该算法相比TF-IDF算法、信息增益法,能够提取更多的有效特征,使得分类准确率有所提高。 A model based on Latent Dirichlet Allocation (Latent Dirichlet Allocation, LDA) feature extraction algorithm is proposed to tackle the problem in which inverse document frequency method (term frequency, inverse document frequency, TF-IDF) ignores feature semantics in feature extraction. The proposed algorithm is based on the underlying theme distribution of document, to convert text to implied themes and theme words collection with a distribution according to certain proportion, and to realize feature extraction according to the key semantics. At the same time the algorithm uses Gaussian weighting to treat all the characteristics of the item as "equality". Experiments results show that the proposed algorithm can extra more effective features and improve the accuracy of classification compared with TF-IDF and mutual information method.
作者 马力 刘惠福
出处 《西安邮电大学学报》 2015年第6期79-81,120,共4页 Journal of Xi’an University of Posts and Telecommunications
基金 西安市科技计划资助项目(CXY1437(8))
关键词 文本分类 特征提取 潜在狄利克雷 支持向量机 text classification, feature extraction, potential Latent Dirichlet, SVM
  • 相关文献

参考文献11

  • 1Blei. Probabilistic topic models [J]. Communicationsof the ACM, 2012, 55(4): 77-84.
  • 2Robertson S E,Van R C J,Porter M F. Probabilisticmodels of indexing and searching[C]//Proceedings ofthe 3rd annual ACM conference on Research and devel-opment in information retrieval. Cambridge : CambridgeUniversity Press, 1980: 35-56.
  • 3宗成庆.统计自然语言处理[M].北京:清华大学出版社,2013.
  • 4Lewis. Representation and learning in information re-trieval[D], USA:University of Massachusetts, 2008:18-26.
  • 5Fan R E,Chang K W, Hsieh C J,et al. LIBLINEAR:A library for large linear classification!^J], The Journal ofMachine Learning Research, 2008, 9(11) : 1871-1874.
  • 6路永和,李焰锋.改进TF-IDF算法的文本特征项权值计算方法[J].图书情报工作,2013,57(3):90-95. 被引量:54
  • 7邱奕飞.基于潜在狄利克雷分配模型的文本分类方法研究[D].西安:西安邮电大学,2014:23-40.
  • 8张小平,周雪忠,黄厚宽,冯奇,陈世波,焦宏官.一种改进的LDA主题模型[J].北京交通大学学报,2010,34(2):111-114. 被引量:47
  • 9王治和,杨延娇.对简单向量距离文本分类算法的改进[J].计算机科学,2009,36(1):236-238. 被引量:4
  • 10李文波,孙乐,张大鲲.基于Labeled-LDA模型的文本分类新算法[J].计算机学报,2008,31(4):620-627. 被引量:103

二级参考文献64

  • 1张启蕊,张凌,董守斌,谭景华.训练集类别分布对文本分类的影响[J].清华大学学报(自然科学版),2005,45(S1):1802-1805. 被引量:27
  • 2曾雪强,王明文,陈素芬.一种基于潜在语义结构的文本分类模型[J].华南理工大学学报(自然科学版),2004,32(z1):99-102. 被引量:27
  • 3苏金树,张博锋,徐昕.基于机器学习的文本分类技术研究进展[J].软件学报,2006,17(9):1848-1859. 被引量:389
  • 4Rocchio J J. Relevance feedback in information retrieval[A]. The SMART Retrieval System Experiments in Automatic Document Processing [C]. Ne Jersey: Prentice Hall, Inc. , 1971 : 313-23
  • 5Ide E. Relevance feedback in an automatic document retrieval systems [R]. Ithaca, NY: Cornell University, 1969
  • 6Salton G. The SMART Retrieval System[M]. Englewood Cliffs N J : Prentice Hall, Inc. , 1971
  • 7CoxlJ, MillerML, OmohundroSM. Pichunter : Bayesian relevance feedback for image retrival system[A]//Int'l Conf. on Pattern Recognition [C]. Vienna, Austeia, 1996:361-369
  • 8Ho L J. Combining the evidence of differrent relevancefeedback methods for information retrieval[J]. Information Processing & Management, 1998,34(6) :681-691
  • 9Blei D,Ng A,Jordan M.Latent Dirichlet allocation[J].Journal of Machine Learning Research,2003:3,993-1022.
  • 10Griffiths T L,Steyvers M.A Probabilistic Approach to Semantic Representation[C]∥ Proceedings of the 24th Annual Conference of the Cognitive Science Society,2002:381-386.

共引文献202

同被引文献50

引证文献6

二级引证文献24

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部