期刊文献+

基于LDA主题模型的短文本分类方法 被引量:75

Short text classification using latent Dirichlet allocation
下载PDF
导出
摘要 针对短文本的特征稀疏性和上下文依赖性两个问题,提出一种基于隐含狄列克雷分配模型的短文本分类方法。利用模型生成的主题,一方面区分相同词的上下文,降低权重;另一方面关联不同词以减少稀疏性,增加权重。采用K近邻方法对自动抓取的网易页面标题数据进行分类,实验表明新方法在分类性能上比传统的向量空间模型和基于主题的相似性度量分别高5%和2.5%左右。 In order to solve the two key problems of the short text classification, very sparse features and strong context dependency, a new method based on latent Dirichlet allocation was proposed. The generated topics not only discriminate contexts of common words and decrease their weights, but also reduce sparsity by connecting distinguishing words and increase their weights. In addition, a short text dataset was constructed by crawling titles of Netease pages. Experiments were done by classifying these short titles using K-nearest neighbors. The proposed method outperforms vector space model and topic-based similarity.
出处 《计算机应用》 CSCD 北大核心 2013年第6期1587-1590,共4页 journal of Computer Applications
基金 国家自然科学基金资助项目(60970061 61075056 61103067) 中央高校基本科研业务费专项资金资助项目
关键词 短文本 分类 K近邻 相似度 隐含狄列克雷分配 short text classification K-Nearest Neighbor (K-NN) similarity measure latent Dirichlet allocation
  • 相关文献

参考文献16

  • 1PARK E K, RA D Y, JANG M G. Techniques for improving Web retrieval effectiveness[J]. Information Processing Management, 2005, 41(5): 1207 -1223.
  • 2LIU W Y, HAO T Y, CHEN W, et al. A Web-based platform for user-interactive question-answering[J]. World Wide Web, 2009, 12(2): 107 -124.
  • 3郑斐然,苗夺谦,张志飞,高灿.一种中文微博新闻话题检测的方法[J].计算机科学,2012,39(1):138-141. 被引量:84
  • 4贺涛,曹先彬,谭辉.基于免疫的中文网络短文本聚类算法[J].自动化学报,2009,35(7):896-902. 被引量:18
  • 5SALTON G, WONG A, YANG C S. A vector space model for auto-matic indexing[J]. Communications of the ACM, 1975, 18 ( 11) : 613 -620.
  • 6PHAN X H, NGUYEN M L, HORIGUCHI S. Learning to classify short and sparse text & Web with hidden topics from large-scale data collections[C] / / Proceedings of the 17 th Conference on World Wide Web. New York: ACM, 2008: 91 -100.
  • 7WANG L, JIA Y, HAN W H. Instant message clustering based on extended vector space model[C] / / Proceedings of the 2nd Interna-tional Conference on Advances in Computation and Intelligence. Berlin: Springer-Verlag, 2007: 435 - 443.
  • 8SAHAMI M, HEILMAN T D. A Web - based kernel function for measuring the similarity of short text snippets[C] / / Proceedings of the 15th Conference on World Wide Web. New York: ACM, 2006: 377 -386.
  • 9YIH W, MEEK C. Improving similarity measures for short segments of text[C] / / Proceedings of the 22nd Conference on Artificial Intel-ligence. Menlo Park: AAAI Press, 2007: 1489 -1494.
  • 10翟延冬,王康平,张东娜,黄岚,周春光.一种基于WordNet的短文本语义相似性算法[J].电子学报,2012,40(3):617-620. 被引量:34

二级参考文献41

  • 1钟将,吴中福,吴开贵,欧灵.基于人工免疫网络的动态聚类算法[J].电子学报,2004,32(8):1268-1272. 被引量:24
  • 2马静.语言学视野中的网络语言[J].西北工业大学学报(社会科学版),2002,22(1):52-56. 被引量:22
  • 3黄永光,刘挺,车万翔,胡晓光.面向变异短文本的快速聚类算法[J].中文信息学报,2007,21(2):63-68. 被引量:17
  • 4王永恒,贾焰,杨树强.海量短语信息文本聚类技术研究[J].计算机工程,2007,33(14):38-40. 被引量:13
  • 5Wang L,Jia Y,Han W H.Instant message clustering based on extended vector space model.In:Proceedings of the 2nd International Symposium on Intelligence Computation and Applications.Wuhan,China:Springer,2007.435-443
  • 6He H,Chen B,Xu W R,Guo J.Short text feature extraction and clustering for web topic mining.In:Proceedings of the 3rd International Conference on Semantics,Knowledge and Grid.Washington D.C.,USA:IEEE,2007.382-385
  • 7de Castro L N,Von Z F J.aiNet:an artificial immune network for data analysis.Data Mining:A Heuristic Approach.New York:Idea Group Publishing,2001.231-259
  • 8Xia Y Q,Wong K F.Anomaly detecting within dynamic Chinese chat text.In:Proceedings of New Text Workshop st the 11th Conference for European Chapter of the Association for Computational Linguistics.Trento,Italy:Acl Anthology Network,2006.48-55
  • 9Xia Y Q,Wong K F,Gao W.NIL is not nothing:recognition of Chinese network informal language expressions.In:Proceedings of the 4th SIGHAN Workshop on Chinese Langunge Processing.Jeju Island,Republic of Korea:Acl Anthology Network,2005.95-102
  • 10Hang X S,Dai H H.An immune network approach for web document clustering.In:Proceedings of the IEEE/WIC/ACM International Conference on Web Intelligence.Beijing,China:IEEE,2004.278-284

共引文献132

同被引文献592

引证文献75

二级引证文献456

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部