期刊文献+

一种面向大规模微博数据的话题挖掘方法 被引量:4

Topic mining method on massive microblog data
下载PDF
导出
摘要 随着微博的日趋流行,新浪微博已成为公众获取和传播信息的重要平台之一,针对微博数据的话题挖掘也成为当前的研究热点。提出一个面向大规模微博数据的话题挖掘方法。首先对大规模微博数据进行分析,基于Bloom Filter算法对数据进行去重处理,针对微博的特有结构,对文本进行预处理,提出改进的LDA主题模型Social Network LDA(SNLDA),采用吉布斯采样法进行模型推导,挖掘出微博话题。实验结果表明,方法能有效地从大规模微博数据中挖掘出话题信息。 With the daily popularity of microblog, Sina Weibo has become one of the important public access to and dis-semination of information platform, microblog topic mining has become a current research focuses. This paper proposes a topic mining method on massive Social Network data. This paper analyzes the large-scale microblog data, uses Bloom Filter algorithm to eliminate the duplicate data. In view of the special structure of microblog, filter the text. SNLDA, an improved LDA topic model is proposed in this paper, Gibbs sampling is chosen to deduce the model, which can mine the microblog topics. The experimental results show that the method can effectively excavate the topics from the large-scale microblog data.
出处 《计算机工程与应用》 CSCD 2014年第22期32-37,共6页 Computer Engineering and Applications
基金 国家自然科学基金(No.11205179 No.11305196) 国家高技术研究发展计划(863)(No.2014AA015205)
关键词 微博 BLOOM FILTER 社会网络主题模型分析(SNLDA) 话题挖掘 Bloom Filter microblog Bloom Filter topic mining
  • 相关文献

参考文献15

  • 1Salton G,Wong A,Yang C S.A vector space model for automatic indexing[J].Communications of the ACM, 1975,18(11) :613-620.
  • 2Deerwesster S, Dumais S T, Fuvnas G W.Indexing by latent semantic analysis[J].Joumal of the American Society for Information Sciens, 1990,41 (6) : 391-407.
  • 3Hofmann T.Unsupervised Learning by Probabilistic Latent Semantic Analysis[J].Machine Learning, 2001, 42 (1) : 177-196.
  • 4Blei D M,Ng A Y, Jordan M I.Latent dirichlet alloca- tion[J].Journal of Machine Learning Research,2003(3): 993-1022.
  • 5路荣,项亮,刘明荣,等.基于隐主题分析和文本聚类的微博客新闻话题发现研究[C] // 第六届全国信息检索学术会议论文集. 北京:中国中文信息学会,2010.
  • 6Zhao Wayne Xin, Jing Jiang, Weng Jianshu, et al.Comparing twitter and traditional media using topic models[C]//Pro- eeedings of 33rd European Conference on Information Retrieval (ECIR' 11 ).Berlin, Heidelberg: Springer-Verlag, 2011:338:349.
  • 7张晨逸,孙建伶,丁轶群.基于MB-LDA模型的微博主题挖掘[J].计算机研究与发展,2011,48(10):1795-1802. 被引量:166
  • 8谢昊,江红.一种面向微博主题挖掘的改进LDA模型[J].华东师范大学学报(自然科学版),2013(6):93-101. 被引量:27
  • 9马雯雯,魏文晗,邓一贵.基于隐含语义分析的微博话题发现方法[J].计算机工程与应用,2014,50(1):96-100. 被引量:36
  • 10中科院高能物理所Bigdata微博爬虫开放平台[EB/OL].[2014-02-03].http://bigdataopc.ihep.ac.cn.

二级参考文献53

  • 1赵世奇,刘挺,李生.一种基于主题的文本聚类方法[J].中文信息学报,2007,21(2):58-62. 被引量:23
  • 2Kang J H, Lerman K, Plangprasopchok A. Analyzing Microblogs with affinity propagation [C] //Proc of the 1st KDD Workshop on Social Media Analytic. New York: ACM, 2010:67-70.
  • 3Ramage D, Dumais S, Liebling D. Characterizing microblogs with topic models [C] //Proc of Int AAAI Conf on Weblogs and Social Media. Menlo Park, CA: AAAI, 2010:130-137.
  • 4Xu R, Wunsch D. Survey of clustering algorithms [J]. IEEE Trans on Neural Networks, 2005, 16(3): 645-678.
  • 5Deerwester S, Dumais S, Landauer T, et al. Indexing by latent semantic analysis [J]. Journal of the American Society of Information Science, 1990, 41(6): 391-407.
  • 6Landauer T K, Foltz P W, Laham D. Introduction to Latent Semantic Analysis [J]. Discourse Processes, 1998, 25 (2) 259-284.
  • 7Griffiths T, Steyvers M. Probabilistic topic models [G] // Latent Semantic Analysis: A Road to Meaning. Hillsdale, NJ: Laurence Erlbaum, 2006.
  • 8Hofmann T. Probabilistic latent semantic indexing [C] // Proc of the 22nd Annual Int ACM SIGIR Conf on Research and Development in Information Retrieval. New York: ACM, 1999:50-57.
  • 9Salton G, McGill M. Introduction to Modern Information Retrieval [M]. New York: McGraw-Hill, 1983.
  • 10Blei D M, Ng A Y, Jordan M I. Latent Dirichlet Allocation [J]. The Journal of Machine Learning Research, 2003, 3: 993-1022.

共引文献208

同被引文献61

  • 1姜晓伟,王建民,丁贵广.基于主题模型的微博重要话题发现与排序方法[J].计算机研究与发展,2013,50(S1):179-185. 被引量:12
  • 2傅向华,马兆丰,何明,冯博琴.一种个性化的主题提取和层次发现算法[J].西安交通大学学报,2005,39(2):119-122. 被引量:5
  • 3Wei L,Wu K H,Lee L Y, et al. Construction of an evalua- tion corpus for opinion extraction[C]//In NTCIR-5 Toky- o. Japan:s. n. J,2005,12:513-520.
  • 4Pang B, Lee L, Vaithyanathan S. Thumbs up? sentiment classification using machine learning techniques[C]//ACL. Philadelphia : [s. n.] , 2002,02 : 79-86.
  • 5Mei Q Z, Ling X, Wondra M, et al. Topic sentiment mix- ture. Modeling facets and opinions in weblogs[C]//Proc, of the 16th Int. conference on World Wide Web. New York: ACM,2007 : 171-180.
  • 6Blei M, Lafferty J. Text mining: theory and applications [M]. London. Chapter Topic Models,Taylor and Francis, 2009.
  • 7Blei D M,Ng A Y,Jordan M I. Latent dirichlet[J]. Journal of Machine Learning Research, 2003,3 (4/5) : 993-1022.
  • 8Steyvers M, Griffiths T. Probabilistic topic models[M]. Latent Semantic Analysis:A Road to Meaning, Laurence Erlbaum, 2005.
  • 9Koller D, Friedman N. Probabilistic graphical modles: principles and techniques [M]. Cambridge MIT Press, 2009.
  • 10数据堂.数据堂页面[EB/OL].(2015-03-06)[2015-04-20].http://datatang.corn/.

引证文献4

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部