期刊文献+

基于分布式LDA-Spark的微博用户兴趣挖掘

Microblog user interest mining based on distributed LDA-Spark
下载PDF
导出
摘要 为了挖掘海量微博数据中潜在的语意信息,通过Gibbs采样方式,并结合Spark分布式计算框架,实现了一种LDA主题模型并行化的算法。该算法针对微博数据的特点,将3层贝叶斯概率模型改为用户-主题-词模型;为了满足LDA的并行化处理需求,采用了一种无冲突的数据分割方法将数据集分成了P×P个数据块,将分割好的数据块重新排序整合成P个子集,保证每个子集中均包含P个数据块,对每个子集进行并行采样。从困惑度、收敛速度及加速比3个方面对改进算法与标准LDA算法进行了对比实验,困惑度2种算法的结果接近;在收敛速度方面,改进算法较标准LDA慢,但在实际应用中对效率没有太大影响;加速比实验中,总词数为100万、work节点为8时,改进算法所用时间是标准LDA的16.78%。实验结果表明,改进算法能得到较为精确的模型,并在大数据环境下可以取得良好的加速效果。 In order to exploit the potential semantic information in massive microblogging data,an algorithm of parallelization of LDA theme model is realized by Gibbs sampling method and Spark distributed computing framework. In order to meet the parallel processing requirements of LDA,a nonconflicting data segmentation method is used to divide the data set into data blocks. In order to meet the requirement of LDA parallelization,a three-layer Bayesian probability model is changed to user-topicword model. P × P data blocks and the segmented blocks are re-ordered into P subsets to ensure that each subset contains P data blocks,each sample for parallel sampling. The improved algorithm and the standard LDA algorithm are compared with the three aspects of confusion,convergence speed and speedup. The results of the two algorithms are close to the experiment. The improved algorithm is slower than the standard LDA in the convergence speed. There is no significant effect on the efficiency of the application. In the acceleration ratio experiment,the total number of words is 1 million and the work node is 8; the time of the improved algorithm is 16. 78% of the standard LDA. Experimental results show that the improved algorithm can obtain a more accurate model,and achieve good acceleration effect in the large data environment.
出处 《北京信息科技大学学报(自然科学版)》 2017年第3期70-74,共5页 Journal of Beijing Information Science and Technology University
基金 863计划课题"面向基础教育的知识能力智能测评与类人答题验证系统"(2015AA015409)
关键词 SPARK 分布式框架 潜在狄利克雷分布 微博 主题模型 Spark distributed framework latent Dirichletal location microblog theme model
  • 相关文献

参考文献2

二级参考文献11

共引文献22

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部