ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform 被引量：1

ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform

导出

摘要 Recently, topic models such as Latent Dirichlet Allocation(LDA) have been widely used in large-scale web mining. Many large-scale LDA training systems have been developed, which usually prefer a customized design from top to bottom with sophisticated synchronization support. We propose an LDA training system named ZenLDA, which follows a generalized design for the distributed data-parallel platform. The novelty of ZenLDA consists of three main aspects:(1) it converts the commonly used serial Collapsed Gibbs Sampling(CGS) inference algorithm to a Monte-Carlo Collapsed Bayesian(MCCB) estimation method, which is embarrassingly parallel;(2)it decomposes the LDA inference formula into parts that can be sampled more efficiently to reduce computation complexity;(3) it proposes a distributed LDA training framework, which represents the corpus as a directed graph with the parameters annotated as corresponding vertices and implements ZenLDA and other well-known inference methods based on Spark. Experimental results indicate that MCCB converges with accuracy similar to that of CGS, while running much faster. On top of MCCB, the ZenLDA formula decomposition achieved the fastest speed among other well-known inference methods. ZenLDA also showed good scalability when dealing with large-scale topic models on the data-parallel platform. Overall, ZenLDA could achieve comparable and even better computing performance with state-of-the-art dedicated systems. Recently, topic models such as Latent Dirichlet Allocation(LDA) have been widely used in large-scale web mining. Many large-scale LDA training systems have been developed, which usually prefer a customized design from top to bottom with sophisticated synchronization support. We propose an LDA training system named ZenLDA, which follows a generalized design for the distributed data-parallel platform. The novelty of ZenLDA consists of three main aspects:(1) it converts the commonly used serial Collapsed Gibbs Sampling(CGS) inference algorithm to a Monte-Carlo Collapsed Bayesian(MCCB) estimation method, which is embarrassingly parallel;(2)it decomposes the LDA inference formula into parts that can be sampled more efficiently to reduce computation complexity;(3) it proposes a distributed LDA training framework, which represents the corpus as a directed graph with the parameters annotated as corresponding vertices and implements ZenLDA and other well-known inference methods based on Spark. Experimental results indicate that MCCB converges with accuracy similar to that of CGS, while running much faster. On top of MCCB, the ZenLDA formula decomposition achieved the fastest speed among other well-known inference methods. ZenLDA also showed good scalability when dealing with large-scale topic models on the data-parallel platform. Overall, ZenLDA could achieve comparable and even better computing performance with state-of-the-art dedicated systems.

作者 Bo Zhao Hucheng Zhou Guoqiang Li Yihua Huang

机构地区 the National Key Laboratory for Novel Software Technology Collaborative Innovation Center of Novel Software Technology and Industrialization Microsoft Research Huawei Technologies Co.

出处《Big Data Mining and Analytics》 2018年第1期57-74,共18页 大数据挖掘与分析（英文）

基金 partially supported by the National Natural Science Foundation of China(No.61572250) the Science and Technology Program of Jiangsu Province(No.BE2017155)

关键词 LATENT DIRICHLET ALLOCATION collapsed Gibbs sampling Monte-Carlo GRAPH COMPUTING LARGE-SCALE machine learning latent Dirichlet allocation collapsed Gibbs sampling Monte-Carlo graph computing large-scale machine learning

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

同被引文献10

1Ahmed Khalafallah.Neural Network Based Model for Predicting Housing Market Performance[J].Tsinghua Science and Technology,2008,13(S1):325-328. 被引量：2
2周德超,吴晓平.基于Repast仿真框架的蚁群算法设计与实现[J].武汉理工大学学报,2007,29(8):121-124. 被引量：2
3廖守亿,陈坚,陆宏伟,戴金海.基于Agent的建模与仿真概述[J].计算机仿真,2008,25(12):1-7. 被引量：27
4邓宏钟,谭跃进,迟妍.一种复杂系统研究方法——基于多智能体的整体建模仿真方法[J].系统工程,2000,18(4):73-78. 被引量：38
5Bo Hou,Fuchun Sun,Hongbo Li,Guangbin Liu.Consensus of Second-Order Multi-Agent Systems with Time-Varying Delays and Antagonistic Interactions[J].Tsinghua Science and Technology,2015,20(2):205-211. 被引量：3
6江平宇,杨茂林,李卫东,刘加军,郭威,李普林.集体智慧研究综述及其社群化制造应用探索[J].中国机械工程,2020,31(15):1852-1865. 被引量：4
7Mohammad Hashem Haghighat,Jun Li.Intrusion Detection System Using Voting-Based Neural Network[J].Tsinghua Science and Technology,2021,26(4):484-495. 被引量：6
8Peixiang Cai,Yu Zhang.Intelligent cognitive spectrum collaboration:Convergence of spectrum sensing,spectrum access,and coding technology[J].Intelligent and Converged Networks,2020,1(1):79-98. 被引量：2
9Jianqiang Huang,Wentao Han,Xiaoying Wang,Wenguang Chen.Heterogeneous Parallel Algorithm Design and Performance Optimization for WENO on the Sunway TaihuLight Supercomputer[J].Tsinghua Science and Technology,2020,25(1):56-67. 被引量：4
10Jack Schryver,James Nutaro,Mallikarjun Shankar.Emulating a System Dynamics Model with Agent-Based Models: A Methodological Case Study in Simulation of Diabetes Progression[J].Open Journal of Modelling and Simulation,2015,3(4):196-214. 被引量：1

引证文献1

1Wenhui Fan,Peiyu Chen,Daiming Shi,Xudong Guo,Li Kou.Multi-Agent Modeling and Simulation in the AI Age[J].Tsinghua Science and Technology,2021,26(5):608-624. 被引量：3

二级引证文献3

1满坚平,黄国立,赖聪,陈子怡,周毅.智能体在医疗健康领域的研究与应用[J].医学信息学杂志,2022,43(4):20-26. 被引量：2
2李世瑾,李睿,顾小清.如何有效推进人工智能教育?——基于多主体仿真的理论前瞻[J].电化教育研究,2022,43(12):18-24. 被引量：2
3Jingfang Chen,Ling Wang,Zixiao Pan,Yuting Wu,Jie Zheng,Xuetao Ding.A Matching Algorithm with Reinforcement Learning and Decoupling Strategy for Order Dispatching in On-Demand Food Delivery[J].Tsinghua Science and Technology,2024,29(2):386-399.

1厉以宇,黄未,冯海华,陈娇洁.Customized design and efficient machining of astigmatism-minimized progressive addition lens[J].Chinese Optics Letters,2018,16(11):75-79. 被引量：1
2BRAZIL[J].Beijing Review,2019,62(6):8-9.
3刘雅,袁本涛.“基于证据”:现代教育智库功能的提升——对欧美研究生教育改革中EUA和CGS的案例分析[J].高教文摘,2019,0(1):71-74.
4彭成淡,张静楷,陈胡锁,徐格宁,戚其松.基于Monte-Carlo的起重机金属结构安全评价方法[J].起重运输机械,2019(7):87-89.
5Leilei Shi,Yan Wu,Lu Liu,Xiang Sun,Liang Jiang.Event Detection and Identification of Influential Spreaders in Social Media Data Streams[J].Big Data Mining and Analytics,2018,1(1):34-46. 被引量：5
6Latifa Dalhoumi,Mohamed Chtourou,Mohamed Djemel.Decomposition Based Fuzzy Model Predictive Control Approaches for Interconnected Nonlinear Systems[J].International Journal of Automation and computing,2019,16(3):369-388.
7张欣.从administration到running——如何以现代经营思维管理博物馆[J].苏州文博论丛,2011(1):262-267.
8Dewen WANG,Fangfang ZHOU,Jiangman LI.Cloud-based parallel power flow calculation using resilient distributed datasets and directed acyclic graph[J].Journal of Modern Power Systems and Clean Energy,2019,7(1):65-77. 被引量：3
9牛亚男.具有词判别力学习能力的短文本聚类概率模型研究[J].计算机应用研究,2018,35(12):3569-3574. 被引量：3
10PETER OTTO.TROPICAL CYCLONE WIND FIELD DETERMINATION:CHALLENGES AND POSSIBILITIES[J].Tropical Cyclone Research and Review,2012,1(1):87-95.

Big Data Mining and Analytics

2018年第1期

浏览历史

内容加载中请稍等...

ZenLDA: Large-Scale Topic Model Training on Distributed Data-Parallel Platform 被引量：1

同被引文献10

引证文献1

二级引证文献3

相关作者

相关机构

相关主题

浏览历史