摘要
针对使用集成学习方法进行大规模文本情感分析实验中计算时间瓶颈的问题,提出基于Spark平台的集成学习模型并行化算法。使用三个数量级的文本进行集成学习的对比实验。结果表明,该算法大幅缩短了文本分类时间,F-score等相关评价指标与单机版本接近,且算法的可拓展性良好,大幅降低了模型优化和调参的时间成本。
Aiming at the problem of computing time bottleneck in large-scale text emotional analysis experiments using ensemble learning,this paper proposes a parallel algorithm of ensemble learning model based on Spark platform.Three order of magnitude texts were used to conduct a comparative experiment of ensemble learning.Experiments show that our algorithm greatly shortens the time of text categorization,and F-score and other related evaluation indicators are close to the single version.And it also has good extensibility,which greatly reduces the time cost of model optimization and parameter adjustment.
作者
杨立月
王移芝
Yang Liyue;Wang Yizhi(College of Computer and Information Technology,Beijing Jiaotong University,Beijing 100044,China)
出处
《计算机应用与软件》
北大核心
2020年第6期130-134,共5页
Computer Applications and Software
关键词
SPARK
分布式计算
模型并行化
集成学习
文本情感分析
Spark
Distributed computing
Parallelization of models
Ensemble learning
Text emotional analysis