摘要
针对自优化转导支持向量机(So TSVM)无法高效处理大规模训练数据的问题,为了将其拓展到海量数据处理领域同时缩短So TSVM算法运行时间,提出了一种基于消息传输接口(MPI)的并行化So TSVM算法。首先,通过分析So TSVM串行算法得到具有高耗时的预测可信度判定模块;其次,进一步将该模块划分成相互独立的距离计算和可信度判定子模块;最后,通过MPI将距离计算模块中每个计算任务按轮流分配的方式分配给进程处理,同时用并行化排序算法处理可信度判定模块来达到算法并行化。通过仿真实验证明So TSVM并行算法相比串行算法缩短了90%的运行时间,同时程序的并行效率在0.8以上,因此可以将So TSVM并行算法应用在具有海量数据的半监督学习分类中。
Since the Self-optimization Transductive SVM( So TSVM) algorithm cannot efficiently process large scale data,the parallelization based on MPI( Message-Passing Interface) was proposed to apply the algorithm to big data processing, and shorten the program running time. Firstly, the major time-consuming module was obtained by the analysis of serial algorithm;secondly, this module was divided into independent sub-modules including distance calculation and confidence determination;Finally, each task of distance calculation module was alternately assigned to the process through MPI, and parallel sorting was employed to confidence determination for algorithm parallelization. Simulation results show that the parallel algorithm can save more than 90% running time compared with the serial algorithm and keep the parallel efficiency above 0. 8. Therefore, it is suitable to utilize the parallel So TSVM for semi-supervised learning classification on massive data.
出处
《计算机应用》
CSCD
北大核心
2017年第A02期27-31,56,共6页
journal of Computer Applications
基金
国家重点研发计划项目(2017YFB0701501)
上海市自然科学基金资助项目(17ZR1409900)
关键词
半监督学习
支持向量机
机器学习
并行化
分类
semi-supervised learning
Support Vector Machine (SVM)
machine learning
parallelization
classification