摘要
基于大数据挖掘的实时性要求和数据样本的多样性特征,提出一种面向大数据挖掘的机器学习模型训练优化算法。分析当前算法的迭代计算过程,根据模型向量的改变量将迭代过程分为粗调和微调两个阶段,并发现在微调阶段绝大部分样本对计算结果的影响极小,因此可以在微调阶段不计算此类样本的梯度而直接采用上次迭代的计算结果,从而减小计算量,提升计算效率。试验结果表明,算法在分布式集群环境下可以减小模型训练约35%的计算量,且训练得到的模型准确度在正常范围内,可有效提高大数据挖掘的实时性。
Traditional machine learning algorithms for small data were not applicable for mining of big data. An optimization algorithm for machine learning and big data mining was proposed. The iterative computation of machine learning algorithms was divided into two phases according to the change of model vector. According to the observation that most samples contributed little to the model update during the iteration,the computation load of machine learning algorithms could be reduced by reusing the iterative computing results of this kind of samples. The experimental results showed that the proposed method could reduce the computation load by 35%,with little effect on prediction accuracy of the training model.
出处
《山东大学学报(工学版)》
CAS
北大核心
2017年第4期1-6,共6页
Journal of Shandong University(Engineering Science)
基金
河南省重点科技攻关资助项目(162102210096
152102210088
142102210090)
河南省高等学校重点科研资助项目(18A520014)