摘要
朴素贝叶斯算法在解决垃圾邮件分类领域内具有较高的准确性,能够很好的将邮件区分开来,但是在分类前期的训练阶段却会大量耗用系统和网络资源,严重影响分类效率.为此引入spark平台.以并行的思想去解决邮件分类问题,利用spark计算平台RDD的血缘关系合理的安排NB邮件分类的各个过程.实验结果表明,与其他传统的分类方法对比而言,朴素贝叶斯在精确率,召回率等方面具有很好的效果,并且与传统单机下的邮件分类,本次实验因引入分布式的思想,利用spark集群的优势大大加快了分类的速率.
Nave Bayes algorithm has high accuracy in solving the spam classification field and can distinguish the mail very well.However,in the pre-classification training phase,it consumes a lot of system and network resources and seriously affects the classification efficiency.Spark platform for this introduction.With parallel thinking to solve the problem of mail classification,the use of spark computing platform RDD kinship rational arrangement of NB mail classification of the various processes.The experimental results show that,compared with other traditional classification methods,Nave Bayes has a good effect on the Precision and Recall rate,etc.,and with the traditional mail classification under single machine,this experiment because of the introduction of distributed thinking,The use of spark clusters greatly accelerate the classification speed.
作者
刘月峰
张亚斌
苑江浩
LIU Yue-feng;ZHANG Ya-bin;YUAN Jiang-hao(School of Information Engineering,Inner Mongolia University of Science and Technology,Baotou 014010,China)
出处
《微电子学与计算机》
CSCD
北大核心
2018年第8期60-63,共4页
Microelectronics & Computer