摘要
传统的分类算法在对模型进行训练之前,需要得到整个训练数据集。然而在大数据环境下,数据以数据流的形式源源不断地流向系统,因此不可能预先获得整个训练数据集。研究了大数据环境下含有噪音的流数据的在线分类问题。将流数据的在线分类描述成一个优化问题,提出了一种加权的Nave Bayes分类器和一种误差敏感的(Error Adaptive)分类器,并通过真实的数据集对提出的算法进行了验证。实验结果表明,文中提出的误差敏感的分类器算法在系统没有噪音的情况下分类预测的准确性要优于相关的算法;此外,当流数据中含有噪音时,误差敏感的分类器算法对噪音不敏感,仍然具有很好的预测准确性,因此可以应用于大数据环境下流数据的在线分类预测。
Traditional classification algorithms need to obtain the whole training dataset before training the model.However,for big data,data are streaming into the system sequentially,so it is impossible to obtain the whole training dataset beforehand.This paper studied the online classification problem in data streaming for big data.It first described the online classification problem as an optimization problem,then proposed a Weighted Naive Bayes classifier and an Error Adaptive classifier,and at last,validated the efficiency of the proposed algorithm according to two real datasets.The experiments show that the prediction accuracy of our proposed algorithm is higher than related researches in non-noisy data streaming,and moreover,while data streaming is noisy,our algorithm still has better prediction accuracy,so it can be used in real online classification application in data streaming.
出处
《计算机科学》
CSCD
北大核心
2014年第5期227-229,234,共4页
Computer Science
基金
国家自然科学基金(61170121)资助
关键词
大数据
决策树
分类算法
流数据
Big data
Decision tree
Classification algorithm
Data streaming