摘要
随着用户发表微博数量的急剧增长,数据集已经达到TB级甚至PB级。针对在单机环境下无法很好地完成海量微博数据集的情感分类任务,文中提出一种基于Hadoop云平台的中文微博情感分类方案。结合微博文本特有的语言特征,依次在MapReduce框架上实现了预处理、特征选择、文本向量化表示、KNN分类算法的并行化。通过对比单机和集群的实验结果表明:Hadoop云平台下的情感分类效率能随着集群规模的扩增而快速提升,并且不影响其分类效果。
With the rapid increase of the number of microblogs the users post, dataset has already reached to a TB or even a PB level. Aiming at tasks of sentiment classification that numerous microblog dataset fails to be completed in a stand-alone environment, this paper provides a project of microblog sentiment classification based on a Hadoop cloud platform. With reference to the characteristics of the mieroblog language , in the MapReduce frame the preprocessing, feature selection, text vectorization, and parallelization of KNN classification algorithms are accordingly realized. The conclusion could be drawn by comparing the stand-alone environment and the cluster experiment database: the efficiency of microblog sentiment classification based on the Hadoop cloud platform increases as the expansion of the cluster scale, and the classification effects would not be affected at the same time.
出处
《信息技术》
2015年第9期215-218,共4页
Information Technology
关键词
情感分类
HADOOP
海量数据
KNN分类算法
并行化
sentiment classification
Hadoop
massive data
KNN classification algorithm
parallelization