基于Hadoop云平台的中文微博情感分类研究

Study on sentiment classification for Chinese microblog based on Hadoop

下载PDF

导出

摘要随着用户发表微博数量的急剧增长,数据集已经达到TB级甚至PB级。针对在单机环境下无法很好地完成海量微博数据集的情感分类任务,文中提出一种基于Hadoop云平台的中文微博情感分类方案。结合微博文本特有的语言特征,依次在MapReduce框架上实现了预处理、特征选择、文本向量化表示、KNN分类算法的并行化。通过对比单机和集群的实验结果表明:Hadoop云平台下的情感分类效率能随着集群规模的扩增而快速提升,并且不影响其分类效果。 With the rapid increase of the number of microblogs the users post, dataset has already reached to a TB or even a PB level. Aiming at tasks of sentiment classification that numerous microblog dataset fails to be completed in a stand-alone environment, this paper provides a project of microblog sentiment classification based on a Hadoop cloud platform. With reference to the characteristics of the mieroblog language , in the MapReduce frame the preprocessing, feature selection, text vectorization, and parallelization of KNN classification algorithms are accordingly realized. The conclusion could be drawn by comparing the stand-alone environment and the cluster experiment database： the efficiency of microblog sentiment classification based on the Hadoop cloud platform increases as the expansion of the cluster scale, and the classification effects would not be affected at the same time.

作者邵丘杨鹤标

机构地区江苏大学计算机科学与通信工程学院

出处《信息技术》 2015年第9期215-218,共4页 Information Technology

关键词情感分类 HADOOP 海量数据 KNN分类算法并行化 sentiment classification Hadoop massive data KNN classification algorithm parallelization

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献4

1代六玲,黄河燕,陈肇雄.中文文本分类中特征抽取方法的比较研究[J].中文信息学报,2004,18(1):26-32. 被引量：228
2樊娜,安毅生,李慧贤.基于K-近邻算法的文本情感分析方法研究[J].计算机工程与设计,2012,33(3):1160-1164. 被引量：10
3张珊,于留宝,胡长军.基于表情图片与情感词的中文微博情感分析[J].计算机科学,2012,39(S3):146-148. 被引量：55
4余永红,向小军,商琳.并行化的情感分类算法的研究[J].计算机科学,2013,40(6):206-210. 被引量：4

二级参考文献50

1黄昌宁等.对自动分词的反思[A]..语言计算与基于内容的文本处理[C].北京:清华大学出版社,2003,7.26-38.
2KIM SM,HOVY E.Identifying and analyzing judgmentopinions[C].PA,USA:Proceedings of the Main Conferenceon Human Language Technology Conference of the North A-merican Chapter of the Association of Computational Linguis-tics,2006:200-207.
3Devitt A,Ahmad K.Sentiment polarity identification in finan-cial news:A cohesion based approach[C].Prague,CZ:As-sociation for Computational Linguistics,2007:984-991.
4PANG B,LEE L.Opinion mining and sentiment analysis[J].Foundations and Trends in Information Retrieval,2008,2(1-2):1-135.
5Titov I,McDonald R.Modeling online reviews with multi-grain topic models[C].New York,NY,USA:Proceedingsof the 17th International Conference on World Wide Web,2008:1-120.
6Stoyanov V,Cardie C.Topic identification for fine-grained o-pinion analysis[C].PA,USA:Proceedings of the 22nd In-ternational Conference on Computational Linguistics,2008:817-824.
7CHOI Y,CARDIE C,RILOF E.Identifying sources ofopinions with conditional random fields and extraction patterns[C].PA,USA:Proceedings of the Conference on HumanLanguage Technology and Empirical Methods in Natural Lan-guage Processing,2009:355-362.
8ZHAO J,LIU K,WANG G.Adding redundant features forCRFs-based sentence sentiment classification[C].PA,USA:Proceedings of the Conference on Empirical Methods in NaturalLanguage Processing,2008:117-126.
9ZHAO J,XU H B,HUANG X J.Overview of Chineseopinion analysis evaluation[EB/OL].http://nlpr-web.ia.a.c/2008papers/gnhy/nh1 0.pdf,2008.
10Kristof Coussenment,Dirk Vanden.Improving customercomplaint management by automatic email classification usinglinguistic style features as predictors[EB/OL].http://www.elsevier.com/locate/dss,2007.

共引文献292

1龚丽娟,王昊,张紫玄,朱立平.Word2Vec对海关报关商品文本特征降维效果分析[J].数据分析与知识发现,2020,4(2):89-100. 被引量：8
2李芮涵,王立明,王昌燕.民俗文化类景区投射形象与感知形象对比分析——以喀什古城景区为例[J].特区经济,2023(9):156-160.
3骆魁永.一种面向不均衡数据集的CHI特征选择改进算法[J].商丘师范学院学报,2021,37(6):9-13.
4王曰芬,吴鹏,丁晟春,陈芬.社会舆情分析研究与进展综述[J].情报学进展,2016(1):132-185. 被引量：1
5张莉.网页自动分类技术概念分析[J].娄底职业技术学院学报（职教与经济研究）,2007(2):58-62.
6张培颖.基于Web内容和日志挖掘的个性化网页推荐系统[J].计算机系统应用,2008,17(9):9-11. 被引量：6
7贾志洋,高炜,王勇刚.结合信息检索技术的半监督文本分类方法[J].苏州大学学报（自然科学版）,2012,28(1):34-39. 被引量：1
8尤晶晶.基于贝叶斯的垃圾邮件过滤优化算法[J].烟台职业学院学报,2008(2):80-83.
9陈涛,宋妍,谢阳群.改进的信息增益特征选择方法在文本聚类中的应用[J].现代图书情报技术,2004(12):7-9. 被引量：2
10王秀娟,郭军,郑康锋.文本分类中一种新的特征选择方法[J].计算机应用,2005,25(3):661-663. 被引量：15

1王凌燕,刘亚辉.基于IOS的新浪微博客户端设计与实现[J].吉林省教育学院学报（中旬）,2013,29(11):145-146. 被引量：1
2通过网易邮箱处理微博[J].网友世界,2010(17):34-34.
3一定要知道的微博应用[J].电脑迷,2012,0(10S):62-62.
4引火虫.你不知道的腾讯微博使用技巧[J].电脑迷,2012(9):77-77.
5王志军.不让QQ好友轻易发现我的微博[J].网友世界,2010(14):34-34.
6韩蕊.“微”论坛[J].纺织服装周刊,2013(25).
7王志军.微博发图也要保护隐私[J].电脑迷,2010(17):78-78.
8王志军.分享精彩发表微博两不误[J].网友世界,2011(5):41-41.
9你知道什么叫“微博”吗?[J].辅导员（中下旬）（教学版）,2015,0(17):53-53.
10长江边.微博轻松转为日志[J].网友世界,2011(14):47-47.

信息技术

2015年第9期

浏览历史

内容加载中请稍等...

基于Hadoop云平台的中文微博情感分类研究

参考文献4

二级参考文献50

共引文献292

相关作者

相关机构

相关主题

浏览历史