期刊文献+

面向不平衡微博数据集的转发行为预测方法 被引量:2

Prediction of retweeting behavior for imbalanced dataset in microblogs
下载PDF
导出
摘要 针对微博转发预测方法研究中的数据集不平衡问题,提出了一种融合过采样技术和随机森林(RF)算法的微博转发行为预测方法。首先,定义了个体信息、社交关系和微博主题3类与微博转发行为相关的特征,并基于信息增益算法实现了关键特征选取;其次,综合微博特征数据的特点来改进少数类样本合成过采样技术(SMOTE),对原始数据集进行非参数概率分布估计,并根据近似概率分布对数据集进行过采样处理,从而使正反例数据量达到平衡;最后,利用随机森林算法,依据微博转发关键特征进行分类器训练,并利用袋外(OOB)数据误差估计来分析和设置随机森林算法的相关参数。通过与基于决策树(DT)、支持向量机(SVM)、朴素贝叶斯(NB)和随机森林等算法的微博转发预测方法进行对比,所提方法整体性能优于基准方法中性能最优的SVM方法,召回率提高了8%,F值提高了5%。实验结果表明,所提方法在实际应用中能够有效提高微博转发行为预测的准确率。 Focusing on the issue that imbalanced dataset influencing the effect of prediction for retweeting behavior in microblogs, a novel predicting algorithm based on oversampling techniques and Random Forest (RF) algorithm was proposed. Firstly, the retweeting-related features, including individual information, social relationships and topic information, were defined. The key feature selection method was implemented based on information gain algorithm. Secondly, by considering the characteristics of the microblogs feature data, an improved algorithm for oversampling based on Synthetic Minority Over- sampling Technique (SMOTE) was proposed. In the course of this algorithm, the probability distribution of the original dataset was estimated based on nonparametric distribution estimation. In order to ensure a balanced number of positive examples and negative examples, an oversampling method was executed based on the improved SMOTE method, according to approximate probability distribution of the original dataset. Finally, a classifier based on random forest algorithm was trained, according to retweeting-related key features. The algorithm parameters of random forest were selected by analyzing the error estimation of Out Of Bag (OOB) data. By comparison with Decision Tree (DT), Support Vector Machine (SVM), Naive Bayesian (NB) and RF algorithms, which were used in the analysis for microblog retweeting behavior, the overall performance of the proposed method is superior to the method based on SVM, which obtains optimal results in all the baseline methods. The recall rate and F-measure of the proposed method are improved by 8%, 5% respectively. The experimental results show that the proposed method can effectively improve the prediction accuracy of microblog retweeting behavior analysis in practical application.
出处 《计算机应用》 CSCD 北大核心 2015年第7期1959-1964,共6页 journal of Computer Applications
基金 国家自然科学基金资助项目(61272458)
关键词 集合与微博主题词集合之间的相似度计算采用向 微博 转发预测 不均匀数据集 过采样 随机森林 microblog retweet prediction imhalanced dataset oversampling Random Forest (RF)
  • 相关文献

参考文献19

  • 1王元卓,靳小龙,程学旗.网络大数据:现状与展望[J].计算机学报,2013,36(6):1125-1138. 被引量:711
  • 2黄英来,孙晓芳,刘镇波,高萌.微博转发预测算法评测系统的建立及性能比较[J].哈尔滨理工大学学报,2013,18(4):52-57. 被引量:2
  • 3SUH B, HONG L C, PIROLLI P, et al. Want to be retweeted? Large scale analytics on factors impacting retweet in twitter network [C]// Proceedings of the 2010 IEEE International Conference on Social Computing. Piscataway: IEEE, 2010: 177-184.
  • 4XU Z, YANG Q. Analyzing user retweet behavior on twitter [C]// Proceedings of 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. Piscataway: IEEE, 2012: 46-50.
  • 5ROMERO D M, MEEDER B, KLEINBERG J. Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter [C]// Proceedings of the 20th International Conference on World Wide Web. New York: ACM, 2011: 695-740.
  • 6WENG J, LIM E P, JIANG J. TwitterRank: finding topic-sensitive influential twitterers [C]// Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. New York: ACM, 2010: 261-270.
  • 7WELCH M J, SCHONFELD U, HE D. Topical semantics of twitter links [C]// Proceedings of the 4th ACM International Conference on Web Search and Data Mining. New York: ACM, 2011: 327-336.
  • 8MORCHID M, DUFOUR R, LINARES G, et al. Feature selection using principal component analysis for massive retweet detection [J]. Pattern Recognition Letters, 2014, 49(11): 33-39.
  • 9PENG H, ZHU J, PIAO D Z, et al. Retweet modeling using condi-tional random fields [C]// ICDMW'11: Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops. Washington, DC: IEEE Computer Society, 2011: 336-343.
  • 10张旸,路荣,杨青.微博客中转发行为的预测研究[J].中文信息学报,2012,26(4):109-114. 被引量:70

二级参考文献173

  • 1欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法[J].清华大学学报(自然科学版),2005,45(S1):1743-1747. 被引量:70
  • 2周立柱,林玲.聚焦爬虫技术研究综述[J].计算机应用,2005,25(9):1965-1969. 被引量:153
  • 3Pieter N, Michiel H. Mining Twitter in the cloud: A case study [C]// Proceedings of the 2010 IEEE 3rd International Conference on Cloud Computing, CLOUD 2010. Miami, USA: IEEE Computer Society, 2010: 107 -114.
  • 4Abraham R, Martinez T. Twitter: Network properties analysis [C]// Proceedings of the CONIELECOMP 2010 20th International Conference on Electronics Communications and Computers. Cholula Puebla, Mexico: IEEE Computer Society, 2010: 180 - 184.
  • 5wenE,SunV.新浪微博研究报告[Z/OL].(2011-05-20),http://www.techweb.com.cn/data/2011-02-25/916941.shtml.
  • 6HAN Ruixia. The influence of microblogging on personal public participation [C]// Proceedings of the 2010 IEEE 2nd Symposium on Web Society, SWS 2010. Beijing, China: Association for Computing Machinery, 2010:615 -618.
  • 7KANG Shulong, ZHANG Chuang. Complexity research of massively microhlogging based on human behaviors [C]//2010 2nd International Workshop on Database Technology and Applications, DBTA2010 Proceedings. Wuhan, China: IEEE Computer Society, 2010: 1 -4.
  • 8WANG Rui, JIN Yongsheng. An empirical study on the relationship between the followers' number and influence of microblogging [C]// Proceedings of the International Conference on E-Business and E-Government, ICEE 2010. Guangzhou, China: IEEE Computer Society, 2010: 2014- 2017.
  • 9Westman S, Freund L characters or less : Genres on interaction in 140 twitter [C]//IIiX 2010 Proceedings of the 2010 Information Interaction in Context Symposium. New Brunswick, USA: Association for Computing Machinery, 2010:323 - 326.
  • 10Guyon I, Weston J, Barnhill S, et al. Gene Selection for Cancer Classification Using Support Vector Machines [J]. Machine Learning, 2002,46(1-3) : 389-422.

共引文献948

同被引文献21

引证文献2

二级引证文献8

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部