面向不平衡微博数据集的转发行为预测方法被引量：2

Prediction of retweeting behavior for imbalanced dataset in microblogs

下载PDF

导出

摘要针对微博转发预测方法研究中的数据集不平衡问题,提出了一种融合过采样技术和随机森林(RF)算法的微博转发行为预测方法。首先,定义了个体信息、社交关系和微博主题3类与微博转发行为相关的特征,并基于信息增益算法实现了关键特征选取;其次,综合微博特征数据的特点来改进少数类样本合成过采样技术(SMOTE),对原始数据集进行非参数概率分布估计,并根据近似概率分布对数据集进行过采样处理,从而使正反例数据量达到平衡;最后,利用随机森林算法,依据微博转发关键特征进行分类器训练,并利用袋外(OOB)数据误差估计来分析和设置随机森林算法的相关参数。通过与基于决策树(DT)、支持向量机(SVM)、朴素贝叶斯(NB)和随机森林等算法的微博转发预测方法进行对比,所提方法整体性能优于基准方法中性能最优的SVM方法,召回率提高了8%,F值提高了5%。实验结果表明,所提方法在实际应用中能够有效提高微博转发行为预测的准确率。 Focusing on the issue that imbalanced dataset influencing the effect of prediction for retweeting behavior in microblogs, a novel predicting algorithm based on oversampling techniques and Random Forest （RF） algorithm was proposed. Firstly, the retweeting-related features, including individual information, social relationships and topic information, were defined. The key feature selection method was implemented based on information gain algorithm. Secondly, by considering the characteristics of the microblogs feature data, an improved algorithm for oversampling based on Synthetic Minority Over- sampling Technique （SMOTE） was proposed. In the course of this algorithm, the probability distribution of the original dataset was estimated based on nonparametric distribution estimation. In order to ensure a balanced number of positive examples and negative examples, an oversampling method was executed based on the improved SMOTE method, according to approximate probability distribution of the original dataset. Finally, a classifier based on random forest algorithm was trained, according to retweeting-related key features. The algorithm parameters of random forest were selected by analyzing the error estimation of Out Of Bag （OOB） data. By comparison with Decision Tree （DT）, Support Vector Machine （SVM）, Naive Bayesian （NB） and RF algorithms, which were used in the analysis for microblog retweeting behavior, the overall performance of the proposed method is superior to the method based on SVM, which obtains optimal results in all the baseline methods. The recall rate and F-measure of the proposed method are improved by 8%, 5% respectively. The experimental results show that the proposed method can effectively improve the prediction accuracy of microblog retweeting behavior analysis in practical application.

作者赵煜邵必林边根庆宋丹

机构地区西安建筑科技大学管理学院

出处《计算机应用》 CSCD 北大核心 2015年第7期1959-1964,共6页 journal of Computer Applications

基金国家自然科学基金资助项目(61272458)

关键词集合与微博主题词集合之间的相似度计算采用向微博转发预测不均匀数据集过采样随机森林 microblog retweet prediction imhalanced dataset oversampling Random Forest （RF）

分类号 TP391.1 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献19

1王元卓,靳小龙,程学旗.网络大数据:现状与展望[J].计算机学报,2013,36(6):1125-1138. 被引量：711
2黄英来,孙晓芳,刘镇波,高萌.微博转发预测算法评测系统的建立及性能比较[J].哈尔滨理工大学学报,2013,18(4):52-57. 被引量：2
3SUH B, HONG L C, PIROLLI P, et al. Want to be retweeted? Large scale analytics on factors impacting retweet in twitter network [C]// Proceedings of the 2010 IEEE International Conference on Social Computing. Piscataway: IEEE, 2010: 177-184.
4XU Z, YANG Q. Analyzing user retweet behavior on twitter [C]// Proceedings of 2012 IEEE/ACM International Conference on Advances in Social Networks Analysis and Mining. Piscataway: IEEE, 2012: 46-50.
5ROMERO D M, MEEDER B, KLEINBERG J. Differences in the mechanics of information diffusion across topics: idioms, political hashtags, and complex contagion on twitter [C]// Proceedings of the 20th International Conference on World Wide Web. New York: ACM, 2011: 695-740.
6WENG J, LIM E P, JIANG J. TwitterRank: finding topic-sensitive influential twitterers [C]// Proceedings of the 3rd ACM International Conference on Web Search and Data Mining. New York: ACM, 2010: 261-270.
7WELCH M J, SCHONFELD U, HE D. Topical semantics of twitter links [C]// Proceedings of the 4th ACM International Conference on Web Search and Data Mining. New York: ACM, 2011: 327-336.
8MORCHID M, DUFOUR R, LINARES G, et al. Feature selection using principal component analysis for massive retweet detection [J]. Pattern Recognition Letters, 2014, 49(11): 33-39.
9PENG H, ZHU J, PIAO D Z, et al. Retweet modeling using condi-tional random fields [C]// ICDMW'11: Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops. Washington, DC: IEEE Computer Society, 2011: 336-343.
10张旸,路荣,杨青.微博客中转发行为的预测研究[J].中文信息学报,2012,26(4):109-114. 被引量：70

二级参考文献173

1欧健文,董守斌,蔡斌.模板化网页主题信息的提取方法[J].清华大学学报（自然科学版）,2005,45(S1):1743-1747. 被引量：70
2周立柱,林玲.聚焦爬虫技术研究综述[J].计算机应用,2005,25(9):1965-1969. 被引量：153
3Pieter N, Michiel H. Mining Twitter in the cloud: A case study [C]// Proceedings of the 2010 IEEE 3rd International Conference on Cloud Computing, CLOUD 2010. Miami, USA: IEEE Computer Society, 2010: 107 -114.
4Abraham R, Martinez T. Twitter: Network properties analysis [C]// Proceedings of the CONIELECOMP 2010 20th International Conference on Electronics Communications and Computers. Cholula Puebla, Mexico: IEEE Computer Society, 2010: 180 - 184.
5wenE,SunV.新浪微博研究报告[Z/OL].(2011-05-20),http://www.techweb.com.cn/data/2011-02-25/916941.shtml.
6HAN Ruixia. The influence of microblogging on personal public participation [C]// Proceedings of the 2010 IEEE 2nd Symposium on Web Society, SWS 2010. Beijing, China: Association for Computing Machinery, 2010:615 -618.
7KANG Shulong, ZHANG Chuang. Complexity research of massively microhlogging based on human behaviors [C]//2010 2nd International Workshop on Database Technology and Applications, DBTA2010 Proceedings. Wuhan, China: IEEE Computer Society, 2010: 1 -4.
8WANG Rui, JIN Yongsheng. An empirical study on the relationship between the followers' number and influence of microblogging [C]// Proceedings of the International Conference on E-Business and E-Government, ICEE 2010. Guangzhou, China: IEEE Computer Society, 2010: 2014- 2017.
9Westman S, Freund L characters or less : Genres on interaction in 140 twitter [C]//IIiX 2010 Proceedings of the 2010 Information Interaction in Context Symposium. New Brunswick, USA: Association for Computing Machinery, 2010:323 - 326.
10Guyon I, Weston J, Barnhill S, et al. Gene Selection for Cancer Classification Using Support Vector Machines [J]. Machine Learning, 2002,46(1-3) : 389-422.

共引文献948

1周熙阳.哔哩哔哩用户群体特征研究[J].新媒体研究,2020(14):39-41. 被引量：5
2韩妍妍,何彦茹,刘培鹤,任慧,张锦圣.基于爬虫的XSS漏洞检测工具设计与实现[J].北京电子科技学院学报,2019,0(1):7-16. 被引量：1
3张丛铄.基于大数据的研究生心理危机预警机制的构建[J].中国新通信,2020,0(2):80-81. 被引量：2
4吴嘉琪.一种基于ELK框架的地理信息动态时空数据获取与挖掘方法[J].测绘通报,2020(1):45-49. 被引量：2
5谢月锋,董现垒,陈卉,王燕,刘志成.利用网络痕迹信息即时预测儿童腹泻流行趋势[J].医学信息（医学与计算机应用）,2016,29(29):1-4.
6于洪,杨显.微博中节点影响力度量与传播路径模式研究[J].通信学报,2012,33(S1):96-102. 被引量：27
7于留宝,胡长军,苏林晗.基于MapReduce的微博文本采集平台[J].计算机科学,2012,39(S3):143-145. 被引量：5
8韩益亮,卢万谊,武光明,杨晓元.适用于网络大数据的属性基广义签密方案[J].计算机研究与发展,2013,50(S2):23-29. 被引量：2
9邓波,张玉超,金松昌,林旺群.基于MapReduce并行架构的大数据社会网络社团挖掘方法[J].计算机研究与发展,2013,50(S2):187-195. 被引量：10
10梁俊杰,熊亚军.以固态硬盘为缓存的存储技术研究[J].微电子学与计算机,2015,32(1):40-44. 被引量：2

同被引文献21

1陈维克,李文锋,首珩,袁兵.基于RSSI的无线传感器网络加权质心定位算法[J].武汉理工大学学报（交通科学与工程版）,2006,30(2):265-268. 被引量：206
2张旸,路荣,杨青.微博客中转发行为的预测研究[J].中文信息学报,2012,26(4):109-114. 被引量：70
3苑卫国,刘云,程军军,熊菲.微博双向“关注”网络节点中心性及传播影响力的分析[J].物理学报,2013,62(3):494-503. 被引量：43
4吴凯,季新生,刘彩霞.基于行为预测的微博网络信息传播建模[J].计算机应用研究,2013,30(6):1809-1812. 被引量：32
5陈江宁,陆余良,郭浩.基于时间线数据的微博活跃用户数估计[J].计算机应用与软件,2013,30(8):246-249. 被引量：6
6李英乐,于洪涛,刘力雄.基于SVM的微博转发规模预测方法[J].计算机应用研究,2013,30(9):2594-2597. 被引量：23
7邹理.微博传播机制的社会网络分析[J].求索,2013(11):241-243. 被引量：9
8褚建勋,倪国香,魏燊.基于用户网络关系结构的微博社交功能研究[J].情报杂志,2014,33(2):128-131. 被引量：6
9刘欣,李鹏,刘璟,王娅丹.社交网络节点中心性测度[J].计算机工程与应用,2014,50(5):116-120. 被引量：11
10曹玖新,吴江林,石伟,刘波,郑啸,罗军舟.新浪微博网信息传播分析与预测[J].计算机学报,2014,37(4):779-790. 被引量：109

引证文献2

1方冰,缪文渊.基于网络拓扑结构视角的社交媒体用户转发预测算法[J].计算机应用研究,2016,33(12):3705-3708. 被引量：2
2周先亭,黄文明,邓珍荣.融合异常检测与随机森林的微博转发行为预测方法[J].计算机科学,2017,44(7):191-196. 被引量：6

二级引证文献8

1马广浩.拓扑矩阵方法评估管制员工作负荷[J].科技创新与应用,2016,6(27):6-7.
2李倩倩,姜景,李瑛,刘怡君.我国政务微博转发规模分类预测[J].情报杂志,2018,37(1):95-99. 被引量：11
3王智远,陈榕,任崇广.基于集成学习的云平台异常点检测[J].计算机工程与设计,2020,41(5):1288-1294. 被引量：11
4张林森,包崇明,周丽华,孔兵.基于混合特征和XGBoost算法的微博转发预测[J].云南大学学报（自然科学版）,2020,42(5):836-845. 被引量：3
5安璐,沈燕.多话题竞争情境下网民关注度转移预测模型研究[J].情报科学,2020,38(10):3-10. 被引量：3
6曾辉,彭俊,胡蓉,胡冰华.基于主题与用户关系信息的微博热度预测算法[J].现代电子技术,2021,44(13):140-143. 被引量：2
7张明杰,肖奇荣,朱烨行.基于XGBoost模型的融合多特征微博信息传播预测方法[J].科学技术与工程,2023,23(10):4279-4285. 被引量：2
8冯楠,曹弘毅.基于IEA-ARIMA模型的微博信息传播效率研究[J].现代电子技术,2023,46(22):68-74. 被引量：1

1杨晓宇,傅忠谦,王卫平.基于BMAC—RLS模型的复杂系统行为预测方法及其应用[J].模式识别与人工智能,2007,20(2):266-270.
2周颖杰,焦程波,陈慧楠,马力,胡光岷.基于流量行为特征的DoS&DDoS攻击检测与异常流识别[J].计算机应用,2013,33(10):2838-2841. 被引量：10
3杜方.一种基于数据划分实现分布式SPARQL查询的方法[J].计算机应用与软件,2016,33(10):23-27. 被引量：1
4何丽.基于类Markov链的用户浏览行为预测方法[J].计算机工程,2008,34(22):32-33. 被引量：6
5方敏,王宝树.基于进化策略的多传感器雷达辐射源目标识别方法[J].控制理论与应用,2004,21(2):165-168. 被引量：2
6张永,李卓然,刘小丹.基于主动学习SMOTE的非均衡数据分类[J].计算机应用与软件,2012,29(3):91-93. 被引量：23
7姜群,王越.基于最大熵的分布估计算法[J].微电子学与计算机,2007,24(11):73-76. 被引量：8
8卢万譞,贾云得.基于眼动数据的网络搜索行为预测方法[J].北京航空航天大学学报,2015,41(5):904-910. 被引量：4
9霍玉丹,谷琼,蔡之华,袁磊.基于遗传算法改进的少数类样本合成过采样技术的非平衡数据集分类算法[J].计算机应用,2015,35(1):121-124. 被引量：18
10曹鹏,李博,栗伟,赵大哲.基于概率分布估计的混合采样算法[J].控制与决策,2014,29(5):815-820. 被引量：6

计算机应用

2015年第7期

浏览历史

内容加载中请稍等...

面向不平衡微博数据集的转发行为预测方法被引量：2

参考文献19

二级参考文献173

共引文献948

同被引文献21

引证文献2

二级引证文献8

相关作者

相关机构

相关主题

浏览历史

面向不平衡微博数据集的转发行为预测方法 被引量：2

参考文献19

二级参考文献173

共引文献948

同被引文献21

引证文献2

二级引证文献8

相关作者

相关机构

相关主题

浏览历史

面向不平衡微博数据集的转发行为预测方法被引量：2