基于最大频繁项集挖掘的微博炒作群体发现方法

Detection of hype groups based on mining maximum frequent itemsets in Microblogs

下载PDF

导出

摘要近年来微博炒作账户异军突起,采用违规手段开展网络公关活动,严重扰乱了正常的互联网秩序。传统的炒作账户发现主要采用特征分析方法,忽视了炒作账户的组织性和策划性,难以发现隐蔽性高的炒作账户。针对以上问题,充分考虑到炒作账户共同参与微博炒作的群体特性,将炒作群体发现问题转化为挖掘最大频繁项集问题,提出了一种基于最大频繁项集挖掘的炒作群体发现方法,能够找出多次共同参与炒作微博传播的账户群体。为了提高最大频繁项集挖掘的效率,结合研究背景以及事务数据库的特点,提出了一种基于迭代交集的最大频繁项集发现算法,采用基于二分查找的最大频繁候选项集筛选策略对事务数据库进行缩减,并利用多种方式减少事务间取交集的次数。最后通过实验对IIA算法的性能进行了评估,并在真实的新浪微博数据集上验证了炒作群体发现方法的有效性,实验结果表明利用该方法发现的炒作群体准确率高于90%,而且能发现传统特征分析方法难以识别的隐蔽炒作账户。 In recent years, the hype accounts in Microblogs rise as a new force, using illegal means to carry out the network public relations activities, which has seriously disturbed the normal order of the Internet. The traditional detection of hype accounts mainly uses methods based on feature analysis, ignoring that hype accounts are strongly organizational and planning,which is difficult to find the concealed ones. In view of the above problems, fully considering the group characteristics that hype accounts often participate in hype microblogs together, the problem of hype groups detection is transformed into the problem of mining maximum frequent itemsets, and a method based on mining maximum frequent itemsets for the detection of hype groups is proposed, which can find accounts groups who have participated in hype microblogs together in many times. According to the research background and the characteristics of transaction database, a new algorithm based on iterative intersection is proposed to improve the efficiency of mining maximum frequent itemsets, which uses a selection strategy based on binary search algorithm to reduce the transaction database, and uses a variety of ways to reduce the times of intersection between transactions. Finally, the performance of IIA algorithm is evaluated by experiments, and experiments are conducted on a real dataset from Sina Weibo, the experiments results show that this method can find highly concealed hype accounts that can’t be identified by traditional methods based on feature analysis, with the accuracy rate of up to 90%.

作者刘琰张进陈静尹美娟张伟丽 LIU Yan;ZHANG Jin;CHEN Jing;CHEN Jing;ZHANG Weili(State Key Laboratory of Mathematical Engineering and Advanced Computing, Zhengzhou 450002, China)

机构地区数学工程与先进计算国家重点实验室

出处《计算机工程与应用》 CSCD 北大核心 2017年第4期90-97,共8页 Computer Engineering and Applications

基金国家自然科学基金(No.61309007) 国家高技术研究发展计划(863)(No.2012AA012902)

关键词数据挖掘微博炒作群体最大频繁项集 data mining microblog hype groups maximum frequent itemsets

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献5

1陈昱,张慧琳.社会计算在信息安全中的应用[J].清华大学学报（自然科学版）,2011,51(10):1323-1328. 被引量：10
2丁兆云,周斌,贾焰,汪祥.微博中基于统计特征与双向投票的垃圾用户发现[J].计算机研究与发展,2013,50(11):2336-2348. 被引量：11
3路松峰,卢正鼎.快速开采最大频繁项目集[J].软件学报,2001,12(2):293-297. 被引量：113
4宋余庆,朱玉全,孙志挥,陈耿.基于FP-Tree的最大频繁项目集挖掘及更新算法[J].软件学报,2003,14(9):1586-1592. 被引量：164
5颜跃进,李舟军,陈火旺.一种挖掘最大频繁项集的深度优先算法[J].计算机研究与发展,2005,42(3):462-467. 被引量：20

二级参考文献43

1王飞跃.人工社会、计算实验、平行系统——关于复杂社会经济系统计算研究的讨论[J].复杂系统与复杂性科学,2004,1(4):25-35. 被引量：234
2王飞跃.平行系统方法与复杂系统的管理和控制[J].控制与决策,2004,19(5):485-489. 被引量：330
3张泽明,罗文坚,王煦法.一种基于人工免疫的多层垃圾邮件过滤算法[J].电子学报,2006,34(9):1616-1620. 被引量：16
4中国互联网络信息中心.中国互联网络发展状况统计报告[EB/OL].http://www.cnnic net.cn,2003—07-01.
5Lin Dao I，Proc the 6th European Conference on Extending Database Technology，1998年，105页
6Agrawal R，Proc the 11th Inter Conference on Data Engineering，1995年，3页
7R. Agrawal, T. Imielinski, A. Swami. Mining association rules between sets of items in large databases. The 1993 ACM SIGMOD Int'l Conf. on Management of Data, Washington, D.C. USA,1993.
8R. Agrawal, R. Srikant. Fast algorithms for mining association rules in large databases. The 20th Int'l Conf. on Very Large Databases, Santiago, Chile, 1994.
9R. Agarwal, C. Aggarwal, V. V. V. Prasad. A tree projection algorithm for generation of frequent itemsets. Journal of Parallel and Distributed Computing (Special Issue on High Performance Data Mining), 2001, 61(3): 350--371.
10J. Han, J. Pei, Y. Yin. Mining frequent patterns without candidate generation. The 2000 ACM SIGMOD Int'l Conf. on Management of Data, Dallas, USA, 2000.

共引文献246

1谢志强,朱孟杰,杨静.基于改进FP-树的最大项目集挖掘算法[J].计算机应用研究,2009,26(2):502-505. 被引量：1
2姜晗,贾泂.基于标记域FP-Tree快速挖掘最大频繁项集[J].计算机研究与发展,2007,44(z2):334-349. 被引量：4
3杨种学.基于并行FP-growth算法挖掘网上关联交易规则[J].南京晓庄学院学报,2005,21(5):65-70.
4李满意.社会计算与信息安全[J].保密科学技术,2012(3):72-74.
5王盛,董黎刚,李群.一种基于逆序编码的关联规则挖掘研究[J].杭州电子科技大学学报（自然科学版）,2010,30(5):169-172. 被引量：1
6陈晴光,李际军.汽车ERP中关联规则挖掘与动态更新的实现策略[J].机械制造,2004,42(6):69-72. 被引量：2
7杨君锐.逆向启发式开采最大频繁项目集[J].计算机工程,2004,30(14):116-118. 被引量：1
8朱玉全,宋余庆,陈耿.约束最大频繁项目集的增量式更新算法[J].计算机工程,2004,30(18):31-32.
9杨君锐,赵群礼.一种不产生候选集的最大频繁集快速挖掘算法[J].微电子学与计算机,2004,21(11):125-128. 被引量：4
10张莹,韩芳溪,柴乔林.基于频繁模式树的AOI聚类算法[J].计算机工程与应用,2004,40(35):178-179.

1张进,刘琰,罗军勇,董雨辰.基于特征分析的微博炒作账户识别方法[J].计算机工程,2015,41(4):48-54. 被引量：3
2陈晨.最大频繁项集挖掘算法综述[J].电脑知识与技术,2008,0(11Z):1030-1031.
3黄松英.基于最大频繁项集挖掘的入侵检测研究[J].绍兴文理学院学报,2007,27(10):32-36. 被引量：1
4邓忠军,宋威,郑雪峰,王少杰.P2P网络中最大频繁项集挖掘算法研究[J].计算机应用研究,2010,27(9):3490-3492. 被引量：1
5DAN TYNAN PIXELGARDEN.COM（图）.打造百毒不侵的互联网[J].科技新时代,2006(11):78-83.
6彭慧伶,舒云星,武新.基于FP-tree的最大频繁项集挖掘新算法[J].计算技术与自动化,2009,28(2):62-65.
7陈凤娟.基于FP树的最大频繁项集挖掘[J].电子世界,2014(17):119-119.
8李由.Google:将博客当作武器[J].经营者,2006(6):96-96.
9创造属于自己的3D影像[J].数码影像时代,2011(5):106-107.
10陈慧萍,王建东,王煜.频繁项集挖掘的研究与进展[J].计算机仿真,2006,23(4):68-73. 被引量：10

计算机工程与应用

2017年第4期

浏览历史

内容加载中请稍等...

基于最大频繁项集挖掘的微博炒作群体发现方法

参考文献5

二级参考文献43

共引文献246

相关作者

相关机构

相关主题

浏览历史