FSMBUS:一种基于Spark的大规模频繁子图挖掘算法被引量：20

FSMBUS:A Frequent Subgraph Mining Algorithm in Single Large-Scale Graph Using Spark

下载PDF

导出

摘要随着社交网络用户数的快速增加,大规模单图上频繁子图挖掘的需求越来越强烈.单机算法对大规模图的运行效率较低,难以支撑支持度较低的频繁子图的挖掘;现有的分布式环境下单图的频繁子图挖掘算法不支持子图增长模式的挖掘,它们所使用的Hadoop框架也不适合运行迭代式算法.提出了一种基于Spark的大规模单图频繁子图挖掘算法FSMBUS,通过次优树构建并行计算的候选子图,在给定最小支持度时挖掘出所有的频繁子图,并利用非频繁检测和搜索顺序选择实现优化,还设计了一种名为Sorted-Greedy的轻量级数据划分方法.实验结果表明,FSMBUS的效率要比现有单图上最新的算法快一个数量级,并支持更低最小支持度阈值以及更大规模图数据的挖掘,同时FSMBUS比其Hadoop的移植版要快2~4倍. Mining frequent subgraphs in a single large-scale graph is of huge demand with the rapid growth of the social networking. However, it is inefficient for the serial algorithms to mine frequent subgraphs in low support when mining for a single large-scale graph. Meanwhile, few existing distributed algorithms can＇t support the growth pattern mining, and the Hadoop framework they worked is not suitable for iterative running. In this paper, a distributed algorithm named FSMBUS for mining frequent subgraph in a single large-scale graph under Spark framework is proposed. It constructs the parallel computing candidate subgraphs by suboptimal CAM Tree, which returns all the frequent subgraphs for given user-defined minimum support. Additionally, infrequent patterns＇ test and searching order chosen is introduced to optimize the algorithm. Sorted-Greedy method is designed for data partition to balance the workload. Our experiments show that FSMBUS runs faster and more effective than the existing algorithms with real datasets, and even can run with the lower support threshold and the larger graph datasets as well. At the same time, FSMBUS runs 2~4 times faster on Spark framework than that on Hadoop framework.

作者严玉良董一鸿何贤芒汪卫

机构地区宁波大学信息科学与工程学院复旦大学计算机科学技术学院

出处《计算机研究与发展》 EI CSCD 北大核心 2015年第8期1768-1783,共16页 Journal of Computer Research and Development

基金国家自然科学基金项目(61170006 61202007) 宁波市自然科学基金项目(2013A610063 2013A610110)

关键词频繁子图大规模单图分布式挖掘 SPARK 负载均衡 frequent subgraph single large-scale graph distribute mining Spark workload balance

分类号 TP301 [自动化与计算机技术—计算机系统结构]

引文网络
相关文献

参考文献28

1Borgelt C, Berthold M R, Patterson D E. Molecular fragment mining for drug discovery [G] //Symbolic and Quantitative Approaches to Reasoning with Uncertainty. Berlin: Springer, 2005 : 1002-1013.
2王桂娟,印鉴,詹卫许.一种新的基于嵌入集的图分类方法[J].计算机研究与发展,2012,49(11):2311-2319. 被引量：5
3Guralnik V, Karypis G. A scalable algorithm for clustering sequential data [C] //Proc of the 1st IEEE Int Conf on Data Mining. Piscataway, NJ: IEEE, 2001:179-186.
4Yan X, Yu P S, Han J. Graph indexing: A frequent structure-based approach [C] //Proc of the 17th ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2004: 335-346.
5Liu Y, Jiang X, Chen H, et al. Mapreduce-based pattern finding algorithm applied in motif detection for prescription compatibility network [G] //Advanced Parallel Processing Technologies. Berlin: Springer, 2009: 341-355.
6Shahrivari S, Jalili S. Distributed discovery of /requent subgraphs of a network using MapReduce [OL]. [2015-03- 25]. http://link, springer, corn/article/10. 1007/s00607-015 0446 9.
7Elseidy M, Abdelhamid E, Skiadopoulos S, et al. GRAMI: Frequent subgraph and pattern mining in a single large graph [C] //Proc of the 40th Int Conf on Very Large Data Bases. Berlin: Springer, 2014:517-528.
8Bhuiyan M A, A1 Hasan M. An iterative MapReduce based frequent subgraph mining algorithm [J]. IEEE Trans on Knowledge and Data Engineering, 2013, 27(3): 608-620.
9Lu W, Chen G, Tung A K H, et al. Efficiently extracting frequent subgraphs using mapreduce [C] //Proc of the 1st IEEE Int Conf on Big Data. Piscataway, NJ: IEEE, 2013: 639-647.
10Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters [J]. Communications of the ACM, 2008, 51(1): 107-113.

二级参考文献45

1Rakesh Agrawal, Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. VLDB1994, Santiago,Chile, 1994.
2Heikki Mannila, et al. Search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery,1997, 1(3): 241～258.
3Jong Soo Park, et al. An effective Hash based algorithm for mining association rules. SIGMOD1995, San Jose, USA, 1995.
4Sergey Brin, et al. Dynamic itemset counting and implication rules for market basket data. SIGMOD1997, Tucson, USA,1997.
5Ramesh C. Agarwal, et al. Depth first generation of long patterns, KDD 2000, Boston, USA, 2000.
6Ramesh C. Agarwal, et al. A tree projection algorithm for generation of frequent itemsets. J. of Parallel and Distributed Computing, 2001, 61(3): 350～371.
7Jiawei Han, Jian Pei, Yiwen Yin. Mining frequent patterns without candidate generation. SIGMOD2000, Dallas, USA, 2000.
8J. Pei, et al.. H-Mine: Hyper-structure mining of frequent patterns in large databases. ICDM'01, San Jose, CA, 2001.
9Mike Perkowitz, Oren Etzioni. Adaptive sites: Automatically learning from user access patterns. WWW' 97, Santa Clara, 1997.
10J. Pei, et al.. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. ICDE'01, Heidelberg, 2001.

共引文献49

1鲁慧民,冯博琴,宋擒豹.频繁子图挖掘研究综述[J].微电子学与计算机,2009,26(3):156-161. 被引量：1
2詹宇斌,殷建平,张玲,龙军,程杰仁.一种基于有向树挖掘Web日志中最大频繁访问模式的方法[J].计算机应用,2006,26(7):1662-1665. 被引量：9
3陈亮,高建民,李青,陈琨.基于频繁活动序列挖掘的过程改进机会分析[J].西安交通大学学报,2006,40(11):1310-1314. 被引量：1
4刘勇,李建中,朱敬华.一种新的基于频繁闭显露模式的图分类方法[J].计算机研究与发展,2007,44(7):1169-1176. 被引量：10
5吴卫江,李国和.一种基于极大连通子图的电信社群网分割算法[J].计算机工程与应用,2008,44(5):8-9. 被引量：2
6王涛.一种基于频繁子树的数据库索引方法[J].华中科技大学学报（自然科学版）,2008,36(3):103-106.
7高琳,覃桂敏,周晓峰.图数据中频繁模式挖掘算法研究综述[J].电子学报,2008,36(8):1603-1609. 被引量：9
8周军,姜元春,林文龙.基于有向带权图的Web用户浏览行为模型[J].情报理论与实践,2008,31(5):795-798. 被引量：1
9吴甲,陈崚.一种快速的频繁子图挖掘算法[J].计算机应用,2008,28(10):2533-2536. 被引量：4
10付立东,赵永刚,邓福岐.二维非线性对流扩散方程求解程序优化[J].西安科技大学学报,2009,29(1):104-108.

同被引文献96

1刘正伟,文中领,张海涛.云计算和云数据管理技术[J].计算机研究与发展,2012,49(S1):26-31. 被引量：170
2李洪波,吴凤鸽,孙增圻,孙富春.网络控制系统仿真平台的设计与实现[J].系统仿真学报,2006,18(6):1700-1704. 被引量：21
3谢莹,吴建国,李炜,许荣斌.基于gSpan算法的未知化合物毒性预测[J].合肥工业大学学报（自然科学版）,2007,30(10):1278-1280. 被引量：4
4刘玉艳,沈明玉.LVS负载均衡技术在网络服务中的应用[J].合肥工业大学学报（自然科学版）,2007,30(12):1592-1595. 被引量：12
5IDC. The Digital Universe of Opportunities:Rich Data and the Incdreasing Value of the Internet of Things [EB/OL]. [2014-04]. http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm.
6FERRERIA C R L , Traina J C, MACHADO T A J, et al. Clustering Very Large Multi-Dimensional Datasets with Mapreduce [C]. 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2011 ACM. San Diego: ACM Press, 2011: 690-698.
7YU Y, HUANG C, LEE Y. An Intelligent Touring System Based on Mobile Social Network and Cloud Computing for Travel Recom- mendation[C]. 28th International Conference on Advanced Information Networking and Applications Workshops(AINA), 2014 IEEE. Victoria, Canada: IEEE Press, 2014:19-24.
8WALUNJ S G, SADAFALE K. An Online Recommendation System for E-commerce Based on Apache Mahout Framework[C]. 2013 Annual Conference on Computers and People Research, 2013 ACM. Cincinnati: ACM Press,2013: 153-158.
9ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark: Cluster Computing with Working Sets[C]. Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing , 2010:10-10.
10ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for in-Memory Cluster Computing[C]. Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, 2012:2-2.

引证文献20

1岑凯伦,于红岩,杨腾霄.大数据下基于Spark的电商实时推荐系统的设计与实现[J].现代计算机,2016,22(16):61-69. 被引量：21
2王丽娜,余荣威,付楠,鞠瑞,徐鹏志.基于大数据分析的APT防御方法[J].信息安全研究,2015,1(3):230-237. 被引量：8
3杨枢,邱昱炎,石波.区域心电监护物联网云计算平台关键技术研究[J].中国医疗器械杂志,2016,40(5):341-343. 被引量：2
4廖彬,张陶,于炯,国冰磊,刘继.基于二维划分的杰卡德相似系数批量计算效率优化[J].计算机科学,2017,44(1):219-225. 被引量：2
5郑诗敏,秦小麟,刘亮,周倩.云环境下的突发关键字查询算法[J].计算机科学,2017,44(3):10-15.
6黄林昊,郭昆.基于并行决策树的微博互动数预测[J].福建工程学院学报,2017,15(3):294-300.
7张鹏,段磊,秦攀,左劼,唐常杰,元昌安,彭舰.基于Spark的Top-k对比序列模式挖掘[J].计算机研究与发展,2017,54(7):1452-1464. 被引量：7
8李龙洋,董一鸿,严玉良,陈华辉,钱江波.Spark环境下基于频繁边的大规模单图采样算法[J].计算机研究与发展,2017,54(9):1966-1978. 被引量：3
9崔景洋.图数据挖掘研究[J].太原师范学院学报（自然科学版）,2018,17(1):38-40. 被引量：3
10张陶,于炯,廖彬,国冰磊,卞琛,王跃飞,刘炎.基于GraphX的传球网络构建及分析研究[J].计算机研究与发展,2016,53(12):2729-2752. 被引量：8

二级引证文献69

1于炯,蒲勇霖,鲁亮,刘粟.分布式处理平台节能计算研究综述[J].新疆大学学报（自然科学版）,2018,35(4):389-401. 被引量：1
2叶小榕,邵晴.基于Spark的大规模社交网络社区发现原型系统[J].科技导报,2018,36(23):93-101. 被引量：8
3金铭.大数据与推荐系统研究[J].电脑知识与技术,2018,14(12):253-254.
4董娜,张君艳,刘伟娜,常杰.电网企业APT攻击防御存在的问题及防御措施[J].河北电力技术,2016,35(4):25-27. 被引量：3
5吕欣,韩晓露.大数据安全和隐私保护技术架构研究[J].信息安全研究,2016,2(3):244-250. 被引量：49
6张新刚,于波,王保平,田燕.大数据信息安全典型风险及保障机制[J].创新科技,2016,16(10):75-77. 被引量：7
7张新刚,于波,田燕,王保平.大数据时代高校网络空间安全层次化保障体系分析[J].网络安全技术与应用,2017(1):104-105. 被引量：7
8廖彬,张陶,国冰磊,于炯,张旭光,刘炎.基于Spark的ItemBased推荐算法性能优化[J].计算机应用,2017,37(7):1900-1905. 被引量：8
9王佳娴,王中杰.基于Spark的分布式实时推荐系统[J].系统仿真技术,2017,13(2):158-161. 被引量：3
10许卫.计算机在生物医学及远程心电监护中的应用[J].自动化与仪器仪表,2017(7):174-175. 被引量：1

1赵斌,吉根林.分布式系统中关联规则挖掘研究[J].小型微型计算机系统,2003,24(12):2270-2271. 被引量：8
2卢成浪,吴宗大.分布式数据库关联规则挖掘研究[J].温州师范学院学报,2006,27(2):72-76.
3吕品,于文兵.隐私保护-分布式挖掘中的改进型评价函数[J].武汉理工大学学报,2008,30(6):140-142. 被引量：2
4琚春华,倪栋君.基于元学习的分布式挖掘频繁闭合模式算法研究[J].计算机应用研究,2009,26(1):41-43. 被引量：1
5王春花,黄厚宽,田盛丰,王志海.从大型数据库中分布式挖掘多层关联规则的算法[J].铁道学报,2000,22(5):47-50. 被引量：1
6马廷淮,张海盛.分布式数据挖掘的集成体系结构研究[J].计算机应用研究,2003,20(11):126-128. 被引量：4
7尚俭.东京文具展与您共享喜悦和丰收[J].文体用品与科技,2008(4):49-49.
8魏红宁,颜治平.分布式环境下的决策树挖掘研究[J].计算机与数字工程,2007,35(8):127-130. 被引量：1
9李云,刘学诚,朱峰.数据挖掘技术在入侵检测中的应用[J].计算机应用与软件,2011,28(5):117-119. 被引量：6
10齐书阳.摩尔定律会终结吗[J].电脑爱好者,2013(8):1-1. 被引量：2

计算机研究与发展

2015年第8期

浏览历史

内容加载中请稍等...

FSMBUS:一种基于Spark的大规模频繁子图挖掘算法被引量：20

参考文献28

二级参考文献45

共引文献49

同被引文献96

引证文献20

二级引证文献69

相关作者

相关机构

相关主题

浏览历史

FSMBUS:一种基于Spark的大规模频繁子图挖掘算法 被引量：20

参考文献28

二级参考文献45

共引文献49

同被引文献96

引证文献20

二级引证文献69

相关作者

相关机构

相关主题

浏览历史

FSMBUS:一种基于Spark的大规模频繁子图挖掘算法被引量：20