期刊文献+

FSMBUS:一种基于Spark的大规模频繁子图挖掘算法 被引量:20

FSMBUS:A Frequent Subgraph Mining Algorithm in Single Large-Scale Graph Using Spark
下载PDF
导出
摘要 随着社交网络用户数的快速增加,大规模单图上频繁子图挖掘的需求越来越强烈.单机算法对大规模图的运行效率较低,难以支撑支持度较低的频繁子图的挖掘;现有的分布式环境下单图的频繁子图挖掘算法不支持子图增长模式的挖掘,它们所使用的Hadoop框架也不适合运行迭代式算法.提出了一种基于Spark的大规模单图频繁子图挖掘算法FSMBUS,通过次优树构建并行计算的候选子图,在给定最小支持度时挖掘出所有的频繁子图,并利用非频繁检测和搜索顺序选择实现优化,还设计了一种名为Sorted-Greedy的轻量级数据划分方法.实验结果表明,FSMBUS的效率要比现有单图上最新的算法快一个数量级,并支持更低最小支持度阈值以及更大规模图数据的挖掘,同时FSMBUS比其Hadoop的移植版要快2~4倍. Mining frequent subgraphs in a single large-scale graph is of huge demand with the rapid growth of the social networking. However, it is inefficient for the serial algorithms to mine frequent subgraphs in low support when mining for a single large-scale graph. Meanwhile, few existing distributed algorithms can't support the growth pattern mining, and the Hadoop framework they worked is not suitable for iterative running. In this paper, a distributed algorithm named FSMBUS for mining frequent subgraph in a single large-scale graph under Spark framework is proposed. It constructs the parallel computing candidate subgraphs by suboptimal CAM Tree, which returns all the frequent subgraphs for given user-defined minimum support. Additionally, infrequent patterns' test and searching order chosen is introduced to optimize the algorithm. Sorted-Greedy method is designed for data partition to balance the workload. Our experiments show that FSMBUS runs faster and more effective than the existing algorithms with real datasets, and even can run with the lower support threshold and the larger graph datasets as well. At the same time, FSMBUS runs 2~4 times faster on Spark framework than that on Hadoop framework.
出处 《计算机研究与发展》 EI CSCD 北大核心 2015年第8期1768-1783,共16页 Journal of Computer Research and Development
基金 国家自然科学基金项目(61170006 61202007) 宁波市自然科学基金项目(2013A610063 2013A610110)
关键词 频繁子图 大规模单图 分布式挖掘 SPARK 负载均衡 frequent subgraph single large-scale graph distribute mining Spark workload balance
  • 相关文献

参考文献28

  • 1Borgelt C, Berthold M R, Patterson D E. Molecular fragment mining for drug discovery [G] //Symbolic and Quantitative Approaches to Reasoning with Uncertainty. Berlin: Springer, 2005 : 1002-1013.
  • 2王桂娟,印鉴,詹卫许.一种新的基于嵌入集的图分类方法[J].计算机研究与发展,2012,49(11):2311-2319. 被引量:5
  • 3Guralnik V, Karypis G. A scalable algorithm for clustering sequential data [C] //Proc of the 1st IEEE Int Conf on Data Mining. Piscataway, NJ: IEEE, 2001:179-186.
  • 4Yan X, Yu P S, Han J. Graph indexing: A frequent structure-based approach [C] //Proc of the 17th ACM SIGMOD Int Conf on Management of Data. New York: ACM, 2004: 335-346.
  • 5Liu Y, Jiang X, Chen H, et al. Mapreduce-based pattern finding algorithm applied in motif detection for prescription compatibility network [G] //Advanced Parallel Processing Technologies. Berlin: Springer, 2009: 341-355.
  • 6Shahrivari S, Jalili S. Distributed discovery of /requent subgraphs of a network using MapReduce [OL]. [2015-03- 25]. http://link, springer, corn/article/10. 1007/s00607-015 0446 9.
  • 7Elseidy M, Abdelhamid E, Skiadopoulos S, et al. GRAMI: Frequent subgraph and pattern mining in a single large graph [C] //Proc of the 40th Int Conf on Very Large Data Bases. Berlin: Springer, 2014:517-528.
  • 8Bhuiyan M A, A1 Hasan M. An iterative MapReduce based frequent subgraph mining algorithm [J]. IEEE Trans on Knowledge and Data Engineering, 2013, 27(3): 608-620.
  • 9Lu W, Chen G, Tung A K H, et al. Efficiently extracting frequent subgraphs using mapreduce [C] //Proc of the 1st IEEE Int Conf on Big Data. Piscataway, NJ: IEEE, 2013: 639-647.
  • 10Dean J, Ghemawat S. MapReduce: Simplified data processing on large clusters [J]. Communications of the ACM, 2008, 51(1): 107-113.

二级参考文献45

  • 1Rakesh Agrawal, Ramakrishnan Srikant. Fast algorithms for mining association rules in large databases. VLDB1994, Santiago,Chile, 1994.
  • 2Heikki Mannila, et al. Search and borders of theories in knowledge discovery. Data Mining and Knowledge Discovery,1997, 1(3): 241~258.
  • 3Jong Soo Park, et al. An effective Hash based algorithm for mining association rules. SIGMOD1995, San Jose, USA, 1995.
  • 4Sergey Brin, et al. Dynamic itemset counting and implication rules for market basket data. SIGMOD1997, Tucson, USA,1997.
  • 5Ramesh C. Agarwal, et al. Depth first generation of long patterns, KDD 2000, Boston, USA, 2000.
  • 6Ramesh C. Agarwal, et al. A tree projection algorithm for generation of frequent itemsets. J. of Parallel and Distributed Computing, 2001, 61(3): 350~371.
  • 7Jiawei Han, Jian Pei, Yiwen Yin. Mining frequent patterns without candidate generation. SIGMOD2000, Dallas, USA, 2000.
  • 8J. Pei, et al.. H-Mine: Hyper-structure mining of frequent patterns in large databases. ICDM'01, San Jose, CA, 2001.
  • 9Mike Perkowitz, Oren Etzioni. Adaptive sites: Automatically learning from user access patterns. WWW' 97, Santa Clara, 1997.
  • 10J. Pei, et al.. PrefixSpan: Mining sequential patterns efficiently by prefix-projected pattern growth. ICDE'01, Heidelberg, 2001.

共引文献49

同被引文献96

  • 1刘正伟,文中领,张海涛.云计算和云数据管理技术[J].计算机研究与发展,2012,49(S1):26-31. 被引量:170
  • 2李洪波,吴凤鸽,孙增圻,孙富春.网络控制系统仿真平台的设计与实现[J].系统仿真学报,2006,18(6):1700-1704. 被引量:21
  • 3谢莹,吴建国,李炜,许荣斌.基于gSpan算法的未知化合物毒性预测[J].合肥工业大学学报(自然科学版),2007,30(10):1278-1280. 被引量:4
  • 4刘玉艳,沈明玉.LVS负载均衡技术在网络服务中的应用[J].合肥工业大学学报(自然科学版),2007,30(12):1592-1595. 被引量:12
  • 5IDC. The Digital Universe of Opportunities:Rich Data and the Incdreasing Value of the Internet of Things [EB/OL]. [2014-04]. http://www.emc.com/leadership/digital-universe/2014iview/executive-summary.htm.
  • 6FERRERIA C R L , Traina J C, MACHADO T A J, et al. Clustering Very Large Multi-Dimensional Datasets with Mapreduce [C]. 17th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2011 ACM. San Diego: ACM Press, 2011: 690-698.
  • 7YU Y, HUANG C, LEE Y. An Intelligent Touring System Based on Mobile Social Network and Cloud Computing for Travel Recom- mendation[C]. 28th International Conference on Advanced Information Networking and Applications Workshops(AINA), 2014 IEEE. Victoria, Canada: IEEE Press, 2014:19-24.
  • 8WALUNJ S G, SADAFALE K. An Online Recommendation System for E-commerce Based on Apache Mahout Framework[C]. 2013 Annual Conference on Computers and People Research, 2013 ACM. Cincinnati: ACM Press,2013: 153-158.
  • 9ZAHARIA M, CHOWDHURY M, FRANKLIN M J, et al. Spark: Cluster Computing with Working Sets[C]. Proceedings of the 2nd USENIX Conference on Hot Topics in Cloud Computing , 2010:10-10.
  • 10ZAHARIA M, CHOWDHURY M, DAS T, et al. Resilient Distributed Datasets: A Fault-Tolerant Abstraction for in-Memory Cluster Computing[C]. Proceedings of the 9th USENIX Conference on Networked Systems Design and Implementation. USENIX Association, 2012:2-2.

引证文献20

二级引证文献69

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部