摘要
随着单个图数据规模的扩大以及应用领域的扩展,大规模单图的频繁模式挖掘的需求越来越强烈.传统的单机环境已经无法满足大规模图数据挖掘的要求,而现有的并行或者分布式环境下的挖掘方法,普遍受到并行性以及数据倾斜等问题的限制,论文在分析了现有的频繁模式挖掘算法后,提出了一种基于数据流的单个大图频繁模式挖掘方法.首先,建立基于数据流的频繁模式挖掘模型,将MapReduce模型中的“批”数据变成“微批”数据,提高了数据处理的并行度,并且其迭代方式也满足频繁子图挖掘的反单调性;其二,设计了数据流模型中的频繁模式检查、子图实例扩展以及正规编码计算等操作,实现了基于数据流模型的频繁模式挖掘算法;其三,为解决正规编码计算中的复杂性问题,提出了基于不变关系的正规编码计算策略以及基于编码树的优化策略,优化正规编码比未优化编码的计算性能提升了30%,基于编码树的优化策略比原始编码计算策略在性能上提升了10%;最后,对涉及的相关算法进行了实验测试,实验证明,算法提高了频繁模式挖掘的并行性,大幅度减少了大图的搜索空间,降低了正规编码的计算时间,相比于传统算法大规模单图中频繁模式挖掘的效率提升了30%.
Big graph data mining has been highly motivated not only by the tremendously increasing size of graphs but also by its large number of applications,such as bioinformatics,chemoinformatics,and social networks.One of the most challenging tasks in big graph mining is pattern mining.These tasks consist on using data mining algorithms to discover interesting,unexpected and useful patterns in large amounts of graph data.Several algorithms exist for frequent pattern mining,but they are mainly used on centralized computing systems and evaluated on relatively small datasets.While modern graphs are growing dramatically,several parallel and distributed solutions have been proposed to solve this problem.However,those methods do not have better performance in scalability and balancing.So that we propose an algorithm based on dataflow model for mining frequent patterns in a large single graph.We construct a dataflow model for Mining frequent patterns,which include three operators:IsFrequent,Expand and Code.At first,the frequent pattern mining method based on dataflow model separates large graph into many micro graphs and has following advantages.These micro graphs can be expanded and calculated simultaneously,because they are independent of each other.At the same time,since each iteration is based on the subgraph instance generated in the previous iteration,only one vertex or one edge needs to be extended,it decreases the generation of redundant subgraph in Expand operator.Secondly,we propose a regular code computing strategy based on invariant relation and an optimization strategy based on coding tree.These two approaches solve the problem that it is difficult to calculate the regular code.The results show that our regular code computing strategy improves performance by 30%over the original approach and our optimization strategy improves performance by 10%over the original strategy.Thirdly,we design operators of checking frequent pattern using micro batch data.After the large batch data is decomposed into multiple micro batch data,each micro batch data can be regarded as a single processing unit,a lot of tasks can be generated concurrently,which reduce data skew.These micro batch data can be iteratively computed more easily in parallel computing.And its iterative approach also satisfies the anti-monotonicity of frequent patterns mining.At last,the algorithm of frequent pattern mining is implemented.The experiments on our cluster show that the algorithm can effectively process a variety of large graphs with millions of vertices and tens of millions of frequent pattern mining,and scales well with the degree of available parallelism.
作者
汤小春
樊雪枫
周佳文
李战怀
TANG Xiao-Chun;FAN Xue-Feng;ZHOU Jia-Wen;LI Zhan-Huai(School of Computer Science,Northwestern Polytechnical University,Xi’an 710129)
出处
《计算机学报》
EI
CSCD
北大核心
2020年第7期1293-1311,共19页
Chinese Journal of Computers
基金
科技部云计算与大数据重点专项(2018YFB1003403)资助。
关键词
图挖掘
频繁模式
数据流模型
并行算法
编码树
graph mining
frequent pattern
dataflow model
parallel algorithm
coding tree