基于数据流的大图中频繁模式挖掘算法研究被引量：6

An Algorithm Based on Dataflow Model for Mining Frequent Patterns from a Large Graph

下载PDF

导出

摘要随着单个图数据规模的扩大以及应用领域的扩展,大规模单图的频繁模式挖掘的需求越来越强烈.传统的单机环境已经无法满足大规模图数据挖掘的要求,而现有的并行或者分布式环境下的挖掘方法,普遍受到并行性以及数据倾斜等问题的限制,论文在分析了现有的频繁模式挖掘算法后,提出了一种基于数据流的单个大图频繁模式挖掘方法.首先,建立基于数据流的频繁模式挖掘模型,将MapReduce模型中的“批”数据变成“微批”数据,提高了数据处理的并行度,并且其迭代方式也满足频繁子图挖掘的反单调性;其二,设计了数据流模型中的频繁模式检查、子图实例扩展以及正规编码计算等操作,实现了基于数据流模型的频繁模式挖掘算法;其三,为解决正规编码计算中的复杂性问题,提出了基于不变关系的正规编码计算策略以及基于编码树的优化策略,优化正规编码比未优化编码的计算性能提升了30%,基于编码树的优化策略比原始编码计算策略在性能上提升了10%;最后,对涉及的相关算法进行了实验测试,实验证明,算法提高了频繁模式挖掘的并行性,大幅度减少了大图的搜索空间,降低了正规编码的计算时间,相比于传统算法大规模单图中频繁模式挖掘的效率提升了30%. Big graph data mining has been highly motivated not only by the tremendously increasing size of graphs but also by its large number of applications,such as bioinformatics,chemoinformatics,and social networks.One of the most challenging tasks in big graph mining is pattern mining.These tasks consist on using data mining algorithms to discover interesting,unexpected and useful patterns in large amounts of graph data.Several algorithms exist for frequent pattern mining,but they are mainly used on centralized computing systems and evaluated on relatively small datasets.While modern graphs are growing dramatically,several parallel and distributed solutions have been proposed to solve this problem.However,those methods do not have better performance in scalability and balancing.So that we propose an algorithm based on dataflow model for mining frequent patterns in a large single graph.We construct a dataflow model for Mining frequent patterns,which include three operators:IsFrequent,Expand and Code.At first,the frequent pattern mining method based on dataflow model separates large graph into many micro graphs and has following advantages.These micro graphs can be expanded and calculated simultaneously,because they are independent of each other.At the same time,since each iteration is based on the subgraph instance generated in the previous iteration,only one vertex or one edge needs to be extended,it decreases the generation of redundant subgraph in Expand operator.Secondly,we propose a regular code computing strategy based on invariant relation and an optimization strategy based on coding tree.These two approaches solve the problem that it is difficult to calculate the regular code.The results show that our regular code computing strategy improves performance by 30%over the original approach and our optimization strategy improves performance by 10%over the original strategy.Thirdly,we design operators of checking frequent pattern using micro batch data.After the large batch data is decomposed into multiple micro batch data,each micro batch data can be regarded as a single processing unit,a lot of tasks can be generated concurrently,which reduce data skew.These micro batch data can be iteratively computed more easily in parallel computing.And its iterative approach also satisfies the anti-monotonicity of frequent patterns mining.At last,the algorithm of frequent pattern mining is implemented.The experiments on our cluster show that the algorithm can effectively process a variety of large graphs with millions of vertices and tens of millions of frequent pattern mining,and scales well with the degree of available parallelism.

作者汤小春樊雪枫周佳文李战怀 TANG Xiao-Chun;FAN Xue-Feng;ZHOU Jia-Wen;LI Zhan-Huai(School of Computer Science,Northwestern Polytechnical University,Xi’an 710129)

机构地区西北工业大学计算机学院

出处《计算机学报》 EI CSCD 北大核心 2020年第7期1293-1311,共19页 Chinese Journal of Computers

基金科技部云计算与大数据重点专项(2018YFB1003403)资助。

关键词图挖掘频繁模式数据流模型并行算法编码树 graph mining frequent pattern dataflow model parallel algorithm coding tree

分类号 TP18 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

同被引文献78

1李甜甜,李晨,寿松涛.改良快速急诊预检分诊系统的临床应用[J].天津医科大学学报,2019,0(6):581-584. 被引量：6
2张笑东,夏筱筠,吕海峰,公绪超,廉梦佳.大数据网络并行计算环境中生理数据流动态负载均衡[J].吉林大学学报（工学版）,2020,50(1):247-254. 被引量：11
3李盛恩,王珊.封闭数据立方体技术研究[J].软件学报,2004,15(8):1165-1171. 被引量：25
4向隆刚,龚健雅.一种高度浓缩和语义保持的数据立方[J].计算机研究与发展,2007,44(5):837-844. 被引量：5
5师智斌,黄厚宽.基于形式概念分析的约简数据立方体研究[J].计算机研究与发展,2009,46(11):1956-1962. 被引量：6
6林栋.对穴位—针刺效应研究现状的思考[J].中国针灸,2011,31(11):1049-1051. 被引量：6
7张竞心,孙琦,林祺,卢梦叶,殷旭,李晨,卢东东,徐天成.数字经络智能针灸机器人的研发思路探讨[J].中医药导报,2018,24(19):66-68. 被引量：11
8张小红,程宝珍,林文风,孙立琴,圣文娟,宋瑰琦.急诊分诊管理软件的设计及应用效果[J].中华护理杂志,2015,50(3):328-331. 被引量：46
9徐天成,李诗园,徐先红,卢梦叶,张竞心,孙文渊,张泓鑫,宋思源,顾继昱,孙建华.基于穴-症关系的经脉拓扑模型定量研究[J].中国针灸,2017,37(11):1229-1232. 被引量：12
10刘军煜,贾修一.一种利用关联规则挖掘的多标记分类算法[J].软件学报,2017,28(11):2865-2878. 被引量：34

引证文献6

1李萌.基于ASP.NET技术支持下的急诊护理快速分诊系统研究[J].现代科学仪器,2021,38(4):28-34. 被引量：1
2齐诗仪,倪友聪,章思佳,杜欣,林丽莉,林栋.基于智能选穴模式的针灸“症-穴”相关研究[J].中华中医药杂志,2022,37(12):7220-7223. 被引量：2
3施一飞.分布式多维数据流频繁模式挖掘算法设计[J].吉林大学学报（信息科学版）,2023,41(1):174-179. 被引量：2
4徐静文,游进国,王全鹍,黄星瑞,贾连印.数据立方体与频繁项集的统一计算框架研究[J].计算机学报,2023,46(4):780-802.
5陈榆,何慧敏,梁志胜,欧旭.基于MapReduce的健康大数据并行挖掘算法研究[J].现代电子技术,2023,46(12):79-83.
6鲁江.基于模糊聚类的网络敏感数据流动态挖掘[J].电子设计工程,2024,32(9):152-155. 被引量：1

二级引证文献6

1王永红.论急诊预检分诊系统及信息化管理的研究体会与进展[J].中国卫生产业,2022,19(8):253-256.
2齐诗仪,龚萌,林丽莉,倪友聪,杜欣,林栋.从经脉图与经穴图表现形式探讨穴位体表分布特征[J].中华中医药杂志,2023,38(5):2188-2192. 被引量：1
3方仕健.二进制的top-k闭合频繁模式挖掘[J].电脑迷,2023(15):10-12.
4文聪,郝杰,于丽君.基于Apriori算法的国企人力资源数据挖掘方法探析[J].数字技术与应用,2024,42(6):208-210.
5刘刚刚,高鲁,谢欣昇,高雪娇.数字中医学的研究进展[J].中华中医药学刊,2024,42(9):9-12.
6李华锋.基于Apriori算法的新业态分析系统构建与研究[J].自动化与仪器仪表,2024(9):325-329.

1韩丽,刘书宁,徐圣斯,朴京钰.自适应稀疏编码融合的非刚性三维模型分类算法[J].计算机辅助设计与图形学学报,2019,31(11):1898-1907. 被引量：4
2陆一.尊重认知发展,关注思想方法——“三角形的中位线”教学实践与反思[J].家长,2019,0(23):105-106.
3董文荣.彩色多普勒超声诊断下肢深静脉血栓的应用价值探讨[J].心理月刊,2020,0(10):164-164. 被引量：1
4徐周波,杨健,刘华东,黄文文.基于XGBoost与拓扑结构信息的蛋白质复合物识别算法[J].计算机应用,2020,40(5):1510-1514. 被引量：3
5汤小春,符莹,丁朝,毛安琪,李战怀.数据流计算环境下的集群资源管理技术[J].大数据,2020,6(3):87-100. 被引量：3
6李金涛.四年级数学教学中简便计算策略浅谈[J].今天,2020(11):96-96.
7苏华友,梅松竹,李荣春,窦勇.数据流技术在GPU和大数据处理中的应用[J].大数据,2020,6(3):117-128. 被引量：2
8曹彬.例析运用放缩法证明数列求和不等式的策略[J].中学数学研究（华南师范大学）（上半月）,2020(6):39-40. 被引量：1
9刘广钟,田纪尧,孔维全.基于节点测速的水下传感器时钟同步迭代算法[J].计算机工程与设计,2020,41(6):1522-1527. 被引量：4
10周国良,彭益智,曲伟.一种舰载传感器图像编码实现方法[J].火力与指挥控制,2020,45(1):120-123. 被引量：5

计算机学报

2020年第7期

浏览历史

内容加载中请稍等...

基于数据流的大图中频繁模式挖掘算法研究被引量：6

同被引文献78

引证文献6

二级引证文献6

相关作者

相关机构

相关主题

浏览历史

基于数据流的大图中频繁模式挖掘算法研究 被引量：6

同被引文献78

引证文献6

二级引证文献6

相关作者

相关机构

相关主题

浏览历史

基于数据流的大图中频繁模式挖掘算法研究被引量：6