With the development of information technology, the amount of power grid topology data has gradually increased. Therefore, accurate querying of this data has become particularly important. Several researchers have cho...With the development of information technology, the amount of power grid topology data has gradually increased. Therefore, accurate querying of this data has become particularly important. Several researchers have chosen different indexing methods in the filtering stage to obtain more optimized query results because currently there is no uniform and efficient indexing mechanism that achieves good query results. In the traditional algorithm, the hash table for index storage is prone to "collision" problems, which decrease the index construction efficiency. Aiming at the problem of quick index entry, based on the construction of frequent subgraph indexes, a method of serialized storage optimization based on multiple hash tables is proposed. This method mainly uses the exploration sequence to make the keywords evenly distributed; it avoids conflicts of the stored procedure and performs a quick search of the index. The proposed algorithm mainly adopts the "filterverify" mechanism; in the filtering stage, the index is first established offline, and then the frequent subgraphs are found using the "contains logic" rule to obtain the candidate set. Experimental results show that this method can reduce the time and scale of candidate set generation and improve query efficiency.展开更多
Graph data mining has been a crucial as well as inevitable area of research.Large amounts of graph data are produced in many areas,such as Bioinformatics,Cheminformatics,Social Networks,etc.Scalable graph data mining ...Graph data mining has been a crucial as well as inevitable area of research.Large amounts of graph data are produced in many areas,such as Bioinformatics,Cheminformatics,Social Networks,etc.Scalable graph data mining methods are getting increasingly popular and necessary due to increased graph complexities.Frequent subgraph mining is one such area where the task is to find overly recurring patterns/subgraphs.To tackle this problem,many main memory-based methods were proposed,which proved to be inefficient as the data size grew exponentially over time.In the past few years,several research groups have attempted to handle the Frequent Subgraph Mining(FSM)problem in multiple ways.Many authors have tried to achieve better performance using Graphic Processing Units(GPUs)which has multi-fold improvement over in-memory while dealing with large datasets.Later,Google’s MapReduce model with the Hadoop framework proved to be a major breakthrough in high performance large batch processing.Although MapReduce came with many benefits,its disk I/O and noniterative style model could not help much for FSM domain since subgraph mining process is an iterative approach.In recent years,Spark has emerged to be the De Facto industry standard with its distributed in-memory computing capability.This is a right fit solution for iterative style of programming as well.In this survey,we cover how high-performance computing has helped in improving the performance tremendously in the transactional directed and undirected aspect of graphs and performance comparisons of various FSM techniques are done based on experimental results.展开更多
很多频繁子图挖掘算法已被提出.然而,这些算法产生的频繁子图数量太多而不能被用户有效地利用.为此,提出了一个新的研究问题:挖掘图数据库中的频繁跳跃模式.挖掘频繁跳跃模式既可以大幅度地减少输出模式的数量,又能使有意义的图模式保...很多频繁子图挖掘算法已被提出.然而,这些算法产生的频繁子图数量太多而不能被用户有效地利用.为此,提出了一个新的研究问题:挖掘图数据库中的频繁跳跃模式.挖掘频繁跳跃模式既可以大幅度地减少输出模式的数量,又能使有意义的图模式保留在挖掘结果中.此外,跳跃模式还具有抗噪声干扰能力强等优点.然而,由于跳跃模式不具有反单调性质,挖掘它们非常具有挑战性.通过研究跳跃模式自身的特性,提出了两种新的裁剪技术:基于内扩展的裁剪和基于外扩展的裁剪.在此基础上又给出了一种高效的挖掘算法GraphJP(an algorithm for mining jump patterns from graph databases).另外,还严格证明了裁剪技术和算法GraphJP的正确性.实验结果表明,所提出的裁剪技术能够有效地裁剪图模式搜索空间,算法GraphJP是高效、可扩展的.展开更多
基金supported by the State Grid Science and Technology Project (Title: Research on High Performance Analysis Technology of Power Grid GIS Topology Based on Graph Database, 5455HJ160005)
文摘With the development of information technology, the amount of power grid topology data has gradually increased. Therefore, accurate querying of this data has become particularly important. Several researchers have chosen different indexing methods in the filtering stage to obtain more optimized query results because currently there is no uniform and efficient indexing mechanism that achieves good query results. In the traditional algorithm, the hash table for index storage is prone to "collision" problems, which decrease the index construction efficiency. Aiming at the problem of quick index entry, based on the construction of frequent subgraph indexes, a method of serialized storage optimization based on multiple hash tables is proposed. This method mainly uses the exploration sequence to make the keywords evenly distributed; it avoids conflicts of the stored procedure and performs a quick search of the index. The proposed algorithm mainly adopts the "filterverify" mechanism; in the filtering stage, the index is first established offline, and then the frequent subgraphs are found using the "contains logic" rule to obtain the candidate set. Experimental results show that this method can reduce the time and scale of candidate set generation and improve query efficiency.
文摘Graph data mining has been a crucial as well as inevitable area of research.Large amounts of graph data are produced in many areas,such as Bioinformatics,Cheminformatics,Social Networks,etc.Scalable graph data mining methods are getting increasingly popular and necessary due to increased graph complexities.Frequent subgraph mining is one such area where the task is to find overly recurring patterns/subgraphs.To tackle this problem,many main memory-based methods were proposed,which proved to be inefficient as the data size grew exponentially over time.In the past few years,several research groups have attempted to handle the Frequent Subgraph Mining(FSM)problem in multiple ways.Many authors have tried to achieve better performance using Graphic Processing Units(GPUs)which has multi-fold improvement over in-memory while dealing with large datasets.Later,Google’s MapReduce model with the Hadoop framework proved to be a major breakthrough in high performance large batch processing.Although MapReduce came with many benefits,its disk I/O and noniterative style model could not help much for FSM domain since subgraph mining process is an iterative approach.In recent years,Spark has emerged to be the De Facto industry standard with its distributed in-memory computing capability.This is a right fit solution for iterative style of programming as well.In this survey,we cover how high-performance computing has helped in improving the performance tremendously in the transactional directed and undirected aspect of graphs and performance comparisons of various FSM techniques are done based on experimental results.
基金Supported by the National Natural Science Foundation of China under Grant Nos.60473075 60773063 (国家自然科学基金)+2 种基金the Key Program National Natural Science Foundation of China under Grant No.60533110 (国家自然科学基金重点项目)the National Basic Research Program of China under Grant No.2006CB303000 (国家重点基础研究发展计划(973))the Program for New Century Excellent Talents in University (NCET) under Grant No.NCET-05-0333 (新世纪优秀人才支持计划)
文摘很多频繁子图挖掘算法已被提出.然而,这些算法产生的频繁子图数量太多而不能被用户有效地利用.为此,提出了一个新的研究问题:挖掘图数据库中的频繁跳跃模式.挖掘频繁跳跃模式既可以大幅度地减少输出模式的数量,又能使有意义的图模式保留在挖掘结果中.此外,跳跃模式还具有抗噪声干扰能力强等优点.然而,由于跳跃模式不具有反单调性质,挖掘它们非常具有挑战性.通过研究跳跃模式自身的特性,提出了两种新的裁剪技术:基于内扩展的裁剪和基于外扩展的裁剪.在此基础上又给出了一种高效的挖掘算法GraphJP(an algorithm for mining jump patterns from graph databases).另外,还严格证明了裁剪技术和算法GraphJP的正确性.实验结果表明,所提出的裁剪技术能够有效地裁剪图模式搜索空间,算法GraphJP是高效、可扩展的.