摘要
查找图的连通分量在生物信息学领域有着重要应用价值,其中的关键问题之一是查询最大连通Steiner分量(SMCC).针对已有最大连通Steiner分量查询方法中存在的查询效率低的问题,本文首先提出利用k-edge连通分量与(k+1)-edge连通分量之间的包含关系建立顶点集合的分层索引KST.和现有的专用索引相比,KST索引规模得到了缩减;然后本文提出了基于KST索引的SMCC查询算法以及具有顶点数量限制的SMCC L查询算法.和已有方法中索引的是图中顶点不同,KST索引中维护的是顶点集合的包含关系.其优点在于将已有方法在遍历过程中的一次一顶点的查询方式转换为更高效的一次一集合的查询方式,显著减少了需要访问的索引点数量,极大提升了查询处理的效率;最后,基于15个真实数据集进行实验测试,从不同角度验证了本文所提方法的高效性.
Given a graph G,and a set of query nodes q,the Steiner Maximum-Connected Component(SMCC)is a connected subgraph of G with the maximum connectivity and the maximum number nodes which contains q.And SMCC L is the SMCC of G with the constraint of the number of nodes L.Finding SMCC or SMCC L is one of the fundamental operations in graph data processing,and is one of the hot issues,which has attracted much attention in the research field and can be applied in many applications,including bio-informatics,etc.Existing methods for SMCC and SMCC L query processing are mainly classified into the following two categories:(1)Finding SMCC or SMCC L based on existing k-edge connected component algorithms which decrease the value of k from|V|(the number of nodes in graph G)to 1 in turn,calculate all k-edge connected components in G,then the first k-edge connected component containing query q is SMCC of q,and the first k-edge connected component containing q whose number of nodes is greater than or equal to L is SMCC L of q.The shortcomings of such methods are that the search process needs to traverse graph G many times and the computation cost is high when the size of graph is large.(2)Finding SMCC or SMCC L based on special index which constructs the MST(Maximum Spanning Tree)index of G offline.When processing the query,it firstly calculates the connectivity of q,traverses the MST with any node in q as the start node,and only accesses the nodes corresponding to the edges satisfying specific conditions until all nodes in q are covered,then the visited nodes are the nodes in the SMCC of q.The shortcomings of such methods are that querying SMCC needs expensive traversal operations,each step of traversal can obtain at most one useful node(namely one-node-a-step).When the number of nodes in SMCC is large,the number of nodes that need to be accessed in the index tree will increase.Moreover,when query requests are executed frequently,the system load will increase rapidly.Considering that existing approaches on querying SMCC and SMCC L suffer from inefficiency,we first propose a KST index,which maintains the relationship between k-edge connected subgraph and(k+1)-edge connected subgraph.Compared with existing methods,the index size is largely reduced.Based on the KST index,we propose a new SMCC finding algorithm(namely SMCC-KST),and propose another algorithm to find SMCC with size constraint(namely SMCC L-KST).Compared with existing index which maintains the relationship of nodes in the original graph,in our KST index,each index node represents a nodes set.The benefit is that in this way,we can improve existing approaches from one-node-a-step when traversing on the index to one-set-a-step,such that significantly reduces the number of visited index nodes,while at the same time improves the query efficiency.We conduct extensive experimental study on 15 real datasets.The experimental results show that our approaches perform much better than existing ones when querying SMCC and SMCC L.
作者
陈子阳
陈伟
贾勇
周军锋
CHEN Zi-Yang;CHEN Wei;JIA Yong;ZHOU Jun-Feng(School of Information and Management,Shanghai Lixin University of Accounting and Finance,Shanghai 201620;School of Information Science and Engineering,Yanshan University,Qinhuangdao,Hebei 066004;Department of Information Engineering,Hebei University of Environmental Engineering,Qinhuangdao,Hebei 066102;School of Computer Science and Technology,Donghua University,Shanghai 201620)
出处
《计算机学报》
EI
CSCD
北大核心
2020年第7期1215-1229,共15页
Chinese Journal of Computers
基金
国家自然科学基金(61472339,61572421,61873337)资助.