摘要
gSpan算法是一种基于频繁图的挖掘算法。该算法基于无候选人产生的频繁子图,在图中建立字典序标号,将每个图映射为最小DFS code,再采用深度优先搜索策略挖掘频繁连接子图。与前人算法相比,该算法在生成候选子图时,冗余子图的产生量大大减少;在计算候选子图支持度时避免了大量重复扫描数据库,性能卓越。该文的贡献是将gSpan算法应用在挖掘与已知毒性化合物具有相同子结构的化合物研究工作中,进行未知化合物的毒性预测,对相关领域应用发展具有重要意义。
The graph-based substructure pattern mining algorithm called gSpan, which discovers frequent substructures without candidate generation, is introduced first, gSpan builds a new lexicographic order among graphs and maps each graph to a unique minimum DFS code as its canonical label. Based on this lexicographic order, gSpan adopts the depth-first search strategy to mine frequent con- nected subgraphs efficiently. When it is applied to the chemical compound dataset Chemical_340, gSpan can find the compound set with same substructure as a chemical compound with toxicity, and then the toxicity of unknown chemicals can be predicted. This study is important to the related fields.
出处
《合肥工业大学学报(自然科学版)》
CAS
CSCD
北大核心
2007年第10期1278-1280,共3页
Journal of Hefei University of Technology:Natural Science
关键词
频繁子图
毒性预测
化合物
frequent subgraph
toxicity prediction
chemical compound