摘要
模体发现在揭示基因组水平上的基因表达调控规律以及在蛋白质序列中定位保守结构域中起着重要作用。本文提出一种在生物序列中识别Common Motif(公共模体)的算法。算法采用基于后缀数组或QSA数组的重复模式识别算法挖掘串中最大重复模式作为基元,对基元进行过滤与剪枝后,根据约束条件对优化后基元进行计算与处理从而得到公共模体。算法与基于后缀树或Trie树的同类算法相比在时间和空间效率上都得到了提高。
Motif finding plays an important role on revealing the regulation of gene expression in the genomic level and targeting the conserved domains in the protein sequence. This paper presents an algorithm for finding Common Motif in biological sequences. The algorithm uses the repeat detection algorithms which based on suffix array or QSA array to mining the maximal repeats as primitives. After filtering and pruning, optimized primitives are calculated and processed according to constraints to obtain the common motif. The algorithm is more time and space efficient than the algorithms based on suffix tree or Trie.
出处
《电脑知识与技术(过刊)》
2016年第4X期164-168,共5页
Computer Knowledge and Technology
基金
新疆维吾尔自治区自然科学基金(No.2012211A056)