期刊文献+

DNA序列中基于适应性后缀树的重复体识别算法 被引量:4

An Adaptive Suffix Tree Based Algorithm for Repeats Identification in a DNA Sequence
下载PDF
导出
摘要 现有的在DNA序列中识别重复体的算法多数是基于比对的,对识别速度和吞吐量有很大的限制.针对这个问题文中根据一个平衡重复体的长度和频率的定义,提出了一种基于Ukkonen后缀树的快速识别重复体的RepSeeker算法.算法采用最低限制频率,最大程度地扩展了重复体的长度,同时为了进一步地提高RepSeeker算法的效率,对Ukkonen的后缀树构造算法进行了适应性改进,在构造时加入RepSeeker算法所需的结点信息并将叶子结点和分支结点加以区分,从而使得RepSeeker算法能通过直接读取结点信息来求得子串频率和子串位置.这种改进较大地提高了RepSeeker算法的性能,而且空间开销不大.实验中使用了NCBI中的9条典型DNA序列作为测试数据,并对后缀树改进前后的重复体识别算法做了比较分析.结果表明,RepSeeker在没有损失精度的情况下缩短了算法的运行时间.实验结果与理论上的分析一致. Many existing methods for repeats identification are based on alignments.Their speed and time significantly limit their applications.This paper presents the fast Rep(eats)Seeker algorithm for repeats identification based on the adaptive Ukkonen suffix tree construction algorithm.The RepSeeker algorithm uses the lowest frequency limit to maximize the extension of repeats.The adaptive improvements to the Ukkonen algorithm are made to increase the efficiency of the RepSeeker algorithm.The node information required by the RepSeeker algorithm is added during the suffix tree construction.Because information on leaves and branch nodes are different,the RepSeeker algorithm directly obtains the needed information from the nodes to find out the frequency and locate the positions of a substring.The improvement is considerable for the repeats identification at a little extra cost in space.Nine sequences from the National Center for Biotechnology Information (NCBI) are used to test the performance of the RepSeeker algorithm.Comparisons between before and after improvements of the suffix tree construction show that the running time of the RepSeeker algorithm is reduced without losing the accuracy.The experimental results agree with the theoretical expectations.
出处 《计算机学报》 EI CSCD 北大核心 2010年第4期747-754,共8页 Chinese Journal of Computers
基金 国家自然科学基金(69601003) 青年科学基金(60705004)资助
关键词 重复体识别 适应性后缀树 Ukkonen算法 RepSeeker算法 repeats identification adaptive suffix tree Ukkonen algorithm RepSeeker algorithm
  • 相关文献

参考文献17

  • 1Lander E S, Linton L M, Birren Bet al, Initial sequencing and analysis of the human genome. Nature, 2001, 409 (6822) : 860-921.
  • 2霍红卫,白帆.一种具有精确边界的重复体识别算法[J].计算机学报,2008,31(2):214-219. 被引量:1
  • 3Saha Surya, Bridges Susan, Magbanua Zenaida V, Peterson Daniel G. Empirical comparison of ab initio repeat finding programs. Nucleic Acids Research, 2008, 36(7) : 2284-2294.
  • 4Lefebvre A, Leeroq T, Dauchel H, Alexandre J. FORRepeats: Detects repeats on entire chromosomes and between genomes. Bioinformatics, 2003, 19(3): 319-326.
  • 5Jones Nell C, Pevzner Pavel A. Introduction to Bioinformatics Algorithms. Cambridge, Massachusetts: MIT Press, 2004.
  • 6Huntington's Disease Collaborative Research Group. A novel gene containing a trinucleotide repeat that is expanded an unstable on Huntington's disease chromosomes. Cell, 1993, 72(6), 971-983.
  • 7Bergman Casey M, Quesneville Hadi. Discovering and detecting transposable elements in genome sequences. Briefings in Bioinformatics, 2007, 8(6) : 382-392.
  • 8Pevzner P A, Tang H, Tesler G. De novo repeat classification and fragment assembly. Genome Research, 2004, 14 (9): 1786-1796.
  • 9Kurtz S, Schleiermacher C. REPuter: Fast computation of maximal repeats in complete genomes. Bioinformatics, 1999, 15(5): 426-427.
  • 10Price A L, Jones N C, Pevzner P A. De novo identification of repeat families in large genomes. Bioinformatics, 2005, 21 (Supplement) : i351-i358.

二级参考文献12

  • 1Myers E.W. A four russians algorithm for regular expression pattern matching. Journal of ACM, 1992, 39(4) : 430-448
  • 2Deepak Sharma, Biju Issac, Raghava G P S, Ramaswamy R. Spectral Repeat Finder (SRF) : Identification of repetitive sequences using Fourier transformation. Bioinformatics, 2004, 20(9) : 1405-1412
  • 3Guillaume A, Frederic B, Eduardo P C Rocha, Alain Vian, Eric Coissac. Repseek, a tool to retrieve approximate repeats from large DNA sequences. Bioinformatics, 2006, 23(1): 119-121
  • 4Kurtz S, Schleiermacher C. REPuter: Fast computation of maximal repeats in complete genomes. Bioinformatics, 1999, 15(5) : 426-427
  • 5Volfovsky N, Haas B J, Salzberg S L. A clustering method for repeat analysis in DNA sequences. Genome Biology. , 2001, 2(8) : research0027, 1-0027.11
  • 6Bao Z, Eddy S R. Automated de novo identification of repeat sequence families in sequenced genomes. Genome Research, 2002, 12(8): 1269-1276
  • 7Pevzner P A, Tang H, Tesler G. De novo repeat classification and fragment assembly. Genome Research, 2004, 14 (9) : 1786-1796
  • 8Price A L, Jones N C, Pevzner P A. De novo identification of repeat families in large genomes. Bioinformatics, 2005, 21(Supplement):351-358
  • 9Edgar R C, Myers E W. PILER: Identification and classification of genomic repeats. Bioinformatics, 2005, 1(1): 1-7
  • 10Altschul S F, Gish W, Miller W, Myers E W, Lipman D J. Basic local alignment search tool. Journal of Molecular Biology, 1990, 215(3): 403-410

同被引文献59

  • 1钱能,金文东.DNA序列比对分析中的统计特征方法[J].浙江工业大学学报,2005,33(2):173-175. 被引量:4
  • 2胡吉祥,许洪波,刘悦,程学旗.重复串特征提取算法及其在文本聚类中的应用[J].计算机工程,2007,33(2):65-67. 被引量:6
  • 3Pevsner J著.孙之荣,等译.生物信息学与功能基因组学[M].北京:化学工业出版社,2006:213-257
  • 4Surya S,Susan B,Zenaida M V.Empirical Comparison of AB Initio Repeat Finding Programs[J].Nucleic Acids Research,2008,36(7):2284-2294.
  • 5Cormen T H,Leiserson C E,Rivest R L.Introduction to Algorithms[M].[S.l.] :MIT Press,2001.
  • 6Aoe J.Computer Algorithms:String Pattern Matching Strategies[M].[S.l.] :IEEE Computer Society,1994.
  • 7Tompa M et al. Assessing computational tools for the discov- ery of transcription factor binding sites. Nature Biotechnology, 2005, 23(1): 137-144.
  • 8Das Modan K, Dai Ho-Kwok. A survey of DNA motif find- ing algorithms. BMC Bioinformaties, 2007, 8(Suppl 7)~ $21.
  • 9GuhaThakurta D. Computational identification of transcrip- tional regulatory elements in DNA sequence. Nucleic Acids Research, 2006, 34(12): 3585-3598.
  • 10Sinha S, Tompa M. YMF: A program for discovery of novel transcription factor binding sites by statistical overrepresent- ation. Nucleic Acids Research, 2003, 31(13): 3586-3588.

引证文献4

二级引证文献15

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部