摘要
将DNA序列分成64个碱基一组的短序列。根据每个小段落不同的碱基排列特点,通过对每段中重复频率最高的三碱基组合片段采用特定码书编码,提出了基于统计分析与分段码书的DNA序列压缩方法,以达到对DNA数据压缩的目的。实验表明,本算法在大部分常用基准测试序列中达到了比较好的压缩性能。
DNA sequence is divided into short sequences with a length of 64 bases in every group. According to the different bases arrangement characteristics of each small paragraph, the specific nucleotides triplet is encoded which repeats the most times in a small paragraph with a particular codebook and a compression scheme for DNA data based on statistical analysis and segmented eodebook is put fonward. Thus achieve the purpose of DNA data compression. The experiments show that the proposed algorithm can achieve a good performance in compressing most of the common benchmark sequences.
出处
《科学技术与工程》
北大核心
2012年第29期7505-7509,7514,共6页
Science Technology and Engineering
关键词
DNA序列
统计分析
码书
分段编码
DNA sequences statistical analysis codebook segmented encoding