摘要
随着大规模测序技术的进步 ,收录到数据库中的序列增长很快 ,其中大多是未知功能的ESTs(表达序列标签 ,ExpressedSequenceTags)。一般通过蛋白质 -EST序列联配来实现EST的功能提示。由于EST含有5 %左右的测序误差 ,特别严重的是其中的移框误差 ,用通常的方法将EST按6个阅框翻译为蛋白质序列再进行联配难以处理移框误差问题。通过考虑EST序列各种可能的测序误差 ,将氨基酸序列反翻译为核苷酸序列 ,在核酸水平直接进行序列联配 ,用以实现蛋白质与EST序列的精确匹配 ,并对EST序列的移框误差进行识别与校正。
The sequences in database increase quickly along with the development of the high-throughput sequencing techniques, while most of the sequences are ESTs (Expressed Sequencing Tags) with unknown function. The homology alignment was often employed to identify the biological function of EST sequences, comparing all the six reading frames of EST against the selected protein databases at protein level. However, EST sequences contain nearly 5% sequencing errors, in which the frameshift errors made it difficult to treat precisely with traditional alignment. Addressing most of the possible sequencing errors, our alignment model is reverse-translateing the protein sequence into putative nucleotide sequence, which allowed direct comparison at nucleotide level. Such alignment between protein and EST sequences could be more accurate. And the knotty frameshifts in EST sequences could be identified with high quality.
出处
《生物物理学报》
CAS
CSCD
北大核心
2000年第2期322-333,共12页
Acta Biophysica Sinica
基金
国家自然科学基金重大项目课题资助项目!(39990600 -03)
国家人类基因组南方研究中心项目
关键词
测序误差
移框误差
反翻译
蛋白质-EST联配
Sequencing error
Frameshift error
Reverse-translate
Protein-EST alignment