摘要
在基因预测软件中常用的编码测度得到的序列编码潜力大小往往与序列的C+G含量紧密相关,从而影响了对蛋白编码区的识别效果.研究发现六联体使用偏好与其自身C+G含量存在一种近似线性的相关性,据此提出了一种改进的六联体使用偏好模型,通过综合考虑六联体使用频率与六联体的C+G含量,可简便有效地减小序列编码潜力大小对序列C+G含量的依赖性.测试表明,与分类建模策略相比,该方法所需的训练数据较少,而且具有更好的蛋白编码区识别效果,因此可用于基因预测软件中以提高蛋白编码区与基因结构的预测精度.
Statistical characteristics of nucleotide composition are important information to identify protein coding regions. However, coding potentials calculated by some widely used coding measures closely related to sequence C+G content, thus the performance of recognizing protein coding regions is affected. In view of the fact, the strategy of learning parameters from different C+G content reference sets separately, and some famous eukaryotic gene identification programs are adopted in. An improved hexamer usage preference model reducing the dependence of coding potential on C+G content was presented. In proposed algorithm less training data is needed, but better performance of recognizing protein coding regions than the former strategy gained. It is hoped that the algorithm is useful to improve the accuracy of some existing gene-finding programs.
出处
《华中科技大学学报(自然科学版)》
EI
CAS
CSCD
北大核心
2005年第7期107-110,共4页
Journal of Huazhong University of Science and Technology(Natural Science Edition)
基金
国家自然科学基金资助项目(90203011)
湖北省自然科学基金资助项目(2002AC014).