摘要
以支持向量机为分类器,序列的k-letter词为特征,建立了原核生物的基因识别模型。分别选取已知功能的基因为正样本,和与等长正样本的随机突变序列为负样本组成训练集。5倍交叉实验的结果表示,对于具有不同核函数的支持向量机以及不同长度的词特征,其预测准确率不同,最高的可达94%以上,最差的低于60%;长度为3的词的特征的分类结果最好,其次是长度为4。这说明3联核苷酸为基因序列比较好的统计特征。
A model of gene recognition of Prokaryotes is built, with Support Vector Machine as a method of classification and k-letters word of a sequence as a characteristic. The train set consists of positive samples which are chosen out from known-function genes and equal negative ones generated randomly from the corresponding positive sample. The resuk in 5-cross experiments indicates that accuracy of prediction for SVMs varies with kernal functions and length of word, better above 94% and worse below 60%; the best classification result is of 3-letter word and next 4-letter word. This demonstrates 3 amino acids is a better statistical characteristic ofgene sequences.
出处
《湖南第一师范学院学报》
2011年第2期133-136,共4页
Journal of Hunan First Normal University
基金
湖南省教育厅科研项目(09C888)