摘要
随着后基因组时代的到来,如何去揭示序列背后隐藏的生命信息已成为当前生命科学探讨的主题。而控制基因表达的正是启动子序列,如何去识别和预测序列的启动子区域是基因研究的重点课题。隐马尔可夫模型是最近几年研究基因最主要的模型。本文首先探讨了EM算法并提出了随机迭代算法,在初始状态分布和散发矩阵都随机假设,而转移矩阵由序列计算出的条件下对人类启动子序列进行识别,平均识别率达到了92.05%。改进了多分类问题中的“投票策略”,提出了“一票决定”算法,使算法次数由O(N2)降到了O(N),由此对多个DNA家族进行分类,正确率达90.73%。从结果上看,在两类问题上,支持向量机比隐马尔可夫模型优越,但在处理多分类问题上隐马尔可夫模型却比支持向量机有更强的分类能力。
With the coming of post-genomics era, how to find out the life information hidden at the back of sequences is current main subject in the study of life sciences. Promoter controls gene expression. How to recognize and predict the promoter regions is an important subject in gene study. Hidden Markov Models(HMMs)are the main method for researching gene over the last several years. Firstly, the EM algorithm is discussed in this paper and one kind of random iterated algorithm is proposed. It recognizes human promoters at the condition that the initial state distribution and the emitted matrix are randomly assumed and the state transition probability matrix is known by calculating, with mean rate of recognition up to 92.05%. And it improves on "strategy of voting" for multi-class classification, puts forward one kind of voting algorithm named "one ticket determines", to make the order of calculation reduce from O(N2) to O (N). From that DNA sequences are divided with, the rate of accuracy up to 90. 73%. From the result,SVM has an advantage over HMMS for binary classification, but for multi-class classification, the HMMS is superior to SVM.
出处
《计算机科学》
CSCD
北大核心
2006年第6期195-199,共5页
Computer Science
基金
国家自然科学基金资助项目(No.10371135)。
关键词
隐马尔可夫模型
随机迭代算法
“一票决定”算法
启动子的识别和分类
Hidden Markov models, Random iterated algorithm, "one ticket determines" algorithm,Promoter recogni tion and classification