基于EM的启动子序列半监督学习被引量：3

Semi-supervised Learning of Promoter Sequences Based on EM Algorithm

下载PDF

导出

摘要启动子的预测对于基因的定位有重要意义.已有多种对启动子进行预测的算法,涉及到信号搜索、内容搜索和CpG岛搜索等多种策略.基于马尔可夫模型的启动子分类方法也有研究,其中的转移概率都是直接通过统计已标号训练样本序列得来的.将半监督学习思想引入启动子序列分析中,推导出转移概率等参数的最大似然估计公式.实验中将待测试基因序列片段同已标号训练样本混合,利用得出的参数值对基因序列片段进行识别,使用少量的已标号的样本数据能得出较好的启动子识别结果. The eukaryotic promoter prediction is one of the most important problems in DNA sequence analysis. Promoter is a short sub-sequence before a transcriptional start site （TSS） in a DNA sequence. The prediction of the position of a promoter may approximately describe the position of a TSS, and gives help to biology experiments. Most proposed prediction algorithms are based on some search strategies, such as search by signal, search by content or search by CpG island, their performances are still limited by low sensitivities and high false positives. The promoter classification algorithm based on Markov chain has been proved to be effective in promoter prediction, where parameters such as transition probabilities are calculated by statistics on the labeled samples. In this paper, semi-supervised learning is introduced in promoter sequence analysis to improve classification accuracy with a combination of labeled and unlabeled sequences, and the maximum likelihood estimation formulas for transition probabilities are deduced. In simulating experiments, each long genomic sequence is truncated to short segments, which are mixed with labeled data, and classified according to the calculated probabilities. Comparison with some known prediction algorithms show that semi-supervised learning of promoter sequences based on EM algorithm is efficient when the number of labeled data is small, and the value of Fi is higher than that of predictions based on labeled samples.

作者王立宏赵宪佳武栓虎

机构地区烟台大学计算机科学与技术学院青岛大学国际学院

出处《计算机研究与发展》 EI CSCD 北大核心 2009年第11期1942-1948,共7页 Journal of Computer Research and Development

基金国家自然科学基金项目(60772028) 山东省自然科学基金项目(Y2006G22 Y2008G08)~~

关键词马尔可夫模型最大似然估计启动子识别转移概率半监督学习 Markov model maximum likelihood estimation promoter recognition transition probability semi-supervised learning

分类号 TP391.4 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献1

1Xintao Wu. Incorporating large unlabeled data to enhance EM classification[J] 2006,Journal of Intelligent Information Systems(3):211～226

同被引文献17

1Scherf M A,Klingenhoff T.Werner.Highly Specific Localization of Promoter Regions in Large Genomic Sequences by PromoterInspector:A Novel Context Analysis Approach[J].Journal of Molecular Biology.2000,297(3):599-606.
2Down T A,T.J.Hubbard.Computational Detection and Location of Transcription Start Sites in Mammalian Genomic DNA[J].Genome Research,2002,12(3):458-461.
3Shuanhu Wu,Xudong Xie,et al.Eukaryotic promoter prediction based on relative entropy and positional information[J].Physical Review E,2007,75(4):041908.
4Zhou,D.etc.,Learning with Local and Global Consistency[J],Advances in Neural Information Processing Systems 2004,16,321-328.
5Olivier C, Bernhard S, Alexander Z. Semi-Supervised Leamingl- M]. Cambridge, USA: MIT Press, 2006: 3-10.
6Zhou D, Scholkopf B, Semi-supervised T. Learning on Directed Graphs [J]. Advances in Neural Information Processing System, 2005, 17: 1633-1640.
7Zhou Z H, Li M. Tri-training: Exploiting unlabeled data using three classifiers [J]. IEEE Transactions on Knowledge and Data Engineering, 2005, 17(11) : 1529-1541.
8Song Enmin, Huang Dongshan, Ma Guangzhi, et al. Semi-supervised multi-class Adaboost by exploiting unlabeled dataJ]. Expert Systems with Applications, 2011,38: 6720-6726.
9Blum A, Mitchell T. Combining labeled and unlabeled data with co-training[C]//Proceedings of the 1 lth Annual Conference on Computational Learning Theory (COLT'98) , Wisconsin, USA: ACM, 1998: 92-100.
10李昆仑,曹铮,曹丽苹,张超,刘明.半监督聚类的若干新进展[J].模式识别与人工智能,2009,22(5):735-742. 被引量：50

引证文献3

1赵宪佳.基于半监督聚类的真核启动子识别[J].青岛大学学报（自然科学版）,2010,23(3):42-46.
2刘宁,赵建华.一种多分类器协同的半监督分类算法SSC_MCC[J].河南科学,2015,33(9):1554-1558.
3赵建华,刘宁.一种基于样本选择的安全半监督分类算法[J].系统仿真技术,2020,16(1):7-11.

1刘芳,蒋外文,陈翔.粗糙集理论在启动子识别中的应用研究[J].计算机与数字工程,2008,36(4):15-17.
2张友新,王立宏.基于流形结构重建的启动子识别[J].计算机工程与科学,2013,35(2):96-102.
3李文举,梅丽,信润海,韦丽华.基于KL散度和BP神经网络的人类基因启动子识别[J].辽宁师范大学学报（自然科学版）,2010,33(1):42-45. 被引量：2
4罗泽举,宋丽红,陆胜.启动子序列的非均衡检测识别算法[J].计算机应用,2008,28(8):2094-2096. 被引量：1
5刘咏梅,董宜堃.基于特征综合的启动子识别方法[J].计算机工程与应用,2012,48(11):201-204.
6徐文轩,张莉.基于单核苷酸统计和支持向量机集成的人类基因启动子识别[J].计算机应用,2015,35(10):2808-2812. 被引量：1
7吴彦,王立宏,赵宪佳.基于遗传算法的启动子序列多粒度结构分*析[J].青岛大学学报（自然科学版）,2009,22(2):11-14.
8赵宪佳.基于半监督聚类的真核启动子识别[J].青岛大学学报（自然科学版）,2010,23(3):42-46.
9秦洋,王立宏,武栓虎,宋宜斌.启动子的潜在语义索引差异识别算法[J].烟台大学学报（自然科学与工程版）,2010,23(3):211-216. 被引量：1
10罗泽举,朱思铭.基于EM的隐马氏过程随机迭代算法及其在生物序列启动子识别中的应用[J].计算机科学,2006,33(6):195-199.

计算机研究与发展

2009年第11期

浏览历史

内容加载中请稍等...

基于EM的启动子序列半监督学习被引量：3

参考文献1

同被引文献17

引证文献3

相关作者

相关机构

相关主题

浏览历史

基于EM的启动子序列半监督学习 被引量：3

参考文献1

同被引文献17

引证文献3

相关作者

相关机构

相关主题

浏览历史

基于EM的启动子序列半监督学习被引量：3