摘要
模体发现对于基因发现和理解基因调控关系有着重要的意义,它是生物信息学中最具挑战性的问题之一。提出了针对PMSP算法的3种数据划分方法,并在此基础上提出了基于MapReduce的模体发现算法(PMSPMR)。针对不同难度的问题,在Hadoop集群上的实验结果表明,PMSPMR算法具有良好的可扩展性。特别地,对于难度较大的模体发现问题实例,PMSPMR算法的加速比接近于Hadoop集群中节点的数目。此外,对于真实数据的实验,PMSPMR算法能够识别出真核细胞和酿酒酵母中已知的转录调控模体,表明了算法的有效性。
Motif search plays an important role in gene finding and understanding gene regulation relationship, and is one of the most challenging problems in bioinfotmatics. This paper presents three data partitioning methods for the PMSP algorithm and proposes the PMSP MapReduce algorithm (PMSPMR) for solving motif search problems. For problems of varying difficulty, the experimental results on the Hadoop cluster demonslrate that PMSPMR has good scalability. In particular, for motif search problems with high levels of difficulty, PMSPMR shows its advantage because the speedup is almost linearly proportional to the number of nodes in the Hadoop cluster. This paper also presents experimental results on realistic biological data by identifying known transcriptional regulatory motifs in eukaryotes as well as in actual promoter sequences extracted from Saccharomyces cerevisiae.
出处
《中国科技论文》
CAS
北大核心
2012年第7期487-494,502,共9页
China Sciencepaper
基金
国家自然科学基金资助项目(61173025)
高等学校博士学科点专项科研基金资助项目(20100203110010)
关键词
模体发现
数据划分
可扩展性
motiffinding
data partitioning
scalability