摘要
针对生物信息学研究中的超大规模序列比对计算问题进行了研究,解决了现有的e-PCR软件包在处理小麦基因引物扩增比对任务中存在的内存瓶颈、I/O瓶颈和计算时间瓶颈问题,利用数据和任务分割的基本方法,使其最关键的引物与模板的比对计算能够大规模并行,进而采用基于主从通信模式的MPI通信框架进行编程实现,并从任务的缩减、负载平衡、容错和多作业并发等方面进行了优化,最终在百万亿次超级计算机上顺利实现了千核级大规模并行计算,在数十日内即可完成原本预期需要数年的小麦序列扩增比对计算。
The computation challange of huge scale sequence alignment computation in bioinformatics was discussed.Bottlenecks of system memory,I/O throughput and computation time were eliminated while using e-PCR software to inspect the primers amplification with gene from wheat.Based on data and task partitioning,the essential mission of aligning the primers through the template sequences could be scalably parallelized.Processing code was designed with MPI under the master-slave communication frame.Further optimization had also been done on the view of computation decreasing,load balancing,fault tolerance and multi-task concurrency.The program had eventually performed 1000 cores scale parallelization on 100 Tflops level supercomputer,so that it is possible to complete the primer amplification computation with wheat gene in dozens of days,despite the original expectation of several years.
出处
《计算机应用》
CSCD
北大核心
2011年第A02期32-35,共4页
journal of Computer Applications
基金
国家863计划项目(2006AA01A116)
中国科学院"十一五"信息化专项(INFO-115-B01)
中国科学院知识创新工程项目(CNIC_QN_10004)
关键词
并行计算
生物信息学
分子标记
序列比对
任务分割
e-PCR
parallel computing
bioinformatics
molecular marker
sequence amplification
task partitioning
e-PCR