摘要
针对生物信息分析中基因短序列比对任务计算耗时长的问题,采用Spark平台、RDD数据集以及分布式文件系统HDFS设计了一种分布式计算模型。采用分而治之的策略将庞大的计算任务分割为多个互不重叠的小任务在分布式集群上并行执行。通过基于位置偏移量等分的数据分区算法实现数据的分发;通过将基因短序列封装入RDD数据集的方法实现了短序列的逐条处理;通过将基因比对算法传入RDD的Map函数的方法实现了基因序列的比对。计算模型的实现使得串行比对算法在分布式集群上可扩展,并显著降低了计算耗时,计算结果可与后续的生物信息分析工作相兼容。实验结果证明计算模型具有较好的稳定性和可扩展性,在Spark集群上取得了优秀的加速比。
Aiming at the long time-consuming problem of short reads mapping in bioinformatics analysis,a distributed computing model was designed using Spark platform,RDD data set and distributed file system HDFS.Using divide-and-conquer strategy,an enormous computing job was divided into several small tasks that do not overlap with each othe,r and executed in parallel in distributed cluster.Data distribution was implemented by data partitioning algorithm based on position offset,short sequences were processed by encapsulating them into RDD datasets,and short reads mapping was implemented by passing alignment algorithm into Map function of RDD.The implementation of the computing model makes the serial alignment algorithm scalable on distributed cluster,and significantly reduces the time-consuming.The results are compatible with the subsequent bioinformatics analysis work.The experimental results show that the computing model has good stability and scalability,and achieves excellent speedup ratio on the Spark cluster.
作者
冯晓龙
高静
FENG Xiao-long;GAO Jing(College of Computer and Information Engineering,Inner Mongolia Agricultural University,Hohhot Inner Mongolia 010018,China)
出处
《计算机仿真》
北大核心
2020年第2期231-236,共6页
Computer Simulation
基金
国家自然科学基金(61462070)。
关键词
基因序列比对
短序列映射
分布式计算
并行计算
Gene sequence alignment
Short reads mapping
Distributed computing
Parallel computing