摘要
RNA-seq测序技术的高速发展所产生的海量数据在执行效率上给原有读段定位算法带来严峻的挑战.为此,提出基于MapReduce的不跨越剪切位的空位种子索引算法(PSeqMap)和跨越剪切位的空位种子索引算法(PJuncSeqMap),以及一种负载平衡解决方案.该算法利用MapReduce框架实现空位种子索引算法的并行化,在拟南芥菜基因数据集上的实验结果表明文中提出的算法能够充分利用集群的存储和计算能力,高效处理海量基因数据.
Massive data generated by the rapid development of RNA-seq sequencing technology make serious challenges to the original read mapping algorithm in the efficiency. A spaced seed indexing algorithm without considering splice site based on MapReduee (PSeqMap), a spaced seed indexing algorithm considering splice site (PJuncSeqMap), and a load-balancing solution are proposed. The MapReduce framework is employed to parallelize spaced seed indexing algorithms. The experimental results on the Arabidopsis gene datasets show that the proposed algorithms take full advantage of storage and computing power of the clusters and process massive genetic data efficiently.
出处
《模式识别与人工智能》
EI
CSCD
北大核心
2014年第3期206-212,共7页
Pattern Recognition and Artificial Intelligence
基金
国家自然科学基金项目(No.61272222
61003116)
江苏省自然科学基金重点重大专项项目(No.BK2011005)
江苏省自然科学基金项目(No.BK2011782)
江苏省普通高校研究生科研创新计划项目(No.CXLX12_0415)资助