摘要
为了解决第三代测序数据较高的错误率和提高基因组组装精度,基于10X Genomics链读测序数据设计了一种针对PacBio长读数据的组装和纠错算法SuperLLEC。该算法使用Wtdbg2算法将PacBio长读测序数据拼接成支架序列,运用Bowtie2比对工具将链读序列比对到支架序列,并根据链读条码进一步组装支架序列;对不匹配的比对位点采用Fisher精确检验预测该位点为单核酸多态性或是PacBio测序错误的碱基。通过三组人类细胞的长读数据和链读数据的算法比较实验,证明该方法能够较明显地提高基因组组装的准确度、NG50长度和单核酸多态性位点预测精度。
In order to solve the high error rate of the third-generation sequencing data and improve the accuracy of genome assembly,an assembly and error correction algorithm,called SuperLLEC,is designed for the long-read data of the PacBio based on the 10X Genomics linked-read sequencing data.Wtdbg2 is employed to assemble the PacBio long reads of a genome into scaffolds.Bowtie2 is used to align each linked-read to these scaffolds,and further assemble these scaffolds based on the barcodes of linked-reads.Fisher’s exact test is used to predict whether each mismatched alignment site is a single nucleotide polymorphism(SNP)or an error base sequenced by PacBio.Algorithm comparison experiments on the long-read and linked-read data from three groups of human cells show that SuperLLEC can significantly improve the accuracy of genome assembly,increase NG50 length,and recover more SNPs.
作者
崔雅轩
张少强
CUI Yaxuan;ZHANG Shaoqiang(College of Computer Information and Engineering,Tianjin Normal University,Tianjin 300387,China)
出处
《计算机工程与应用》
CSCD
北大核心
2022年第3期201-206,共6页
Computer Engineering and Applications
基金
国家自然科学基金(61572358)
天津自然科学基金重点项目(19JCZDJC35100)。
关键词
链读
长读
支架
组装
纠错
FISHER精确检验
linked-reads
long-reads
scaffolds
assembly
error correction
Fisher’s exact test