SuperLLEC:全新的链读和长读测序组装纠错算法

SuperLLEC:New Assembly and Error Correction Algorithm for Long Reads and Linked-Reads

下载PDF

导出

摘要为了解决第三代测序数据较高的错误率和提高基因组组装精度,基于10X Genomics链读测序数据设计了一种针对PacBio长读数据的组装和纠错算法SuperLLEC。该算法使用Wtdbg2算法将PacBio长读测序数据拼接成支架序列,运用Bowtie2比对工具将链读序列比对到支架序列,并根据链读条码进一步组装支架序列;对不匹配的比对位点采用Fisher精确检验预测该位点为单核酸多态性或是PacBio测序错误的碱基。通过三组人类细胞的长读数据和链读数据的算法比较实验,证明该方法能够较明显地提高基因组组装的准确度、NG50长度和单核酸多态性位点预测精度。 In order to solve the high error rate of the third-generation sequencing data and improve the accuracy of genome assembly,an assembly and error correction algorithm,called SuperLLEC,is designed for the long-read data of the PacBio based on the 10X Genomics linked-read sequencing data.Wtdbg2 is employed to assemble the PacBio long reads of a genome into scaffolds.Bowtie2 is used to align each linked-read to these scaffolds,and further assemble these scaffolds based on the barcodes of linked-reads.Fisher’s exact test is used to predict whether each mismatched alignment site is a single nucleotide polymorphism(SNP)or an error base sequenced by PacBio.Algorithm comparison experiments on the long-read and linked-read data from three groups of human cells show that SuperLLEC can significantly improve the accuracy of genome assembly,increase NG50 length,and recover more SNPs.

作者崔雅轩张少强 CUI Yaxuan;ZHANG Shaoqiang(College of Computer Information and Engineering,Tianjin Normal University,Tianjin 300387,China)

机构地区天津师范大学计算机与信息工程学院

出处《计算机工程与应用》 CSCD 北大核心 2022年第3期201-206,共6页 Computer Engineering and Applications

基金国家自然科学基金(61572358) 天津自然科学基金重点项目(19JCZDJC35100)。

关键词链读长读支架组装纠错 FISHER精确检验 linked-reads long-reads scaffolds assembly error correction Fisher’s exact test

分类号 TP391 [自动化与计算机技术—计算机应用技术]

引文网络
相关文献

参考文献3

1Anthony Rhoads,Kin Fai Au.PacBio Sequencing and Its Applications[J].Genomics, Proteomics & Bioinformatics,2015,13(5):278-289. 被引量：134
2Hengyun Lu,Francesca Giordano,Zemin Ning.Oxford Nanopore MinION Sequencing and Genome Assembly[J].Genomics, Proteomics & Bioinformatics,2016,14(5):265-279. 被引量：55
3李艳慧,张少强.DNA测序技术及其拼接算法综述[J].天津师范大学学报（自然科学版）,2018,38(5):1-9. 被引量：4

二级参考文献80

1Schadt EE, Turner S, Kasarskis A. A window into third- generation sequencing. Hum Mol Genet 2010;19:R227-40.
2Travers K, Chin CS, Rank D, Eid J, Turner S. A flexible and efficient template format for circular consensus sequencing and SNP detection. Nucleic Acids Res 2010;38:e159.
3Pacific Biosciences. Media Kit, < http://www.pacb.com/company/news- events/media-resources/page/3/> (May 19, 2015, date last accessed).
4Eid J, Fehr A, Gray J, Luong K, Lyle J, Otto G, et al. Real-time DNA sequencing from single polymerase molecules. Science 2009;323:133-8.
5AllSeq. Pacific Biosciences, <http://allseq.com/knowledgebank/ sequencing-platforms/pacific-biosciences> (April 14, 2015, date last accessed).
6Koren S, Phillippy AM. One chromosome, one contig: complete microbial genomes from long-read sequencing and assembly, Curr Opin Microbiol 2015;23:110-20.
7Brown S, Nagaraju S, Utturkar S, De Tissera S, Segovia S, Mitchell W, et al. Comparison of single-molecule sequencing and hybrid approaches for fnishing the genome of Clostridium autoethanogenum and analysis of CRISPR systems in industrial relevant Clostridia. Biotechnol Biofuels 2014;7:40.
8Pacific Biosciences. SMRT sequencing: read lengths, <http:// www.pacb.com/smrt-science/smrt-sequencing/read-lengths/> (October 3, 2015, date last accessed).
9Illumina. HiSeq 2500 specifications, < http://www.illumina.com/ systems/hiseq_2500_ 1500/performance_specifications.html > (April 14. 2015, date last accessed).
10Myers G. PacBio AGBT 2015 live workshop, < http://blog.paci- ficbiosciences.com/2015/02/agbt-2015-1ive-streaming-pacbio-workshop. html > (October 10, 2015, date last accessed).

共引文献172

1刘兰,史新昌,黄永兴,李永红,饶春明.改构溶瘤腺病毒DNA序列分析研究[J].药物分析杂志,2020,40(1):43-47.
2Ming Gao,Lijuan Wang,Peiwen Xu,Hongqiang Xie,Xiaowei Liu,Sexin Huang,Yang Zou,Jie Li,Yang Wang,Pidong Li,Yuan Gao,Zijiang Chen.Noncarrier embryo selection and transfer in preimplantation genetic testing cycles for reciprocal translocation by Oxford Nanopore Technologies[J].Journal of Genetics and Genomics,2020,47(11):718-721. 被引量：1
3刘悦,朱小亚,蒋析文.基于NGS的宏基因组学在微生物病原体鉴定中的应用[J].热带医学杂志,2019,19(11):1446-1449. 被引量：8
4郭涛,张钊,杨浠,潘德京,彭年才,朱真.新型冠状病毒检测方法的研究[J].名医,2020(19):53-54. 被引量：1
5陈蓓蓓,张文静,周元满,莫玉剑,肖晓.无瓣海桑microRNA的鉴定及功能初步分析[J].基因组学与应用生物学,2022,41(6):1305-1315.
6陈华,陈登海,徐培利,欧一新,栾浩,皮刚,康前进.地面建造航天器中分离细菌Rothia amarae KJZ9的基因组序列分析及抗生素敏感性检测[J].基因组学与应用生物学,2021,40(11):3538-3547.
7胡自溪,朱文勇,明文龙,孙啸.长非编码RNA鉴定和预测的数据分析技术及应用[J].基因组学与应用生物学,2021,40(11):3839-3852. 被引量：1
8王成彬.高通量测序技术在临床感染性疾病实验室诊断中的应用[J].中华医学杂志,2023,103(15):1087-1091. 被引量：1
9陈跃胜.重整富氢气压缩机减振改造[J].压缩机技术,2000(1):42-43.
10Sheng Wang,Zhen Li,Yizhou YU,Xin Gao.WaveNano:a signal-level nanopore base-caller via simultaneous prediction of nucleotide labels and move labels through bi-directional WaveNets[J].Frontiers of Electrical and Electronic Engineering in China,2018,6(4):359-368. 被引量：1

1臧童童,沈雳.基因多态性与心力衰竭发生和发展的研究进展[J].心血管病学进展,2021,42(6):516-520. 被引量：2
2黄梓钧,李东.基于聚合全局流嵌入的场景流估计网络[J].科学技术创新,2022(3):59-62.
3Fengchao Jiang,Junhuan Zhang,Sen Wang,Li Yang,Yingfeng Luo,Shenghan Gao,Meiling Zhang,Shuangyang Wu,Songnian Hu,Haoyuan Sun,Yuzhu Wang.The apricot(Prunus armeniaca L.)genome elucidates Rosaceae evolution and beta-carotenoid synthesis[J].Horticulture Research,2019,6(1):133-144. 被引量：15
4荆强,刘凡,韩帅红,张旭辉,梁学志,曹晓明.一次性电子输尿管软镜治疗上尿路结石的学习曲线分析[J].中国微创外科杂志,2022,22(1):40-44. 被引量：7
5张静,豆桂军,陈晓琳,赵雪丽,王建斌,周谦让.ICU高频接触物体表面消毒方法和效果评价的优选策略研究[J].蚌埠医学院学报,2022,47(1):118-121. 被引量：5
6梁静,刘芳,张亚萍,向慧玲,李春红,韩涛.慢性丙型肝炎应用直接抗病毒药物治疗后血尿酸水平的变化[J].中华肝脏病杂志,2022,30(1):30-37. 被引量：4
7王鹏,田哲娟,康忱,李亚栋,王洪乐,杨超沙,邙光伟,康亮,范庆杰,吴志明.番茄5个抗病基因KASP分型技术体系的建立与应用[J].园艺学报,2021,48(11):2211-2226. 被引量：10

计算机工程与应用

2022年第3期

浏览历史

内容加载中请稍等...

SuperLLEC:全新的链读和长读测序组装纠错算法

参考文献3

二级参考文献80

共引文献172

相关作者

相关机构

相关主题

浏览历史