期刊文献+

细胞内大片段DNA数据存储的多RS码交织编码 被引量:4

Multiple interleaved RS codes for data storage using up to Mb-scale synthetic DNA in living cells
下载PDF
导出
摘要 合成DNA作为潜在的数字信息存储介质,存储密度高,可用时间久,有望成为未来数据存储的重要选项。然而,DNA的合成与测序读出往往造成碱基的多种错误,无法满足数据存储的可靠性要求,而保证可靠性的编码方案往往效率较低。针对该问题,提出了一种面向酿酒酵母内大片段DNA数据存储的高效率编码方法。数据编码通过多个极高码率的里德-所罗门(RS)码的码字交织构建数据DNA单元,将其与酵母的自主复制序列(ARS)交替镶嵌,构成酵母人工染色体序列;数据读出时,利用二代高通量测序,组合了读段从头(denovo)组装、ARS导引例,用20×二代测序数据可无错恢复原始数据。该编码方法不仅能实现数据可靠存储,实现的DNA数据部分逻辑密度为1.973 bit/bp,即使考虑生物单元开销,总体逻辑密度仍达到1.947 bit/bp。该设计流程可支持Kb到Mb不同长度的DNA的编码,为大片段DNA数据存储的“湿”实验提供灵活的实验前验证与评估。 The synthetic DNA, as a potential digital data storage medium, has a high storage density and can be usedfor a very long period. It is expected to serve as an important option for future massive data storage. However, thesynthesis, assembly and sequencing of DNA of ten introduce multiple types of base errors,which does not satisfy thereliability requirements of data storage, while reliability-enhanced coding schemes usually sacrifice the logical codingdensity by adding redundancy. To deal with this problem, an encoding process for DNA data storage using largesynthetic DNA fragments in Saccharomyces Cerevisiae was proposed. Data writing into DNA chunks was constructed by interleaving multiple codewords of Reed Solomon(RS)codes with a very high code rate,embedded with autonomous replication sequences(ARSs)in alternation to form a yeast artificial chromosome.Utilizing the high throughput sequencing,data readout combines short read assembly with the de Bruijn graphs,ARS guided contig combination and erasure/error correction to achieve reliable data recovery.The error correction capability has been fully exploited by interleaving the large missing fractions into random erasures across all the RS codewords and correcting more erasures than errors.We designed and simulated a 2.5 Mb ring chromosome and successfully recovered the original data from 20×high-throughput sequencing reads.The simulated sequencing data are generated using the ART simulation software,which has been trained using the real sequencing data from an artificial chromosome of 254886 bp constructed for data storage previously.All the processes including the large DNA chunk assembly,DNA replication,extraction and high-throughput sequencing are viewed as the DNA storage channel in information theory community.We provided an efficient encoding scheme matching the codes and the DNA storage channel based on the information theory paradigm.The logical density of the data DNA chunks was 1.973 bit/bp,and the overall logical density still reached up to 1.947 bit/bp including the biological units(ARSs and vector backbones).The demonstrated design process can support DNA coding schemes with the different lengths from Kb up to Mb,which provides flexible verification and support for wet experiments in the synthesis and sequencing of large fragments of DNA for digital data storage.
作者 陈为刚 葛奇 王盼盼 韩明哲 郭健 CHEN Weigang;GE Qi;WANG Panpan;HAN Mingzhe;GUO Jian(Shool of Microelectronics,Tianjin University,Tianjin 300072,China;Frontiers Science Center for Synthetic Biology(MOE),Tianjin University,Tianjin 300072,China;School of Chemical Engineeringand Technology,Tianjin University,Tianjin 300072,China)
出处 《合成生物学》 CSCD 2021年第3期428-443,共16页 Synthetic Biology Journal
关键词 DNA数据存储 里德-所罗门(RS)码 交织 自主复制序列 重叠群 DNA data storage reed-solomon codes interleaving autonomously replicating sequence contig
  • 相关文献

参考文献11

二级参考文献23

共引文献68

同被引文献35

引证文献4

二级引证文献7

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部