K-mer can be used for the description of biological sequences and k-mer distribution is a tool for solving sequences analysis problems in bioinformatics.We can use k-mer vector as a representation method of the k-mer ...K-mer can be used for the description of biological sequences and k-mer distribution is a tool for solving sequences analysis problems in bioinformatics.We can use k-mer vector as a representation method of the k-mer distribution of the biological sequence.Problems,such as similarity calculations or sequence assembly,can be described in the k-mer vector space.It helps us to identify new features of an old sequence-based problem in bioinformatics and develop new algorithms using the concepts and methods from linear space theory.In this study,we defined the k-mer vector space for the generalized biological sequences.The meaning of corresponding vector operations is explained in the biological context.We presented the vector/matrix form of several widely seen sequence-based problems,including read quantification,sequence assembly,and pattern detection problem.Its advantages and disadvantages are discussed.Also,we implement a tool for the sequence assembly problem based on the concepts of k-mer vector methods.It shows the practicability and convenience of this algorithm design strategy.展开更多
Wheat is a staple foodfor more than 35%of the world's population,with wheatflourused to make hundreds of baked goods.Superior end-use quality is a major breeding target;however,improving it is especially time-cons...Wheat is a staple foodfor more than 35%of the world's population,with wheatflourused to make hundreds of baked goods.Superior end-use quality is a major breeding target;however,improving it is especially time-consuming and expensive.Furthermore,genes encoding seed-storage proteins(ssPs)form multigene families and are repetitive,with gaps commonplace in several genome assemblies.To overcome these barriers and efficiently identify superior wheat SSP alleles,we developed"PanSK"(Pan-SSP k-mer)for genotype-to-phenotype prediction based on an SsP-based pangenome resource.PanSK uses 29-mer sequences that represent each ssP gene at the pangenomic level to reveal untapped diversity across landraces and modern cultivars.Genome-wide association studies with k-mers identified 23 Ssp genes associated with end-use quality that represent novel targets for improvement.We evaluated the effect of rye secalin genes on end-use quality and found that removal of w-secalins from 1BL/1RS wheat translocation lines is associated with enhanced end-use quality.Finally,using machine-learning-based prediction inspired by PanSK,we predicted the quality phenotypes with high accuracy from genotypes alone.This study provides an effective approach for genome design based on ssP genes,enabling the breeding of wheat varieties with superior processing capabilities and improved end-use quality.展开更多
基金the National Natural Science Foundation of China(11771393,11632015)the Natural Sci-ence Foundation of Zhejiang Province,China(LZ14A010002).
文摘K-mer can be used for the description of biological sequences and k-mer distribution is a tool for solving sequences analysis problems in bioinformatics.We can use k-mer vector as a representation method of the k-mer distribution of the biological sequence.Problems,such as similarity calculations or sequence assembly,can be described in the k-mer vector space.It helps us to identify new features of an old sequence-based problem in bioinformatics and develop new algorithms using the concepts and methods from linear space theory.In this study,we defined the k-mer vector space for the generalized biological sequences.The meaning of corresponding vector operations is explained in the biological context.We presented the vector/matrix form of several widely seen sequence-based problems,including read quantification,sequence assembly,and pattern detection problem.Its advantages and disadvantages are discussed.Also,we implement a tool for the sequence assembly problem based on the concepts of k-mer vector methods.It shows the practicability and convenience of this algorithm design strategy.
基金STI 2030-Major Projects(2023ZD04069)the National Natural Science Foundation of China(grant no.32125030)+1 种基金the Pinduoduo-China Agricultural University Research Fund(PC2023A01003)the Major Program of the National Agricultural Science and Technology of China(NK20220601).
文摘Wheat is a staple foodfor more than 35%of the world's population,with wheatflourused to make hundreds of baked goods.Superior end-use quality is a major breeding target;however,improving it is especially time-consuming and expensive.Furthermore,genes encoding seed-storage proteins(ssPs)form multigene families and are repetitive,with gaps commonplace in several genome assemblies.To overcome these barriers and efficiently identify superior wheat SSP alleles,we developed"PanSK"(Pan-SSP k-mer)for genotype-to-phenotype prediction based on an SsP-based pangenome resource.PanSK uses 29-mer sequences that represent each ssP gene at the pangenomic level to reveal untapped diversity across landraces and modern cultivars.Genome-wide association studies with k-mers identified 23 Ssp genes associated with end-use quality that represent novel targets for improvement.We evaluated the effect of rye secalin genes on end-use quality and found that removal of w-secalins from 1BL/1RS wheat translocation lines is associated with enhanced end-use quality.Finally,using machine-learning-based prediction inspired by PanSK,we predicted the quality phenotypes with high accuracy from genotypes alone.This study provides an effective approach for genome design based on ssP genes,enabling the breeding of wheat varieties with superior processing capabilities and improved end-use quality.