Recent advances in single-cell chromatin accessibility sequencing(scCAS)technologies have resulted in new insights into the characterization of epigenomic heterogeneity and have increased the need for automatic cell t...Recent advances in single-cell chromatin accessibility sequencing(scCAS)technologies have resulted in new insights into the characterization of epigenomic heterogeneity and have increased the need for automatic cell type annotation.However,existing automatic annotation methods for scCAS data fail to incorporate the reference data and neglect novel cell types,which only exist in a test set.Here,we propose RAINBOW,a reference-guided automatic annotation method based on the contrastive learning framework,which is capable of effectively identifying novel cell types in a test set.By utilizing contrastive learning and incorporating reference data,RAINBOW can effectively characterize the heterogeneity of cell types,thereby facilitating more accurate annotation.With extensive experiments on multiple scCAS datasets,we show the advantages of RAINBOW over state-of-the-art methods in known and novel cell type annotation.We also verify the effectiveness of incorporating reference data during the training process.In addition,we demonstrate the robustness of RAINBOW to data sparsity and number of cell types.Furthermore,RAINBOW provides superior performance in newly sequenced data and can reveal biological implication in downstream analyses.All the results demonstrate the superior performance of RAINBOW in cell type annotation for scCAS data.We anticipate that RAINBOW will offer essential guidance and great assistance in scCAS data analysis.The source codes are available at the GitHub website(BioX-NKU/RAINBOW).展开更多
1 Introduction Recent advances in single-cell RNA sequencing(scRNA-seq)have enabled the study of how individual cells respond to various external perturbations such as drug stimulation at gene expression level[1].Prec...1 Introduction Recent advances in single-cell RNA sequencing(scRNA-seq)have enabled the study of how individual cells respond to various external perturbations such as drug stimulation at gene expression level[1].Precisely inferring perturbation responses allows us to explore how and why individual tumor cells evade cancer treatment,greatly advancing personalized medicine research and deepening our understanding of biological mechanisms[2].However,considering the high costs of sequencing and the complexity of obtaining perturbed samples,utilizing computational methods to predict cellular responses to perturbations holds great potential[2].展开更多
Recent advances in single-cell sequencing technologies provide significant implications for understanding cellular heterogeneity,developmental biology,and disease mechanisms.To fully exploit the potential of these dat...Recent advances in single-cell sequencing technologies provide significant implications for understanding cellular heterogeneity,developmental biology,and disease mechanisms.To fully exploit the potential of these data,numerous tools have been proposed for upstream and downstream analyses.In the the single-cell RNA sequencing(scRNA-seq)community,scRNA-tools(Zappia et al.,2018)was proposed to help researchers navigate the plethora of tools by category.Since its inception,scRNA-tools has been widely used and its updated version further reveals trends in the field with over 1000 collected tools(Zappia and Theis,2021),providing a valuable guidance in selecting tools for analyses.展开更多
The establishment of a landscape of enhancers across human cells is crucial to deciphering the mechanism of gene regulation,cell differentiation,and disease development.High-throughput experimental approaches,which co...The establishment of a landscape of enhancers across human cells is crucial to deciphering the mechanism of gene regulation,cell differentiation,and disease development.High-throughput experimental approaches,which contain successfully reported enhancers in typical cell lines,are still too costly and time-consuming to perform systematic identification of enhancers specific to different cell lines.Existing computational methods,capable of predicting regulatory elements purely relying on DNA sequences,lack the power of cell line-specific screening.Recent studies have suggested that chromatin accessibility of a DNA segment is closely related to its potential function in regulation,and thus may provide useful information in identifying regulatory elements.Motivated by the aforementioned understanding,we integrate DNA sequences and chromatin accessibility data to accurately predict enhancers in a cell line-specific manner.We proposed Deep CAPE,a deep convolutional neural network to predict enhancers via the integration of DNA sequences and DNase-seq data.Benefitting from the well-designed feature extraction mechanism and skip connection strategy,our model not only consistently outperforms existing methods in the imbalanced classification of cell line-specific enhancers against background sequences,but also has the ability to self-adapt to different sizes of datasets.Besides,with the adoption of autoencoder,our model is capable of making cross-cell line predictions.We further visualize kernels of the first convolutional layer and show the match of identified sequence signatures and known motifs.We finally demonstrate the potential ability of our model to explain functional implications of putative disease-associated genetic variants and discriminate diseaserelated enhancers.The source code and detailed tutorial of Deep CAPE are freely available at https://github.com/Shengquan Chen/DeepCAPE.展开更多
Backgrounds Transcription factor is one of the most important regulators in the transcriptional process.Nevertheless,the functional interpretation of transcription factors is still a main challenge due to the poor per...Backgrounds Transcription factor is one of the most important regulators in the transcriptional process.Nevertheless,the functional interpretation of transcription factors is still a main challenge due to the poor performance of methods relating to regulatory regions to genes.Epigenetic information,such as chromatin accessibility,contains genome-wide knowledge about transcription regulation and thus may shed light on the functional interpretation of transcription factors.Methods:We propose EpiFIT(Epigenetic based Functional Interpretation of Transcription factors),a tool to infer functions of transcription factors from ChlP-seq data.Briefly,we adopt a variable distance rule to establish associations between regulatory regions and nearby genes.The associations are then filtered to ensure that the remaining regions and associated genes are co-open.Finally,GO enrichment is applied to all related genes and a ranking list of GO terms is provided as functional interpretation.Results:We first examined the chromatin openness correlation between regulatory regions and associated genes.The correlation can help EpiFIT purify regulatory region-gene associations.By evaluating EpiFIT on a set of real data,we demonstrated that EpiFIT outperforms other existing methods for precisely interpreting transcription factor functions.We further verify the efficiency of openness in interpretation and the ability of EpiFIT to build distal region-gene associations.Conclusion:EpiFIT is a powerful tool for interpreting the transcription factor functions.We believe EpiFIT will facilitate the functional interpretation of other regulatory elements,and thus open a new door to understanding the regulatory mechanism.Availability:The application is freely accessible at website:bioinfo.au.tsinghua.edu.cn/openness/EpiFIT/.展开更多
基金National Natural Science Foundation of China,Grant/Award Number:62203236Fundamental Research Funds for the Central Universities,Nankai University,Grant/Award Number:63231137。
文摘Recent advances in single-cell chromatin accessibility sequencing(scCAS)technologies have resulted in new insights into the characterization of epigenomic heterogeneity and have increased the need for automatic cell type annotation.However,existing automatic annotation methods for scCAS data fail to incorporate the reference data and neglect novel cell types,which only exist in a test set.Here,we propose RAINBOW,a reference-guided automatic annotation method based on the contrastive learning framework,which is capable of effectively identifying novel cell types in a test set.By utilizing contrastive learning and incorporating reference data,RAINBOW can effectively characterize the heterogeneity of cell types,thereby facilitating more accurate annotation.With extensive experiments on multiple scCAS datasets,we show the advantages of RAINBOW over state-of-the-art methods in known and novel cell type annotation.We also verify the effectiveness of incorporating reference data during the training process.In addition,we demonstrate the robustness of RAINBOW to data sparsity and number of cell types.Furthermore,RAINBOW provides superior performance in newly sequenced data and can reveal biological implication in downstream analyses.All the results demonstrate the superior performance of RAINBOW in cell type annotation for scCAS data.We anticipate that RAINBOW will offer essential guidance and great assistance in scCAS data analysis.The source codes are available at the GitHub website(BioX-NKU/RAINBOW).
基金supported by the National Natural Science Foundation of China(Grant No.62203236)the Fundamental Research Funds for the Central Universities,Nankai University(63231137).
文摘1 Introduction Recent advances in single-cell RNA sequencing(scRNA-seq)have enabled the study of how individual cells respond to various external perturbations such as drug stimulation at gene expression level[1].Precisely inferring perturbation responses allows us to explore how and why individual tumor cells evade cancer treatment,greatly advancing personalized medicine research and deepening our understanding of biological mechanisms[2].However,considering the high costs of sequencing and the complexity of obtaining perturbed samples,utilizing computational methods to predict cellular responses to perturbations holds great potential[2].
基金supported by the National Key Research and Development Program of China(2021YFF1200902,2023YFF1204802)the National Natural Science Foundation of China(62203236,62273194)the Fundamental Research Funds for the Central Universities,Nankai University(63231137).
文摘Recent advances in single-cell sequencing technologies provide significant implications for understanding cellular heterogeneity,developmental biology,and disease mechanisms.To fully exploit the potential of these data,numerous tools have been proposed for upstream and downstream analyses.In the the single-cell RNA sequencing(scRNA-seq)community,scRNA-tools(Zappia et al.,2018)was proposed to help researchers navigate the plethora of tools by category.Since its inception,scRNA-tools has been widely used and its updated version further reveals trends in the field with over 1000 collected tools(Zappia and Theis,2021),providing a valuable guidance in selecting tools for analyses.
基金partially supported by the National Key R&D Program of China(Grant No.2018YFC0910404)the National Natural Science Foundation of China(Grant Nos.61873141,61721003,61573207,71871019,71471016,71531013,and 71729001)the Tsinghua-Fuzhou Institute for Data Technology,China。
文摘The establishment of a landscape of enhancers across human cells is crucial to deciphering the mechanism of gene regulation,cell differentiation,and disease development.High-throughput experimental approaches,which contain successfully reported enhancers in typical cell lines,are still too costly and time-consuming to perform systematic identification of enhancers specific to different cell lines.Existing computational methods,capable of predicting regulatory elements purely relying on DNA sequences,lack the power of cell line-specific screening.Recent studies have suggested that chromatin accessibility of a DNA segment is closely related to its potential function in regulation,and thus may provide useful information in identifying regulatory elements.Motivated by the aforementioned understanding,we integrate DNA sequences and chromatin accessibility data to accurately predict enhancers in a cell line-specific manner.We proposed Deep CAPE,a deep convolutional neural network to predict enhancers via the integration of DNA sequences and DNase-seq data.Benefitting from the well-designed feature extraction mechanism and skip connection strategy,our model not only consistently outperforms existing methods in the imbalanced classification of cell line-specific enhancers against background sequences,but also has the ability to self-adapt to different sizes of datasets.Besides,with the adoption of autoencoder,our model is capable of making cross-cell line predictions.We further visualize kernels of the first convolutional layer and show the match of identified sequence signatures and known motifs.We finally demonstrate the potential ability of our model to explain functional implications of putative disease-associated genetic variants and discriminate diseaserelated enhancers.The source code and detailed tutorial of Deep CAPE are freely available at https://github.com/Shengquan Chen/DeepCAPE.
基金This work has been supported by the National Key Research and Development Program of China(No.2018YFC0910404)the National Natural Science Foundation of China(Nos.61873141,61721003,61573207,71871019 and 71471016)the Tsinghua-Fuzhou Institute for Data Technology.
文摘Backgrounds Transcription factor is one of the most important regulators in the transcriptional process.Nevertheless,the functional interpretation of transcription factors is still a main challenge due to the poor performance of methods relating to regulatory regions to genes.Epigenetic information,such as chromatin accessibility,contains genome-wide knowledge about transcription regulation and thus may shed light on the functional interpretation of transcription factors.Methods:We propose EpiFIT(Epigenetic based Functional Interpretation of Transcription factors),a tool to infer functions of transcription factors from ChlP-seq data.Briefly,we adopt a variable distance rule to establish associations between regulatory regions and nearby genes.The associations are then filtered to ensure that the remaining regions and associated genes are co-open.Finally,GO enrichment is applied to all related genes and a ranking list of GO terms is provided as functional interpretation.Results:We first examined the chromatin openness correlation between regulatory regions and associated genes.The correlation can help EpiFIT purify regulatory region-gene associations.By evaluating EpiFIT on a set of real data,we demonstrated that EpiFIT outperforms other existing methods for precisely interpreting transcription factor functions.We further verify the efficiency of openness in interpretation and the ability of EpiFIT to build distal region-gene associations.Conclusion:EpiFIT is a powerful tool for interpreting the transcription factor functions.We believe EpiFIT will facilitate the functional interpretation of other regulatory elements,and thus open a new door to understanding the regulatory mechanism.Availability:The application is freely accessible at website:bioinfo.au.tsinghua.edu.cn/openness/EpiFIT/.