Functional enrichment analysis or gene set enrichment analysis is a basic bioinformatics method that evaluates the biological importance of a list of genes of interest.However,it may produce a long list of significant...Functional enrichment analysis or gene set enrichment analysis is a basic bioinformatics method that evaluates the biological importance of a list of genes of interest.However,it may produce a long list of significant terms with highly redundant information that is difficult to summarize.Current tools to simplify enrichment results by clustering them into groups either still produce redundancy between clusters or do not retain consistent term similarities within clusters.We propose a new method named binary cut for clustering similarity matrices of functional terms.Through comprehensive benchmarks on both simulated and real-world datasets,we demonstrated that binary cut could efficiently cluster functional terms into groups where terms showed consistent similarities within groups and were mutually exclusive between groups.We compared binary cut clustering on the similarity matrices obtained from different similarity measures and found that semantic similarity worked well with binary cut,while similarity matrices based on gene overlap showed less consistent patterns.We implemented the binary cut algorithm in the R package simplifyEnrichment,which additionally provides functionalities for visualizing,summarizing,and comparing the clustering.The simplifyEnrichment package and the documentation are available at https://bioconductor.org/packages/simplifyEnrichment/.展开更多
As next-generation sequencing (NGS) technology has become widely used to identify genetic causal variants for various diseases and traits,a number of packages for checking NGS data quality have sprung up in public dom...As next-generation sequencing (NGS) technology has become widely used to identify genetic causal variants for various diseases and traits,a number of packages for checking NGS data quality have sprung up in public domains. In addition to the quality of sequencing data,sample quality issues,such as gender mismatch,abnormal inbreeding coefficient,cryptic relatedness,and population outliers,can also have fundamental impact on downstream analysis. However,there is a lack of tools specialized in identifying problematic samples from NGS data,often due to the limitation of sample size and variant counts. We developed SeqSQC,a Bioconductor package,to automate and accelerate sample cleaning in NGS data of any scale. SeqSQC is designed for efficient data storage and access,and equipped with interactive plots for intuitive data visualization to expedite the identification of problematic samples. SeqSQC is available at http://bioconductor. org/packages/SeqSQC.展开更多
基金This work was supported by the National Center for Tumor Diseases(NCT)Molecular Precision Oncology Program and the NCT Donations against Cancer Program,Germany.
文摘Functional enrichment analysis or gene set enrichment analysis is a basic bioinformatics method that evaluates the biological importance of a list of genes of interest.However,it may produce a long list of significant terms with highly redundant information that is difficult to summarize.Current tools to simplify enrichment results by clustering them into groups either still produce redundancy between clusters or do not retain consistent term similarities within clusters.We propose a new method named binary cut for clustering similarity matrices of functional terms.Through comprehensive benchmarks on both simulated and real-world datasets,we demonstrated that binary cut could efficiently cluster functional terms into groups where terms showed consistent similarities within groups and were mutually exclusive between groups.We compared binary cut clustering on the similarity matrices obtained from different similarity measures and found that semantic similarity worked well with binary cut,while similarity matrices based on gene overlap showed less consistent patterns.We implemented the binary cut algorithm in the R package simplifyEnrichment,which additionally provides functionalities for visualizing,summarizing,and comparing the clustering.The simplifyEnrichment package and the documentation are available at https://bioconductor.org/packages/simplifyEnrichment/.
基金supported by the National Cancer Institute (NCI), the National Institutes of Health (NIH), USA (Grant Nos. CA162218 awarded to SL and HZ, CA105274 awarded to LHK, and CA195565 awarded to LHK and CBA)supported by the NCI (Grant No. P30CA016056 awarded to Roswell Park Comprehensive Cancer Center involving the use of DBBR, Genomic, Bioinformatics, and Biostatistics Shared Resources)supported by the Breast Cancer Research Foundation, USA
文摘As next-generation sequencing (NGS) technology has become widely used to identify genetic causal variants for various diseases and traits,a number of packages for checking NGS data quality have sprung up in public domains. In addition to the quality of sequencing data,sample quality issues,such as gender mismatch,abnormal inbreeding coefficient,cryptic relatedness,and population outliers,can also have fundamental impact on downstream analysis. However,there is a lack of tools specialized in identifying problematic samples from NGS data,often due to the limitation of sample size and variant counts. We developed SeqSQC,a Bioconductor package,to automate and accelerate sample cleaning in NGS data of any scale. SeqSQC is designed for efficient data storage and access,and equipped with interactive plots for intuitive data visualization to expedite the identification of problematic samples. SeqSQC is available at http://bioconductor. org/packages/SeqSQC.