The potential for being able to identify individuals at high disease risk solely based on genotype data has garnered significant interest.Although widely applied,traditional polygenic risk scoring methods fall short,a...The potential for being able to identify individuals at high disease risk solely based on genotype data has garnered significant interest.Although widely applied,traditional polygenic risk scoring methods fall short,as they are built on additive models that fail to capture the intricate associations among single nucleotide polymorphisms(SNPs).This presents a limitation,as genetic diseases often arise from complex interactions between multiple SNPs.To address this challenge,we developed DeepRisk,a biological knowledge-driven deep learning method for modeling these complex,nonlinear associations among SNPs,to provide a more effective method for scoring the risk of common diseases with genome-wide genotype data.Evaluations demonstrated that DeepRisk outperforms existing PRs-based methods in identifying individuals at high risk for four common diseases:Alzheimer's disease,inflammatory bowel disease,type 2diabetes,and breast cancer.展开更多
Order-preserving submatrix (OPSM) has become important in modelling biologically meaningful subspace cluster, capturing the general tendency of gene expressions across a subset of conditions. With the advance of mic...Order-preserving submatrix (OPSM) has become important in modelling biologically meaningful subspace cluster, capturing the general tendency of gene expressions across a subset of conditions. With the advance of microarray and analysis techniques, big volume of gene expression datasets and OPSM mining results are produced. OPSM query can efficiently retrieve relevant OPSMs from the huge amount of OPSM datasets. However, improving OPSM query relevancy remains a difficult task in real life exploratory data analysis processing. First, it is hard to capture subjective interestingness aspects, e.g., the analyst's expectation given her/his domain knowledge. Second, when these expectations can be declaratively specified, it is still challenging to use them during the computational process of OPSM queries. With the best of our knowledge, existing methods mainly fo- cus on batch OPSM mining, while few works involve OPSM query. To solve the above problems, the paper proposes two constrained OPSM query methods, which exploit userdefined constraints to search relevant results from two kinds of indices introduced. In this paper, extensive experiments are conducted on real datasets, and experiment results demonstrate that the multi-dimension index (cIndex) and enumerating sequence index (esIndex) based queries have better performance than brute force search.展开更多
The packing of genomic DNA from double helix into highly-order hierarchical assemblies has a great impact on chromosome flexibility,dynamics and functions.The open and accessible regions of chromosomes are primary bin...The packing of genomic DNA from double helix into highly-order hierarchical assemblies has a great impact on chromosome flexibility,dynamics and functions.The open and accessible regions of chromosomes are primary binding positions for regulatory elements and are crucial to nuclear processes and biological functions.Motivated by the success of flexibility-rigidity index(FRI)in biomolecular flexibility analysis and drug design,we propose an FRI-based model for quantitatively characterizing chromosome flexibility.Based on Hi-C data,a flexibility index for each locus can be evaluated.Physically,flexibility is tightly related to packing density.Highly compacted regions are usually more rigid,while loosely packed regions are more flexible.Indeed,a strong correlation is found between our flexibility index and DNase and ATAC values,which are measurements for chromosome accessibility.In addition,the genome regions with higher chromosome flexibility have a higher chance to be bound by transcription factors.Recently,the Gaussian network model(GNM)is applied to analyze the chromosome accessibility and a mobility profile has been proposed to characterize chromosome flexibility.Compared with GNM,our FRI is slightly more accurate(1%to 2%increase)and significantly more efficient in both computational time and costs.For a 5Kb resolution Hi-C data,the flexibility evaluation process only takes FRI a few minutes on a single-core processor.In contrast,GNM requires 1.5 hours on 10 CPUs.Moreover,interchromosome interactions can be easily combined into the flexibility evaluation,thus further enhancing the accuracy of our FRI.In contrast,the consideration of interchromosome information into GNM will significantly increase the size of its Laplacian(or Kirchhoff)matrix,thus becoming computationally extremely challenging for the current GNM.The software and supplementary document are available at https://github.com/jiajiepeng/FRI_chrFle.展开更多
Background: One of the most important and challenging issues in biomedicine and genomics is how to identify disease related genes. Datasets from high-throughput biotechnologies have been widely used to overcome this ...Background: One of the most important and challenging issues in biomedicine and genomics is how to identify disease related genes. Datasets from high-throughput biotechnologies have been widely used to overcome this issue from various perspectives, e.g., epigenomics, genomics, transcriptomics, proteomics, metabolomics. At the genomic level, copy number variations (CNVs) have been recognized as critical genetic variations, which contribute significantly to genomic diversity. They have been associated with both common and complex diseases, and thus have a large influence on a variety of Mendelian and somatic genetic disorders. Results: In this review, based on a variety of complex diseases, we give an overview about the critical role of using CNVs for identifying disease related genes, and discuss on details the different high-throughput and sequencing methods applied for CNV detection. Some limitations and challenges concerning CNV are also highlighted. Conclusions: Reliable detection of CNVs will not only allow discriminating driver mutations for various diseases, but also helps to develop personalized medicine when integrating it with other genomic features.展开更多
基金the National Natural Science Foundation of China(62072376 and U1811262)Guangdong Provincial Basic and Applied Research Fund Project(2022A1515010144)+1 种基金Innovation Capability Support Program of Shaanxi(2022KJXX-75)the Fundamental Research Funds for the Central Universities(D5000230056).
文摘The potential for being able to identify individuals at high disease risk solely based on genotype data has garnered significant interest.Although widely applied,traditional polygenic risk scoring methods fall short,as they are built on additive models that fail to capture the intricate associations among single nucleotide polymorphisms(SNPs).This presents a limitation,as genetic diseases often arise from complex interactions between multiple SNPs.To address this challenge,we developed DeepRisk,a biological knowledge-driven deep learning method for modeling these complex,nonlinear associations among SNPs,to provide a more effective method for scoring the risk of common diseases with genome-wide genotype data.Evaluations demonstrated that DeepRisk outperforms existing PRs-based methods in identifying individuals at high risk for four common diseases:Alzheimer's disease,inflammatory bowel disease,type 2diabetes,and breast cancer.
基金The authors thank the anonymous referees for their useful comments that greatly improved the quality of the paper. This work was supported in part by the National Basic Research Program 973 of China (2012CB316203), the Natural Science Foundation of China (Grant Nos. 61033007, 61272121, 61332014, 61572367, 61332006, 61472321, and 61502390), the National High Technology Research and Development Program 863 of China (2015AA015307), the Fundational Research Funds for the Central Universities (3102015JSJ0011, 3102014JSJ0005, and 3102014JSJ0013), and the Graduate Starting Seed Fund of Northwestern Polytechnical University (Z2012128).
文摘Order-preserving submatrix (OPSM) has become important in modelling biologically meaningful subspace cluster, capturing the general tendency of gene expressions across a subset of conditions. With the advance of microarray and analysis techniques, big volume of gene expression datasets and OPSM mining results are produced. OPSM query can efficiently retrieve relevant OPSMs from the huge amount of OPSM datasets. However, improving OPSM query relevancy remains a difficult task in real life exploratory data analysis processing. First, it is hard to capture subjective interestingness aspects, e.g., the analyst's expectation given her/his domain knowledge. Second, when these expectations can be declaratively specified, it is still challenging to use them during the computational process of OPSM queries. With the best of our knowledge, existing methods mainly fo- cus on batch OPSM mining, while few works involve OPSM query. To solve the above problems, the paper proposes two constrained OPSM query methods, which exploit userdefined constraints to search relevant results from two kinds of indices introduced. In this paper, extensive experiments are conducted on real datasets, and experiment results demonstrate that the multi-dimension index (cIndex) and enumerating sequence index (esIndex) based queries have better performance than brute force search.
基金supported inpart by Nanyang Technological University Startup(M4081842.110)Singapore Ministry of Education Academic Research fund(Tier 1 RG126/16,RG31/18)the National Natural Science Foundation of China(Grant Nos.61702421,61332014,61772426).
文摘The packing of genomic DNA from double helix into highly-order hierarchical assemblies has a great impact on chromosome flexibility,dynamics and functions.The open and accessible regions of chromosomes are primary binding positions for regulatory elements and are crucial to nuclear processes and biological functions.Motivated by the success of flexibility-rigidity index(FRI)in biomolecular flexibility analysis and drug design,we propose an FRI-based model for quantitatively characterizing chromosome flexibility.Based on Hi-C data,a flexibility index for each locus can be evaluated.Physically,flexibility is tightly related to packing density.Highly compacted regions are usually more rigid,while loosely packed regions are more flexible.Indeed,a strong correlation is found between our flexibility index and DNase and ATAC values,which are measurements for chromosome accessibility.In addition,the genome regions with higher chromosome flexibility have a higher chance to be bound by transcription factors.Recently,the Gaussian network model(GNM)is applied to analyze the chromosome accessibility and a mobility profile has been proposed to characterize chromosome flexibility.Compared with GNM,our FRI is slightly more accurate(1%to 2%increase)and significantly more efficient in both computational time and costs.For a 5Kb resolution Hi-C data,the flexibility evaluation process only takes FRI a few minutes on a single-core processor.In contrast,GNM requires 1.5 hours on 10 CPUs.Moreover,interchromosome interactions can be easily combined into the flexibility evaluation,thus further enhancing the accuracy of our FRI.In contrast,the consideration of interchromosome information into GNM will significantly increase the size of its Laplacian(or Kirchhoff)matrix,thus becoming computationally extremely challenging for the current GNM.The software and supplementary document are available at https://github.com/jiajiepeng/FRI_chrFle.
基金This work was supported by the National Natural Science Foundation of China (Nos. 61602386 and 61332014), the Natural Science Foundation of Shaanxi Province (No. 2017JQ6008), and the top university visiting foundation for excellent youth scholars of Northwestern Polytechnical University.
文摘Background: One of the most important and challenging issues in biomedicine and genomics is how to identify disease related genes. Datasets from high-throughput biotechnologies have been widely used to overcome this issue from various perspectives, e.g., epigenomics, genomics, transcriptomics, proteomics, metabolomics. At the genomic level, copy number variations (CNVs) have been recognized as critical genetic variations, which contribute significantly to genomic diversity. They have been associated with both common and complex diseases, and thus have a large influence on a variety of Mendelian and somatic genetic disorders. Results: In this review, based on a variety of complex diseases, we give an overview about the critical role of using CNVs for identifying disease related genes, and discuss on details the different high-throughput and sequencing methods applied for CNV detection. Some limitations and challenges concerning CNV are also highlighted. Conclusions: Reliable detection of CNVs will not only allow discriminating driver mutations for various diseases, but also helps to develop personalized medicine when integrating it with other genomic features.