Gastrointestinal(GI)cancers are a set of diverse diseases affecting many parts/organs.The five most frequent GI cancer types are esophageal,gastric cancer(GC),liver cancer,pancreatic cancer,and colorectal cancer(CRC);...Gastrointestinal(GI)cancers are a set of diverse diseases affecting many parts/organs.The five most frequent GI cancer types are esophageal,gastric cancer(GC),liver cancer,pancreatic cancer,and colorectal cancer(CRC);together,they give rise to 5 million new cases and cause the death of 3.5 million people annually.We provide information about molecular changes crucial to tumorigenesis and the behavior and prognosis.During the formation of cancer cells,the genomic changes are microsatellite instability with multiple chromosomal arrangements in GC and CRC.The genomically stable subtype is observed in GC and pancreatic cancer.Besides these genomic subtypes,CRC has epigenetic modification(hypermethylation)associated with a poor prognosis.The pathway information highlights the functions shared by GI cancers such as apoptosis;focal adhesion;and the p21-activated kinase,phosphoinositide 3-kinase/Akt,transforming growth factor beta,and Toll-like receptor signaling pathways.These pathways show survival,cell proliferation,and cell motility.In addition,the immune response and inflammation are also essential elements in the shared functions.We also retrieved information on protein-protein interaction from the STRING database,and found that proteins Akt1,catenin beta 1(CTNNB1),E1A binding protein P300,tumor protein p53(TP53),and TP53 binding protein 1(TP53BP1)are central nodes in the network.The protein expression of these genes is associated with overall survival in some GI cancers.The low TP53BP1 expression in CRC,high EP300 expression in esophageal cancer,and increased expression of Akt1/TP53 or low CTNNB1 expression in GC are associated with a poor prognosis.The Kaplan Meier plotter database also confirmed the association between expression of the five central genes and GC survival rates.In conclusion,GI cancers are very diverse at the molecular level.However,the shared mutations and protein pathways might be used to understand better and reveal diagnostic/prognostic or drug targets.展开更多
Explainable artificial intelligence aims to interpret how machine learning models make decisions,and many model explainers have been developed in the computer vision field.However,understanding of the applicability of...Explainable artificial intelligence aims to interpret how machine learning models make decisions,and many model explainers have been developed in the computer vision field.However,understanding of the applicability of these model explainers to biological data is still lacking.In this study,we comprehensively evaluated multiple explainers by interpreting pre-trained models for predicting tissue types from transcriptomic data and by identifying the top contributing genes from each sample with the greatest impacts on model prediction.To improve the reproducibility and interpretability of results generated by model explainers,we proposed a series of optimization strategies for each explainer on two different model architectures of multilayer perceptron(MLP)and convolutional neural network(CNN).We observed three groups of explainer and model architecture combinations with high reproducibility.Group II,which contains three model explainers on aggregated MLP models,identified top contributing genes in different tissues that exhibited tissue-specific manifestation and were potential cancer biomarkers.In summary,our work provides novel insights and guidance for exploring biological mechanisms using explainable machine learning models.展开更多
Omics data provides an essential means for molecular biology and systems biology to capture the systematic properties of inner activities of cells. And one of the strongest challenge problems biological researchers ha...Omics data provides an essential means for molecular biology and systems biology to capture the systematic properties of inner activities of cells. And one of the strongest challenge problems biological researchers have faced is to find the methods for discovering biomarkers for tracking the process of disease such as cancer. So some feature selection methods have been widely used to cope with discovering biomarkers problem. However omics data usually contains a large number of features, but a small number of samples and some omics data have a large range distribution, which make feature selection methods remains difficult to deal with omics data. In order to overcome the problems, wepresent a computing method called localized statistic of abundance distribution based on Gaussian window(LSADBGW) to test the significance of the feature. The experiments on three datasets including gene and protein datasets showed the accuracy and efficiency of LSADBGW for feature selection.展开更多
This review comprehensively explores the core application of artificial intelligence (AI) in the fields of genomics and bioinformatics, and deeply analyzes how it leads the innovative progress of science. In the cutti...This review comprehensively explores the core application of artificial intelligence (AI) in the fields of genomics and bioinformatics, and deeply analyzes how it leads the innovative progress of science. In the cutting-edge fields of genomics and bioinformatics, the application of AI is propelling a deeper understanding of complex genetic mechanisms and the development of innovative therapeutic approaches. The precision of AI in genomic sequence analysis, coupled with breakthroughs in precise gene editing, such as AI-designed gene editors, significantly enhances our comprehension of gene functions and disease associations . Moreover, AI’s capabilities in disease prediction, assessing individual disease risks through genomic data analysis, provide robust support for personalized medicine. AI applications extend beyond gene identification, gene expression pattern prediction, and genomic structural variant analysis, encompassing key areas such as epigenetics, multi-omics data integration, genetic disease diagnosis, evolutionary genomics, and non-coding RNA function prediction. Despite challenges including data privacy, algorithm transparency, and bioethical issues, the future of AI is expected to continue revolutionizing genomics and bioinformatics, ushering in a new era of personalized medicine and precision treatments.展开更多
Objective:To use the gene chip of pseudomonas aeruginosa as a research sample and to explore it at an omics level,aiming at elucidating the co-expression network characteristics of the virulence genes exoS and exoU of...Objective:To use the gene chip of pseudomonas aeruginosa as a research sample and to explore it at an omics level,aiming at elucidating the co-expression network characteristics of the virulence genes exoS and exoU of pseudomonas aeruginosa in the lower respiratory tract from the perspective of molecular biology and identifying its key regulatory genes.Methods:From March 2016 to May 2018,312 patients infected with pseudomonas aeruginosa in the lower respiratory tract who were admitted to Department of Respiratory Medicine of Baogang Hospital and given follow-up treatments in the hospital were selected as subjects by use of cluster sampling.Alveolar lavage fluid and sputum collected from those patients were used as biological specimens.The genes of pseudomonas aeruginosa were detected with the help of oligonucleotide probes to make a pre-processing of chip data.A total of 8 common antibiotics(ceftazidime,gentamicin,piperacillin,amikacin,ciprofloxacin,levofloxacin,doripenem and ticarcillin)against Gram-negative bacteria were selected to determine the drug resistance of biological specimens.MCODE algorithm was used to construct a co-expression network model of the drug-resistance genes focused on exoS/exoU.Results:The expression level of exoS/exoU in the drug-resistance group was significantly higher than that in the non-resistance group(p<0.05).The top 5 differentially expressed genes in the alveolar lavage fluid specimens from the drug-resistance group were RAC1,ITGB1,ITGB5,CRK and IGF1R in the order from high to low.In the sputum specimens,the top 5 differentially expressed genes were RAC1,CRK,IGF1R,ITGB1 and ITGB5.In the alveolar lavage fluid specimens,only RAC1 had a positive correlation with the expression of exoS and exoU(p<0.05).In the sputum specimens,RAC1,ITGB1,ITGB5,CRK and IGF1R were positively correlated with the expression of exoS and exoU(p<0.05).The genes included in the co-expression network contained exoS,exoU,RAC1,ITGB1,ITGB5,CRK,CAMK2D,RHOA,FLNA,IGF1R,TGFBR2 and FOS.Among them,RAC1 had a highest score in the aspect of regulatory ability(72.00)and the largest number of regulatory genes(6);followed by ITGB1,ITGB5 and CRK genes.Conclusions:The high expression of exoS and exoU in the sputum specimens suggests that pseudomonas aeruginosa has a higher probability to get resistant to antibiotics;RAC1,ITGB1,ITGB5 and CRK genes may be the key genes that can regulate the expression of exoS and exoU.展开更多
Proteins play a pivotal role in coordinating the functions of organisms,essentially governing their traits,as the dynamic arrangement of diverse amino acids leads to a multitude of folded configurations within peptide...Proteins play a pivotal role in coordinating the functions of organisms,essentially governing their traits,as the dynamic arrangement of diverse amino acids leads to a multitude of folded configurations within peptide chains.Despite dynamic changes in amino acid composition of an individual protein(referred to as AAP)and great variance in protein expression levels under different conditions,our study,utilizing transcriptomics data from four model organisms uncovers surprising stability in the overall amino acid composition of the total cellular proteins(referred to as AACell).Although this value may vary between different species,we observed no significant differences among distinct strains of the same species.This indicates that organisms enforce system-level constraints to maintain a consistent AACell,even amid fluctuations in AAP and protein expression.Further exploration of this phenomenon promises insights into the intricate mechanisms orchestrating cellular protein expression and adaptation to varying environmental challenges.展开更多
生命组学大数据是国家重要基础性、战略性资源,对支撑生命科学基础研究和应用创新、推动生物经济创新发展、维护国家安全具有重要意义。随着数据规模的不断增长,生命组学大数据的安全管理问题逐渐凸显。国家基因组科学数据中心(National...生命组学大数据是国家重要基础性、战略性资源,对支撑生命科学基础研究和应用创新、推动生物经济创新发展、维护国家安全具有重要意义。随着数据规模的不断增长,生命组学大数据的安全管理问题逐渐凸显。国家基因组科学数据中心(National Genomics Data Center,NGDC)面向我国人口健康和社会可持续发展的重大战略需求,建立了生命与健康大数据汇交存储、安全管理、开放共享与整合挖掘研究体系,形成了一系列数据安全管理的制度和措施。本文聚焦于生命组学大数据全生命周期的安全管理问题,探讨生命组学大数据安全管理框架,全面分析在数据汇交、存储、管理、共享全生命周期中涉及的安全管理内容,并总结了NGDC在生命组学大数据安全管理方面的成效。最后,本文展望了生命组学大数据安全管理的发展方向,包括完善数据分级分类制度、提升数据分级安全管理技术和加强数据异地灾备建设,以期实现生命组学大数据的安全管理与可持续发展。展开更多
基因与表型间的关联分析对揭示生物的内在遗传关联具有重要意义.随机游走算法可以融合多组学数据,聚合一阶或高阶邻居的标签信息,对网络中不同节点间关联信息进行补全,提高关联预测的准确度,进而发现基因和表型间潜在的遗传关联.但现有...基因与表型间的关联分析对揭示生物的内在遗传关联具有重要意义.随机游走算法可以融合多组学数据,聚合一阶或高阶邻居的标签信息,对网络中不同节点间关联信息进行补全,提高关联预测的准确度,进而发现基因和表型间潜在的遗传关联.但现有随机游走算法通常平等地对待每个节点,忽略了不同节点的重要性,使非重要节点过度传播,降低了模型性能.为此,本文提出了一种基于多组学数据融合的个性化随机游走算法(individual Multiple Random Walks,iMRW),在由基因、miRNA及表型节点构建的多组学异质网络上,基于网络拓扑结构,设计个性化多元随机游走策略,为不同重要程度的节点分配不同的游走步长,并结合高斯相互作用属性核相似性与随机游走,对网络不同节点及节点间关联信息进行补全,最终实现多源基因-表型关联矩阵的融合,准确获取基因-表型关联预测矩阵.在不同实验设置下,与主流算法的对比实验结果均显示iMRW能够取得更优的预测性能.在玉米光合作用能力和淀粉含量表型的实验分析结果也进一步证实了iMRW在识别潜在的基因-表型关联的实用性与有效性.展开更多
文摘Gastrointestinal(GI)cancers are a set of diverse diseases affecting many parts/organs.The five most frequent GI cancer types are esophageal,gastric cancer(GC),liver cancer,pancreatic cancer,and colorectal cancer(CRC);together,they give rise to 5 million new cases and cause the death of 3.5 million people annually.We provide information about molecular changes crucial to tumorigenesis and the behavior and prognosis.During the formation of cancer cells,the genomic changes are microsatellite instability with multiple chromosomal arrangements in GC and CRC.The genomically stable subtype is observed in GC and pancreatic cancer.Besides these genomic subtypes,CRC has epigenetic modification(hypermethylation)associated with a poor prognosis.The pathway information highlights the functions shared by GI cancers such as apoptosis;focal adhesion;and the p21-activated kinase,phosphoinositide 3-kinase/Akt,transforming growth factor beta,and Toll-like receptor signaling pathways.These pathways show survival,cell proliferation,and cell motility.In addition,the immune response and inflammation are also essential elements in the shared functions.We also retrieved information on protein-protein interaction from the STRING database,and found that proteins Akt1,catenin beta 1(CTNNB1),E1A binding protein P300,tumor protein p53(TP53),and TP53 binding protein 1(TP53BP1)are central nodes in the network.The protein expression of these genes is associated with overall survival in some GI cancers.The low TP53BP1 expression in CRC,high EP300 expression in esophageal cancer,and increased expression of Akt1/TP53 or low CTNNB1 expression in GC are associated with a poor prognosis.The Kaplan Meier plotter database also confirmed the association between expression of the five central genes and GC survival rates.In conclusion,GI cancers are very diverse at the molecular level.However,the shared mutations and protein pathways might be used to understand better and reveal diagnostic/prognostic or drug targets.
文摘Explainable artificial intelligence aims to interpret how machine learning models make decisions,and many model explainers have been developed in the computer vision field.However,understanding of the applicability of these model explainers to biological data is still lacking.In this study,we comprehensively evaluated multiple explainers by interpreting pre-trained models for predicting tissue types from transcriptomic data and by identifying the top contributing genes from each sample with the greatest impacts on model prediction.To improve the reproducibility and interpretability of results generated by model explainers,we proposed a series of optimization strategies for each explainer on two different model architectures of multilayer perceptron(MLP)and convolutional neural network(CNN).We observed three groups of explainer and model architecture combinations with high reproducibility.Group II,which contains three model explainers on aggregated MLP models,identified top contributing genes in different tissues that exhibited tissue-specific manifestation and were potential cancer biomarkers.In summary,our work provides novel insights and guidance for exploring biological mechanisms using explainable machine learning models.
文摘Omics data provides an essential means for molecular biology and systems biology to capture the systematic properties of inner activities of cells. And one of the strongest challenge problems biological researchers have faced is to find the methods for discovering biomarkers for tracking the process of disease such as cancer. So some feature selection methods have been widely used to cope with discovering biomarkers problem. However omics data usually contains a large number of features, but a small number of samples and some omics data have a large range distribution, which make feature selection methods remains difficult to deal with omics data. In order to overcome the problems, wepresent a computing method called localized statistic of abundance distribution based on Gaussian window(LSADBGW) to test the significance of the feature. The experiments on three datasets including gene and protein datasets showed the accuracy and efficiency of LSADBGW for feature selection.
文摘This review comprehensively explores the core application of artificial intelligence (AI) in the fields of genomics and bioinformatics, and deeply analyzes how it leads the innovative progress of science. In the cutting-edge fields of genomics and bioinformatics, the application of AI is propelling a deeper understanding of complex genetic mechanisms and the development of innovative therapeutic approaches. The precision of AI in genomic sequence analysis, coupled with breakthroughs in precise gene editing, such as AI-designed gene editors, significantly enhances our comprehension of gene functions and disease associations . Moreover, AI’s capabilities in disease prediction, assessing individual disease risks through genomic data analysis, provide robust support for personalized medicine. AI applications extend beyond gene identification, gene expression pattern prediction, and genomic structural variant analysis, encompassing key areas such as epigenetics, multi-omics data integration, genetic disease diagnosis, evolutionary genomics, and non-coding RNA function prediction. Despite challenges including data privacy, algorithm transparency, and bioethical issues, the future of AI is expected to continue revolutionizing genomics and bioinformatics, ushering in a new era of personalized medicine and precision treatments.
文摘Objective:To use the gene chip of pseudomonas aeruginosa as a research sample and to explore it at an omics level,aiming at elucidating the co-expression network characteristics of the virulence genes exoS and exoU of pseudomonas aeruginosa in the lower respiratory tract from the perspective of molecular biology and identifying its key regulatory genes.Methods:From March 2016 to May 2018,312 patients infected with pseudomonas aeruginosa in the lower respiratory tract who were admitted to Department of Respiratory Medicine of Baogang Hospital and given follow-up treatments in the hospital were selected as subjects by use of cluster sampling.Alveolar lavage fluid and sputum collected from those patients were used as biological specimens.The genes of pseudomonas aeruginosa were detected with the help of oligonucleotide probes to make a pre-processing of chip data.A total of 8 common antibiotics(ceftazidime,gentamicin,piperacillin,amikacin,ciprofloxacin,levofloxacin,doripenem and ticarcillin)against Gram-negative bacteria were selected to determine the drug resistance of biological specimens.MCODE algorithm was used to construct a co-expression network model of the drug-resistance genes focused on exoS/exoU.Results:The expression level of exoS/exoU in the drug-resistance group was significantly higher than that in the non-resistance group(p<0.05).The top 5 differentially expressed genes in the alveolar lavage fluid specimens from the drug-resistance group were RAC1,ITGB1,ITGB5,CRK and IGF1R in the order from high to low.In the sputum specimens,the top 5 differentially expressed genes were RAC1,CRK,IGF1R,ITGB1 and ITGB5.In the alveolar lavage fluid specimens,only RAC1 had a positive correlation with the expression of exoS and exoU(p<0.05).In the sputum specimens,RAC1,ITGB1,ITGB5,CRK and IGF1R were positively correlated with the expression of exoS and exoU(p<0.05).The genes included in the co-expression network contained exoS,exoU,RAC1,ITGB1,ITGB5,CRK,CAMK2D,RHOA,FLNA,IGF1R,TGFBR2 and FOS.Among them,RAC1 had a highest score in the aspect of regulatory ability(72.00)and the largest number of regulatory genes(6);followed by ITGB1,ITGB5 and CRK genes.Conclusions:The high expression of exoS and exoU in the sputum specimens suggests that pseudomonas aeruginosa has a higher probability to get resistant to antibiotics;RAC1,ITGB1,ITGB5 and CRK genes may be the key genes that can regulate the expression of exoS and exoU.
基金This research was funded by the National Key R&D Program of China(2022YFC2106000)National Natural Science Foundation of China(32300529,32201242,12326611)+2 种基金Tianjin Synthetic Biotechnology Innovation Capacity Improvement Projects(TSBICIP-PTJS-001,TSBICIP-PTJJ-007)Major Program of Haihe Laboratory of Synthetic Biology(22HHSWSS00021)Strategic Priority Research Program of the Chinese Academy of Sciences(XDC0120201)。
文摘Proteins play a pivotal role in coordinating the functions of organisms,essentially governing their traits,as the dynamic arrangement of diverse amino acids leads to a multitude of folded configurations within peptide chains.Despite dynamic changes in amino acid composition of an individual protein(referred to as AAP)and great variance in protein expression levels under different conditions,our study,utilizing transcriptomics data from four model organisms uncovers surprising stability in the overall amino acid composition of the total cellular proteins(referred to as AACell).Although this value may vary between different species,we observed no significant differences among distinct strains of the same species.This indicates that organisms enforce system-level constraints to maintain a consistent AACell,even amid fluctuations in AAP and protein expression.Further exploration of this phenomenon promises insights into the intricate mechanisms orchestrating cellular protein expression and adaptation to varying environmental challenges.
文摘生命组学大数据是国家重要基础性、战略性资源,对支撑生命科学基础研究和应用创新、推动生物经济创新发展、维护国家安全具有重要意义。随着数据规模的不断增长,生命组学大数据的安全管理问题逐渐凸显。国家基因组科学数据中心(National Genomics Data Center,NGDC)面向我国人口健康和社会可持续发展的重大战略需求,建立了生命与健康大数据汇交存储、安全管理、开放共享与整合挖掘研究体系,形成了一系列数据安全管理的制度和措施。本文聚焦于生命组学大数据全生命周期的安全管理问题,探讨生命组学大数据安全管理框架,全面分析在数据汇交、存储、管理、共享全生命周期中涉及的安全管理内容,并总结了NGDC在生命组学大数据安全管理方面的成效。最后,本文展望了生命组学大数据安全管理的发展方向,包括完善数据分级分类制度、提升数据分级安全管理技术和加强数据异地灾备建设,以期实现生命组学大数据的安全管理与可持续发展。
文摘基因与表型间的关联分析对揭示生物的内在遗传关联具有重要意义.随机游走算法可以融合多组学数据,聚合一阶或高阶邻居的标签信息,对网络中不同节点间关联信息进行补全,提高关联预测的准确度,进而发现基因和表型间潜在的遗传关联.但现有随机游走算法通常平等地对待每个节点,忽略了不同节点的重要性,使非重要节点过度传播,降低了模型性能.为此,本文提出了一种基于多组学数据融合的个性化随机游走算法(individual Multiple Random Walks,iMRW),在由基因、miRNA及表型节点构建的多组学异质网络上,基于网络拓扑结构,设计个性化多元随机游走策略,为不同重要程度的节点分配不同的游走步长,并结合高斯相互作用属性核相似性与随机游走,对网络不同节点及节点间关联信息进行补全,最终实现多源基因-表型关联矩阵的融合,准确获取基因-表型关联预测矩阵.在不同实验设置下,与主流算法的对比实验结果均显示iMRW能够取得更优的预测性能.在玉米光合作用能力和淀粉含量表型的实验分析结果也进一步证实了iMRW在识别潜在的基因-表型关联的实用性与有效性.