摘要
本文选取癌症基因组图谱数据库的乳腺癌样本作为数据集,在全基因组的水平上研究乳腺癌病人从正常到发病Ⅰ期基因表达的变化,寻找与乳腺癌发病密切相关的特征基因,建立乳腺癌发生的模式识别分类方法,为乳腺癌预防及早期诊断提供理论支持.研究中,综合利用相关性、t检验、置信区间等统计学方法,建立乳腺癌发生特征基因筛选方法,获得与乳腺癌发生具有显著性差异的特征基因336个.通过机器学习方法建模,得到的分类准确率能达到98%以上,与之前乳腺癌相关的研究相比,准确率更高.同时采用KEGG(kyoto encyclopedia of genes and genomes)通路分析得到与基因显著相关(P<0.05)的通路有8个,GO(gene ontology)基因功能富集分析显示与基因显著相关(P<0.05)的功能有18个.最后对映射在8个通路中的一部分基因进行简要功能分析,说明了其在调控水平上的密切关系,表明识别的特征基因在乳腺癌的发生过程中有重要的作用,这对了解乳腺癌发病机理以及乳腺癌的早期诊断非常重要.
To identify signature genes for the pathogenesis of breast cancer, which provides a theoretical support for prevention and early diagnosis of breast cancer. The pattern recognition method was used to analysis the genome-wide gene expression data which was collected from the breast cancer part of TCGA (The Cancer Genome Atlas) database.336 gene expression signature genes were selected by means of a combination of statistical methods such as correlation, t test, confidence interval, etc. The accuracy can be as high as 98% through the machine learning method modeling, which is higher compared with the previous study. The KEGG (Kyoto Encyclopedia of Genes and Genomes) pathway analysis and GO (Gene Ontology) enrichment analysis indicated the significant correlation among eight and eighteen kinds of genes respectively. A functional analysis of the part of the eight pathways showed theirs close relationship at the level of gene regulation which indicted the identified signature genes play an important role in the pathogenesis of breast cancer and is very important for understanding the pathogenesis of breast cancer and the early diagnosis of breast cancer.
出处
《生物化学与生物物理进展》
SCIE
CAS
CSCD
北大核心
2017年第11期1016-1025,共10页
Progress In Biochemistry and Biophysics
基金
国家自然科学基金(11572014)
智能制造领域大科研推进计划(01500054631751)资助项目~~
关键词
乳腺癌
基因表达
模式识别
肿瘤预测
早期诊断
breast cancer, gene expression, pattern recognition, tumor prediction, early diagnosis