摘要
吸烟是导致肺癌的一个重要诱导因素,从全基因组基因甲基化水平出发,利用生物信息学方法,通过建立对当前吸烟/不吸烟样本的模式识别分类模型,识别甲基化特征基因,为揭示不吸烟肺癌患者的患病机理奠定基础。为避免甲基化微阵列数据超高维小样本、高噪声、高相关性以及信息饱和现象淹没真正的特征基因,首次采用迭代多重筛选方法,分别从显著性差异、与基因表达水平的关系、生物功能、分类重要性等多个角度对全基因组甲基化数据进行多步筛选,从而识别吸烟相关特征基因。以TCGA数据库中127个肺腺癌样本为训练集,64个EDRN肺腺癌样本为独立测试集,最终确定了48个关键基因。相应模式识别模型对训练集精度达到87.5%(敏感性、特异性分别为87.2%和87.8%),独立测试集分类精度达到76.4%(敏感性、特异性分别为80.2%和73.6%)。交叉研究表明,其中17个基因对癌症发展的重要性已经在其他研究中有所证实,进一步的研究则证明其甲基化的重要性。同时,KEGG和IPA对特征基因在基因调控网络和代谢通路水平的分析表明,特征基因与癌症的发展以及生物功能、细胞发育等都有着密切的联系。
To understand the biological mechanism of never smoker lung adenocarcinomas,we focused on the genome-wide methylation values( ME) to discover signature genes for the distinguishing of current / never smokers. In order to overcome the disadvantages of small-size-high-dimension,high noise and to overcome the predominate influence of the whole genome to the dozens of signature genes,a new integrative selection method was used iteratively to uncover the real signature genes. To do this,instead of using only one criteria for gene selection,we identified genes according to their significance test performance,the relationship between their methylation levels and expression levels,the biological function and the contribution to the current / never smoker classification. As a result,48 genes were identified as ME smoke related signature genes based on the127 lung adenocarcinoma samples downloaded from TCGA database. Then we used 64 EDRN lung adenocarcinoma samples as an independent validation set. Only using the methylation values of these 48 signature genes,the current / never smoker classification accuracy of TCGA training set is 87. 5%( SN =87. 2%,SP = 87. 8%) and for EDRN validation set is 76. 4%( SN = 80. 2%,SP = 73. 6%),respectively.Cross-study proved the highly cancer related of 17 important genes in our 48 signature genes. Addition to these results,we proved the importance of their corresponding methylation values. The ingenuity pathway( IPA) and Kyoto encyclopedia of genes and genomes( KEGG) pathways analysis indicated the relationships among these genes on the genetic network level and pathway levels. They also indicated they are involved in the highly cancer-related pathways.
出处
《中国生物医学工程学报》
CAS
CSCD
北大核心
2016年第3期301-309,共9页
Chinese Journal of Biomedical Engineering
基金
国家自然科学基金(31271351)
关键词
肺腺癌
甲基化数据
吸烟史
模式识别
分类
lung adenocarcinoma
methylation values
smoke exposure
pattern recognition
classification