摘要
目的利用癌症基因组图谱(the cancer genome atlas,TCGA)中的结直肠癌27 K甲基化数据和临床随访数据,挖掘与结直肠癌不良预后相关的因素及建立结直肠癌诊断模型。方法自2020年12月-2021年9月,在TCGA网站下载207例结直肠癌27 K甲基化测序数据和相关临床资料。用R语言edger软件包筛选出差异甲基化位点,然后使用SPSS软件对差异甲基化位点进行ROC分析、逐步回归分析,筛选出对结直肠癌诊断有意义的位点。应用支持向量机、神经网络等建立起基于数个DNA甲基化位点的数学模型,并在GEO数据库中获取独立数据集GSE131013来评估结直肠癌诊断模型的性能。同时利用Kaplan-Meier(KM)单因素分析法和Cox多因素分析法对临床数据指标和甲基化位点进行生存分析,筛选出与结直肠癌不良预后有关的因素。结果筛选出6个于结直肠癌有诊断潜力的位点:cg00240432、cg06744574、cg08090772、cg13577076、cg17872757和cg24446548。基于6个DNA甲基化位点建立ANN模型、Logistic回归模型、SVM模型,3种模型10折交叉验证平均准确率分别为99.0%、98.0%、99.5%,漏诊率分别为1.0%、2.0%、0.5%。运用GEO数据库中的独立数据集验证模型,3种模型的准确率分别为92.9%、85.8%、91.2%。KM生存分析发现cg24446548高甲基化以及结直肠癌晚期(Ⅲ、Ⅳ期)与结直肠癌不良预后有关(P<0.05)。Cox多因素分析发现肿瘤分期对生存期有明显影响(P<0.05)。结论筛选出的甲基化位点具有诊断结直肠癌的潜能。在基于甲基化位点的筛选建立的3种模型中,ANN和SVM模型分类和预测性能较好。结直肠癌患者的cg24446548位点高甲基化以及肿瘤晚期(Ⅲ、Ⅳ期)预示着不良预后。
Objective To use the 27 K methylation data and clinical follow-up data of colorectal cancer in The Can⁃cer Genome Atlas(TCGA)to mine the factors associated with poor prognosis of colorectal cancer and establish a colorectal cancer diagnostic model.Methods From December 2020 to September 2021,207 cases of colorectal cancer 27 K methylation sequencing data and related clinical data were downloaded from the TCGA website.Differentially methylated sites were screened with the R language edger package.Then,ROC analysis and stepwise regression analy⁃sis were performed on the differentially methylated sites using SPSS software to screen out the sites with significance for the diagnosis of colorectal cancer.A mathematical model based on several DNA methylation sites was established using support vector machines,neural networks,etc.,and an independent data set GSE131013 was obtained in the GEO database to evaluate the performance of the colorectal cancer diagnostic model.At the same time,Kaplan-Meier(KM)univariate analysis and Cox multivariate analysis were used for survival analysis of clinical data indicators and methylation sites,and factors related to poor prognosis of colorectal cancer were screened out.Results Six loci with di⁃agnostic potential for colorectal cancer were screened:cg00240432,cg06744574,cg08090772,cg13577076,cg17872757 and cg24446548.The ANN model,Logistic regression model,and SVM model were established based on 6 DNA methylation sites.The average accuracy rates of the three models were 99.0%,98.0%,and 99.5%with 10-fold cross-validation,and the missed diagnosis rates were 1.0%and 2.0%,0.5%.Using independent datasets in the GEO database to validate the models,the accuracy rates of the three models were 92.9%,85.8%,and 91.2%,respectively.KM survival analysis found that cg24446548 hypermethylation and advanced colorectal cancer(stageⅢ,Ⅳ)were as⁃sociated with poor prognosis in colorectal cancer(P<0.05).Cox multivariate analysis found that tumor stage had a sig⁃nificant effect on survival(P<0.05).Conclusion The screened methylation sites have the potential to diagnose colorec⁃tal cancer.Among the three models established based on the screening of methylation sites,the ANN and SVM models performed better in classification and prediction.Hypermethylation at cg24446548 and advanced tumor stage(stageⅢ,Ⅳ)in colorectal cancer patients predict poor prognosis.
作者
薛春萌
高洁
李嘉乐
李荣佳
刘畅
梁建伟
XUE Chunmeng;GAO Jie;LI Jiale;LI Rongjia;LIU Chang;LIANG Jianwei(Health Management,the First Affiliated Hospital of Shandong First Medical University(Shandong Qianfoshan Hospi-tal),Shandong Health Checkup Engineering Laboratory,Jinan,Shandong Province,250000 China;School of Basic Medicine,Shandong First Medical University,Jinan,Shandong Province,250000 China;Department of General Sur-gery,Tai'an Central Hospital,Tai'an,Shandong Province,271000 China)
出处
《系统医学》
2022年第15期39-45,共7页
Systems Medicine
基金
山东省医药卫生科技发展计划项目(202011000331)
国家级大学生创新创业训练计划项目(S202010439019)。
关键词
结直肠癌
机器学习
甲基化
10折交叉验证
Colorectal cancer
Machine learning
Methylation
10-fold cross-validation