摘要
目的建立adaboost分类器模型,评估肝癌非编码区疾病相关突变的可能性,识别非编码区的有害突变。方法利用人类基因突变数据库(HGMD)疾病相关的非编码区突变共13 108个作为实验组,中性单核苷酸多态性(SNP)作为对照,结合非编码区的调控因子,如保守区、进化性的RNA保守结构、高表达基因、DNA酶Ⅰ超敏感位点、转录因子结合位点、组蛋白修饰和早期复制基因等指标,建立adaboost分类器,分析以上指标对预测非编码区中有害突变的价值。构建预测概率的受试者工作特征(ROC)曲线,计算其相应的ROC曲线下面积(AUCROC)。分别利用全基因组关联研究(GWAS)和Clin Var疾病相关的突变数据库对模型进行验证。结果对疾病相关突变鉴别的重要性由大到小分别是保守区、早期复制基因、非翻译区(UTR)、启动子、高表达区、H3K36me3和保守性的转录因子结合位点等。应用adaboost分类器的预测概率建立ROC曲线,其AUCROC为0.90。GWAS和ClinVar疾病相关突变的平均得分显著高于中性SNP(P<0.05)。结论adaboost分类器有助于评估肝癌非编码区有害突变的可能性,是一种准确率高的预测工具。
Objective To establish a model of adaboost classifier, evaluate the possibility of disease related mutations in non-coding regions of liver cancer ceils, and identify harmful mutations in non-coding regions. Methods A total of 13 108 disease related mutations in non-coding regions were selected from HGMD database and used as subjects and neutral SNPs were used as controls. Combined with regulatory factors of non-coding regions, such as conserved regions, evolutionary RNA conservative structures, high-expressed genes, DNAse I hypersensitive sites, transcription factor binding sites, histone modification, and early replicated genes, the model of adaboost classifier was established. The value of these factors for predicting harmful mutations in non- coding regions was analyzed. The receiver operating characteristic (ROC) curve was plotted and the area under the ROC curve (AUCRoc) was calculated. The genome-wide association study (GWAS) and GlinVar disease- associated variants database were used to verify the model. Results Factors sorted by the imPortance for identifying disease related mutations were conserved regions, early replicated genes, untranslated Regions (UTR), promoters, high-expressed regions, H3K36me3, and conserved TFBSs. The ROC curve was established by using the prediction probability of adaboost classifier and the AUGRoc was 0.90. The average scores of GWAS and ClinVar disease-associated variants were siguificandy higher than that of neutral SNPs (P〈0.05). Conclusion The adaboost classifier is helpful for evaluating the possibility of harmful mutations in non-coding regions of liver cancer cells and is an accurate prediction tool.
出处
《上海交通大学学报(医学版)》
CAS
CSCD
北大核心
2015年第6期819-823,共5页
Journal of Shanghai Jiao tong University:Medical Science