摘要
目的探讨以主成分分析(principal component analysis,PCA)法分析口腔鳞状细胞癌(oral squamous cell carcinoma,OSCC)的差异表达基因(differentially expressed genes,DEGs)数据库构建的OSCC诊断模型的价值,为临床诊疗提供参考。方法从癌症基因组图谱(The Cancer Genome Atlas,TCGA)数据库中获取OSCC和正常对照样本的RNA⁃seq表达数据,通过R软件对表达数据进行归一化和差异表达分析,以筛选出DEGs,并同时对DEGs行基因本体(gene ontology,GO)和京都基因与基因组百科全书(Kyoto Encyclopedia of Genes and Genomes,KEGG)富集分析,以发现主要生物学特征。随机选取RNA⁃seq中DEGs表达数据的70%作为训练集以及30%作为测试集后,应用PCA法对训练集数据进行分析,提取与诊断OSCC相关的主成分(principal compo⁃nents,PC)构建PCA模型,再分别绘制训练集和测试集PCA模型的受试者工作特征(receiver operating characteris⁃tic,ROC)曲线并计算曲线下面积(area under curve,AUC),以评估PCA模型对OSCC诊断的准确性。结果从TCGA数据库中获取OSCC和正常对照样本的RNA⁃seq表达数据分别为330例、32例。以错误发现率(false discovery rate,FDR)<0.001和|log2FC|(|log2 fold change|)>4为阈值,共筛选出159个下调和248个上调DEGs,主要富集在中间纤维、黑素体膜等细胞成分,以及色素和唾液相关的生物过程;主要参与唾液分泌、酪氨酸代谢等通路(P.adjust<0.05和Q<0.05)。将DEGs拟作为诊断OSCC的肿瘤标志物,对训练集行PCA分析显示,主成分前3位PC1、PC2、PC3方差的贡献率分别为0.873、0.100、0.023,三者累计方差的贡献率为0.996,主成分前3位PC1、PC2、PC3包含颌下腺雄激素调节蛋白3B(submaxillary gland androgen regulated protein 3B,SMR3B)、富含脯氨酸27(proline rich 27,PRR27)、组蛋白3(histatin 3,HTN3)、抗凝素(statherin,STATH)、胱抑素D(cys⁃tatin D,CST5)、包含A族成员2的BPI折叠(BPI fold containing family A member 2,BPIFA2)、富含脯氨酸的蛋白质HaeⅢ亚家族2(proline rich protein HaeⅢsubfamily 2,PRH2)、角蛋白35(keratin 35,KRT35)、组蛋白1(histatin 1,HTN1)、淀粉酶α1B(amylase alpha 1B,AMY1B)。进一步结合三者的特征向量构建OSCC的PCA诊断模型,在训练集和测试集ROC曲线中显示该模型的AUC值分别为0.852、0.844,均高于其他基因。结论基于PCA法和DEGs构建的以SMR3B、PRR27、HTN3、STATH、CST5、BPIFA2、PRH2、KRT35、HTN1和AMY1B表达水平为基础的OSCC诊断模型有较高的诊断优势,可为OSCC的早期基因诊断以及PCA模型在临床诊断中的应用提供理论依据。
Objective To explore the value of an oral squamous cell carcinoma(OSCC)diagnostic model construct⁃ed by using principal component analysis(PCA)to analyze a database of differentially expressed genes in OSCC and to provide a reference for clinical diagnosis and treatment.Methods RNA⁃seq expression data of OSCC and normal con⁃trol samples were obtained from The Cancer Genome Atlas(TCGA)database,and then,normalized and differentially ex⁃pressed genes(DEGs)were identified by R software.DEGs were enriched by Gene Ontology(GO)and Kyoto Encyclope⁃dia of Genes and Genomes(KEGG)analysis to identify their main biological characteristics.70%of DEGs expression data in RNA⁃seq were randomly selected as the training set and 30%were selected as the test set.Then,the PCA meth⁃od was applied to analyze the training set data and extract the principal components(PCs)related to the diagnosis of OS⁃CC in order to construct a PCA model.Then,the receiver operating characteristic(ROC)curves of PCA models in the training set and the test set were respectively drawn,and the area under curve(AUC)was calculated to evaluate the ac⁃curacy of the PCA model in the diagnosis of OSCC.Results RNA⁃seq expression data of OSCC and normal control samples obtained from TCGA database included 330 samples and 32 samples,respectively.Using false discovery rate(FDR)<0.001 and|log2 fold change|(|log2FC|)>4 as the thresholds,a total of 159 downregulated and 248 upregulated DEGs were identified,which were mainly enriched in cellular components such as intermediate fiber and melanosomal membrane,pigment and salivation⁃related biological processes and mainly involved in salivary secretion and tyrosine me⁃tabolism pathways(P.adjust<0.05 and Q<0.05).The DEGs were proposed as tumor markers for OSCC,and PCA analy⁃sis of the training set showed that the cumulative ratio of variance of PC1,PC2 and PC3:[including submaxillary gland androgen regulated protein 3B(SMR3B),proline rich 27(PRR27),histatin 3(HTN3),statherin(STATH),cystatin D(CST5),BPI fold containing family A member 2(BPIFA2),proline rich protein HaeⅢsubfamily 2(PRH2),keratin 35(KRT35),histatin 1(HTN1),amylase alpha 1B(AMY1B)]were 0.873,0.100 and 0.023,respectively,and the total weight of the three was 0.996.The PCA diagnostic model of OSCC was further constructed by combining the eigenvectors of the above three components.The ROC curves of the training set and test set showed that the AUC values of the PCA model were 0.852 and 0.844,respectively,which were higher than those of other single genes.Conclusion The OSCC diag⁃nostic model based on the expression levels of SMR3B,PRR27,HTN3,STATH,CST5,BPIFA2,PRH2,KRT35,HTN1 and AMY1B constructed with the PCA method and DEGs has a high diagnostic advantage.This study provides a theoreti⁃cal basis for the early genetic diagnosis of OSCC and the application of the PCA model in clinical diagnosis.
作者
温凌杜
王子弘
张国明
赖茜
杨宏宇
WEN Lingdu;WANG Zihong;ZHANG Guoming;LAI Xi;YANG Hon-gyu(Graduate School of Guangzhou Medical University,Guangzhou 510000,China;Department of Stomatolo-gy,Shenzhen Baoan Hospital of Traditional Chinese Medicine(Group),Shenzhen 518000,China;Department of Sto-matology,Shenzhen Baoan Maternity and Child Health Hospital,Shenzhen 518000,China;Department of Stomatol-ogy,Peking University Shenzhen Hospital,Shenzhen 518000,China)
出处
《口腔疾病防治》
2022年第4期251-257,共7页
Journal of Prevention and Treatment for Stomatological Diseases
基金
广东省自然科学基金项目(2019A1515011911)
广东省高水平临床重点专科项目(SZGSP008)
深圳市医疗卫生三名工程(SZSM201512036)。
关键词
口腔鳞状细胞癌
差异表达基因
肿瘤标志物
早期诊断
基因诊断
主成分分析
诊断模型
生物信息学
oral squamous cell carcinoma
differentially expressed genes
tumor markers
early diagnosis
genetic diagnosis
principal component analysis
diagnostic model
bioinformatics