摘要
[目的]基于机器学习方法,构建拟南芥基因组DNA复制时间分类器,探究与复制时间相关的表观遗传修饰,为进一步研究DNA复制时间的表观遗传调控机制提供参考。[方法]收集拟南芥全基因组的DNA复制时间数据和多种DNA表观遗传修饰特征(ChIP-Seq)数据,以及染色质开放状态(DNase-Seq)数据,先通过t-SNE初步对DNA表观遗传修饰特征数据降维来衡量DNA复制早晚的可预测性,并利用皮尔逊相关系数计算了多种DNA表观遗传特征与DNA复制时间信号两两之间的相关性,再通过构建随机森林、多类别逻辑回归和支持向量机3种分类器对DNA复制时间进行建模分析,以十折交叉验证和ROC曲线下的面积(AUC)为衡量指标,用80%的数据建模,20%的数据对模型效果进行验证。[结果]3种分类器对DNA复制时间都具有良好的预测能力,平均AUC均达0.8以上。DNA复制早期信号与RNA聚合酶Ⅱ结合信号以及染色质开放状态信号等呈正相关,而复制晚期信号则与其呈负相关。其中H3.1、H3.3、H2AW、H4K16ac、H3K36me3、H3K4me3均可能与DNA复制时间存在密切关系。[结论]拟南芥基因组DNA复制时间可以通过表观遗传修饰进行准确预测,其中对DNA复制晚期的预测最为准确;并发现了与DNA复制时间关系密切的组蛋白变体及表观遗传修饰。
[Objective] Based on machine learning methods, genome-wide DNA replication timing classifiers of Arabidopsis thaliana were constructed and epigenetic modifications related to replication timing were explored to provide basis for further research on epigenetic regulation mechanism of DNA replication timing program.[Method] The data of chromatin accessibility(DNase-Seq) and genome-wide DNA replication timing program, epigenetic modifications(ChIP-Seq) of Arabidopsis thaliana were collected.First, t-SNE was used to reduce dimension of DNA epigenetic modification features to measure predictability of DNA replication timing and the Pearson correlation coefficients of multiple DNA epigenetic features and DNA replication timing were calculated between each pair.Then, three types of classifiers including random forest(RF),multinomial logistic regression and support vector machine(SVM) were used to model DNA replication timing program using 80% of data and 20% of data was validated these classifiers.The ten-fold cross-validation and the area under the ROC curve(AUC) were used as measurement indicators.[Result] Three classifiers had great predictive ability.The average AUCs of the three classifiers in the early, middle and late phases of DNA replication were above 0.8.Early replicating signals were positively correlated with RNA polymerase Ⅱ binding and chromatin accessibility, while late signals were negatively correlated with them.Among all epigenetic modifications features, H3.1,H3.3,H2 AW,H4 K16 ac, H3 K36 me3 and H3 K4 me3 may be related to DNA replication timing program.[Conclusion] DNA replication timing program can be accurately predicted by epigenetic modifications, especially in the late phase of DNA replication.Histone variants and epigenetic modifications closely related to DNA replication timing were identified.
作者
李椰
李东维
李昭宏
杨若林
LI Ye;LI Dongwei;LI Zhaohong;YANG Ruolin(College of Life Sciences,Northwest A&F University,Yangling,Shaanxi 712100,China)
出处
《西北农林科技大学学报(自然科学版)》
CSCD
北大核心
2021年第4期133-141,共9页
Journal of Northwest A&F University(Natural Science Edition)
基金
陕西省“百人计划”项目(SXBR8025)。
关键词
拟南芥
DNA复制时间
表观遗传修饰特征
机器学习
Arabidopsis thaliana
DNA replication timing program
epigenetic modifications
machine learning