期刊文献+

基于集成学习的N6甲基化位点预测方法研究

Research on Prediction Method of N6-methylation Sites Based on Ensemble Learning
下载PDF
导出
摘要 N6-甲基腺嘌呤(N6-methyladenine,6mA)是指腺嘌呤第6位氮原子的甲基化修饰。6mA在维持细胞正常的转录活性、DNA损伤修复、染色质重塑、遗传印记、胚胎发育和肿瘤发生等生物过程中起着非常重要的作用。通过生物实验的方法来鉴定6mA位点耗时且昂贵。近年来,研究界提出了一些基于机器学习的6mA位点预测方法,但这些预测方法过度依赖一种学习模型,导致模型的泛化能力不足以及预测的准确度不高等问题。集成学习综合多种预测模型的优点,具有较好的泛化能力及预测性能。因此,为了进一步提升6mA位点的预测准确性,提出了一种基于stacking集成学习的6mA位点预测模型Stack6mAPred。该模型由两层分类器组成,第一层集成了朴素贝叶斯、支持向量机(support vector machine,SVM)和LightGBM等三种主流分类器,第二层使用逻辑回归(logistic regression,LR)分类器。Stack6mAPred利用增强核苷酸组成等5种特征对实验已鉴定6mA序列和非6mA序列进行编码,使用XGBoost(extreme gradient boosting)算法进行特征选择,去除冗余特征。通过在水稻基准数据集上进行五折交叉验证,与目前性能最优的方法MM-6mAPred相比,Stack6mAPred在敏感性、特异性、准确度、MCC和AUC上均获得了更好的性能,分别提高了1.7%、1.36%、1.72%、0.06和0.031。 N6-methyladenine(6mA)refers to the methylation modification of the nitrogen atom at position 6 of adenine,which plays an important role in maintaining the normal transcriptional activity of cells,DNA damage repair,chromatin remodeling,genetic imprinting,embryonic development and tumorigenesis.However,it is a challenge to detect 6mA sites through experimental methods,which are time-consuming and costly.In recent years,the research community has proposed several machine learning-based approaches for 6mA sites detection.However,the existing 6mA detection approaches heavily rely on a single learning model.A single learning model mainly focuses on some respects to detect 6mA sites,and its accuracy and generalization ability need to be further improved.Ensemble learning can achieve powerful performance by combining multiple models.To address the drawbacks of a single learning model,a stacking ensemble-based 6mA site prediction method called Stack6mAPred is proposed.Stack6mAPred consists of two layers of classifiers.In the first layer,three mainstream classifiers such as Naive Bayes,support vector machine(SVM)and LightGBM are integrated,and in the second layer the logistic regression(LR)classifier is used.Stack6mAPred uses five sequence features to encode the experimentally identified 6mA sequences and non-6mA sequences into feature vectors,and employs XGBoost(extreme gradient boosting)algorithm to select important features from a high dimension.We conduct a five-fold cross-validation test on the benchmark rice datasets and compare with current best performing method MM-6mAPred.Results show that Stack6mAPred has achieved better performances on five common evaluation metrics,including sensitivity,specificity,accuracy,MCC(Matthews correlation coefficient)and AUC(area under the ROC curve).Performances of these five metrics are increased by 1.7%,1.36%,1.72%,0.06 and 0.032 respectively.
作者 赵媛媛 陈进祥 李富义 吴昊 刘全中 ZHAO Yuan-yuan;CHEN Jin-xiang;LI Fu-yi;WU Hao;LIU Quan-zhong(School of Information Engineering,Northwest A&F University,Yangling 712100,China;Monash Centre for Data Science,Monash University,Melbourne VIC 3800,Australia;Monash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology,Monash University,Melbourne VIC 3800,Australia)
出处 《计算机技术与发展》 2021年第3期149-156,共8页 Computer Technology and Development
基金 国家自然科学基金面上项目(61972322) 教育部人文社科交叉项目(18YJCZH190) 基本科研业务费前沿与交叉科学研究项目(2452019180) 中央高校基本科研业务费(2452017342) 博士科研启动经费(2452017019)。
关键词 6mA甲基化 stacking集成学习 XGBoost LightGBM 支持向量机 N6-methyladenine(6mA) stacking ensemble learning extreme gradient boosting(XGBoost) LightGBM support vector machine
  • 相关文献

参考文献2

二级参考文献5

共引文献26

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部