期刊文献+

一种基于随机森林的长非编码RNA预测方法 被引量:2

A long non-coding RNA prediction method based on random forest
原文传递
导出
摘要 为了提高长非编码RNA(long non-coding RNA,lncRNA)预测的准确性,提出一种基于随机森林算法的lncRNA预测方法.在国际通用的基因注释和基因组序列训练数据集中,首先进行特征选取,然后采用随机森林算法对包含特征信息的数据集进行模型训练.选取的特征包含14种三聚核酸序列(ACG、CCG、CGA、CGC、CGG、CGT、CTA、GCG、GGG、GTA、TAA、TAC、TAG、TCG)的占比、终止密码子在3种阅读框中的数量标准差、GC含量、蛋白质编码能力、转录本长度、外显子个数、平均外显子长度和保守性分值.10折交叉验证结果表明,该预测方法在真阳性率、精确率、召回率、F值和AUC值等性能指标方面均优于其他算法. To improve the accuracy of long non-coding RNA (lncRNA) prediction, a method based on random forest is proposed. Dataset for model training is derived from worldwide generally used gene annotation and genome sequence. Features selected include ratios of 14 triple-nucleotide sequences (ACG, CCG, CGA, CGC, CGG, CGT, CTA, GCG, GGG, GTA, TAA, TAC, TAG, TCG) to the transcript length respectively, standard deviations of stop codon counts of three read- ing frames, GC content, protein-coding potential (CDS, CDS length and ratio of CDS to tran- script), transcript length, exon count, average exon length, conservation score (average PhastCons score of transcript). Then the random forest algorithm is applied to the dataset for model training, and the over-fitting problem is solved during the realization of other algorithms. Results of 10-fold cross-validation manifest that the lncRNA prediction method based on random forest performs better than other methods including K-nearest neighbors (K-NN), Naive Bayes and Bayesian net- work in terms of true positive rate, precision, recall, F score and AUC (area under curve).
出处 《扬州大学学报(自然科学版)》 CAS 北大核心 2016年第4期50-53,共4页 Journal of Yangzhou University:Natural Science Edition
基金 国家自然科学基金资助项目(61301220) 江苏省"六大人才高峰"第七批高层次人才项目(2010-DZXX-149)
关键词 长非编码RNA 随机森林 基因预测 long non-coding RNA random forest gene prediction
  • 相关文献

同被引文献25

引证文献2

二级引证文献16

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部