摘要
为了提高长非编码RNA(long non-coding RNA,lncRNA)预测的准确性,提出一种基于随机森林算法的lncRNA预测方法.在国际通用的基因注释和基因组序列训练数据集中,首先进行特征选取,然后采用随机森林算法对包含特征信息的数据集进行模型训练.选取的特征包含14种三聚核酸序列(ACG、CCG、CGA、CGC、CGG、CGT、CTA、GCG、GGG、GTA、TAA、TAC、TAG、TCG)的占比、终止密码子在3种阅读框中的数量标准差、GC含量、蛋白质编码能力、转录本长度、外显子个数、平均外显子长度和保守性分值.10折交叉验证结果表明,该预测方法在真阳性率、精确率、召回率、F值和AUC值等性能指标方面均优于其他算法.
To improve the accuracy of long non-coding RNA (lncRNA) prediction, a method based on random forest is proposed. Dataset for model training is derived from worldwide generally used gene annotation and genome sequence. Features selected include ratios of 14 triple-nucleotide sequences (ACG, CCG, CGA, CGC, CGG, CGT, CTA, GCG, GGG, GTA, TAA, TAC, TAG, TCG) to the transcript length respectively, standard deviations of stop codon counts of three read- ing frames, GC content, protein-coding potential (CDS, CDS length and ratio of CDS to tran- script), transcript length, exon count, average exon length, conservation score (average PhastCons score of transcript). Then the random forest algorithm is applied to the dataset for model training, and the over-fitting problem is solved during the realization of other algorithms. Results of 10-fold cross-validation manifest that the lncRNA prediction method based on random forest performs better than other methods including K-nearest neighbors (K-NN), Naive Bayes and Bayesian net- work in terms of true positive rate, precision, recall, F score and AUC (area under curve).
出处
《扬州大学学报(自然科学版)》
CAS
北大核心
2016年第4期50-53,共4页
Journal of Yangzhou University:Natural Science Edition
基金
国家自然科学基金资助项目(61301220)
江苏省"六大人才高峰"第七批高层次人才项目(2010-DZXX-149)