摘要
传统随机文法模型预测RNA二级结构需要寻找足够多的相关序列样本,这限制了该方法的实际应用。为有效利用大量未标注的RNA序列进行结构预测,将半监督学习方法融入到随机文法模型中,采用少量已标注的RNA样本和大量未标注样本作为预测模型的训练集。设计了基于EM算法的半监督学习预测模型,该模型将基于产生式方法的SCFG模型作为分类器,通过训练对未标记的RNA序列进行标注,再将己标注的序列逐步合并到已标记样本集中,并能够调节已标记样本和未标记样本所占的比例,最后输出结构标签序列。实验结果表明,通过对多种混合了已标注和未标注RNA序列集的测试,验证了该方法可有效地利用未标注序列数据,大大降低了对已标注序列样本的需求数量,提高了预测精度,并测试了掺入不同的未标记序列数量对模型预测性能的影响。
To predict RNA secondary structures, traditional stochastic grammar models need to collect plenty of related RNA sequences, which limits the practical application of this method. In order to use a large number of unlabeled RNA sequences effectively for structure prediction, the Semi-supervised method has been applied to stochastic grammar models. We use a small amount of labeled RNA samples and a large number of unlabeled samples as a training set of prediction model. Designing a semi-supervised learning model based on EM algorithm, using a SCFG model based on generative method as classifier, we labeled the unlabeled RNA sequences through training, and then gradually merged into labeled Dataset. Moreover, the model can regulate the proportion of labeled and unlabeled sequences, finally It can output structure tags sequence. By experiment result show, through training variety of the mixture of RNA sequence set, this method can utilize unlabeled sequences data effectively, greatly reduces the demand for the number of related sequence samples, and improve the prediction accuracy. In addition, we had measured the performance of the model prediction influenced by different amount of unlabeled sequences.
出处
《计算机与应用化学》
CAS
CSCD
北大核心
2013年第9期1038-1042,共5页
Computers and Applied Chemistry
基金
湖南省自然科学基金项目(12JJ4058)
衡阳师范学院科研基金项目(09A36)
关键词
半监督学习
RNA
二级结构
随机文法模型
Semi-supervised learning
RNA
secondary structure
stochastic grammar model