期刊文献+

基于BERT和CNN的基因剪接位点识别

Gene splice site identification based on BERT and CNN
下载PDF
导出
摘要 随着高通量测序技术的发展,海量的基因组序列数据为了解基因组的结构提供了数据基础。剪接位点识别是基因组学研究的重要环节,在基因发现和确定基因结构方面发挥着重要作用,且有利于理解基因性状的表达。针对现有模型对脱氧核糖核酸(DNA)序列高维特征提取能力不足的问题,构建了由BERT(Bidirectional Encoder Representations from Transformer)和平行的卷积神经网络(CNN)组合而成的剪接位点预测模型——BERT-splice。首先,采用BERT预训练方法训练DNA语言模型,从而提取DNA序列的上下文动态关联特征,并且使用高维矩阵映射DNA序列特征;其次,采用人类参考基因组序列hg19数据,使用DNA语言模型将该数据映射为高维矩阵后作为平行CNN分类器的输入进行再训练;最后,在上述基础上构建了剪接位点预测模型。实验结果表明,BERT-splice模型在DNA剪接位点供体集上的预测准确率为96.55%,在受体集上的准确率为95.80%,相较于BERT与循环卷积神经网络(RCNN)构建的预测模型BERT-RCNN分别提高了1.55%和1.72%;同时,在5条完整的人类基因序列上测试得到的所提模型的供体/受体剪接位点平均假阳性率(FPR)为4.74%。以上验证了BERT-splice模型用于基因剪接位点预测的有效性。 With the development of high-throughput sequencing technology,massive genome sequence data provide a data basis to understand the structure of genome.As an essential part of genomics research,splice site identification plays a vital role in gene discovery and determination of gene structure,and is of great importance for understanding the expression of gene traits.To address the problem that existing models cannot extract high-dimensional features of DNA(DeoxyriboNucleic Acid)sequences sufficiently,a splice site prediction model consisted of BERT(Bidirectional Encoder Representations from Transformers)and parallel Convolutional Neural Network(CNN)was constructed,namely BERTsplice.Firstly,the DNA language model was trained by BERT pre-training method to extract the contextual dynamic association features of DNA sequences and map DNA sequence features with a high-dimensional matrix.Then,the DNA language model was used to map the human reference genome sequence hg19 data into a high-dimensional matrix,and the result was adopted as input of parallel CNN classifier for retraining.Finally,a splice site prediction model was constructed on the basis of the above.Experimental results show that the prediction accuracy of BERT-splice model is 96.55%on the donor set of DNA splice sites and 95.80%on the acceptor set,which improved by 1.55%and 1.72%respectively,compared to that of the BERT and Recurrent Convolutional Neural Network(RCNN)constructed prediction model BERTRCNN.Meanwhile,the average False Positive Rate(FPR)of donor/acceptor splice sites tested on five complete human gene sequences is 4.74%.The above verifies that the effectiveness of BERT-splice model for gene splice site prediction.
作者 左敏 王虹 颜文婧 张青川 ZUO Min;WANG Hong;YAN Wenjing;ZHANG Qingchuan(National Engineering Research Centre for Agri-Product Quality Traceability,Beijing Technology and Business University,Beijing 100048,China;School of E-Business and Logistics,Beijing Technology and Business University,Beijing 100048,China)
出处 《计算机应用》 CSCD 北大核心 2023年第10期3309-3314,共6页 journal of Computer Applications
基金 国家自然科学基金项目资助项目(61873027)。
关键词 剪接位点识别 BERT 卷积神经网络 深度学习 脱氧核糖核酸 splice site identification Bidirectional Encoder Representations from Transformers(BERT) Convolutional Neural Network(CNN) deep learning DeoxyriboNucleic Acid(DNA)
  • 相关文献

参考文献1

二级参考文献8

共引文献17

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部