期刊文献+

基于统计差表与加权投票的高精度剪接位点预测

High-accuracy Splice Site Prediction Based on Statistical Difference Table and Weighted Voting
下载PDF
导出
摘要 基于机器学习的高精度剪接位点识别是真核生物基因组注释的关键.本文采用卡方测验确定序列窗口长度,构建卡方统计差表提取位置特征,并结合碱基二联体频次表征序列;针对剪接位点正负样本高度不均衡这一情形,构建10个正负样本均衡的支持向量机分类器,进行加权投票决策,有效解决了不平衡模式分类问题. HS^3D数据集上的独立测试结果显示,供体、受体位点预测准确率分别达到93.39%、90.46%,明显高于参比方法.基于卡方统计差表的位置特征能有效表征DNA序列,在分子序列信号位点识别中具有应用前景. High-accuracy splice site recognition based on machine learning is the key to eukaryotic genome annotation.In this paper,we used chi-square test to determine the window size of sequences,and constructed a chi-square statistical difference table to extract the positional features,and combined with the frequencies of dinucleotides to characterize sequences.For the problem that the positive and negative samples of splice sites are extremely imbalanced,10 SVM classifiers based on the equal proportion of positive and negative samples were built for weighted voting,which effectively solved the imbalanced pattern classification problem.Independent testing results in HS^3D dataset showed that the prediction accuracy of donor and acceptor sites were 93.39%and 90.46%respectively,obviously higher than that of the compared methods.The positional features based on the chi-square statistical difference table can effectively characterize DNA sequences,and have application prospects in signal site recognition of molecular sequences.
作者 曾莹 陈渊 袁哲明 ZENG Ying;CHEN Yuan;YUAN Zhe-Ming(Hunan Engineering&Technology Research Center for Agricultural Big Data Analysis&Decision-making,Hunan Agricultural University,Changsha 410128,China;Orient Science&Technology College,Hunan Agricultural University,Changsha 410128,China)
出处 《生物化学与生物物理进展》 SCIE CAS CSCD 北大核心 2019年第5期496-503,共8页 Progress In Biochemistry and Biophysics
基金 国家自然科学基金(61701177) 湖南省自然科学基金(2018JJ3225) 湖南省教育厅科学研究项目(17A096)资助~~
关键词 剪接位点 位置特征 卡方统计差表 加权投票 支持向量机 plice site positional features chi-square statistical difference table weighted voting support vector machine(SVM)
  • 相关文献

参考文献2

二级参考文献32

  • 1刘利,李前忠,樊国梁.低维输入空间的支持向量机识别人类剪接位点[J].生物物理学报,2008,24(1):49-56. 被引量:3
  • 2郑毅,丁达夫.果蝇内含子3'剪接位点的选择机制[J].生物物理学报,1994,10(3):459-464. 被引量:6
  • 3晋宏营,罗辽复,张利绒.核酸-蛋白质结合能在剪切位点识别中的应用[J].生物物理学报,2007,23(3):185-191. 被引量:3
  • 4Lukashin A V, Borodovsky M. Gene Mark. HMM: new solutions for gene finding[ J ]. Nucleic Acids Research, 1998,26 (4) : 1107.
  • 5Thanaraj T A, Clark F. Human GC--AG alternative intron isoforms with weak donor sites show enhanced consensus at acceptot exon positions[ J]. Nucleic Acids Research, 2001,29 : 2581.
  • 6Thanaraj T A. A clean data set of EST-confimned splice sites from Homo sapiens and standardsfor clean-up procedures[J]. Nucleic Acids Res, 1999,27: 2627.
  • 7Yin M M, Wang J T L. Effective hidden Markov models for detecting splicing junction sites in DNA sequences[J]. Information Sciences, 2001,139( 11 ) : 139.
  • 8SUN Yingfei, FAN Xiaodan, LI Yanda. Identifying splicing sites in eukaryotic RNA: Support vector machine approach[J ]. Computersuin Biology and Medione, 2003,33:17.
  • 9Burset M, Guigo R. Evaluation of gene structure prediction programs[J]. Genomics, 1996,34:353.
  • 10Vapnik V. The nature of statistical learning theory [ M ]. New York: Springer Verlag, 1995.

共引文献5

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部