摘要
基于机器学习的高精度剪接位点识别是真核生物基因组注释的关键.本文采用卡方测验确定序列窗口长度,构建卡方统计差表提取位置特征,并结合碱基二联体频次表征序列;针对剪接位点正负样本高度不均衡这一情形,构建10个正负样本均衡的支持向量机分类器,进行加权投票决策,有效解决了不平衡模式分类问题. HS^3D数据集上的独立测试结果显示,供体、受体位点预测准确率分别达到93.39%、90.46%,明显高于参比方法.基于卡方统计差表的位置特征能有效表征DNA序列,在分子序列信号位点识别中具有应用前景.
High-accuracy splice site recognition based on machine learning is the key to eukaryotic genome annotation.In this paper,we used chi-square test to determine the window size of sequences,and constructed a chi-square statistical difference table to extract the positional features,and combined with the frequencies of dinucleotides to characterize sequences.For the problem that the positive and negative samples of splice sites are extremely imbalanced,10 SVM classifiers based on the equal proportion of positive and negative samples were built for weighted voting,which effectively solved the imbalanced pattern classification problem.Independent testing results in HS^3D dataset showed that the prediction accuracy of donor and acceptor sites were 93.39%and 90.46%respectively,obviously higher than that of the compared methods.The positional features based on the chi-square statistical difference table can effectively characterize DNA sequences,and have application prospects in signal site recognition of molecular sequences.
作者
曾莹
陈渊
袁哲明
ZENG Ying;CHEN Yuan;YUAN Zhe-Ming(Hunan Engineering&Technology Research Center for Agricultural Big Data Analysis&Decision-making,Hunan Agricultural University,Changsha 410128,China;Orient Science&Technology College,Hunan Agricultural University,Changsha 410128,China)
出处
《生物化学与生物物理进展》
SCIE
CAS
CSCD
北大核心
2019年第5期496-503,共8页
Progress In Biochemistry and Biophysics
基金
国家自然科学基金(61701177)
湖南省自然科学基金(2018JJ3225)
湖南省教育厅科学研究项目(17A096)资助~~
关键词
剪接位点
位置特征
卡方统计差表
加权投票
支持向量机
plice site
positional features
chi-square statistical difference table
weighted voting
support vector machine(SVM)