摘要
糖基化是蛋白质翻译后的主要修饰,O-糖基化的固定模式未知,高精度识别O-糖基化位点是机器学习面临的挑战性问题.以迄今最大的人O-糖基化位点Steentoft数据集为基础,本文首次提出了基于位置的卡方差表特征χ^2pos,融合伪氨基酸序列进化信息Pse PSSM以及无方向的k间隔氨基酸对组分Undirected-CKSAAP表征序列,构建5个正负样本均衡的支持向量机分类器,经加权投票,独立测试准确率、Matthew相关系数及ROC曲线下面积,分别达到了89.62%、0.79、0.96,明显优于文献报道结果.χ^2pos、Pse PSSM与Undirected-CKSAAP三种特征的融合在蛋白质糖基化、磷酸化等位点预测中有广泛应用前景.
Glycosylation is a major modification process in post-translational modification of protein.Accurate prediction of O-linked glycosylation sites is a big challenging faced by machine-learning,for the fixed-model of O-linked glycosylation is not yet known.In this paper,on the basis of the largest-ever Steentoft database up to now,a new feature——chi-square score difference table method based on position(χ^2-pos) was first proposed,which combined with pseudo position-specific scoring matrix(Pse PSSM) and undirected composition of k-spaced amino acid pairs(Undirected-CKSAAP) were used to present protein sequences.Then 5 support vector machines models were constructed with the same proportion of positive and negative samples.At last,by weighted voting,our results showed that the prediction accuracy,Matthew's correlation coefficient and area under ROC curve reached89.62%,0.79 and 0.96 respectively.They were superior to the literature report.It also demonstrated that the combination of three different features χ^2-pos,Pse PSSM and Undirected-CKSAAP has extensive application prospect in protein sites prediction such as glycosylation and phosphorylation.
出处
《生物化学与生物物理进展》
SCIE
CAS
CSCD
北大核心
2016年第7期691-698,共8页
Progress In Biochemistry and Biophysics
基金
高等学校博士学科点专项科研基金(20124320110002)
湖南省自然科学基金(14JJ2082)
长沙市科技计划项目(K1406018-21)资助
关键词
O-糖基化位点预测
卡方差表特征
伪氨基酸序列进化信息
无方向的k间隔氨基酸对组分
加权投票
O-glycosylation prediction
chi-square score difference table
pseudo position-specific scoring matrix
undirected composition of k-spaced amino acid pairs
weighted voting