摘要
首先采用伪氨基酸组成(Pse AA)和特定位点记分矩阵(PSSM)2种方法组合的特征提取方法来表达蛋白质序列。通过该方法将蛋白质序列转化成特征向量,虽然该向量在很大程度上保留了蛋白质序列的原始信息,但是它产生的相应的维数会很高,这使得蛋白质亚细胞位置的预测过程变得很复杂。同时,就目前的情况来看,想要获取大量已标记的蛋白质亚细胞位置样本也很困难。为了解决这些问题,提出采用半监督降维算法(SS-MVP)对特征向量进行降维的同时能从标记和未标记的样本点中提取对分类有用的信息。基于降维后的样本利用支持向量机(SVM)的算法来预测蛋白质亚细胞位置类型。实验结果表明,采用上述方法既能简化蛋白质亚细胞位置的预测系统,又能提高其分类性能。
Firstly, a fusion feature extraction method by combining Pseudo Amino Acid composition (PseAA) and Position-Specific Scoring Matrix (PSSM) is adopted to represent the features of proteins. Through this method, proteins are changed to feature vectors which can mostly retain the original information of protein sequence. But this high-dimensional feature vectors produced by using this fusion method may make the prediction system of protein subceUuler localization complex. At the same time, to obtain a large sample of marked protein subcellular location is also very difficult. To overcome these problems, a dimensionality reduction algorithm called Semi-Supervised Maximum Variance Projections (SS-MVP) is introduced to reduce the dimensional of feature vectors and extract useful information for classification from labeled and unlabeled sample points at the same time. Based on the reduced samples, Support Vector Machine (SVM) was applied for the prediction of protein subcelhiler localization. Finally, the obtained results prove that the prediction system of protein subcelluler localization is simplified and classification performances are improved by adopting aboved methods.
出处
《上海第二工业大学学报》
2015年第3期260-265,共6页
Journal of Shanghai Polytechnic University
基金
国家自然科学基金(No.61301249
No.61272036)
上海市自然科学基金(No.15ZR1417000)
上海市教委优青项目(No.ZZegd14001)
上海第二工业大学校基金(No.EGD14XQD13)资助
关键词
蛋白质亚细胞定位
预测
半监督
降维
subcelluler localization of protein
prediction
semi-supervised
dimension reduction