摘要
随着融合了激光解析等新技术的蛋白质自动测序技术发展,蛋白质序列越来越容易获得,如何通过蛋白质序列预测其结构成为重要研究问题。蛋白质关联图预测是蛋白质三级结构预测的中间步骤,是典型的数据集极度不均衡的分类问题,非关联类别数据远远多于关联类别数据。与文本分类等问题不同,蛋白质关联图预测问题的特征维数不高,因而不能从特征选择上进行数据集优化。为了有效减少多数类样本的规模,提出结合聚类的数据下采样预处理方法,使关联和非关联类别的分布趋于平衡。实验表明,支持向量机方法在优化后的蛋白质数据集可以有效实现数据分类。
With the development of automatic protein sequencing which integrating the new technologies such as laser analysis,protein sequences are more and more easily obtained,and prediction of protein structures based on sequences becomes an important research problem. Prediction of protein inter- residue contacts map is one of the most important intermediate steps to the protein structure study,and it is a typically class imbalance problems,and the amino acid residue pairs in contact are far more than pairs not in contact. Unlike text classification problems,feature dimensionality is not high in protein contacts map prediction,so the optimistic feature selection methods is not viable. In order to reduce the size of majority class,a new method of under- sampling based on clustering is proposed to balancing the dataset. Experimental results show that Support Vector Machine which combined the proposed method can predict protein contacts map effectively.
出处
《激光杂志》
北大核心
2015年第6期114-117,共4页
Laser Journal
基金
重庆市科委自然科学基金计划(cstc2011jj A10054)
关键词
激光
蛋白质关联图预测
不均衡数据集
下采样
聚类
Laser
Protein contacts map prediction
Imbalanced data
Under-sampling
Cluster