摘要
图聚类用于蛋白质分类问题可以获得较好结果,其前提是将蛋白质之间复杂的相互关系转化为适当的相似性网络作为图聚类分类的输入数据。本文提出一种基于BLAST检索的相似性网络构建方法,从目标蛋白质序列出发,通过若干轮次的BLAST检索逐步从数据库中提取与目标蛋白质直接或间接相关的序列,构成关联集。关联集中序列之间的相似性关系即相似性网络,可作为图聚类算法的分类依据。对Pfam数据库中依直接相似关系难以正确分类的蛋白质的计算表明,按本文方法构建的相似性网络取得了比较满意的结果。
Graph clustering can get good results for protein classification on the basis of transformation from complex relations among the proteins into proper similarity networks that are used as the input data of the graph clustering. This paper,a method of similarity network construction based on BLAST search was presented,which was conducted several rounds of BLAST to find sequences related directly or indirectly to the target protein from database and gradually build up a correlation set. The similarity relations constituted the similarity network among sequences in the correlation set,which was served as the classification basis of the graph clustering algorithm. The method was tested for a selected set of proteins hardly correctly classified by means of only direct relations. The results showed that such similarity networks in conjunction with our graph clustering algorithm yielded accurate family assignments to most,namely,83. 2% of the proteins in the hard test set.
出处
《工业微生物》
CAS
CSCD
2015年第3期53-57,共5页
Industrial Microbiology
关键词
相似性网络
图聚类
蛋白质家族分类
similarity networks
graph clustering
protein family classification