In many cases, biological sequence databases contain redundant sequences that make it difficult to achieve reliable statistical analysis. Removing the redundant sequences to find all the real protein families and thei...In many cases, biological sequence databases contain redundant sequences that make it difficult to achieve reliable statistical analysis. Removing the redundant sequences to find all the real protein families and their representatives from a large sequences dataset is quite important in bioinformatics. The problem of removing redundant protein sequences can be modeled as finding the maximum independent set from a graph, which is a NP problem in Mathematics. This paper presents a novel program named FastCluster on the basis of mathematical graph theory. The algorithm makes an improvement to Hobohm and Sander’s algorithm to generate non-redundant protein sequence sets. FastCluster uses BLAST to determine the similarity between two sequences in order to get better sequence similarity. The algorithm’s performance is compared with Hobohm and Sander’s algorithm and it shows that Fast- Cluster can produce a reasonable non-redundant pro- tein set and have a similarity cut-off from 0.0 to 1.0. The proposed algorithm shows its superiority in generating a larger maximal non-redundant (independent) protein set which is closer to the real result (the maximum independent set of a graph) that means all the protein families are clustered. This makes Fast- Cluster a valuable tool for removing redundant protein sequences.展开更多
An L(2, 1)-labelling of a graph G is a function from the vertex set V(G) to the set of all nonnegative integers such that │f(u) - f(v)│≥2 if dG(u, v) = 1 and │f(u) - f(v)│ ≥ 1 if dG(u, v) = 2. Th...An L(2, 1)-labelling of a graph G is a function from the vertex set V(G) to the set of all nonnegative integers such that │f(u) - f(v)│≥2 if dG(u, v) = 1 and │f(u) - f(v)│ ≥ 1 if dG(u, v) = 2. The L(2, 1)-labelling problem is to find the smallest number, denoted by A(G), such that there exists an L(2, 1)-labelling function with no label greater than it. In this paper, we study this problem for trees. Our results improve the result of Wang [The L(2, 1)-labelling of trees, Discrete Appl. Math. 154 (2006) 598-603].展开更多
文摘In many cases, biological sequence databases contain redundant sequences that make it difficult to achieve reliable statistical analysis. Removing the redundant sequences to find all the real protein families and their representatives from a large sequences dataset is quite important in bioinformatics. The problem of removing redundant protein sequences can be modeled as finding the maximum independent set from a graph, which is a NP problem in Mathematics. This paper presents a novel program named FastCluster on the basis of mathematical graph theory. The algorithm makes an improvement to Hobohm and Sander’s algorithm to generate non-redundant protein sequence sets. FastCluster uses BLAST to determine the similarity between two sequences in order to get better sequence similarity. The algorithm’s performance is compared with Hobohm and Sander’s algorithm and it shows that Fast- Cluster can produce a reasonable non-redundant pro- tein set and have a similarity cut-off from 0.0 to 1.0. The proposed algorithm shows its superiority in generating a larger maximal non-redundant (independent) protein set which is closer to the real result (the maximum independent set of a graph) that means all the protein families are clustered. This makes Fast- Cluster a valuable tool for removing redundant protein sequences.
基金Supported by the National Natural Science Foundation of China (No. 10971248,11101057)Anhui Provincial Natural Science Foundation (No. 10040606Q45)Postdoctoral Science Foundation of Jiangsu Provinc (No.1102095C)
文摘An L(2, 1)-labelling of a graph G is a function from the vertex set V(G) to the set of all nonnegative integers such that │f(u) - f(v)│≥2 if dG(u, v) = 1 and │f(u) - f(v)│ ≥ 1 if dG(u, v) = 2. The L(2, 1)-labelling problem is to find the smallest number, denoted by A(G), such that there exists an L(2, 1)-labelling function with no label greater than it. In this paper, we study this problem for trees. Our results improve the result of Wang [The L(2, 1)-labelling of trees, Discrete Appl. Math. 154 (2006) 598-603].