摘要
As explored by biologists, there is a real and emerging need to identify co-regulated gene clusters, which include both positive and negative regulated gene clusters. However, the existing pattern-based and tendency-based clustering approaches are only designed for finding positive regulated gene clusters. In this paper, a new subspace clustering model called g-Cluster is proposed for gene expression data. The proposed model has the following advantages: 1) find both positive and negative co-regulated genes in a shot, 2) get away from the restriction of magnitude transformation relationship among co-regulated genes, and 3) guarantee quality of clusters and significance of regulations using a novel similarity measurement gCode and a user-specified regulation threshold δ, respectively. No previous work measures up to the task which has been set. Moreover, MDL technique is introduced to avoid insignificant g-Clusters generated. A tree structure, namely GS-tree, is also designed, and two algorithms combined with efficient pruning and optimization strategies to identify all qualified g-Clusters. Extensive experiments are conducted on real and synthetic datasets. The experimental results show that 1) the algorithm is able to find an amount of co-regulated gene clusters missed by previous models, which are potentially of high biological significance, and 2) the algorithms are effective and efficient, and outperform the existing approaches.
As explored by biologists, there is a real and emerging need to identify co-regulated gene clusters, which include both positive and negative regulated gene clusters. However, the existing pattern-based and tendency-based clustering approaches are only designed for finding positive regulated gene clusters. In this paper, a new subspace clustering model called g-Cluster is proposed for gene expression data. The proposed model has the following advantages: 1) find both positive and negative co-regulated genes in a shot, 2) get away from the restriction of magnitude transformation relationship among co-regulated genes, and 3) guarantee quality of clusters and significance of regulations using a novel similarity measurement gCode and a user-specified regulation threshold δ, respectively. No previous work measures up to the task which has been set. Moreover, MDL technique is introduced to avoid insignificant g-Clusters generated. A tree structure, namely GS-tree, is also designed, and two algorithms combined with efficient pruning and optimization strategies to identify all qualified g-Clusters. Extensive experiments are conducted on real and synthetic datasets. The experimental results show that 1) the algorithm is able to find an amount of co-regulated gene clusters missed by previous models, which are potentially of high biological significance, and 2) the algorithms are effective and efficient, and outperform the existing approaches.
基金
This work is supported by the National Grand Fundamental Research 973 Program of China (Grant No. 2006CB303103) and the National Natural Science Foundation of China under Grants No. 60573089, No. 60273079 and No. 60473074.