[Objective] This paper aimed to provide a new method for genetic data clustering by analyzing the clustering effect of genetic data clustering algorithm based on the minimum coding length. [Method] The genetic data cl...[Objective] This paper aimed to provide a new method for genetic data clustering by analyzing the clustering effect of genetic data clustering algorithm based on the minimum coding length. [Method] The genetic data clustering was regarded as high dimensional mixed data clustering. After preprocessing genetic data, the dimensions of the genetic data were reduced by principal component analysis, when genetic data presented Gaussian-like distribution. This distribution of genetic data could be clustered effectively through lossy data compression, which clustered the genes based on a simple clustering algorithm. This algorithm could achieve its best clustering result when the length of the codes of encoding clustered genes reached its minimum value. This algorithm and the traditional clustering algorithms were used to do the genetic data clustering of yeast and Arabidopsis, and the effectiveness of the algorithm was verified through genetic clustering internal evaluation and function evaluation. [Result] The clustering effect of the new algorithm in this study was superior to traditional clustering algorithms, and it also avoided the problems of subjective determination of clustering data and sensitiveness to initial clustering center. [Conclusion] This study provides a new clustering method for the genetic data clustering.展开更多
文摘[Objective] This paper aimed to provide a new method for genetic data clustering by analyzing the clustering effect of genetic data clustering algorithm based on the minimum coding length. [Method] The genetic data clustering was regarded as high dimensional mixed data clustering. After preprocessing genetic data, the dimensions of the genetic data were reduced by principal component analysis, when genetic data presented Gaussian-like distribution. This distribution of genetic data could be clustered effectively through lossy data compression, which clustered the genes based on a simple clustering algorithm. This algorithm could achieve its best clustering result when the length of the codes of encoding clustered genes reached its minimum value. This algorithm and the traditional clustering algorithms were used to do the genetic data clustering of yeast and Arabidopsis, and the effectiveness of the algorithm was verified through genetic clustering internal evaluation and function evaluation. [Result] The clustering effect of the new algorithm in this study was superior to traditional clustering algorithms, and it also avoided the problems of subjective determination of clustering data and sensitiveness to initial clustering center. [Conclusion] This study provides a new clustering method for the genetic data clustering.