摘要
就基因芯片数据聚类分析中广泛应用的K-means算法对常见的2种类型的基因芯片数据上的应用进行研究。结果表明,不同类型的基因芯片数据适用于不同的预处理方式和不同的相似度。对于时间序列数据集,对数化转换后,相似度选择协方差所得结果最好。对于非时间序列数据集,对数转化最好,相似度选取欧氏距离、平方欧氏距离、马氏距离都比较好。
The effects of different measuring metrics and data preprocessing for different gene expression data on K-means clustering were studied. The results illustrated that different data preprocessing ways made significant differences under different measuring metrics. The best data preprocessing in K means clustering was to select log transformations for the time-course gene expression dataset, and measuring metrics is to select covariance metrics. However, the best data preprocessing is log transformations for other datasets, three measuring metrics (Euclidean distance, squared Euclidean distance and Manhattan distance) led to better results.
出处
《畜牧兽医学报》
CAS
CSCD
北大核心
2009年第2期180-184,共5页
ACTA VETERINARIA ET ZOOTECHNICA SINICA
基金
国家自然科学基金(30771534)
教育部"长江学者和创新团队发展计划"
"猪抗病营养的分子机制"团队项目(IRT0555-6)