摘要
跨物种的生物序列比较已经被广泛应用于基因功能预测,而越来越多的实验表明序列相似性并不足以保证基因功能相似.为了精确确定基因功能,不仅需要考虑序列性质,还需探索基因表达信息的特性,因为基因表达的改变往往伴随着基因功能的改变.通过聚类分析基因表达谱,可以直观判断协同表达基因及其规律,这是考察基因功能的重要一步.由于生物组织基因表达的复杂性,以及识别表达的microarray技术和理念的不断更新,表达数据的规模也呈指数规律递增,聚类分析遭遇了巨大瓶颈——过高的时空复杂度.根据“基因表达谱”的数据特征,对处理表达谱数据的分层聚类提出了一种并行分层聚类算法——PHCA,主要解决了并行设计的负载平衡问题,并实现了MPI平台的并行程序设计.并行程序性能分析表明,PHCA算法较大幅度降低了分层聚类算法的时空复杂度.
Cross-species sequence comparison has been widely used to infer gene function, however, an increasing number of genetic studies apparently indicate that sequence similarity is not always proportional to gene functional similarity. In order to determine the function of a gene precisely, we need to investigate not only its sequence characteristics but also its expression information, since changes in gene expression may often be associated with changes in gene function. It is believed that clusters of gene expression patterns help to identify co-expressed genes and its regulations. Due to the complexity of gene expression as well as the updating microarray technology, the multi-dimensional dataset of gene expression patterns shows exponential increase and the performances of clustering algorithms are very critical. This paper proposes a Parallel Hierarchical Clustering Algorithm (PHCA) based on hierarchical clustering method and implements it via MPI. The algorithm focuses on solving the problem of load balance. The parallel performance analysis indicates that PHCA decreases the complexities of time and memory to a great extent.
出处
《计算机学报》
EI
CSCD
北大核心
2007年第2期311-316,共6页
Chinese Journal of Computers
基金
国家自然科学基金(60533020
60673064)
国家科学技术部"天文
生物信息和计算化学网格计算应用系统建设"项目基金(2005DKA64002)资助~~
关键词
聚类分析
基因表达谱
分层聚类
负载平衡
clustering analysis
gene expression patterns
hierarchical clustering
load balance