摘要
基因的聚类分析是基因表达数据分析研究的重要技术,它按照表达谱相近原则将基因表达数据归类,探究未知的基因功能.近年来,RNA-seq 技术广泛应用于测量基因表达水平,产生了大量的读段数据,为基因表达聚类分析提供了充分条件.由于读段非均匀分布的特性,对读段计数一般采用负二项分布进行建模.现有的负二项分布算法和传统的聚类算法对于聚类分析都是直接对读段计数进行建模,没有充分考虑实验本身存在的各种噪声,以及基因表达水平测量的不确定性,或者对聚类中心的不确定性考虑不够.基于 PGSeq 模型,模拟读段的随机产生过程,采用拉普拉斯方法考虑多条件多重复基因表达水平之间的相关性,获得了基因表达水平的不确定性,联合混合 t 分布聚类模型,提出 PUseqClust(propagating uncertainty into RNA-seq clustering )框架进行 RNA-seq 读段数据的聚类分析.实验结果表明,该方法相比其他方法获得了更具生物意义的聚类结果.
Clustering analysis is an important technique for gene expression data analysis. It groups the data according to similar gene expression patterns to explore the unknown gene functions. In recent years, RNA-seq technology has been widely adopted to measure gene expression. It produces a large number of read data, which provide possibilities for clustering analysis of gene expression. In this area, read counts are popularly modeled by the negative binomial distribution to reduce the impact of the non-uniform read distribution, while most existing clustering methods process directly read counts. They donot fully consider the various noise existing in the data, and the uncertainty of gene expression measurements. Some methods also ignore the variability of clustering centers. This study proposes PUseqClust (propagating uncertainty into RNA-Seq clustering) framework for clustering of RNA-seq data. This framework first uses PGSeq to model the stochastic process of read generation. Laplace method is next used to consider correlation between expressions under various conditions and replicates to obtain the uncertainty of expression estimation. Finally, the method adopts the student’s t mixture model to perform gene expression clustering. Results show that the proposed methods obtained more biologically relevant clustering results.
作者
石险峰
刘学军
张礼
SHI Xian-Feng;LIU Xue-Jun;ZHANG Li(College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 211106, China;College of Information Science and Technology, Nanjing Forestry University, Nanjing 210037, China)
出处
《软件学报》
EI
CSCD
北大核心
2019年第9期2857-2868,共12页
Journal of Software
基金
国家自然科学基金(61170152)
航空基金(20151452021)~~
关键词
RNA-SEQ
聚类分析
负二项分布
拉普拉斯方法
混合t分布
RNA-seq
clustering analysis
negative binomial distribution
Laplace method
mixture student’s t distribution