摘要
基因表达数据作为生物学中一种特定类型的大数据,尽管基因表达值都是普通的实数值,但它们的相似性不是基于欧氏距离度量,而是基于基因表达值是否展现同升同降趋势。目前的基因贝叶斯网络以基因表达水平值为节点随机变量,没有体现这种子空间模式的相似性。因此,提出基于基因关联分析的贝叶斯网络疾病分类算法(BCGA),从带类标签的疾病样本-基因表达数据中学习贝叶斯网络并预测新疾病样本的分类。首先,将疾病样本离散化过滤以选择基因,并将降维后的基因表达值排序和置换为基因列下标;其次,分解基因列下标序列为长度为2的原子序列集合,而这个集合的频繁原子序列对应一对基因的关联关系;最后,通过基因关联熵度量因果关系,并用于贝叶斯网络结构学习。BCGA的参数学习也变得很容易,基因节点的条件概率分布只要统计该基因的原子序列和父节点基因的原子序列出现频次即可。在多个肿瘤和非肿瘤基因表达数据集上的实验结果表明,相较于已有的同类算法,BCGA的疾病分类准确率明显提高,分析时间有效缩短;另外,BCGA使用基因关联熵代替条件独立性,使用基因原子序列代替基因表达值,可以更好地拟合基因表达数据。
As a specific type of big data in biology,similarity of gene expression data is not based on Euclidean distance but on whether gene expression values show a trend of both rise and fall together,although they are all ordinary real values.The current gene Bayesian network uses gene expression level values as node random variables and does not reflect the similarity of this kind of subspace pattern.Therefore,a Bayesian network disease Classification algorithm based on Gene Association analysis(BCGA)was proposed to learn Bayesian networks from labeled disease sample-gene expression data and predict the classification of new disease samples.Firstly,disease samples were discretized and filtered to select genes,and the dimensionally reduced gene expression values were sorted and replaced with gene column subscripts.Secondly,the subscript sequence of gene column was decomposed into a set of atomic sequences with a length of 2,and the frequent atomic sequence of this set was corresponding to the association of a pair of genes.Finally,causal relationships were measured through gene association entropy for Bayesian network structure learning.Besides,the parameter learning of BCGA became easy,and the conditional probability distribution of a gene node was able to be obtained by counting the atomic sequence occurrence frequency of the gene and its parent node gene.Experimental results on multiple tumor and non-tumor gene expression datasets show that BCGA significantly improves disease classification accuracy and effectively reduces analysis time compared to the existing similar algorithms.In addition,BCGA uses gene association entropy instead of conditional independence,and gene atomic sequences instead of gene expression values,which can better fit gene expression data better.
作者
李志杰
廖旭红
李元香
李青蓝
LI Zhijie;LIAO Xuhong;LI Yuanxiang;LI Qinglan(School of Information Science and Engineering,Hunan Institute of Science and Technology,Yueyang Hunan 414006,China;School of Computer Science,Wuhan University,Wuhan Hubei 430072,China;Perelman School of Medicine,University of Pennsylvania,Philadelphia Pennsylvania 19019,USA)
出处
《计算机应用》
CSCD
北大核心
2024年第11期3449-3458,共10页
journal of Computer Applications
基金
国家自然科学基金资助项目(61672391)
湖南省自然科学基金资助项目(2019JJ40111)。
关键词
基因表达数据
频繁原子序列
基因关联熵
基因序列贝叶斯网络
疾病分类
gene expression data
frequent atomic sequence
gene association entropy
gene sequence Bayesian network
disease classification