基于基因关联分析的贝叶斯网络疾病样本分类算法

Disease sample classification algorithm by Bayesian network with gene association analysis

下载PDF

导出

摘要基因表达数据作为生物学中一种特定类型的大数据,尽管基因表达值都是普通的实数值,但它们的相似性不是基于欧氏距离度量,而是基于基因表达值是否展现同升同降趋势。目前的基因贝叶斯网络以基因表达水平值为节点随机变量,没有体现这种子空间模式的相似性。因此,提出基于基因关联分析的贝叶斯网络疾病分类算法(BCGA),从带类标签的疾病样本-基因表达数据中学习贝叶斯网络并预测新疾病样本的分类。首先,将疾病样本离散化过滤以选择基因,并将降维后的基因表达值排序和置换为基因列下标;其次,分解基因列下标序列为长度为2的原子序列集合,而这个集合的频繁原子序列对应一对基因的关联关系;最后,通过基因关联熵度量因果关系,并用于贝叶斯网络结构学习。BCGA的参数学习也变得很容易,基因节点的条件概率分布只要统计该基因的原子序列和父节点基因的原子序列出现频次即可。在多个肿瘤和非肿瘤基因表达数据集上的实验结果表明,相较于已有的同类算法,BCGA的疾病分类准确率明显提高,分析时间有效缩短;另外,BCGA使用基因关联熵代替条件独立性,使用基因原子序列代替基因表达值,可以更好地拟合基因表达数据。 As a specific type of big data in biology,similarity of gene expression data is not based on Euclidean distance but on whether gene expression values show a trend of both rise and fall together,although they are all ordinary real values.The current gene Bayesian network uses gene expression level values as node random variables and does not reflect the similarity of this kind of subspace pattern.Therefore,a Bayesian network disease Classification algorithm based on Gene Association analysis(BCGA)was proposed to learn Bayesian networks from labeled disease sample-gene expression data and predict the classification of new disease samples.Firstly,disease samples were discretized and filtered to select genes,and the dimensionally reduced gene expression values were sorted and replaced with gene column subscripts.Secondly,the subscript sequence of gene column was decomposed into a set of atomic sequences with a length of 2,and the frequent atomic sequence of this set was corresponding to the association of a pair of genes.Finally,causal relationships were measured through gene association entropy for Bayesian network structure learning.Besides,the parameter learning of BCGA became easy,and the conditional probability distribution of a gene node was able to be obtained by counting the atomic sequence occurrence frequency of the gene and its parent node gene.Experimental results on multiple tumor and non-tumor gene expression datasets show that BCGA significantly improves disease classification accuracy and effectively reduces analysis time compared to the existing similar algorithms.In addition,BCGA uses gene association entropy instead of conditional independence,and gene atomic sequences instead of gene expression values,which can better fit gene expression data better.

作者李志杰廖旭红李元香李青蓝 LI Zhijie;LIAO Xuhong;LI Yuanxiang;LI Qinglan(School of Information Science and Engineering,Hunan Institute of Science and Technology,Yueyang Hunan 414006,China;School of Computer Science,Wuhan University,Wuhan Hubei 430072,China;Perelman School of Medicine,University of Pennsylvania,Philadelphia Pennsylvania 19019,USA)

机构地区湖南理工学院信息科学与工程学院武汉大学计算机学院宾夕法尼亚大学医学院

出处《计算机应用》 CSCD 北大核心 2024年第11期3449-3458,共10页 journal of Computer Applications

基金国家自然科学基金资助项目(61672391) 湖南省自然科学基金资助项目(2019JJ40111)。

关键词基因表达数据频繁原子序列基因关联熵基因序列贝叶斯网络疾病分类 gene expression data frequent atomic sequence gene association entropy gene sequence Bayesian network disease classification

分类号 TP181 [自动化与计算机技术—控制理论与控制工程]

引文网络
相关文献

参考文献6

1Xiaohui YANG,Zheng WANG,Jian SUN,Zongben XU.Unlabeled data driven cost-sensitive inverse projection sparse representation-based classification with 1/2 regularization[J].Science China(Information Sciences),2022,65(8):35-52. 被引量：1
2姜涛,李战怀.基因表达数据中的局部模式挖掘研究综述[J].计算机研究与发展,2018,55(11):2343-2360. 被引量：3
3Ahmad Chaddad,Qizong Lu,Jiali Li,Yousef Katib,Reem Kateb,Camel Tanougast,Ahmed Bouridane,Ahmed Abdulkadir.Explainable, Domain-Adaptive, and Federated Artificial Intelligence in Medicine[J].IEEE/CAA Journal of Automatica Sinica,2023,10(4):859-876. 被引量：4
4Tao SHEN,Jie ZHANG,Xinkang JIA,Fengda ZHANG,Zheqi LV,Kun KUANG,Chao WU,Fei WU.Federatedmutual learning:a collaborative machine learning method for heterogeneous data,models,and objectives[J].Frontiers of Information Technology & Electronic Engineering,2023,24(10):1390-1402. 被引量：1
5鱼亮,任姝洁.基于网络和基因差异表达信息的癌症致病基因预测[J].中国科学：生命科学,2023,53(1):94-108. 被引量：1
6Jian-Fu Zhu,Zhong-Kai Hao,Qi Liu,Yu Yin,Cheng-Qiang Lu,Zhen-Ya Huang,En-Hong Chen.Towards Exploring Large Molecular Space:An Efficient Chemical Genetic Algorithm[J].Journal of Computer Science & Technology,2022,37(6):1464-1477. 被引量：1

二级参考文献16

1赵宇海,王国仁,印莹,许光宇.A Novel Approach to Revealing Positive and Negative Co-Regulated Genes[J].Journal of Computer Science & Technology,2007,22(2):261-272. 被引量：2
2印莹,赵宇海,张斌,王国仁.时序微阵列数据中的同步和异步共调控基因聚类[J].计算机学报,2007,30(8):1302-1314. 被引量：5
3岳峰,孙亮,王宽全,王永吉,左旺孟.基因表达数据的聚类分析研究进展[J].自动化学报,2008,34(2):113-120. 被引量：25
4闫雷鸣,孙志挥,吴英杰,张柏礼.联合聚类非线性相关的时序基因表达数据[J].计算机研究与发展,2008,45(11):1865-1873. 被引量：5
5张焕萍,王惠南,卢光明,钟元,张志强.基于互信息的差异共表达致病基因挖掘方法[J].东南大学学报（自然科学版）,2009,39(1):151-155. 被引量：6
6邹权,郭茂祖,刘扬,王峻.类别不平衡的分类方法及在生物信息学中的应用[J].计算机研究与发展,2010,47(8):1407-1414. 被引量：26
7陈利娟,贾永旭,范菲菲,李醒亚.原始神经外胚层肿瘤中FLI-1的表达及预后因素分析[J].中华肿瘤杂志,2010,32(12):917-920. 被引量：17
8饶过,彭毅,徐宗本.基于S_(1/2)建模的稳健稀疏–低秩矩阵分解[J].中国科学：信息科学,2013,43(6):733-748. 被引量：14
9陈伟,程咏梅,张绍武,潘泉.邻域种子的启发式454序列聚类方法[J].软件学报,2014,25(5):929-938. 被引量：3
10Amichai Painsky,Saharon Rosset.Optimal Set Cover Formulation for Exclusive Row Biclustering of Gene Expression[J].Journal of Computer Science & Technology,2014,29(3):423-435. 被引量：2

共引文献5

1段刚龙,王妍,马鑫,杨泽阳.银行客户分类的数据特征选择方法与实证研究[J].计算机工程与应用,2022,58(11):302-312. 被引量：2
2廖旭红,江华,廖莎,李志杰.基于Charm算法挖掘基因表达保序子序列[J].现代计算机,2023,29(14):8-13.
3M.Victoria Luzón,Nuria Rodríguez-Barroso,Alberto Argente-Garrido,Daniel Jiménez-López,Jose M.Moyano,Javier Del Ser,Weiping Ding,Francisco Herrera.A Tutorial on Federated Learning from Theory to Practice:Foundations,Software Frameworks,Exemplary Use Cases,and Selected Trends[J].IEEE/CAA Journal of Automatica Sinica,2024,11(4):824-850.
4Bingrong Xu,Jianhua Yin,Cheng Lian,Yixin Su,Zhigang Zeng.Low-Rank Optimal Transport for Robust Domain Adaptation[J].IEEE/CAA Journal of Automatica Sinica,2024,11(7):1667-1680.
5朱平,吕珀华,邹卫明,蒋学涛,史进,张扬,马益荣.超大规模可解释机器智能系统建设[J].计算机技术与发展,2024,34(11):172-179.

1宋楠,邸若海,王鹏,李晓艳,贺楚超,王储.基于改进萤火虫算法的贝叶斯网络结构学习[J].科学技术与工程,2024,24(26):11314-11322.
2祝柏杨,刘金龙,林均岐.基于信息度量的地震动参数优化选择方法[J].振动与冲击,2024,43(15):86-94.
3陈书旺,杜朋宇,蔡雨昕.基于集成学习的偏头痛病症分型的效果分析[J].科学技术与工程,2024,24(30):13032-13038.
4汪鹤,董晓峰,沈健.遗传神经网络下光伏功率高比例异常数据检测[J].电子设计工程,2024,32(22):87-90.
5王军,张占薪,王佳佳,郭伟平,苏倩倩,张博慧,吴启文,宋晓霞.LncRNA FAM83H-AS1和miR-136在上皮性卵巢癌中的表达及临床意义[J].天津医科大学学报,2024,30(6):485-490.
6初玉芹.支原体肺炎感染住院患儿家属焦虑抑郁的心理护理研究[J].中文科技期刊数据库（全文版）医药卫生,2024(11):162-167.
7张宇辉,胡思睿,常鑫.飞机雷雨情景着陆冲偏出跑道的贝叶斯网络风险分析[J].安全与环境学报,2024,24(10):3709-3718.
8张惠茅.中国泌尿系统影像学70年发展历程与展望[J].中华放射学杂志,2024,58(11):1258-1263.
9于家斌,陈帅祥,陈慧敏,赵峙尧,张新,王小艺,崔晓玉.基于危害物风险综合评价的粮食抽检决策研究[J].食品安全质量检测学报,2024,15(20):232-245.
10田晓敏,李晓冬.基于故障树-模糊贝叶斯网络的装配式建筑施工质量风险分析[J].科学技术与工程,2024,24(30):13119-13126.

计算机应用

2024年第11期

浏览历史

内容加载中请稍等...

基于基因关联分析的贝叶斯网络疾病样本分类算法

参考文献6

二级参考文献16

共引文献5

相关作者

相关机构

相关主题

浏览历史