摘要
通过特征提取方式挖掘生物信息数据中潜在的规律是生物信息学研究的基本问题之一。基于DNA序列的碱基转移概率、含量和位置比三类特征构造了24维特征向量,成功应用于11物种的β-珠蛋白基因完整编码序列和18哺乳动物线粒体基因组序列的相似性比较,构建的系统发生树与进化事实相符。基于该特征向量,结合支持向量机分类方法识别了28株细菌中的必需基因,平均AUC值高达0.808,高于部分识别方法。实验结果说明:生物序列基本构成元素的转移概率、含量和位置比可作为研究生物信息学中相关分类问题的选择性工具。
To exploit some potential rules in biological information data based on the feature extraction is one of the basic problems in bioinformatics.The constructed24-D feature vector is composed of base transition probabilities,base contents and base position ratios,and is applied to compare complete coding sequences of p-globin genes of11species and whole mitochondrial genomes of18eutherian mammals respectively.The derived phylogenetic trees are quite agreement with the evolutionary relationship.In addition,the essential genes of28bacteria are successfully identified by combining the feature vector and the support vector machine.The average AUC value is0.808,much higher than some other methods.The results of experiments demonstrate that the proposed three characteristics are alternative classifiers in related bioinformatics research.
作者
李玉双
魏东
吕艳芬
LI Yushuang;WEI Dong;LU Yanfen(School of Sciences, Yanshan University, Qinhuangdao, Hebei 066004, China)
出处
《燕山大学学报》
CAS
北大核心
2018年第1期59-66,74,共9页
Journal of Yanshan University
基金
河北省高等学校青年拔尖人才计划资助项目(BJ2014060)
燕山大学"新锐工程"人才支持计划项目
关键词
转移概率
特征向量
系统发生树
必需基因
支持向量机
transition probability
feature vector
phylogenetic tree
essential gene
support vector machine