期刊文献+

非平衡基因数据的差异表达基因选择算法研究 被引量:11

Differential Expression Gene Selection Algorithms for Unbalanced Gene Datasets
下载PDF
导出
摘要 针对准确率不适于评价不平衡数据特征子集性能的缺陷,提出了 F2 -measure(简称 F2 )准则.为避免mRMR(minimal Redundancy-Maximal Relevance)的互信息方法倾向于选择多值特征,提出了归一化互信息 SU (Symmetrical Uncertainty).针对最大化 AUC (Area Under an ROC Curve)框架下,特征选择算法的特征与类标相关性、特征间相关性的取值范围(量纲)不一致问题,提出了归一化的特征权重.为加快特征选择过程,提出了结合 SU和AUC 的特征预选择,缩小特征搜索空间.提出动态加权顺序前向搜索DWSFS(Dynamic Weighted Sequential Forward Search)和动态加权顺序前向浮动搜索DWSFFS(Dynamic Weighted Sequential Forward Floating Search),以期得到分类性能更好的特征子集.基于最大化 AUC 和mRMR框架,结合上述创新点,设计出16种特征选择算法.7个经典二类不平衡基因数据集、3个多类不平衡(或近似平衡)基因数据集的50次重复实验表明:所提算法选择的基因子集具有非常好的分类识别能力;提出的 F2、SU、归一化基因权重、基因预选择,以及DWSFS和DWSFFS对选择非平衡基因数据集的差异表达基因非常有效.提出的 SU 在度量基因冗余性时优于斯皮尔曼等级相关系数 RCC (Rank Correlation Coefficient);基因选择过程中的权值度量采用基因与类标相关性减去基因间冗余性优于采用基因与类标相关性除以基因冗余性方案.与现有经典基因选择算法的实验比较表明:提出的基因选择算法的性能优于现有基因选择算法. To overcome the classification accuracy cannot evaluate the capacity of a selected feature subset for unbalanced gene datasets, F2 -measure(referred to as F2 ) is proposed in this paper, so that the feature subset with much more capacity can be detected to recognize cancer patients. The normalized mutual information named SU (Symmetrical Uncertainty) is present to avoid the mutual information in mRMR(minimal Redundancy-Maximal Relevance) preferring to select those features with many values. To avoid the difference of score ranges of features to label and of between features when maximizing AUC (Area Under an ROC Curve) in feature selection process, a new normalized metric is present to unify the weights between feature and label and between features. To advance the efficiency of feature selection process, SU and AUC are linked together to develop the feature preselection algorithm to reduce the number of candidate features. Dynamic Weighted Sequential Forward Search (DWSFS) and Dynamic Weighted Sequential Forward Floating Search (DWSFFS) are put forward to obtain the feature subset simultaneously with small size and strong recognition capacity when combining F2 and AUC as a criterion to evaluate the importance of a feature. We 16 feature subset selection algorithms based on the frame of mRMR with maximizing AUC while incorporating the aforementioned innovations. The mean results of 50 repeated experiments on 7 classical binary unbalanced gene expression datasets and 3 multi-class unbalanced or approximately balanced gene expression datasets proved that the developed algorithms in this paper can detect the gene subsets with superior classification power, and all the innovations proposed in this paper have got their superior capacities. Furthermore, the experimental results also demonstrate that the normalized mutual information SU is superior to the Spearman’s rank correlation coefficient in evaluating the redundant of genes. At the same time that the weight of a gene by the difference of its correlation to labels and the redundant between genes is overwhelming the quotient of its correlation to labels divided by the redundant between genes. Our proposed gene subset selction algorithms defeat those available ones when compared to them.
作者 谢娟英 王明钊 周颖 高红超 许升全 XIE Juan-Ying;WANG Ming-Zhao;ZHOU Ying;GAO Hong-Chao;XU Sheng-Quan(School of Computer Science, Shaanxi Normal University, Xi’an 710119;College of Life Science, Shaanxi Normal University, Xi’an 710119;School of Cyber Security, University of Chinese Academy of Sciences, Beijing 100093)
出处 《计算机学报》 EI CSCD 北大核心 2019年第6期1232-1251,共20页 Chinese Journal of Computers
基金 国家自然科学基金(61673251) 国家重点研发计划(2016YFC0901900) 科技成果转化培育项目(GK201806013) 中央高校基本科研业务费专项资金项目(GK201701006) 研究生培养创新基金资助项目(2015CXS028,2016CSY009)资助~~
关键词 基因选择 AUC 互信息 mRMR 不平衡数据 gene selection AUC mutual information mRMR unbalanced datasets
  • 相关文献

参考文献8

二级参考文献105

共引文献193

同被引文献61

引证文献11

二级引证文献74

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部