摘要
针对生物组学数据普遍存在的高维小样本和样本分布不平衡问题,提出基于粒子群优化分类模型选择算法.该算法中粒子编码由样本平衡模型、特征选择模型和分类模型及超参数构成,粒子种群以达到以生物组学数据最佳分类性能为目标,通过对粒子的速度和位置进行迭代更新,得到模型组合及超参数的最优解.在8组真实生物组学数据集上的实验结果表明,所提模型选择算法能够避免人为选择所带来的主观偏差,提高最佳分类性能和稳定性.
A new model selection algorithm based on particle swarm optimization is proposed for omics data classification. Specifically,the algorithm is designed to handle the high dimensionality,small sample size and class imbalance problems that are inherent in omics data. The particles encode candidate combinations of data sampling,feature selection,classification models and their corresponding parameter settings. The swarm optimization is targeted at the best classification performance. The particle velocity and position are iteratively updated until some stopping criteria are met and then the optimal solution model combination is output. The simulation results on eight real-world omics datasets show that the proposed model selection algorithm is capable of avoiding the bias introduced by manual settings and leading to accurate and reliable classification performance.
出处
《深圳大学学报(理工版)》
EI
CAS
CSCD
北大核心
2016年第3期264-271,共8页
Journal of Shenzhen University(Science and Engineering)
基金
国家自然科学基金资助项目(61171125
61471246)~~
关键词
生物组学
粒子群优化
样本平衡
特征选择
分类模型
模型选择
数据挖掘
omics dataset
particle swarm optimization
data sampling
feature selection
classification model
model selection
data mining