摘要
针对不完整基因表达数据的聚类问题,提出了一种多目标NSGA-Ⅱ框架下缺失值填补与聚类协同优化的算法.算法根据欧式距离确定不完整基因的近邻基因,以缺失值的最近邻区间为约束,采用混合编码将缺失值填补与聚类中心优化融入NSGA-Ⅱ进化过程,通过将数据集的统计信息与聚类结果共同作为缺失值填补因素,提升不完整基因表达数据的填补准确度及聚类性能.在多个基因表达数据集上的实验结果表明,所提算法得到了更接近真实表达值的填补结果及更紧凑的聚类效果,且聚类结果具有统计显著性.
Aiming at the problem of clustering incomplete gene expression data,a collaborative optimization algorithm for missing value imputation and clustering is proposed in the framework of multi-objective NSGA-Ⅱ.The algorithm determines the neighbor genes of incomplete genes according to Euclidean distance.Constrained by the nearest neighbor interval of missing value,the algorithm combines missing value imputation with clustering center optimization into NSGA-Ⅱby mixed encoding.Taking statistical information of datasets and the clustering results into account is helpful to improve the imputation accuracy and clustering performance.Experimental results on multiple gene expression datasets show that the proposed algorithm obtains an imputation result closer to the true expression value and a more compact clustering effect.Furthermore,the proposed algorithm proves to be statistically significant.
作者
常巧珍
曹隽喆
顾宏
李丹
CHANG Qiaozhen;CAO Junzhe;GU Hong;LI Dan(School of Control Science and Engineering,Dalian University of Technology,Dalian 116024,China)
出处
《大连理工大学学报》
CAS
CSCD
北大核心
2021年第4期416-423,共8页
Journal of Dalian University of Technology
基金
国家自然科学基金资助项目(81872247).
关键词
基因表达数据
缺失值
多目标聚类
最近邻规则
gene expression data
missing value
multi-objective clustering
the nearest neighbor rule