摘要
单核苷酸多态性(Single Nucleotide Polymorphism,SNP)数据是一种关于遗传病理学研究的重要数据,其高维少样本,存在大量噪声和冗余,并且SNP位点之间存在连锁不平衡性,因此需要对SNP数据进行降维。提出一种改进的K-Center算法——K-MSU算法。使用K-Center进行数据降维,在K-Center算法的距离度量中引入对称不确定性,解决SNP数据之间的连锁不平衡性;针对K-Center算法的随机选择初始聚类中心的方法容易对聚类结果产生较大的影响,使用基于信息增益的密度方法去选择初始聚类中心。在医院提供的临床实验数据的实验结果表明,K-MSU算法在SNP选择中具有更高的分类准确率和较好的效果。
SNP(Single nucleotide polymorphism)data is a kind of important data about genetic pathology research.It has high dimension with a few samples,a lot of noise and redundancy,and there is a chain imbalance between SNP loci.Therefore,it is necessary to reduce the dimension of SNP data.This paper proposes an improved K-Center algorithm——K-MSU algorithm.It used K-Center for data dimension reduction,and symmetric uncertainty was introduced into the distance measurement of the K-Center algorithm to solve the linkage imbalance between SNP data.The method of random selection of initial clustering center based on K-Center algorithm was easy to have a great impact on the clustering results,so we used the density method based on information gain to select the initial clustering center.The experimental results of clinical trial data provided by the hospital show that K-MSU algorithm in SNP selection has higher classification accuracy and the better effect.
作者
曹莉敏
周从华
Cao Limin;Zhou Conghua(School of Computer Science and Telecommunication Engineering,Jiangsu University,Zhenjiang 212013,Jiangsu,China)
出处
《计算机应用与软件》
北大核心
2020年第9期227-234,共8页
Computer Applications and Software
基金
江苏省重点研发计划(社会发展)项目(BE2016630,BE2017628)
无锡市卫生计生委科研项目(Z201603)。