摘要
基因表达数据集维度高、样本量少的问题导致分类任务的计算成本和计算复杂度高,重要的特征和合适的分类算法的选择是解决这一问题的重要方法。由于不同环境下影响油菜花期长短的重要基因位点不同,为了对多个环境下的油菜花期进行分类,在初步降维后的油菜基因数据的基础上提出了一个基于主成分协同表示的分类(principal component and collaboration representation-based classification,PC_CRC)方法。第1步,通过距离相关性(distance correlation, DC)方法从全基因位点中筛选重要的基因位点,再从这些基因位点中筛选显著的交互效应,基于选取的数据集T划分训练集T1和测试集T2;第2步,通过对T1的简单随机抽样获得样本均衡的新训练集T3,并通过协同表示分类(CRC)方法在T3上训练T1,选出对油菜花期做分类的最优主成分个数N;最后,对T选取N个主成分,通过第2步的分类方法得出最终分类结果。PC_CRC方法通过降维和稀疏表示能有效避免数据的过拟合,实现更精准的分类。实验结果表明,所提出的PC_CRC方法在10种环境下的油菜基因表达数据集上取得了79.34%的平均分类准确率,在8个环境中均优于决策树、支持向量机和随机森林等机器学习方法。
The high dimensionality and small sample size of gene expression datasets result in high computational costs and complexity of classification task.ldentifying important features and selecting appropriateclassification algorithms are effective methods to solve these problems.Due to the important gene loci affecting the length of rapeseed florescence is different in different environments,a principal component and collaboration representation based classification method(PC.CRC)is proposed on the basis of preliminary dimensionality reduction of rapeseed gene data to classify flowering dates of rapeseed in multiple environments.In the first step,the distance correlation(DC)method is used to screen important gene loci from the whole gene loci,and the significant interaction effects from the screened gene loci is screened,the training set Ti and testing set T2 are divided based on the selected dataset T.In the second step,a new training set with balanced samples is obtained by simple random sampling of training set,and Ti is trained on Ts by collaboration representation classification(CRC)method to select the optimal number N of principal components for classification of rape flowering time.Finally,principal components N are selected form dataset,and the final classification result is obtained through the classification method of the second step.The PC.CRC method can effectively avoid data overfitting and achieve more accurate classification through the dimensionality reduction and sparse representation.The experimental results show that the PC.CRC method proposed in this paper outperforms machine learning methods such as decision trees,support vector machines,and random forests in eight of ten environments,with an average classification accuracy of 79.34%on the gene expression dataset of rapeseed flowers in the 10 environments.
作者
张治鹏
张李义
ZHANG Zhipeng;ZHANG Liyi(School of Information Management,Wuhan University,Wuhan 430072,China;Wuhan Qingchuan University,Wuhan 430204,China)
出处
《武汉大学学报(工学版)》
CAS
CSCD
北大核心
2024年第3期380-387,共8页
Engineering Journal of Wuhan University
基金
国家自然科学基金项目(编号:71874126)。
关键词
距离相关性方法
基因选择
主成分
协同表示
distance correlation method
gene selection
principal component
collaboration representation