摘要
提出一种基于支持向量机的肿瘤基因表达谱数据挖掘方法。首先采用信噪比方法对白血病、结肠癌、肺癌数据提取特征基因,生成特征基因子集。然后通过支持向量机分类模型对特征基因子集进行机器学习训练分类。实验结果表明:急性白血病、结肠癌只需4个特征基因,均获得100%的10折交叉验证分类准确率。最后为了有效地排除噪声基因进而挑选出精确度更高的分类特征基因,采用多尺度小波阈值法对肺癌数据进行降噪处理,降噪后仅需5个特征基因获得96.61%的分类准确率。
This paper put forward cancer gene expression profile data mining methods based on support vector machine( SVM). Firstly,informative genes were extracted from leukemia,colon cancer and lung cancer data by signal-to-noise ratio method, thus generating informative genes subsets. Then informative genes subsets were classified by machine learning and training through support vector machine( SVM) classification model. The experimental results show that only fourinformative genes are needed for acute leukemia and colon cancer to get 100% classification accuracy by 10 fold cross-validation. Finally,multi-scale wavelet threshold denoising method was established to reduce the noise of the data in lung cancer gene expression profiles for getting higher classification accuracy. After noise reduction,only five informative genes are needed to get 96. 61% classification accuracy.
出处
《重庆理工大学学报(自然科学)》
CAS
2016年第6期102-108,共7页
Journal of Chongqing University of Technology:Natural Science
基金
国家自然科学基金资助项目(41204115)
山东省自然科学基金资助项目(ZR2013AM007
ZR2014FL021)
山东省高等学校科技计划项目(J13LI54)
关键词
基因表达谱
肿瘤分类
特征基因
信噪比
支持向量机
gene expression profile
cancer classification
informative gene
signal to noise ratio
support vector machine