摘要
基于基因表达谱的肿瘤诊断方法有望成为临床医学上一种快速而有效的诊断方法,但由于基因表达数据存在维数过高、样本量很小以及噪音大等特点,使得提取与肿瘤有关的信息基因成为一件有挑战性的工作。因此,在分析了目前肿瘤分类检测所采用方法的基础上,本文提出了一种结合基因特征记分和主成份分析的混合特征抽取方法。实验表明,这种方法能够有效地提取分类特征信息,并在保持较高的肿瘤识别准确率的前提下大幅度地降低基因表达数据的维数,使得分类器性能得到很大提高。实验采用了两种与肿瘤有关的基因表达数据集来验证这种混合特征抽取方法的有效性,采用支持向量机的分类实验结果表明,所提出的混合方法不仅交叉验证识别准确率高而且分类结果能够可视化。对于结肠癌组织样本集,其交叉验证识别准确率高达95.16%;而对于急性白血病组织样本集,其交叉验证识别准确率高达100%。
The tumor diagnosis method based on gene expression profiles will be developed into a fast and effective method in clinical domain in the near future. Although DNA microarray experiments provide us with a huge amount of gene expression data, in fact, only a few genes relate to tumor. Moreover, it is difficult to extract tumor-related genes from gene expression profiles because of its characteristics such as the high dimensionality, the small sample set, many noises and redundancies in gene expression profiles. In this paper we propose a novel feature extraction approach which projects high dimensional data onto a lower dimensional feature space,which improves the SVM-based classification performance of gene expression data. We have examined two sets of gene expression data (colon dataset and leukemia dataset) by means of SVM classifiers with different parameters to validate the proposed approach. Experimental results show that SVM has a superior performance in the classification of gene expression data using the principal components extracted from the top-ranked genes based on the gene ranking method. The cross-validation accuracy of 95.16% has been achieved for colon dataset using SVM classifiers and 100% for leukemia dataset also. Another advantage of the proposed method is that the results of the sample classification can be visualized in the form of 2D or 3D scatter plot.
出处
《计算机工程与科学》
CSCD
2007年第9期84-90,共7页
Computer Engineering & Science
基金
国家自然科学基金资助项目(60233020)
关键词
支持向量机
基因表达谱
肿瘤分类
主成份分析
SVM
genc expression profile
tumor classification
principal component analysis