摘要
目的针对高维冗余的SELDI蛋白质质谱数据,提出一种基于聚类分析和半监督学习的数据分类方法。方法算法首先运用t-test对蛋白质质谱数据进行初步降维;然后将处理后的数据用聚类分析算法进行进一步降维;最后运用半监督学习算法传递标签,充分提取有标记样本和无标记样本的信息,从而进行分类。结果在公共卵巢癌数据集OC-WCX2b和公共前列腺癌数据集PC-H4上获得了99.15%和96.75%分类准确率。在浙江省肿瘤医院临床乳腺癌数据集BC-WCX2a上获得了95.18%的分类准确率和100%的敏感性。结论基于聚类分析的半监督学习方法能够有效利用未标记的质谱样本信息,与经典的监督学习算法相比,其分类性能更理想、实用性更好。
Objective To propose a classification method based on affinity propagation clustering and semi-supervised learning for the high-dimensional and redundant mass spectrometry data. Methods First,t-test was applied to extract part of component of the proteomic mass spectrometry data preliminarily. Then,the affinity propagation clustering was employed to extract the principal component. Finally,to take advantage of both labeled samples and unlabeled samples,semi-supervised learning was used to predict the labels. Results The classification accuracy of the algorithm proved to be 99. 15% and 96. 75% respectively in the public ovarian cancer database OC-WCX2 b and the public prostate cancer database PC-H4. In the clinical breast cancer database BC-WCX2 a of Zhejiang Cancer Hospital,the classification accuracy was 95. 18% and the sensitivity was 100%. Conclusion The experimental results demonstrate that the method of classification based on affinity propagation clustering and semi-supervised learning can effectively make use of the information from unlabeled mass spectrometry samples. Compared with the supervised learning method,it proves to be a more ideal method of classification and more practical.
出处
《航天医学与医学工程》
CAS
CSCD
北大核心
2014年第5期367-372,共6页
Space Medicine & Medical Engineering
基金
国家自然科学基金(60801054
61205200)
浙江省自然科学基金(LY12F01005)
关键词
蛋白质质谱
聚类分析
半监督学习
特征提取
proteomic mass spectrometry
cluster analysis
semi-supervised learning
feature extraction