摘要
目的:对数据挖掘中解决分类问题的常用方法进行分析,比较它们应用于计算机辅助诊断系统时的性能。方法:收集1998-06/2004-12在北京友谊医院和北京结核病院胸部肿瘤研究所经手术或穿刺活检病理证实的孤立肺节结200例(恶性135例,良性65例),观察2项临床指标(年龄及是否有痰中带血丝)和5项薄层CT指标,并按7∶3的比例将样本随机数字法分配到训练集和测试集中。分别用Fisher线性判别分析、Logistic回归分析、决策树和神经网络方法构建诊断分类器,并用测试样本验证各个分类器。利用诊断的敏感度、特异度评价分类器的准确性,用ROC曲线及曲线下面积比较各个分类器总体诊断性能。结果:①对60例样本进行诊断测试,4种方法的敏感度分别为84.6%,87.2%,87.2%和87.2%,特异度分别为85.7%,81.0%,76.2%和81.0%。②4种方法诊断的ROC曲线下面积分别为0.918,0.918,0.939和0.942,任何两种方法比较,曲线下面积的差异均无统计学意义(P值分别为0.8982,0.1576,0.3495,0.2857,0.4319和0.9868)。结论:从分类算法的分类准确性、分类器的可理解性以及对诊断的指导意义三方面进行比较,Logisitc回归和神经网络方法具有较高的诊断分类准确性,判别分析、Logistic回归分析和决策树方法具有较好的模型可理解性,基于BP算法的神经网络对实际诊断具有较好的指导作用。它们都可用于计算机辅助诊断系统中。
AIM: To analyze several classification methods in data mining and compare their diagnostic performance when used in computer-aided diagnosis system.
METHODS: Two hundred cases of solitary pulmonary nodules confirmed by biopsy pathology with surgery operation or puncturation in Beijing Friendship Hospital and Beijing Institute of Tuberculosis and Thoracic Tumor between June 1998 and December 2004 were collected including 135 pedpheral lung cancers and 65 benign nodules. Two clinical features (ageand having blood streak in phlegm or not) and 5 thin-slice CT signs of each nodule were determined and quantified. 200 valid samples were randomly divided into training samples and examination samples at the radio of 7:3. Diagnostic classificators were established through Fisher linear discriminated function, Logistic regression function, decision tree and neural network model,and validated by examination samples. Index such as sensitivity and specialty were used to evaluate the accuracy of the classificators; and area under ROC curve were adopted to compare the diagnostic performance of these classificators.
RESULTS: (1)ln the diagnosis of 60 cases, sensitivities of the four classificators were 84.6%, 87.2%, 87.2% and 87.2%, specialties of them were 85.7%, 81.0%, 76.2% and 81.0%, respectively. (2)Areas under ROC curve by four classificators were 0.918, 0.918, 0.939 and 0.942, no significant difference was found in the comparison between any two of them (P =0.898 2, 0.157 6, 0.349 5, 0.285 7, 0.431 9 and 0.986 8).
CONCLUSION: In terms of classified accuracy, understandability and helpfulness to clinical diagnosis, Logistic regression and BP neural network have higher diagnostic accuracy; discriminated analysis, Logistic regression and decision tree have higher understandabilities; BP neural network does better in actual diagnostic decision. All these methods can be applied in computer-aided diagnosis system.
出处
《中国组织工程研究与临床康复》
CAS
CSCD
北大核心
2007年第5期879-881,885,共4页
Journal of Clinical Rehabilitative Tissue Engineering Research
基金
首都医科大学基础临床合作项目(2003JL03)~~