摘要
目的建立一种预处理方法,在进行医疗费用数据挖掘时,将因变量(呈偏态分布的连续性变量)转换为分类变量,从而得到更加科学合理的研究结果。方法以广东省甲型病毒性肝炎医疗费用调查取得的115例患者为研究对象,分别采用中位数的分类方法和K-means聚类的方法作为预处理方法,对医疗费用这一呈偏态分布的因变量进行分类,然后建立支持向量机数学模型,采用支持向量机进行医疗费用影响因素分析;通过比较模型的预测精度、模型收益以及影响因素的筛选结果,确定最优的预处理方法。结果 115例甲肝病人甲肝总住院费用中位数为2 744.69元,呈偏态分布。应变量以中位数方法分类,采用支持向量机模型筛选影响因素结果显示,对医疗费用影响最大的有7个变量(前3位为医院等级、性别、疾病类型);采用聚类分析进行数据预处理时筛选影响因素结果显示,对医疗费用影响最大的有7个变量(前3位为医院等级、住院天数、支付方式)。与中位数方法的分类方法比较,采用聚类分析进行数据预处理时,支持向量机模型结果得到的预测精度由91.30%上升到97.39%;收益图表陡峭地升高到100.00%然后渐渐变得平缓,显示模型收益更好;影响因素筛选结果更加科学合理,符合实际情况。结论聚类分析是一种优秀的数据挖掘预处理方法,具有良好的应用性。
Objective In the medical expense research, establish a pretreatment method to trans- form the continuous dependent variable to categorical variable to get more reasonable result. Methods Data of 115 patients were obtained from the survey of medical costs for patients with viral hepatitis in Guangdong Province. The classification of the median and K-means clustering method were used as a pretreatment method to classify the skewed distribution dependent variables of medical expenses for hepatitis. Then, a support vector machine mathematical model was established to analyze the influence factors of med- ical expenses by support vector machine. By comparing the forecasting accuracy, model gain, and selection of dependent variables, the optimal pretreatment method was determined. Results The median of medical expenses of hospitalization for 115 patients with viral hepatitis was 2 774. 69 yuan, showing a skewed distri- bution. Using support vector machine model selection influence factors, the result showed that seven varia- bles had greatest impact on medical costs (The top three were hospital level, gender, and disease type. ). While using cluster analysis as data pretreatment method, the influence factors selection showed that seven variables had greatest impact on the medical expenditure (The top three were hospital level, days of hospi- talization, and payment manner). Compared to the median classification method, the data mining results of clustering analysis acquired higher forecasting accuracy (from 91.30% to 97. 39% ), better model gains (the gain chart steep rose to 100% and then gradually became flat. ), and more reasonable and practical influence factors. Conclusion As a good pretreatment method of data mining, the clustering analysis showed good applicability.
出处
《华南预防医学》
2012年第1期18-22,共5页
South China Journal of Preventive Medicine
基金
广东省医学科研基金项目(A2009071)