This study hopes to contribute to disease detection by analyzing a medical examination dataset with 123,968 samples.Based on association rules mining and related medical knowledge,6 models were constructed here to pre...This study hopes to contribute to disease detection by analyzing a medical examination dataset with 123,968 samples.Based on association rules mining and related medical knowledge,6 models were constructed here to predict hyperuricemia prevalence and investigated its risk factors.Comparing different models,the prediction performances of Lasso logistic regression,traditional logistic regression,and random forest are excellent,and the results can be interpreted.PCA logistic regression model also works well,but it is not analytical.KNN's prediction performance is relatively poor,while data dimensionality reduction can significantly improve its AUC.SVC has the worst performance and its efficiency of processing high-dimensional large dataset is extremely low.The risk factors of hyperuricemia mainly belongs to 4 categories,which are obesity-related factors,renal function factors,liver function factors,and myeloproliferative diseases-related factors.Random forest,Lasso regression,and logistic regression all treat serum creatinine,BMI,triglyceride,fatty liver,and age as key predictive variables.Models also show that serum urea,serum alanine aminotransferase,negative urobilinogen,red blood cell count,white blood cell count and the pH are significantly correlated with the risk.展开更多
基金This work has been supported by MOE Project of Key Research Institute of Humanities and Social Sciences at Universities under Grant No.14JJD630008.
文摘This study hopes to contribute to disease detection by analyzing a medical examination dataset with 123,968 samples.Based on association rules mining and related medical knowledge,6 models were constructed here to predict hyperuricemia prevalence and investigated its risk factors.Comparing different models,the prediction performances of Lasso logistic regression,traditional logistic regression,and random forest are excellent,and the results can be interpreted.PCA logistic regression model also works well,but it is not analytical.KNN's prediction performance is relatively poor,while data dimensionality reduction can significantly improve its AUC.SVC has the worst performance and its efficiency of processing high-dimensional large dataset is extremely low.The risk factors of hyperuricemia mainly belongs to 4 categories,which are obesity-related factors,renal function factors,liver function factors,and myeloproliferative diseases-related factors.Random forest,Lasso regression,and logistic regression all treat serum creatinine,BMI,triglyceride,fatty liver,and age as key predictive variables.Models also show that serum urea,serum alanine aminotransferase,negative urobilinogen,red blood cell count,white blood cell count and the pH are significantly correlated with the risk.