期刊文献+

Issues in the Mining of Heart Failure Datasets

Issues in the Mining of Heart Failure Datasets
原文传递
导出
摘要 This paper investigates the characteristics of a clinical dataset using a combination of feature selection and classification methods to handle missing values and understand the underlying statistical characteristics of a typical clinical dataset. Typically, when a large clinical dataset is presented, it consists of challenges such as missing values, high dimensionality, and unbalanced classes. These pose an inherent problem when implementing feature selection and classification algorithms. With most clinical datasets, an initial exploration of the dataset is carried out, and those attributes with more than a certain percentage of missing values are eliminated from the dataset. Later, with the help of missing value imputation, feature selection and classification algorithms, prognostic and diagnostic models are developed. This paper has two main conclusions: 1) Despite the nature of clinical datasets, and their large size, methods for missing value imputation do not affect the final performance. What is crucial is that the dataset is an accurate representation of the clinical problem and those methods of imputing missing values are not critical for developing classifiers and prognostic/diagnostic models. 2) Supervised learning has proven to be more suitable for mining clinical data than unsupervised methods. It is also shown that non-parametric classifiers such as decision trees give better results when compared to parametric classifiers such as radial basis function networks(RBFNs). This paper investigates the characteristics of a clinical dataset using a combination of feature selection and classification methods to handle missing values and understand the underlying statistical characteristics of a typical clinical dataset. Typically, when a large clinical dataset is presented, it consists of challenges such as missing values, high dimensionality, and unbalanced classes. These pose an inherent problem when implementing feature selection and classification algorithms. With most clinical datasets, an initial exploration of the dataset is carried out, and those attributes with more than a certain percentage of missing values are eliminated from the dataset. Later, with the help of missing value imputation, feature selection and classification algorithms, prognostic and diagnostic models are developed. This paper has two main conclusions: 1) Despite the nature of clinical datasets, and their large size, methods for missing value imputation do not affect the final performance. What is crucial is that the dataset is an accurate representation of the clinical problem and those methods of imputing missing values are not critical for developing classifiers and prognostic/diagnostic models. 2) Supervised learning has proven to be more suitable for mining clinical data than unsupervised methods. It is also shown that non-parametric classifiers such as decision trees give better results when compared to parametric classifiers such as radial basis function networks(RBFNs).
出处 《International Journal of Automation and computing》 EI CSCD 2014年第2期162-179,共18页 国际自动化与计算杂志(英文版)
关键词 Heart failure clinical dataset classification clustering missing values feature selection. Heart failure,clinical dataset,classification,clustering,missing values,feature selection.
  • 相关文献

参考文献18

  • 1Johannes C. Kelder,Maarten J. Cramer,Jan van Wijngaarden,Rob van Tooren,Arend Mosterd,Karel G.M. Moons,Jan W. Lammers,Martin R. Cowie,Diederick E. Grobbee,Arno W. Hoes.The Diagnostic Value of Physical Examination and Additional Testing in Primary Care Patients With Suspected Heart Failure[J]. Circulation . 2011 (25)
  • 2Kyung-Duk Min,Masanori Asakura,Yulin Liao,Kenji Nakamaru,Hidetoshi Okazaki,Tomoko Takahashi,Kazunori Fujimoto,Shin Ito,Ayako Takahashi,Hiroshi Asanuma,Satoru Yamazaki,Tetsuo Minamino,Shoji Sanada,Osamu Seguchi,Atsushi Nakano,Yosuke Ando,Toshiaki Otsuka,Hidehiko Furukawa,Tadashi Isomura,Seiji Takashima,Naoki Mochizuki,Masafumi Kitakaze.Identification of genes related to heart failure using global gene expression profiling of human failing myocardium[J]. Biochemical and Biophysical Research Communications . 2010 (1)
  • 3Esther-Lydia Silva-Ramírez,Rafael Pino-Mejías,Manuel López-Coello,María-Dolores Cubiles-de-la-Vega.Missing value imputation on missing completely at random data using multilayer perceptrons[J]. Neural Networks . 2010 (1)
  • 4Masashi Sugiyama,Motoaki Kawanabe,Pui Ling Chui.Dimensionality reduction for density ratio estimation in high-dimensional spaces[J]. Neural Networks . 2009 (1)
  • 5Nina Zhou Lipo Wang.A Modified T-test Feature Selection Method and Its Application on the HapMap Genotype Data[J].Genomics, Proteomics & Bioinformatics,2007,5(3):242-249. 被引量:6
  • 6Zhiguo Yan,Zhizhong Wang,Hongbo Xie.The application of mutual information-based feature selection and fuzzy LS-SVM-based classifier in motion classification[J]. Computer Methods and Programs in Biomedicine . 2008 (3)
  • 7Lassi Autio,Martti Juhola,Jorma Laurikkala.On the neural network classification of medical data and an endeavour to balance non-uniform data sets with artificial data extension[J]. Computers in Biology and Medicine . 2006 (3)
  • 8Frans M. Coetzee.Correcting the Kullback–Leibler distance for feature selection[J]. Pattern Recognition Letters . 2005 (11)
  • 9Douglas S. Lee,Linda Donovan,Peter C. Austin,Yanyan Gong,Peter P. Liu,Jean L. Rouleau,Jack V. Tu.Comparison of Coding of Heart Failure and Comorbidities in Administrative and Clinical Data for Use in Outcomes Research[J]. Medical Care . 2005 (2)
  • 10Ying Zhao,George Karypis,Usama Fayyad.Hierarchical Clustering Algorithms for Document Datasets[J]. Data Mining and Knowledge Discovery . 2005 (2)

二级参考文献25

  • 1[1]Halperin,E.,et al.2005.Tag SNP selection in geno-type data for maximizing SNP prediction accuracy.Bioinformatics 21:i195-203.
  • 2[2]Liu,T.F.,et al.2005.Effective algorithms for tag SNP selection.J.Bioinform.Comput.Biol.3:1089-1106.
  • 3[3]Liu,Z.and Altman,R.B.2004.Finding haplotype tagging SNPs by use of principal components analy-sis.Am.J.Hum.Genet.75:850-861.
  • 4[4]Phuong,T.M.,et al.2005.Choosing SNPs using fea-ture selection.Proc.IEEE Comput.Syst.Bioinform.Conf.301-309.
  • 5[5]Devlin,B.and Risch,N.1995.A comparison of link-age disequilibrium measures for fine-scale mapping.Genomics 29:311-322.
  • 6[6]Pritchard,J.K.and Przeworski,M.2001.Linkage dis-equilibrium in humans:models and data.Am.J.Hum.Genet.69:1-14.
  • 7[7]Rosenberg,N.A.,et al.2003.Informativeness of ge-netic markers for inference of ancestry.Am.J.Hum.Genet.73:1402-1422.
  • 8[8]Rosenberg,N.A.2005.Algorithms for selecting infor-mative marker panels for population assignment.J.Comput.Biol.12:1183-1201.
  • 9[9]Wright,S.1965.The interpretation of population structure by F-statistics with special regard to systems of mating.Evolution 19:395-420.
  • 10[10]Devore,J.and Peck,R.1997.Statistics:The Explo-ration and Analysis of Data (third edition).Duxbury Press,Pacific Grove,USA.

共引文献15

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部