摘要
This study uses <span style="font-family:Verdana;">an empirical</span><span style="font-family:Verdana;"> analysis to quantify the downstream analysis effects of data pre-processing choices. Bootstrap data simulation is used to measure the bias-variance decomposition of an empirical risk function, mean square error (MSE). Results of the risk function decomposition are used to measure the effects of model development choices on </span><span style="font-family:Verdana;">model</span><span style="font-family:Verdana;"> bias, variance, and irreducible error. Measurements of bias and variance are then applied as diagnostic procedures for model pre-processing and development. Best performing model-normalization-data structure combinations were found to illustrate the downstream analysis effects of these model development choices. </span><span style="font-family:Verdana;">In addition</span><span style="font-family:Verdana;">s</span><span style="font-family:Verdana;">, results found from simulations were verified and expanded to include additional data characteristics (imbalanced, sparse) by testing on benchmark datasets available from the UCI Machine Learning Library. Normalization results on benchmark data were consistent with those found using simulations, while also illustrating that more complex and/or non-linear models provide better performance on datasets with additional complexities. Finally, applying the findings from simulation experiments to previously tested applications led to equivalent or improved results with less model development overhead and processing time.</span>
This study uses <span style="font-family:Verdana;">an empirical</span><span style="font-family:Verdana;"> analysis to quantify the downstream analysis effects of data pre-processing choices. Bootstrap data simulation is used to measure the bias-variance decomposition of an empirical risk function, mean square error (MSE). Results of the risk function decomposition are used to measure the effects of model development choices on </span><span style="font-family:Verdana;">model</span><span style="font-family:Verdana;"> bias, variance, and irreducible error. Measurements of bias and variance are then applied as diagnostic procedures for model pre-processing and development. Best performing model-normalization-data structure combinations were found to illustrate the downstream analysis effects of these model development choices. </span><span style="font-family:Verdana;">In addition</span><span style="font-family:Verdana;">s</span><span style="font-family:Verdana;">, results found from simulations were verified and expanded to include additional data characteristics (imbalanced, sparse) by testing on benchmark datasets available from the UCI Machine Learning Library. Normalization results on benchmark data were consistent with those found using simulations, while also illustrating that more complex and/or non-linear models provide better performance on datasets with additional complexities. Finally, applying the findings from simulation experiments to previously tested applications led to equivalent or improved results with less model development overhead and processing time.</span>
作者
Jessica M. Rudd
Herman “Gene” Ray
Jessica M. Rudd;Herman “Gene” Ray(Analytics and Data Science Institute, College of Software and Computer Engineering, Kennesaw State University, Kennesaw, GA, USA)