期刊文献+

An Empirical Study of Downstream Analysis Effects of Model Pre-Processing Choices

An Empirical Study of Downstream Analysis Effects of Model Pre-Processing Choices
下载PDF
导出
摘要 This study uses <span style="font-family:Verdana;">an empirical</span><span style="font-family:Verdana;"> analysis to quantify the downstream analysis effects of data pre-processing choices. Bootstrap data simulation is used to measure the bias-variance decomposition of an empirical risk function, mean square error (MSE). Results of the risk function decomposition are used to measure the effects of model development choices on </span><span style="font-family:Verdana;">model</span><span style="font-family:Verdana;"> bias, variance, and irreducible error. Measurements of bias and variance are then applied as diagnostic procedures for model pre-processing and development. Best performing model-normalization-data structure combinations were found to illustrate the downstream analysis effects of these model development choices. </span><span style="font-family:Verdana;">In addition</span><span style="font-family:Verdana;">s</span><span style="font-family:Verdana;">, results found from simulations were verified and expanded to include additional data characteristics (imbalanced, sparse) by testing on benchmark datasets available from the UCI Machine Learning Library. Normalization results on benchmark data were consistent with those found using simulations, while also illustrating that more complex and/or non-linear models provide better performance on datasets with additional complexities. Finally, applying the findings from simulation experiments to previously tested applications led to equivalent or improved results with less model development overhead and processing time.</span> This study uses <span style="font-family:Verdana;">an empirical</span><span style="font-family:Verdana;"> analysis to quantify the downstream analysis effects of data pre-processing choices. Bootstrap data simulation is used to measure the bias-variance decomposition of an empirical risk function, mean square error (MSE). Results of the risk function decomposition are used to measure the effects of model development choices on </span><span style="font-family:Verdana;">model</span><span style="font-family:Verdana;"> bias, variance, and irreducible error. Measurements of bias and variance are then applied as diagnostic procedures for model pre-processing and development. Best performing model-normalization-data structure combinations were found to illustrate the downstream analysis effects of these model development choices. </span><span style="font-family:Verdana;">In addition</span><span style="font-family:Verdana;">s</span><span style="font-family:Verdana;">, results found from simulations were verified and expanded to include additional data characteristics (imbalanced, sparse) by testing on benchmark datasets available from the UCI Machine Learning Library. Normalization results on benchmark data were consistent with those found using simulations, while also illustrating that more complex and/or non-linear models provide better performance on datasets with additional complexities. Finally, applying the findings from simulation experiments to previously tested applications led to equivalent or improved results with less model development overhead and processing time.</span>
作者 Jessica M. Rudd Herman “Gene” Ray Jessica M. Rudd;Herman “Gene” Ray(Analytics and Data Science Institute, College of Software and Computer Engineering, Kennesaw State University, Kennesaw, GA, USA)
出处 《Open Journal of Statistics》 2020年第5期735-809,共75页 统计学期刊(英文)
关键词 Empirical Analysis Bias-Variance Decomposition Mean Squared Error Downstream Analysis Effects Empirical Risk Empirical Analysis Bias-Variance Decomposition Mean Squared Error Downstream Analysis Effects Empirical Risk
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部