摘要
就大数据生成过程的多维性、稀疏性和动态性等特征而言,大数据集并不等于统计总体,即便对于静态大数据集,随机抽样同样有着不可或缺的参数估计和总体推断的方法论价值。在大型数据分析中,常常遇到需要降低维度和减少计算量但又不知如何抽样处理的问题。因此,提出均匀抽样在大数据挖掘中应用的基本策略,并使用模拟数据和医学胎心宫缩监护数据集进行数值分析。结果表明:均匀抽样在降低决策树、adaboost、bagging和随机森林的误差率上优于现有文献的常用方法,这一策略能为面向大数据的数据挖掘方法提供参考,也为针对大数据分析的抽样有效性提供佐证。
On multidimensional,sparse and dynamic characteristics of big data generation process,the big data set does not mean that the statistical population.Even for big static data,random sampling also has an indispensable value.In large-scale data analysis,it is often encounter the need to reduce the dimensions and reduce the amount of calculation and yet we do not know how to deal with the problem of sampling.Our paper proposes a uniform sampling strategy in big data mining applications,and apply simulated data and monitoring fetal heart contractions datasets to numerical analysis.Our results indicate that proposed method is obviously superior to the existing methods in literatures on the error rate of the training data.This conclusion might be useful for the implementation of data mining by sampling on the large database,and provide evidence for sampling effectiveness in big data analysis.
出处
《统计与信息论坛》
CSSCI
北大核心
2015年第4期3-6,共4页
Journal of Statistics and Information
基金
国家自然科学基金项目<在家系序列数据中同质性检验的连锁研究>(31470070)
山西省自然科学基金项目<基因型模式在基因组选择中的整合研究>(2014011030-4)
山西省回国留学人员科研资助项目<基于统计学习理论的基因组选择研究>(2013-72)