摘要
面对海量数据,如何选取一个具有代表性的样本进行统计建模以揭示数据背后的规律、进而对经济和社会发展进行预测和判断,是统计学研究的重点。本研究以确定性抽样方法给出该问题的答案,该方法能够有效避免由传统概率抽样方法带来的损失,使得具有代表性的样本点尽量被选入抽样样本中,更加全面地反映总体情况。本研究集中在广义可加模型的最优样本抽取方法。通过比较全样本和抽样样本估计结果之间的差距,发现样本需要满足一定的正交性条件才能最大程度还原总体的统计特征。基于该正交条件,给出了一个贪婪的拟最优样本选择方法。大量的模拟数据和实际数据证实,相较于传统概率抽样方法,确定性抽样方法具有更优良的性能,该方法可以拓展到广义变系数模型,并且适用于处理经济统计和政府统计产生的大型微观数据集。
Big data opens up a new era,bringing people an explosion of information and great changes in thinking,but also making data modeling and prediction face new difficulties and challenges.How to select a representative sample from big data to open out the objective rules behind data and have a valid prediction,is always the focus and hotspot in statistics.In this paper deterministic sampling method is considered to offer the key.The essential idea is to find a condition that directly affects the estimation accuracy,and then sort the sample points according to this condition,and select the important sample points in turn.This approach can effectively avoid the loss resulting from traditional probabilistic sampling technique,making those samples with significant effects are prone to have an entry,thus getting a more comprehensive picture of the overall case.The novel optimal sampling method is studied in the case of general additive model to model the complex data structure.It has been discovered that certain orthogonal condition needs to be satisfied to recover the main features of original dataset.On this basis,a quasi-optimal sample selection algorithm is present.The algorithm adopts a greedy strategy,and sequentially selects the samples that maximize the orthogonal index.The result may only be locally optimal,which is similar to the K-means clustering algorithm.Both simulation data and real data demonstrate the better performance of proposed method over traditional probabilistic sampling methods,including simple random sampling and Leverage sampling.The proposed method has lower in-sample fitting error and in-sample prediction error.The proposed method can be extended to general varying coefficient model,and is suitable for analyzing the microscopic large data in economic statistics and government statistics.In the future research,we can further study the following questions.(1)Under the current sample selection condition,whether there is a better orthogonal index to improve the quality of sample selection.Since the current orthogonal index requires at least P samples,so the proposed algorithm obtains quasi-optimal results.If the orthogonal index can be constructed on fewer samples,the sample selection will be closer to the global optimal choice.(2)Whether there are new sample selection conditions that lead to estimation and prediction with a higher degree of accuracy.The sample selection condition in this paper are proposed by comparing the estimation results of the full sample and the sampled sample.In principle,the sample selection condition will change with the estimation method,loss function or sampling design,which affects the estimation and prediction results,so the sample selection condition with optimal prediction performance is a problem worthy of study.
作者
秦磊
叶玲珑
谢邦昌
QIN Lei;YE Ling-long;SHIA Ben-chang(School of Statistics,University of International Business and Economics,Beijing 100029,China;School of Public Affairs,Xiamen University,Xiamen 361005,China;College of Management,Fu Jen Catholic University,Taiwan 242062,China)
出处
《统计与信息论坛》
CSSCI
北大核心
2022年第10期16-24,共9页
Journal of Statistics and Information
基金
对外经济贸易大学中央高校基本科研业务费专项资金资助“大数据下重大传染病的监测和预警研究”(20YQ12),对外经济贸易大学惠园杰出青年学者项目(20JQ07)。
关键词
大数据
确定性抽样
广义可加模型
拟最优样本
big data
deterministic sampling
general additive model
quasi-optimal sample