摘要
大数据背景下,需要对传统的抽样调查技术进行改进,以应对数据结构变化。以杠杆得分为入样概率的Leverage重要性抽样能够增加高杠杆值样本点被抽中的概率,但也增加了异常值选入抽样子集的风险,使得抽样估计偏离真实值。为降低大数据异常值影响,提高大数据抽样子集估计的稳健性,本文提出基于阈值自选择的两阶段Leverage重要性抽样方法。该方法第一阶段以样本距离的有序聚类识别稳健子集,使得用于二阶段抽样的样本更具代表性,第二阶段则是在稳健子集的基础上获得稳健抽样估计。模拟分析结果表明,本文所提方法能够提升线性回归系数估计的精度,在漂移型、波动型和混合型离群值中均适用。实证分析中本文所提方法在三个案例数据中拥有较小的预测值均方误差,有效降低了异常值的影响。
In the context of big data,it is necessary to improve the traditional sampling survey technology to cope with the reality of data structure changes.Leverage importance sampling with leverage score as the sampling probability can increase the probability of sample points with high leverage value being selected,but it also increases the risk of outliers being selected into the sampling subset,which makes the sampling estimation deviate from the true value.In order to reduce the influence of outliers and improve the robustness of sampling subset estimation of big data,this paper proposes a two-stage Leverage importance sampling method based on threshold self-selection.In the first stage,the method identifies robust subsets by ordered clustering of sample distances,which makes the samples used for two-stage sampling more representative.In the second stage,robust sampling estimation is obtained on the basis of robust subsets.The simulation results show that the method proposed in this paper can improve the accuracy of linear regression coefficient estimation,and is applicable to drift,fluctuation and mixed outliers.In the empirical analysis,the method has a small mean square error of the predicted value in the data of three cases,effectively reducing the influence of outliers.
作者
贺建风
何韩吉
He Jianfeng;He Hanji
出处
《统计研究》
CSSCI
北大核心
2024年第10期149-160,共12页
Statistical Research
基金
国家社会科学基金一般项目“大数据背景下随机抽样技术及模型辅助估计方法研究”(19BTJ022)
全国统计科学研究重大项目“大数据背景下抽样调查方法的改进及其应用研究”(2020LD02)
全国统计科学研究优选项目“大数据背景下多源数据融合推断研究”(2023LY010)
华南理工大学中央高校哲学社会科学创新团队项目“大数据统计调查与计量经济分析”(CXTD202405)。