大数据背景下两阶段Leverage重要性抽样方法研究

Two-stage Leverage Importance Sampling Method in the Context of Big Data

下载PDF

导出

摘要大数据背景下,需要对传统的抽样调查技术进行改进,以应对数据结构变化。以杠杆得分为入样概率的Leverage重要性抽样能够增加高杠杆值样本点被抽中的概率,但也增加了异常值选入抽样子集的风险,使得抽样估计偏离真实值。为降低大数据异常值影响,提高大数据抽样子集估计的稳健性,本文提出基于阈值自选择的两阶段Leverage重要性抽样方法。该方法第一阶段以样本距离的有序聚类识别稳健子集,使得用于二阶段抽样的样本更具代表性,第二阶段则是在稳健子集的基础上获得稳健抽样估计。模拟分析结果表明,本文所提方法能够提升线性回归系数估计的精度,在漂移型、波动型和混合型离群值中均适用。实证分析中本文所提方法在三个案例数据中拥有较小的预测值均方误差,有效降低了异常值的影响。 In the context of big data,it is necessary to improve the traditional sampling survey technology to cope with the reality of data structure changes.Leverage importance sampling with leverage score as the sampling probability can increase the probability of sample points with high leverage value being selected,but it also increases the risk of outliers being selected into the sampling subset,which makes the sampling estimation deviate from the true value.In order to reduce the influence of outliers and improve the robustness of sampling subset estimation of big data,this paper proposes a two-stage Leverage importance sampling method based on threshold self-selection.In the first stage,the method identifies robust subsets by ordered clustering of sample distances,which makes the samples used for two-stage sampling more representative.In the second stage,robust sampling estimation is obtained on the basis of robust subsets.The simulation results show that the method proposed in this paper can improve the accuracy of linear regression coefficient estimation,and is applicable to drift,fluctuation and mixed outliers.In the empirical analysis,the method has a small mean square error of the predicted value in the data of three cases,effectively reducing the influence of outliers.

作者贺建风何韩吉 He Jianfeng;He Hanji

机构地区华南理工大学经济与金融学院数量经济学系

出处《统计研究》 CSSCI 北大核心 2024年第10期149-160,共12页 Statistical Research

基金国家社会科学基金一般项目“大数据背景下随机抽样技术及模型辅助估计方法研究”(19BTJ022) 全国统计科学研究重大项目“大数据背景下抽样调查方法的改进及其应用研究”(2020LD02) 全国统计科学研究优选项目“大数据背景下多源数据融合推断研究”(2023LY010) 华南理工大学中央高校哲学社会科学创新团队项目“大数据统计调查与计量经济分析”(CXTD202405)。

关键词大规模数据线性模型有序聚类 Leverage重要性抽样 Large-scale Data Linear Model Ordered Clustering Leverage Importance Sampling

分类号 O212 [理学—概率论与数理统计]

引文网络
相关文献

1杨诗彤,王玉洁,蒋俊佳,许吉祥,高俊岭.老年人社会网络类型与主观幸福感相关性研究[J].中国健康教育,2024,40(6):518-524.
2张喜铭,徐欢,杨秋勇,高伟,张睿喆.考虑电力系统数据治理智能化的数据库生成方法研究[J].制造业自动化,2024,46(2):160-165.
3王畅,李嘉慧,杨毅婷,黎金荣,林国桢.广州市越秀区居民接受健康教育的现况调查分析[J].预防医学论坛,2024,30(1):25-28.
4陈海红,申广忠.基于多模态知识图谱的跨平台信息推荐仿真[J].计算机仿真,2024,41(10):463-467.
5孙阿宁,余克富,吴琼,杨淑贤,赵志刚.胆石症及胆囊切除术与胃食管反流病因果关系的相关性研究[J].中国临床药理学杂志,2024,40(21):3163-3166.
6程宏波,李昊岭,李宗伟,万紫彤,蔡木良,辛建波.基于欠定盲源分离模型的负荷分解方法研究[J].重庆理工大学学报（自然科学）,2024,38(10):193-201.
7刘晓利,李耀翔,彭润东,张哲宇,陈雅.基于卷积神经网络的樟子松木材密度近红外预测模型优化[J].森林工程,2024,40(3):142-151.
8Huaping Hu,Yuqing Shan,Qiming Zhao,Jinglun Wang,Lingjun Wu,Wanqiang Liu.The prediction of donor number and acceptor number of electrolyte solvent molecules based on machine learning[J].Journal of Energy Chemistry,2024,98(11):374-382.
9刘志军.留守经历与学业成就内在关系的争论与反思[J].中北大学学报（社会科学版）,2024,40(6):69-79.
10许元杰,李建平,吴登生.基于期刊论文主题耦合的学科结构识别方法研究[J].图书情报工作,2024,68(21):3-15.

统计研究

2024年第10期

浏览历史

内容加载中请稍等...

大数据背景下两阶段Leverage重要性抽样方法研究

相关作者

相关机构

相关主题

浏览历史