大数据序贯检验方法及其应用

Sequential Testing Method and Its Application in Big Data

下载PDF

导出

摘要分布的一致性检验在很多领域中得到了广泛的应用,它是统计学在众多应用中的一个基本主题。然而,随着大数据时代的到来,各个科学领域收集存储了丰富的数据。这些数据规模庞大、类型多样、结构复杂、更新速度快,传统的分布一致性检验方法受数据规模和存储方式的影响在处理和分析这类数据时面临着巨大的挑战。目前,分治策略是解决这类问题的主要方法,其核心思想是采用分布式框架对每个节点数据的计算结果进行集成以获取最终的结果。在处理大规模分布一致性检验问题时,这种对所有节点的检验结果进行集成的方式并不高效,特别是在数据分布存在明显差异时这种方式往往会增加检验的成本。因此,基于序贯检验的思想通过合理设置检验问题的“误差区域”对已有的分治策略进行优化,提出了一种分布式序贯检验方法。该方法在检验过程中不集成所有的节点数据,而是根据当前收集到的节点数据实时调整后续的决策,通过这种方式能够实现在不使用全部节点数据的前提下,做出正确的检验结果。模拟实验和实例分析结果表明:相比于已有的分治策略检验方法,所提出的分布式序贯检验方法能够在保证检验水平与功效的同时,提高分布式检验的计算效率,为解决临床试验、工业检验等领域中大规模数据检验成本高的问题提供了方法支撑。 The consistency test of distributions has been widely applied in many fields and has been a fundamental theme of statistics in numerous applications.However,with the advent of the big data era,rich data have been collected and stored in various scientific domains.These data are characterized by large scale,diverse types,complex structures,and fast update rates.Traditional methods for distribution consistency tests are facing significant challenges in processing and analyzing such data due to the influence of data scale and storage methods.Currently,a divide-and-conquer strategy is the primary method for addressing such issues,with the core idea being the integration of calculation results for each node’s data using a distributed framework to obtain the final result.However,when dealing with large-scale distribution consistency testing problems,the method of integrating test results from all nodes is not efficient,especially when there are significant differences in data distribution,which often increases the cost of testing.In response,based on the idea of sequential testing,a distributed sequential testing method is proposed to optimize existing divide-and-conquer strategies by appropriately setting the“error region”of the testing problem.This method sequentially compares the test statistic with a predetermined threshold,enabling the maintenance of test level and power without using all node data.Simulation experiments and case studies demonstrate that compared to traditional divide-and-conquer testing methods,the proposed distributed sequential testing method can make correct testing decisions using fewer node data,thereby improving the computational efficiency of distributed testing and providing methodological support for addressing the high testing costs of large-scale data in fields such as clinical trials and industrial inspections.

作者田梓璇谢小月 TIAN Zixuan;XIE Xiaoyue(School of Mathematical Sciences,University of Chinese Academy of Sciences,Beijing 100190,China;Institute of Mathematics and Systems Science,University of Chinese Academy of Sciences,Beijing 100190,China;School of Equipment Management and Unmanned Aerial Vehicle Engineering,Air Force Engineering University,Xi’an 710038,China)

机构地区中国科学院大学数学科学学院中国科学院大学数学与系统科学研究院空军工程大学装备管理与无人机工程学院

出处《统计与信息论坛》北大核心 2024年第9期13-22,共10页 Journal of Statistics and Information

基金陕西省自然科学基础研究计划资助项目“大规模数据下分布一致性检验方法研究”(2023-JC-QN-0059)。

关键词分治策略大数据序贯检验分布式框架 divide and conquer strategy big data sequential testing distributed framework

分类号 O212.1 [理学—概率论与数理统计]

引文网络
相关文献

1龙荣进,胡思贵,叶茂越.改进的T-SPRT在电子商务产品质量检验中的应用[J].电子商务评论,2024,13(2):1368-1377.
2张晶,周稻祥,吴永飞,冯姝.联合度量指标损失和U-Net的文档图像二值化[J].计算机工程与设计,2024,45(8):2400-2406.
3黄华.高速公路施工成本问题及管控措施[J].人民交通,2024(11):0058-0060.
4张婷婷.对比分析不同免疫检验方法检测乙肝病毒感染血清标志物的效果[J].中文科技期刊数据库（引文版）医药卫生,2024(8):0099-0102.
5吴子刚,向林浩,郑佳楠.基于船舶辅助检验设备的可视化技术应用研究[J].中国船检,2024(6):83-86.
6杨华波,张士峰.导弹命中精度的序贯截尾概率圆检验方法[J].国防科技大学学报,2024,46(2):62-69.
7吴清华.工程质量检测在建设工程质量管理中的重要性研究[J].中文科技期刊数据库（文摘版）工程技术,2024(9):0044-0047.
8苏钺兴.起重机械电气检验问题及检验要点研究[J].中文科技期刊数据库（引文版）工程技术,2024(9):0170-0173.
9张洁琳,信师勇,刘东岭.血清β2微球蛋白、超敏C反应蛋白与尿微量白蛋白在早期糖尿病肾病诊断中的应用研究[J].糖尿病天地,2024,21(8):105-106.
10李清忠.基于ASP.NET技术、量子遗传算法的物流调度管理系统设计与实现[J].信息技术与信息化,2024(8):65-68.

统计与信息论坛

2024年第9期

浏览历史

内容加载中请稍等...

大数据序贯检验方法及其应用

相关作者

相关机构

相关主题

浏览历史