期刊文献+

大数据序贯检验方法及其应用

Sequential Testing Method and Its Application in Big Data
下载PDF
导出
摘要 分布的一致性检验在很多领域中得到了广泛的应用,它是统计学在众多应用中的一个基本主题。然而,随着大数据时代的到来,各个科学领域收集存储了丰富的数据。这些数据规模庞大、类型多样、结构复杂、更新速度快,传统的分布一致性检验方法受数据规模和存储方式的影响在处理和分析这类数据时面临着巨大的挑战。目前,分治策略是解决这类问题的主要方法,其核心思想是采用分布式框架对每个节点数据的计算结果进行集成以获取最终的结果。在处理大规模分布一致性检验问题时,这种对所有节点的检验结果进行集成的方式并不高效,特别是在数据分布存在明显差异时这种方式往往会增加检验的成本。因此,基于序贯检验的思想通过合理设置检验问题的“误差区域”对已有的分治策略进行优化,提出了一种分布式序贯检验方法。该方法在检验过程中不集成所有的节点数据,而是根据当前收集到的节点数据实时调整后续的决策,通过这种方式能够实现在不使用全部节点数据的前提下,做出正确的检验结果。模拟实验和实例分析结果表明:相比于已有的分治策略检验方法,所提出的分布式序贯检验方法能够在保证检验水平与功效的同时,提高分布式检验的计算效率,为解决临床试验、工业检验等领域中大规模数据检验成本高的问题提供了方法支撑。 The consistency test of distributions has been widely applied in many fields and has been a fundamental theme of statistics in numerous applications.However,with the advent of the big data era,rich data have been collected and stored in various scientific domains.These data are characterized by large scale,diverse types,complex structures,and fast update rates.Traditional methods for distribution consistency tests are facing significant challenges in processing and analyzing such data due to the influence of data scale and storage methods.Currently,a divide-and-conquer strategy is the primary method for addressing such issues,with the core idea being the integration of calculation results for each node’s data using a distributed framework to obtain the final result.However,when dealing with large-scale distribution consistency testing problems,the method of integrating test results from all nodes is not efficient,especially when there are significant differences in data distribution,which often increases the cost of testing.In response,based on the idea of sequential testing,a distributed sequential testing method is proposed to optimize existing divide-and-conquer strategies by appropriately setting the“error region”of the testing problem.This method sequentially compares the test statistic with a predetermined threshold,enabling the maintenance of test level and power without using all node data.Simulation experiments and case studies demonstrate that compared to traditional divide-and-conquer testing methods,the proposed distributed sequential testing method can make correct testing decisions using fewer node data,thereby improving the computational efficiency of distributed testing and providing methodological support for addressing the high testing costs of large-scale data in fields such as clinical trials and industrial inspections.
作者 田梓璇 谢小月 TIAN Zixuan;XIE Xiaoyue(School of Mathematical Sciences,University of Chinese Academy of Sciences,Beijing 100190,China;Institute of Mathematics and Systems Science,University of Chinese Academy of Sciences,Beijing 100190,China;School of Equipment Management and Unmanned Aerial Vehicle Engineering,Air Force Engineering University,Xi’an 710038,China)
出处 《统计与信息论坛》 北大核心 2024年第9期13-22,共10页 Journal of Statistics and Information
基金 陕西省自然科学基础研究计划资助项目“大规模数据下分布一致性检验方法研究”(2023-JC-QN-0059)。
关键词 分治策略 大数据 序贯检验 分布式框架 divide and conquer strategy big data sequential testing distributed framework
  • 相关文献

相关作者

内容加载中请稍等...

相关机构

内容加载中请稍等...

相关主题

内容加载中请稍等...

浏览历史

内容加载中请稍等...
;
使用帮助 返回顶部