摘要
目的高维组学数据分析常伴随多重检验问题,不当处理可造成检验效能低下或阳性发现错误率(FDR)升高。FDR已发展为一类新的控制标准,衍生出一系列理论和方法。方法本研究首先对一类自适应性FDR控制方法原理及条件进行介绍,其基于Benjamini-Hochberg程序,由数据自适应性地估计真实原假设数(m 0)进而对FDR实现控制;具体方法包括迭代式、分位数(中位数、定值)、多阶段、阈值函数调整、m 0外插等。进一步,将上述方法用于肺癌患者CT图像特征和COVID-19患者血清蛋白表达两个实例分析。结果相对两项分析中的控制前结果,各自适应性FDR控制方法均降低了阳性发现数,并相比Bonferroni校正结果很大程度保留了阳性比例。然而蛋白表达实例提示,该类程序无法从根本上解决结果不稳定性问题。通过数据拆分验证,适当降低拟控制水平并以各方法的结果进行综合评估,可使筛选结果稳定性得到一定程度提升。结论由于自适应性FDR控制程序基于样本估计m 0且需满足特定结构假设,高维小样本组学数据的结构复杂性可对其分析效果产生影响,故建议进行方法间的联合应用和综合评价。在阳性发现比例较大时,经典的线性递增程序不失为一种简易、稳健且有效的FDR控制方法。
Objective Multiplicity is a common issue in analyzing omics data,leading to a low test power or a high false discovery rate(FDR)if treated improperly.Over recent years,a new control standard based on the FDR metric has matured,from which a series of related theories and methods have been derived.Methods We first introduce the principle and assumption of a family of adaptive FDR control methods.They are based on the classical linear step-up procedure proposed by Benjamin and Hochberg,with further improvement by adaptively estimating m 0(number of real null hypotheses)according to the sample data.The specific methods include iterative,quantile(median or fixed value),multi-stage,threshold adjustment,plug-in and others.Then,we apply these methods to two case studies,one on analysis of CT image features of lung cancer and another on analysis of serum protein expression of COVID-19 patients.Results In both applications,the adaptive FDR control procedures reduced the number of positive findings compared with the pre-control results,and retained the proportion of positive findings to a large extent compared with the results by Bonferroni correction.However,the protein expression example showed that the root cause of result instability(discovery by chance)cannot be addressed by the procedures alone.In split validation,the robustness of results was improved to a certain extent by comprehensively evaluating the result of each control procedure and appropriately reducing the target control level.Conclusion As the adaptive FDR control proceduresare based on sample estimate of m 0 and have assumptions regarding dependency,the results could be impacted by the complex dependence in high-dimensional small-sample omics data.Therefore,joint application of the methods and comprehensive evaluation of their results is encouraged.In case the proportion of positive detection is high,the classical linear step-up procedure could be a good conservative choice due to its simplicity,robustness and effectiveness for FDR control.
作者
王子兴
薛芳
姜晶梅
Wang Zixing;Xue Fang;Jiang Jingmei(Institute of Basic Medical Sciences,Chinese Academy of Medical Sciences/School of Basic Medicine,Peking Union Medical College,100005,Beijing)
出处
《中国卫生统计》
CSCD
北大核心
2023年第1期68-73,共6页
Chinese Journal of Health Statistics
基金
中国医学科学院医学与健康科技创新工程(2017-I2M-1-009)
中央高校基本科研业务费专项资金(3332021038)。