The false discovery proportion (FDP) is a useful measure of abundance of false positives when a large number of hypotheses are being tested simultaneously. Methods for controlling the expected value of the FDP, namely...The false discovery proportion (FDP) is a useful measure of abundance of false positives when a large number of hypotheses are being tested simultaneously. Methods for controlling the expected value of the FDP, namely the false discovery rate (FDR), have become widely used. It is highly desired to have an accurate prediction interval for the FDP in such applications. Some degree of dependence among test statistics exists in almost all applications involving multiple testing. Methods for constructing tight prediction intervals for the FDP that take account of dependence among test statistics are of great practical importance. This paper derives a formula for the variance of the FDP and uses it to obtain an upper prediction interval for the FDP, under some semi-parametric assumptions on dependence among test statistics. Simulation studies indicate that the proposed formula-based prediction interval has good coverage probability under commonly assumed weak dependence. The prediction interval is generally more accurate than those obtained from existing methods. In addition, a permutation-based upper prediction interval for the FDP is provided, which can be useful when dependence is strong and the number of tests is not too large. The proposed prediction intervals are illustrated using a prostate cancer dataset.展开更多
Many QTL mapping methods have been developed in the past two decades.Statistically,the best method should have a high detection power but a low false discovery rate (FDR).Power and FDR cannot be derived theoretically ...Many QTL mapping methods have been developed in the past two decades.Statistically,the best method should have a high detection power but a low false discovery rate (FDR).Power and FDR cannot be derived theoretically for most QTL mapping methods,but they can be properly evaluated using computer simulations.In this paper,we used four genetic models (two for independent loci and two for linked loci) to illustrate power and FDR estimation for interval mapping (IM) and inclusive composite interval mapping (ICIM).For each model,we simulated 1000 populations each of 200 doubled haploids.A support interval (SI) was first defined to indicate to which predefined QTL the significant QTL belonged.Power was calculated by counting the number of simulation runs with significant peaks higher than the logarithm of odds (LOD) threshold in the SI.Quantitative trait loci not identified in any SIs were viewed as false positives.The FDR is the rate at which QTLs are identified as significant when they are actually non-significant.Simulation results allowed us to estimate power and FDR of IM and ICIM for two independent and two linkage genetic models.Our estimates allowed us to readily compare the efficiencies of different statistical methods for QTL mapping,including the ability to separate linkage,under a wide range of genetic models.We used IM and ICIM as examples of how to estimate power and FDR,but the principles shown in this paper can be used for power analysis and comparison of any other QTL mapping methods,especially those based on interval tests.展开更多
The traditional approaches to false discovery rate(FDR)control in multiple hypothesis testing are usually based on the null distribution of a test statistic.However,all types of null distributions,including the theore...The traditional approaches to false discovery rate(FDR)control in multiple hypothesis testing are usually based on the null distribution of a test statistic.However,all types of null distributions,including the theoretical,permutation-based and empirical ones,have some inherent drawbacks.For example,the theoretical null might fail because of improper assumptions on the sample distribution.Here,we propose a null distributionfree approach to FDR control for multiple hypothesis testing in the case-control study.This approach,named target-decoy procedure,simply builds on the ordering of tests by some statistic or score,the null distribution of which is not required to be known.Competitive decoy tests are constructed from permutations of original samples and are used to estimate the false target discoveries.We prove that this approach controls the FDR when the score function is symmetric and the scores are independent between different tests.Simulation demonstrates that it is more stable and powerful than two popular traditional approaches,even in the existence of dependency.Evaluation is also made on two real datasets,including an arabidopsis genomics dataset and a COVID-19 proteomics dataset.展开更多
In the era of big data,high-dimensional data always arrive in streams,making timely and accurate decision necessary.It has become particularly important to rapidly and sequentially identify individuals whose behavior ...In the era of big data,high-dimensional data always arrive in streams,making timely and accurate decision necessary.It has become particularly important to rapidly and sequentially identify individuals whose behavior deviates from the norm.Aiming at identifying as many irregular behavioral patterns as possible,the authors develop a large-scale dynamic testing system in the framework of false discovery rate(FDR)control.By fully exploiting the sequential feature of datastreams,the authors propose a screening-assisted procedure that filters streams and then only tests streams that pass the filter at each time point.A data-driven optimal screening threshold is derived,giving the new method an edge over existing methods.Under some mild conditions on the dependence structure of datastreams,the FDR is shown to be strongly controlled and the suggested approach for determining screening thresholds is asymptotically optimal.Simulation studies show that the proposed method is both accurate and powerful,and a real-data example is used for illustrative purpose.展开更多
This paper focuses on the support recovery of the Gaussian graphical model(GGM)with false discovery rate(FDR)control.The graceful symmetrized data aggregation(SDA)technique which involves sample splitting,data screeni...This paper focuses on the support recovery of the Gaussian graphical model(GGM)with false discovery rate(FDR)control.The graceful symmetrized data aggregation(SDA)technique which involves sample splitting,data screening and information pooling is exploited via a node-based way.A matrix of test statistics with symmetry property is constructed and a data-driven threshold is chosen to control the FDR for the support recovery of GGM.The proposed method is shown to control the FDR asymptotically under some mild conditions.Extensive simulation studies and a real-data example demonstrate that it yields a better FDR control while offering reasonable power in most cases.展开更多
文摘The false discovery proportion (FDP) is a useful measure of abundance of false positives when a large number of hypotheses are being tested simultaneously. Methods for controlling the expected value of the FDP, namely the false discovery rate (FDR), have become widely used. It is highly desired to have an accurate prediction interval for the FDP in such applications. Some degree of dependence among test statistics exists in almost all applications involving multiple testing. Methods for constructing tight prediction intervals for the FDP that take account of dependence among test statistics are of great practical importance. This paper derives a formula for the variance of the FDP and uses it to obtain an upper prediction interval for the FDP, under some semi-parametric assumptions on dependence among test statistics. Simulation studies indicate that the proposed formula-based prediction interval has good coverage probability under commonly assumed weak dependence. The prediction interval is generally more accurate than those obtained from existing methods. In addition, a permutation-based upper prediction interval for the FDP is provided, which can be useful when dependence is strong and the number of tests is not too large. The proposed prediction intervals are illustrated using a prostate cancer dataset.
基金supported by the NationalBasic Research Program of China(2011CB100100)the National Natural Science Foundation of China(31000540)
文摘Many QTL mapping methods have been developed in the past two decades.Statistically,the best method should have a high detection power but a low false discovery rate (FDR).Power and FDR cannot be derived theoretically for most QTL mapping methods,but they can be properly evaluated using computer simulations.In this paper,we used four genetic models (two for independent loci and two for linked loci) to illustrate power and FDR estimation for interval mapping (IM) and inclusive composite interval mapping (ICIM).For each model,we simulated 1000 populations each of 200 doubled haploids.A support interval (SI) was first defined to indicate to which predefined QTL the significant QTL belonged.Power was calculated by counting the number of simulation runs with significant peaks higher than the logarithm of odds (LOD) threshold in the SI.Quantitative trait loci not identified in any SIs were viewed as false positives.The FDR is the rate at which QTLs are identified as significant when they are actually non-significant.Simulation results allowed us to estimate power and FDR of IM and ICIM for two independent and two linkage genetic models.Our estimates allowed us to readily compare the efficiencies of different statistical methods for QTL mapping,including the ability to separate linkage,under a wide range of genetic models.We used IM and ICIM as examples of how to estimate power and FDR,but the principles shown in this paper can be used for power analysis and comparison of any other QTL mapping methods,especially those based on interval tests.
基金supported by the National Key R&D Program of China(No.2018YFB0704304)the National Natural Science Foundation of China(Nos.32070668,62002231,61832003,61433014)the K.C.Wong Education Foundation。
文摘The traditional approaches to false discovery rate(FDR)control in multiple hypothesis testing are usually based on the null distribution of a test statistic.However,all types of null distributions,including the theoretical,permutation-based and empirical ones,have some inherent drawbacks.For example,the theoretical null might fail because of improper assumptions on the sample distribution.Here,we propose a null distributionfree approach to FDR control for multiple hypothesis testing in the case-control study.This approach,named target-decoy procedure,simply builds on the ordering of tests by some statistic or score,the null distribution of which is not required to be known.Competitive decoy tests are constructed from permutations of original samples and are used to estimate the false target discoveries.We prove that this approach controls the FDR when the score function is symmetric and the scores are independent between different tests.Simulation demonstrates that it is more stable and powerful than two popular traditional approaches,even in the existence of dependency.Evaluation is also made on two real datasets,including an arabidopsis genomics dataset and a COVID-19 proteomics dataset.
基金supported by the National Natural Science Foundation of China under Grant Nos.11771332,11771220,11671178,11925106,11971247the National Science Foundation of Tianjin under Grant Nos.18JCJQJC46000,18ZXZNGX00140+1 种基金the 111Project B20016Mushtaq was also supported by the Fundamental Research Funds for the Central Universities。
文摘In the era of big data,high-dimensional data always arrive in streams,making timely and accurate decision necessary.It has become particularly important to rapidly and sequentially identify individuals whose behavior deviates from the norm.Aiming at identifying as many irregular behavioral patterns as possible,the authors develop a large-scale dynamic testing system in the framework of false discovery rate(FDR)control.By fully exploiting the sequential feature of datastreams,the authors propose a screening-assisted procedure that filters streams and then only tests streams that pass the filter at each time point.A data-driven optimal screening threshold is derived,giving the new method an edge over existing methods.Under some mild conditions on the dependence structure of datastreams,the FDR is shown to be strongly controlled and the suggested approach for determining screening thresholds is asymptotically optimal.Simulation studies show that the proposed method is both accurate and powerful,and a real-data example is used for illustrative purpose.
基金supported partially by the China National Key R&D Program under Grant Nos.2019YFC1908502,2022YFA1003703,2022YFA1003802,and 2022YFA1003803the National Natural Science Foundation of China under Grant Nos.11925106,12231011,11931001,and 11971247。
文摘This paper focuses on the support recovery of the Gaussian graphical model(GGM)with false discovery rate(FDR)control.The graceful symmetrized data aggregation(SDA)technique which involves sample splitting,data screening and information pooling is exploited via a node-based way.A matrix of test statistics with symmetry property is constructed and a data-driven threshold is chosen to control the FDR for the support recovery of GGM.The proposed method is shown to control the FDR asymptotically under some mild conditions.Extensive simulation studies and a real-data example demonstrate that it yields a better FDR control while offering reasonable power in most cases.