In the era of big data,high-dimensional data always arrive in streams,making timely and accurate decision necessary.It has become particularly important to rapidly and sequentially identify individuals whose behavior ...In the era of big data,high-dimensional data always arrive in streams,making timely and accurate decision necessary.It has become particularly important to rapidly and sequentially identify individuals whose behavior deviates from the norm.Aiming at identifying as many irregular behavioral patterns as possible,the authors develop a large-scale dynamic testing system in the framework of false discovery rate(FDR)control.By fully exploiting the sequential feature of datastreams,the authors propose a screening-assisted procedure that filters streams and then only tests streams that pass the filter at each time point.A data-driven optimal screening threshold is derived,giving the new method an edge over existing methods.Under some mild conditions on the dependence structure of datastreams,the FDR is shown to be strongly controlled and the suggested approach for determining screening thresholds is asymptotically optimal.Simulation studies show that the proposed method is both accurate and powerful,and a real-data example is used for illustrative purpose.展开更多
This paper focuses on the support recovery of the Gaussian graphical model(GGM)with false discovery rate(FDR)control.The graceful symmetrized data aggregation(SDA)technique which involves sample splitting,data screeni...This paper focuses on the support recovery of the Gaussian graphical model(GGM)with false discovery rate(FDR)control.The graceful symmetrized data aggregation(SDA)technique which involves sample splitting,data screening and information pooling is exploited via a node-based way.A matrix of test statistics with symmetry property is constructed and a data-driven threshold is chosen to control the FDR for the support recovery of GGM.The proposed method is shown to control the FDR asymptotically under some mild conditions.Extensive simulation studies and a real-data example demonstrate that it yields a better FDR control while offering reasonable power in most cases.展开更多
The traditional approaches to false discovery rate(FDR)control in multiple hypothesis testing are usually based on the null distribution of a test statistic.However,all types of null distributions,including the theore...The traditional approaches to false discovery rate(FDR)control in multiple hypothesis testing are usually based on the null distribution of a test statistic.However,all types of null distributions,including the theoretical,permutation-based and empirical ones,have some inherent drawbacks.For example,the theoretical null might fail because of improper assumptions on the sample distribution.Here,we propose a null distributionfree approach to FDR control for multiple hypothesis testing in the case-control study.This approach,named target-decoy procedure,simply builds on the ordering of tests by some statistic or score,the null distribution of which is not required to be known.Competitive decoy tests are constructed from permutations of original samples and are used to estimate the false target discoveries.We prove that this approach controls the FDR when the score function is symmetric and the scores are independent between different tests.Simulation demonstrates that it is more stable and powerful than two popular traditional approaches,even in the existence of dependency.Evaluation is also made on two real datasets,including an arabidopsis genomics dataset and a COVID-19 proteomics dataset.展开更多
Longevity is regarded as the most important functional trait in cattle breeding with high economic value yet low heritability. In order to identify genomic regions associated with longevity, a genome wise association ...Longevity is regarded as the most important functional trait in cattle breeding with high economic value yet low heritability. In order to identify genomic regions associated with longevity, a genome wise association study was performed using data from 4887 Fleckvieh bulls and 33,556 SNPs after quality control. Single SNP regression was used for identification of important SNPs including eigenvectors as a means of correction for population structure. SNPs selected with a false discovery rate threshold of 0.05 and with local false discovery rate identified genomic regions associated with longevity which were subsequently cross checked with the National Center for Biotechnology Information (NCBI) database. This, to identify interesting genes in cattle and their homologue forms in other species. The most notable genes were SYT10 located on chromosome 5, ADAMTS3 on chromosome 6, NTRK2 on chromosome 8 and SNTG1 on chromosome 14 of the cattle genome. Several of the genes found have previously been associated with cattle fertility. Poor fertility is an important culling reason and thereby affects longevity in cattle. Several signals were located in regions sparse with described genes, which suggest that there might be several other non-identified genetic pathways for this important trait.展开更多
Epistasis is a commonly observed genetic phenomenon and an important source of variation of complex traits, which could maintain additive variance and therefore assure the long-term genetic gain in breeding. Inclusive...Epistasis is a commonly observed genetic phenomenon and an important source of variation of complex traits, which could maintain additive variance and therefore assure the long-term genetic gain in breeding. Inclusive composite interval mapping (ICIM) is able to identify epistatic quantitative trait loci (QTLs) no matter whether the two interacting QTLs have any additive effects. In this article, we conducted a simulation study to evaluate detection power and false discovery rate (FDR) of ICIM epistatic mapping, by considering F2 and doubled haploid (DH) populations, different F2 segregation ratios and population sizes. Results indicated that estimations of QTL locations and effects were unbiased, and the detection power of epistatic mapping was largely affected by population size, heritability of epistasis, and the amount and distribution of genetic effects. When the same likelihood of odd (LOD) threshold was used, detection power of QTL was higher in F2 population than power in DH population; meanwhile FDR in F2 was also higher than that in DH. The increase of marker density from 10 cM to 5 cM led to similar detection power but higher FDR. In simulated populations, ICIM achieved better mapping results than multiple interval mapping (MIM) in estimation of QTL positions and effect. At the end, we gave epistatic mapping results of ICIM in one actual population in rice (Oryza sativa L.).展开更多
This study is undertaken to apply a bootstrap method of controlling the false discovery rate (FDR) when performing pairwise comparisons of normal means. Due to the dependency of test statistics in pairwise compariso...This study is undertaken to apply a bootstrap method of controlling the false discovery rate (FDR) when performing pairwise comparisons of normal means. Due to the dependency of test statistics in pairwise comparisons, many conventional multiple testing procedures can't be employed directly. Some modified pro- cedures that control FDR with dependent test statistics are too conservative. In the paper, by bootstrap and goodness-of-fit methods, we produce independent p-values for pairwise comparisons. Based on these indepen- dent p-values, plenty of procedures can be used, and two typical FDR controlling procedures are applied here. An example is provided to illustrate the proposed approach. Extensive simulations show the satisfactory FDR control and power performance of our approach. In addition, the proposed approach can be easily extended to more than two normal, or non-normal, balance or unbalance cases.展开更多
基金supported by the National Natural Science Foundation of China under Grant Nos.11771332,11771220,11671178,11925106,11971247the National Science Foundation of Tianjin under Grant Nos.18JCJQJC46000,18ZXZNGX00140+1 种基金the 111Project B20016Mushtaq was also supported by the Fundamental Research Funds for the Central Universities。
文摘In the era of big data,high-dimensional data always arrive in streams,making timely and accurate decision necessary.It has become particularly important to rapidly and sequentially identify individuals whose behavior deviates from the norm.Aiming at identifying as many irregular behavioral patterns as possible,the authors develop a large-scale dynamic testing system in the framework of false discovery rate(FDR)control.By fully exploiting the sequential feature of datastreams,the authors propose a screening-assisted procedure that filters streams and then only tests streams that pass the filter at each time point.A data-driven optimal screening threshold is derived,giving the new method an edge over existing methods.Under some mild conditions on the dependence structure of datastreams,the FDR is shown to be strongly controlled and the suggested approach for determining screening thresholds is asymptotically optimal.Simulation studies show that the proposed method is both accurate and powerful,and a real-data example is used for illustrative purpose.
基金supported partially by the China National Key R&D Program under Grant Nos.2019YFC1908502,2022YFA1003703,2022YFA1003802,and 2022YFA1003803the National Natural Science Foundation of China under Grant Nos.11925106,12231011,11931001,and 11971247。
文摘This paper focuses on the support recovery of the Gaussian graphical model(GGM)with false discovery rate(FDR)control.The graceful symmetrized data aggregation(SDA)technique which involves sample splitting,data screening and information pooling is exploited via a node-based way.A matrix of test statistics with symmetry property is constructed and a data-driven threshold is chosen to control the FDR for the support recovery of GGM.The proposed method is shown to control the FDR asymptotically under some mild conditions.Extensive simulation studies and a real-data example demonstrate that it yields a better FDR control while offering reasonable power in most cases.
基金supported by the National Key R&D Program of China(No.2018YFB0704304)the National Natural Science Foundation of China(Nos.32070668,62002231,61832003,61433014)the K.C.Wong Education Foundation。
文摘The traditional approaches to false discovery rate(FDR)control in multiple hypothesis testing are usually based on the null distribution of a test statistic.However,all types of null distributions,including the theoretical,permutation-based and empirical ones,have some inherent drawbacks.For example,the theoretical null might fail because of improper assumptions on the sample distribution.Here,we propose a null distributionfree approach to FDR control for multiple hypothesis testing in the case-control study.This approach,named target-decoy procedure,simply builds on the ordering of tests by some statistic or score,the null distribution of which is not required to be known.Competitive decoy tests are constructed from permutations of original samples and are used to estimate the false target discoveries.We prove that this approach controls the FDR when the score function is symmetric and the scores are independent between different tests.Simulation demonstrates that it is more stable and powerful than two popular traditional approaches,even in the existence of dependency.Evaluation is also made on two real datasets,including an arabidopsis genomics dataset and a COVID-19 proteomics dataset.
基金financial support of the Austrian Ministry for Transport,Innovation and Technology(BMVIT)and the Austrian Science Fund(FWF)via the project TRP46-B19Part of the study was conducted using a travel grant provided by the European Science Foundation(ESF).
文摘Longevity is regarded as the most important functional trait in cattle breeding with high economic value yet low heritability. In order to identify genomic regions associated with longevity, a genome wise association study was performed using data from 4887 Fleckvieh bulls and 33,556 SNPs after quality control. Single SNP regression was used for identification of important SNPs including eigenvectors as a means of correction for population structure. SNPs selected with a false discovery rate threshold of 0.05 and with local false discovery rate identified genomic regions associated with longevity which were subsequently cross checked with the National Center for Biotechnology Information (NCBI) database. This, to identify interesting genes in cattle and their homologue forms in other species. The most notable genes were SYT10 located on chromosome 5, ADAMTS3 on chromosome 6, NTRK2 on chromosome 8 and SNTG1 on chromosome 14 of the cattle genome. Several of the genes found have previously been associated with cattle fertility. Poor fertility is an important culling reason and thereby affects longevity in cattle. Several signals were located in regions sparse with described genes, which suggest that there might be several other non-identified genetic pathways for this important trait.
基金supported by the HarvestPlus Challenge Program of CGIARthe Special Funds for EU Collaboration from the Ministry of Science and Technology of China(Project no.1113)the Seventh Framework Programme of European Commission(Project no.266045)
文摘Epistasis is a commonly observed genetic phenomenon and an important source of variation of complex traits, which could maintain additive variance and therefore assure the long-term genetic gain in breeding. Inclusive composite interval mapping (ICIM) is able to identify epistatic quantitative trait loci (QTLs) no matter whether the two interacting QTLs have any additive effects. In this article, we conducted a simulation study to evaluate detection power and false discovery rate (FDR) of ICIM epistatic mapping, by considering F2 and doubled haploid (DH) populations, different F2 segregation ratios and population sizes. Results indicated that estimations of QTL locations and effects were unbiased, and the detection power of epistatic mapping was largely affected by population size, heritability of epistasis, and the amount and distribution of genetic effects. When the same likelihood of odd (LOD) threshold was used, detection power of QTL was higher in F2 population than power in DH population; meanwhile FDR in F2 was also higher than that in DH. The increase of marker density from 10 cM to 5 cM led to similar detection power but higher FDR. In simulated populations, ICIM achieved better mapping results than multiple interval mapping (MIM) in estimation of QTL positions and effect. At the end, we gave epistatic mapping results of ICIM in one actual population in rice (Oryza sativa L.).
基金Supported by the National Natural Science Foundation of China(No.11471030,11471035,71201160)
文摘This study is undertaken to apply a bootstrap method of controlling the false discovery rate (FDR) when performing pairwise comparisons of normal means. Due to the dependency of test statistics in pairwise comparisons, many conventional multiple testing procedures can't be employed directly. Some modified pro- cedures that control FDR with dependent test statistics are too conservative. In the paper, by bootstrap and goodness-of-fit methods, we produce independent p-values for pairwise comparisons. Based on these indepen- dent p-values, plenty of procedures can be used, and two typical FDR controlling procedures are applied here. An example is provided to illustrate the proposed approach. Extensive simulations show the satisfactory FDR control and power performance of our approach. In addition, the proposed approach can be easily extended to more than two normal, or non-normal, balance or unbalance cases.