In the era of big data,high-dimensional data always arrive in streams,making timely and accurate decision necessary.It has become particularly important to rapidly and sequentially identify individuals whose behavior ...In the era of big data,high-dimensional data always arrive in streams,making timely and accurate decision necessary.It has become particularly important to rapidly and sequentially identify individuals whose behavior deviates from the norm.Aiming at identifying as many irregular behavioral patterns as possible,the authors develop a large-scale dynamic testing system in the framework of false discovery rate(FDR)control.By fully exploiting the sequential feature of datastreams,the authors propose a screening-assisted procedure that filters streams and then only tests streams that pass the filter at each time point.A data-driven optimal screening threshold is derived,giving the new method an edge over existing methods.Under some mild conditions on the dependence structure of datastreams,the FDR is shown to be strongly controlled and the suggested approach for determining screening thresholds is asymptotically optimal.Simulation studies show that the proposed method is both accurate and powerful,and a real-data example is used for illustrative purpose.展开更多
This paper focuses on the support recovery of the Gaussian graphical model(GGM)with false discovery rate(FDR)control.The graceful symmetrized data aggregation(SDA)technique which involves sample splitting,data screeni...This paper focuses on the support recovery of the Gaussian graphical model(GGM)with false discovery rate(FDR)control.The graceful symmetrized data aggregation(SDA)technique which involves sample splitting,data screening and information pooling is exploited via a node-based way.A matrix of test statistics with symmetry property is constructed and a data-driven threshold is chosen to control the FDR for the support recovery of GGM.The proposed method is shown to control the FDR asymptotically under some mild conditions.Extensive simulation studies and a real-data example demonstrate that it yields a better FDR control while offering reasonable power in most cases.展开更多
The traditional approaches to false discovery rate(FDR)control in multiple hypothesis testing are usually based on the null distribution of a test statistic.However,all types of null distributions,including the theore...The traditional approaches to false discovery rate(FDR)control in multiple hypothesis testing are usually based on the null distribution of a test statistic.However,all types of null distributions,including the theoretical,permutation-based and empirical ones,have some inherent drawbacks.For example,the theoretical null might fail because of improper assumptions on the sample distribution.Here,we propose a null distributionfree approach to FDR control for multiple hypothesis testing in the case-control study.This approach,named target-decoy procedure,simply builds on the ordering of tests by some statistic or score,the null distribution of which is not required to be known.Competitive decoy tests are constructed from permutations of original samples and are used to estimate the false target discoveries.We prove that this approach controls the FDR when the score function is symmetric and the scores are independent between different tests.Simulation demonstrates that it is more stable and powerful than two popular traditional approaches,even in the existence of dependency.Evaluation is also made on two real datasets,including an arabidopsis genomics dataset and a COVID-19 proteomics dataset.展开更多
Discovery rates for all metals, including gold, are declining, the cost per significant discovery is increasing sharply, and the economic situation of the industry is one of low base rate. The current hierarchical str...Discovery rates for all metals, including gold, are declining, the cost per significant discovery is increasing sharply, and the economic situation of the industry is one of low base rate. The current hierarchical structure of the exploration and mining industry makes this situation difficult to redress. Economic geologists can do little to influence the required changes to the overall structure and philosophy of an industry driven by business rather than geological principles, However, it should be possible to follow the lead of the oil industry and improve the success rate of greenfield exploration, necessary for the next group of lower-exploration-spend significant mineral deposit discoveries. Here we promote the concept that mineral explorers need to carefully consider the scale at which their exploration targets are viewed. It is necessary to carefully assess the potential of drill targets in terms of terrane to province to district scale, rather than deposit scale, where most current economic geology research and conceptual thinking is concentrated. If orogenic, IRGD, Carlin-style and IOCG gold-rich systems are viewed at the deposit scale, they appear quite different in terms of conventionally adop- ted research parameters. However, recent models for these deposit styles show increasingly similar source-region parameters when viewed at the lithosphere scale, suggesting common tectonic settings. It is only by assessing individual targets in their tectonic context that they can be more reliably ranked in terms of potential to provide a significant drill discovery. Targets adjacent to craton margins, other lithosphere boundaries, and suture zones are clearly favoured for all of these gold deposit styles, and such exploration could lead to incidental discovery of major deposits of other metals sited along the same tectonic boundaries.展开更多
Longevity is regarded as the most important functional trait in cattle breeding with high economic value yet low heritability. In order to identify genomic regions associated with longevity, a genome wise association ...Longevity is regarded as the most important functional trait in cattle breeding with high economic value yet low heritability. In order to identify genomic regions associated with longevity, a genome wise association study was performed using data from 4887 Fleckvieh bulls and 33,556 SNPs after quality control. Single SNP regression was used for identification of important SNPs including eigenvectors as a means of correction for population structure. SNPs selected with a false discovery rate threshold of 0.05 and with local false discovery rate identified genomic regions associated with longevity which were subsequently cross checked with the National Center for Biotechnology Information (NCBI) database. This, to identify interesting genes in cattle and their homologue forms in other species. The most notable genes were SYT10 located on chromosome 5, ADAMTS3 on chromosome 6, NTRK2 on chromosome 8 and SNTG1 on chromosome 14 of the cattle genome. Several of the genes found have previously been associated with cattle fertility. Poor fertility is an important culling reason and thereby affects longevity in cattle. Several signals were located in regions sparse with described genes, which suggest that there might be several other non-identified genetic pathways for this important trait.展开更多
When detecting deletions in complex human genomes,split-read approaches using short reads generated with next-generation sequencing still face the challenge that either false discovery rate is high,or sensitivity is l...When detecting deletions in complex human genomes,split-read approaches using short reads generated with next-generation sequencing still face the challenge that either false discovery rate is high,or sensitivity is low.To address the problem,an integrated strategy is proposed.It organically combines the fundamental theories of the three mainstream methods(read-pair approaches,split-read technologies and read-depth analysis) with modern machine learning algorithms,using the recipe of feature extraction as a bridge.Compared with the state-of-art split-read methods for deletion detection in both low and high sequence coverage,the machine-learning-aided strategy shows great ability in intelligently balancing sensitivity and false discovery rate and getting a both more sensitive and more precise call set at single-base-pair resolution.Thus,users do not need to rely on former experience to make an unnecessary trade-off beforehand and adjust parameters over and over again any more.It should be noted that modern machine learning models can play an important role in the field of structural variation prediction.展开更多
The paper discusses the generalization of constrained Bayesian method (CBM) for arbitrary loss functions and its application for testing the directional hypotheses. The problem is stated in terms of false and tru...The paper discusses the generalization of constrained Bayesian method (CBM) for arbitrary loss functions and its application for testing the directional hypotheses. The problem is stated in terms of false and true discovery rates. One more criterion of estimation of directional hypotheses tests quality, the Type III errors rate, is considered. The ratio among discovery rates and the Type III errors rate in CBM is considered. The advantage of CBM in comparison with Bayes and frequentist methods is theoretically proved and demonstrated by an example.展开更多
Epistasis is a commonly observed genetic phenomenon and an important source of variation of complex traits, which could maintain additive variance and therefore assure the long-term genetic gain in breeding. Inclusive...Epistasis is a commonly observed genetic phenomenon and an important source of variation of complex traits, which could maintain additive variance and therefore assure the long-term genetic gain in breeding. Inclusive composite interval mapping (ICIM) is able to identify epistatic quantitative trait loci (QTLs) no matter whether the two interacting QTLs have any additive effects. In this article, we conducted a simulation study to evaluate detection power and false discovery rate (FDR) of ICIM epistatic mapping, by considering F2 and doubled haploid (DH) populations, different F2 segregation ratios and population sizes. Results indicated that estimations of QTL locations and effects were unbiased, and the detection power of epistatic mapping was largely affected by population size, heritability of epistasis, and the amount and distribution of genetic effects. When the same likelihood of odd (LOD) threshold was used, detection power of QTL was higher in F2 population than power in DH population; meanwhile FDR in F2 was also higher than that in DH. The increase of marker density from 10 cM to 5 cM led to similar detection power but higher FDR. In simulated populations, ICIM achieved better mapping results than multiple interval mapping (MIM) in estimation of QTL positions and effect. At the end, we gave epistatic mapping results of ICIM in one actual population in rice (Oryza sativa L.).展开更多
This study is undertaken to apply a bootstrap method of controlling the false discovery rate (FDR) when performing pairwise comparisons of normal means. Due to the dependency of test statistics in pairwise compariso...This study is undertaken to apply a bootstrap method of controlling the false discovery rate (FDR) when performing pairwise comparisons of normal means. Due to the dependency of test statistics in pairwise comparisons, many conventional multiple testing procedures can't be employed directly. Some modified pro- cedures that control FDR with dependent test statistics are too conservative. In the paper, by bootstrap and goodness-of-fit methods, we produce independent p-values for pairwise comparisons. Based on these indepen- dent p-values, plenty of procedures can be used, and two typical FDR controlling procedures are applied here. An example is provided to illustrate the proposed approach. Extensive simulations show the satisfactory FDR control and power performance of our approach. In addition, the proposed approach can be easily extended to more than two normal, or non-normal, balance or unbalance cases.展开更多
基金supported by the National Natural Science Foundation of China under Grant Nos.11771332,11771220,11671178,11925106,11971247the National Science Foundation of Tianjin under Grant Nos.18JCJQJC46000,18ZXZNGX00140+1 种基金the 111Project B20016Mushtaq was also supported by the Fundamental Research Funds for the Central Universities。
文摘In the era of big data,high-dimensional data always arrive in streams,making timely and accurate decision necessary.It has become particularly important to rapidly and sequentially identify individuals whose behavior deviates from the norm.Aiming at identifying as many irregular behavioral patterns as possible,the authors develop a large-scale dynamic testing system in the framework of false discovery rate(FDR)control.By fully exploiting the sequential feature of datastreams,the authors propose a screening-assisted procedure that filters streams and then only tests streams that pass the filter at each time point.A data-driven optimal screening threshold is derived,giving the new method an edge over existing methods.Under some mild conditions on the dependence structure of datastreams,the FDR is shown to be strongly controlled and the suggested approach for determining screening thresholds is asymptotically optimal.Simulation studies show that the proposed method is both accurate and powerful,and a real-data example is used for illustrative purpose.
基金supported partially by the China National Key R&D Program under Grant Nos.2019YFC1908502,2022YFA1003703,2022YFA1003802,and 2022YFA1003803the National Natural Science Foundation of China under Grant Nos.11925106,12231011,11931001,and 11971247。
文摘This paper focuses on the support recovery of the Gaussian graphical model(GGM)with false discovery rate(FDR)control.The graceful symmetrized data aggregation(SDA)technique which involves sample splitting,data screening and information pooling is exploited via a node-based way.A matrix of test statistics with symmetry property is constructed and a data-driven threshold is chosen to control the FDR for the support recovery of GGM.The proposed method is shown to control the FDR asymptotically under some mild conditions.Extensive simulation studies and a real-data example demonstrate that it yields a better FDR control while offering reasonable power in most cases.
基金supported by the National Key R&D Program of China(No.2018YFB0704304)the National Natural Science Foundation of China(Nos.32070668,62002231,61832003,61433014)the K.C.Wong Education Foundation。
文摘The traditional approaches to false discovery rate(FDR)control in multiple hypothesis testing are usually based on the null distribution of a test statistic.However,all types of null distributions,including the theoretical,permutation-based and empirical ones,have some inherent drawbacks.For example,the theoretical null might fail because of improper assumptions on the sample distribution.Here,we propose a null distributionfree approach to FDR control for multiple hypothesis testing in the case-control study.This approach,named target-decoy procedure,simply builds on the ordering of tests by some statistic or score,the null distribution of which is not required to be known.Competitive decoy tests are constructed from permutations of original samples and are used to estimate the false target discoveries.We prove that this approach controls the FDR when the score function is symmetric and the scores are independent between different tests.Simulation demonstrates that it is more stable and powerful than two popular traditional approaches,even in the existence of dependency.Evaluation is also made on two real datasets,including an arabidopsis genomics dataset and a COVID-19 proteomics dataset.
文摘Discovery rates for all metals, including gold, are declining, the cost per significant discovery is increasing sharply, and the economic situation of the industry is one of low base rate. The current hierarchical structure of the exploration and mining industry makes this situation difficult to redress. Economic geologists can do little to influence the required changes to the overall structure and philosophy of an industry driven by business rather than geological principles, However, it should be possible to follow the lead of the oil industry and improve the success rate of greenfield exploration, necessary for the next group of lower-exploration-spend significant mineral deposit discoveries. Here we promote the concept that mineral explorers need to carefully consider the scale at which their exploration targets are viewed. It is necessary to carefully assess the potential of drill targets in terms of terrane to province to district scale, rather than deposit scale, where most current economic geology research and conceptual thinking is concentrated. If orogenic, IRGD, Carlin-style and IOCG gold-rich systems are viewed at the deposit scale, they appear quite different in terms of conventionally adop- ted research parameters. However, recent models for these deposit styles show increasingly similar source-region parameters when viewed at the lithosphere scale, suggesting common tectonic settings. It is only by assessing individual targets in their tectonic context that they can be more reliably ranked in terms of potential to provide a significant drill discovery. Targets adjacent to craton margins, other lithosphere boundaries, and suture zones are clearly favoured for all of these gold deposit styles, and such exploration could lead to incidental discovery of major deposits of other metals sited along the same tectonic boundaries.
基金financial support of the Austrian Ministry for Transport,Innovation and Technology(BMVIT)and the Austrian Science Fund(FWF)via the project TRP46-B19Part of the study was conducted using a travel grant provided by the European Science Foundation(ESF).
文摘Longevity is regarded as the most important functional trait in cattle breeding with high economic value yet low heritability. In order to identify genomic regions associated with longevity, a genome wise association study was performed using data from 4887 Fleckvieh bulls and 33,556 SNPs after quality control. Single SNP regression was used for identification of important SNPs including eigenvectors as a means of correction for population structure. SNPs selected with a false discovery rate threshold of 0.05 and with local false discovery rate identified genomic regions associated with longevity which were subsequently cross checked with the National Center for Biotechnology Information (NCBI) database. This, to identify interesting genes in cattle and their homologue forms in other species. The most notable genes were SYT10 located on chromosome 5, ADAMTS3 on chromosome 6, NTRK2 on chromosome 8 and SNTG1 on chromosome 14 of the cattle genome. Several of the genes found have previously been associated with cattle fertility. Poor fertility is an important culling reason and thereby affects longevity in cattle. Several signals were located in regions sparse with described genes, which suggest that there might be several other non-identified genetic pathways for this important trait.
基金Project(61472026)supported by the National Natural Science Foundation of ChinaProject(2014J410081)supported by Guangzhou Scientific Research Program,China
文摘When detecting deletions in complex human genomes,split-read approaches using short reads generated with next-generation sequencing still face the challenge that either false discovery rate is high,or sensitivity is low.To address the problem,an integrated strategy is proposed.It organically combines the fundamental theories of the three mainstream methods(read-pair approaches,split-read technologies and read-depth analysis) with modern machine learning algorithms,using the recipe of feature extraction as a bridge.Compared with the state-of-art split-read methods for deletion detection in both low and high sequence coverage,the machine-learning-aided strategy shows great ability in intelligently balancing sensitivity and false discovery rate and getting a both more sensitive and more precise call set at single-base-pair resolution.Thus,users do not need to rely on former experience to make an unnecessary trade-off beforehand and adjust parameters over and over again any more.It should be noted that modern machine learning models can play an important role in the field of structural variation prediction.
文摘The paper discusses the generalization of constrained Bayesian method (CBM) for arbitrary loss functions and its application for testing the directional hypotheses. The problem is stated in terms of false and true discovery rates. One more criterion of estimation of directional hypotheses tests quality, the Type III errors rate, is considered. The ratio among discovery rates and the Type III errors rate in CBM is considered. The advantage of CBM in comparison with Bayes and frequentist methods is theoretically proved and demonstrated by an example.
基金supported by the HarvestPlus Challenge Program of CGIARthe Special Funds for EU Collaboration from the Ministry of Science and Technology of China(Project no.1113)the Seventh Framework Programme of European Commission(Project no.266045)
文摘Epistasis is a commonly observed genetic phenomenon and an important source of variation of complex traits, which could maintain additive variance and therefore assure the long-term genetic gain in breeding. Inclusive composite interval mapping (ICIM) is able to identify epistatic quantitative trait loci (QTLs) no matter whether the two interacting QTLs have any additive effects. In this article, we conducted a simulation study to evaluate detection power and false discovery rate (FDR) of ICIM epistatic mapping, by considering F2 and doubled haploid (DH) populations, different F2 segregation ratios and population sizes. Results indicated that estimations of QTL locations and effects were unbiased, and the detection power of epistatic mapping was largely affected by population size, heritability of epistasis, and the amount and distribution of genetic effects. When the same likelihood of odd (LOD) threshold was used, detection power of QTL was higher in F2 population than power in DH population; meanwhile FDR in F2 was also higher than that in DH. The increase of marker density from 10 cM to 5 cM led to similar detection power but higher FDR. In simulated populations, ICIM achieved better mapping results than multiple interval mapping (MIM) in estimation of QTL positions and effect. At the end, we gave epistatic mapping results of ICIM in one actual population in rice (Oryza sativa L.).
基金Supported by the National Natural Science Foundation of China(No.11471030,11471035,71201160)
文摘This study is undertaken to apply a bootstrap method of controlling the false discovery rate (FDR) when performing pairwise comparisons of normal means. Due to the dependency of test statistics in pairwise comparisons, many conventional multiple testing procedures can't be employed directly. Some modified pro- cedures that control FDR with dependent test statistics are too conservative. In the paper, by bootstrap and goodness-of-fit methods, we produce independent p-values for pairwise comparisons. Based on these indepen- dent p-values, plenty of procedures can be used, and two typical FDR controlling procedures are applied here. An example is provided to illustrate the proposed approach. Extensive simulations show the satisfactory FDR control and power performance of our approach. In addition, the proposed approach can be easily extended to more than two normal, or non-normal, balance or unbalance cases.