摘要
Single-cell RNA-seq data analysis generally requires quality control,normalization,highly variable genes screening,dimensionality reduction and clustering.Among these processes,downstream analysis including dimensionality reduction and clustering are sensitive to the selection of highly variable genes.Though increasing number of tools for selecting the highly variable genes have been developed,an evaluation of theirperformances and a general strategy are lack.Here,wecompare the performance of nine commonly usedmethods for screening variable genes by using single-cell RNA-seq data from hematopoietic stem/progenitor cells and mature blood cells,and find that SCHS outperforms other methods regarding to reproducibility and accuracy.However,this method prefers the selection of highly expressed genes.We further propose a new strategy SIEVE(SIngle-cEll Variable gEnes)bymultiple rounds of randomsampling,therefore minimizing the stochastic noise and identifying a robust set of variable genes.Moreover,SIEVE recovers lowly expressed genes as variable genes and substantially improves the accuracy of single cell classification,especially for the methods with lower reproducibility.The SIEVE software is freely available at https://github.com/YinanZhang522/SIEVE.
基金
This work has been supported by the National Natural Science Foundation of China(82022002,81900117,81890993,81890990,32000803)
National Key Research and Development Program of China(2018YFA0107804)
CAMS Initiative for Innovative Medicine(2017-I2M-1-015,2017-I2M-3-009,2019-I2M-2-001)
Fundamental Research Funds for the Central Research Institutes(2020-RC310-005).