Robust integration of multiple single-cell RNA sequencing datasets using a single reference space thumbnail

Robust integration of multiple single-cell RNA sequencing datasets using a single reference space

Abstract

In many biological applications of single-cell RNA sequencing (scRNA-seq), an integrated analysis of data from multiple batches or studies is necessary. Current methods typically achieve integration using shared cell types or covariance correlation between datasets, which can distort biological signals. Here we introduce an algorithm that uses the gene eigenvectors from a reference dataset to establish a global frame for integration. Using simulated and real datasets, we demonstrate that this approach, called Reference Principal Component Integration (RPCI), consistently outperforms other methods by multiple metrics, with clear advantages in preserving genuine cross-sample gene expression differences in matching cell types, such as those present in cells at distinct developmental stages or in perturbated versus control studies. Moreover, RPCI maintains this robust performance when multiple datasets are integrated. Finally, we applied RPCI to scRNA-seq data for mouse gut endoderm development and revealed temporal emergence of genetic programs helping establish the anterior–posterior axis in visceral endoderm.

Access options

Subscribe to Journal

Get full journal access for 1 year

$259.00

only $21.58 per issue

All prices are NET prices.

VAT will be added later in the checkout.

Tax calculation will be finalised during checkout.

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Data availability

All scRNA-seq datasets in this study were published previously, and their availabilities are described in Supplementary Table 2.

Code availability

The RISC is prepared as an R package and is available for free use via GitHub (https://github.com/bioinfoDZ/RISC). Codes for the analysis (and related source data) are provided in the Code Ocean (https://codeocean.com/capsule/9098032).

References

  1. 1.

    Islam, S. et al. Quantitative single-cell RNA-seq with unique molecular identifiers. Nat. Methods 11, 163–166 (2014).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  2. 2.

    Nawy, T. Single-cell sequencing. Nat. Methods 11, 18 (2014).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  3. 3.

    Wang, Y. & Navin, N. E. Advances and applications of single-cell sequencing technologies. Mol. Cell 58, 598–609 (2015).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  4. 4.

    Zheng, G. X. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  5. 5.

    Azizi, E. et al. Single-cell map of diverse immune phenotypes in the breast tumor microenvironment. Cell 174, 1293–1308 (2018).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  6. 6.

    Rosenberg, A. B. et al. Single-cell profiling of the developing mouse brain and spinal cord with split-pool barcoding. Science 360, 176–182 (2018).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  7. 7.

    Fan, X. et al. Spatial transcriptomic survey of human embryonic cerebral cortex by single-cell RNA-seq analysis. Cell Res. 28, 730–745 (2018).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  8. 8.

    Wang, J. X. et al. Single-cell gene expression analysis reveals regulators of distinct cell subpopulations among developing human neurons. Genome Res. 27, 1783–1794 (2017).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  9. 9.

    Davie, K. et al. A single-cell transcriptome atlas of the aging Drosophila brain. Cell 174, 982–998 (2018).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  10. 10.

    Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  11. 11.

    Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  12. 12.

    Hie, B., Bryson, B. & Berger, B. Efficient integration of heterogeneous single-cell transcriptomes using Scanorama. Nat. Biotechnol. 37, 685–691 (2019).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  13. 13.

    Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  14. 14.

    Lin, Y. et al. scMerge leverages factor analysis, stable expression, and pseudoreplication to merge multiple single-cell RNA-seq datasets. Proc. Natl Acad. Sci. USA 116, 9775–9784 (2019).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  15. 15.

    Lotfollahi, M., Wolf, F. A. & Theis, F. J. scGen predicts single-cell perturbation responses. Nat. Methods 16, 715–721 (2019).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  16. 16.

    Risso, D., Perraudeau, F., Gribkova, S., Dudoit, S. & Vert, J. P. A general and flexible method for signal extraction from single-cell RNA-seq data. Nat. Commun. 9, 284 (2018).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  17. 17.

    Shaham, U. et al. Removal of batch effects using distribution-matching residual networks. Bioinformatics 33, 2539–2546 (2017).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  18. 18.

    Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  19. 19.

    Wang, T. et al. BERMUDA: a novel deep transfer learning method for single-cell RNA sequencing batch correction reveals hidden high-resolution cellular subtypes. Genome Biol. 20, 165 (2019).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  20. 20.

    Welch, J. D. et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887 (2019).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  21. 21.

    Zhang, F., Wu, Y. & Tian, W. A novel approach to remove the batch effect of single-cell data. Cell Discov. 5, 46 (2019).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  22. 22.

    Tran, H. T. N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 21, 12 (2020).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  23. 23.

    Jolliffe, I. T. & Cadima, J. Principal component analysis: a review and recent developments. Philos. Trans. A Math. Phys. Eng. Sci. 374, 20150202 (2016).

    PubMed 
    PubMed Central 

    Google Scholar
     

  24. 24.

    Jolliffe, I. T. Principal Component Analysis (Springer, 2011).

  25. 25.

    Zhang, X., Xu, C. & Yosef, N. Simulating multiple faceted variability in single cell RNA sequencing. Nat. Commun. 10, 2611 (2019).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  26. 26.

    Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  27. 27.

    Scrucca, L., Fop, M., Murphy, T. B. & Raftery, A. E. mclust 5: clustering, classification and density estimation using Gaussian finite mixture models. R J 8, 289–317 (2016).

    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  28. 28.

    Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J. Comput. Appl. Math. 20, 53–65 (1987).

    Article 

    Google Scholar
     

  29. 29.

    Buttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).

    PubMed 
    Article 
    CAS 

    Google Scholar
     

  30. 30.

    Villani, A. C. et al. Single-cell RNA-seq reveals new types of human blood dendritic cells, monocytes, and progenitors. Science 356, eaah4573 (2017).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  31. 31.

    Hu, P. et al. Single-nucleus transcriptomic survey of cell diversity and functional maturation in postnatal mammalian hearts. Genes Dev. 32, 1344–1357 (2018).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  32. 32.

    Liu, Y., Singh, V. K. & Zheng, D. Stereo3D: using stereo images to enrich 3D visualization. Bioinformatics 36, 4189–4190 (2020).

    PubMed 
    Article 

    Google Scholar
     

  33. 33.

    Nowotschin, S. et al. The emergent landscape of the mouse gut endoderm at single-cell resolution. Nature 569, 361–367 (2019).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  34. 34.

    Arnold, S. J. & Robertson, E. J. Making a commitment: cell lineage allocation and axis patterning in the early mouse embryo. Nat. Rev. Mol. Cell Biol. 10, 91–103 (2009).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  35. 35.

    Nowotschin, S., Hadjantonakis, A. K. & Campbell, K. The endoderm: a divergent cell lineage with many commonalities. Development 146, dev150920 (2019).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  36. 36.

    Stuckey, D. W., Di Gregorio, A., Clements, M. & Rodriguez, T. A. Correct patterning of the primitive streak requires the anterior visceral endoderm. PLoS ONE 6, e17620 (2011).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  37. 37.

    Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  38. 38.

    Pepe-Mooney, B. J. et al. Single-cell analysis of the liver epithelium reveals dynamic heterogeneity and an essential role for YAP in homeostasis and regeneration. Cell Stem Cell 25, 23–38 (2019).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  39. 39.

    Hill, M. C. et al. A cellular atlas of Pitx2-dependent cardiac development. Development 146, dev180398 (2019).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  40. 40.

    Gordon, S. R. et al. PD-1 expression by tumour-associated macrophages inhibits phagocytosis and tumour immunity. Nature 545, 495–499 (2017).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  41. 41.

    Savas, P. et al. Single-cell profiling of breast cancer T cells reveals a tissue-resident memory subset associated with improved prognosis. Nat. Med. 24, 986–993 (2018).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  42. 42.

    Yost, K. E. et al. Clonal replacement of tumor-specific T cells following PD-1 blockade. Nat. Med. 25, 1251–1259 (2019).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  43. 43.

    Ding, J. et al. Systematic comparison of single-cell and single-nucleus RNA-sequencing methods. Nat. Biotechnol. 38, 737–746 (2020).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  44. 44.

    Grun, D. et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell 19, 266–277 (2016).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  45. 45.

    Muraro, M. J. et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 3, 385–394 (2016).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  46. 46.

    Segerstolpe, A. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  47. 47.

    Wang, Y. J. et al. Single-cell transcriptomics of the human endocrine pancreas. Diabetes 65, 3028–3038 (2016).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  48. 48.

    Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst. 3, 346–360 (2016).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  49. 49.

    Wolf, F. A. et al. PAGA: graph abstraction reconciles clustering with trajectory inference through a topology preserving map of single cells. Genome Biol. 20, 59 (2019).

    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  50. 50.

    Giraddi, R. R. et al. Single-cell transcriptomes distinguish stem cell state changes and lineage specification programs in early mammary gland development. Cell Rep. 24, 1653–1666 (2018).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  51. 51.

    Maaten, L. V. D. Accelerating t-SNE using tree-based algorithms. J. Mach. Learn. Res. 15, 3221–3245 (2014).


    Google Scholar
     

  52. 52.

    Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38 (2018).

    Article 
    CAS 

    Google Scholar
     

  53. 53.

    Alter, O., Brown, P. O. & Botstein, D. Singular value decomposition for genome-wide expression data processing and modeling. Proc. Natl Acad. Sci. USA 97, 10101–10106 (2000).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  54. 54.

    Kolde, R. pheatmap: Pretty Heatmaps https://rdrr.io/cran/pheatmap/ (2019).

  55. 55.

    Zwiener, I., Frisch, B. & Binder, H. Transforming RNA-seq data to improve the performance of prognostic gene signatures. PLoS ONE 9, e85150 (2014).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  56. 56.

    Law, C. W., Chen, Y., Shi, W. & Smyth, G. K. voom: precision weights unlock linear model analysis tools for RNA-seq read counts. Genome Biol. 15, R29 (2014).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  57. 57.

    McCarthy, D. J., Campbell, K. R., Lun, A. T. & Wills, Q. F. Scater: pre-processing, quality control, normalization and visualization of single-cell RNA-seq data in R. Bioinformatics 33, 1179–1186 (2017).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  58. 58.

    Lun, A. T., Bach, K. & Marioni, J. C. Pooling across cells to normalize single-cell RNA sequencing data with many zero counts. Genome Biol. 17, 75 (2016).

    PubMed 
    Article 
    CAS 

    Google Scholar
     

  59. 59.

    Bacher, R. et al. SCnorm: robust normalization of single-cell RNA-seq data. Nat. Methods 14, 584–586 (2017).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  60. 60.

    R Core Team. R: A Language and Environment for Statistical Computing https://www.R-project.org/ (2019).

  61. 61.

    Koopmans, L. H., Owen, D. B. & Rosenblatt, J. I. Confidence intervals for the coefficient of variation for the normal and log normal distributions. Biometrika 51, 25–32 (1964).

    Article 

    Google Scholar
     

  62. 62.

    Ver Hoef, J. M. & Boveng, P. L. Quasi-Poisson vs. negative binomial regression: how should we model overdispersed count data? Ecology 88, 2766–2772 (2007).

    PubMed 
    Article 

    Google Scholar
     

  63. 63.

    Gonzalez, I., Déjean, S., Martin, P. & Baccini, A. CCA: an R package to extend canonical correlation analysis. J. Stat. Softw. 23, 14 (2008).

    Article 

    Google Scholar
     

  64. 64.

    Witten, D. M., Tibshirani, R. & Hastie, T. A penalized matrix decomposition, with applications to sparse principal components and canonical correlation analysis. Biostatistics 10, 515–534 (2009).

    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  65. 65.

    Wooldridge, J.M. Introductory Econometrics: A Modern Approach (Cengage, 2018)

  66. 66.

    Lambert, D. Zero-inflated Poisson regression, with an application to defects in manufacturing. Technometrics 34, 1–14 (1992).

    Article 

    Google Scholar
     

  67. 67.

    Rousseeuw, P. J. Silhouettes: a graphical aid to the interpretation and validation of cluster-analysis. J. Comput. Appl. Math. 20, 53–65 (1987).

    Article 

    Google Scholar
     

  68. 68.

    Maechler, M., Rousseeuw, P., Struyf, A., Hubert, M., Hornik, K. cluster: Cluster Analysis Basics and Extensions https://cran.r-project.org/package=cluster (2019).

  69. 69.

    Venables, W.N., Ripley, B.D. & Venables, W.N. Modern Applied Statistics with S (Springer, 2002).

  70. 70.

    Finak, G. et al. MAST: a flexible statistical framework for assessing transcriptional changes and characterizing heterogeneity in single-cell RNA sequencing data. Genome Biol. 16, 278 (2015).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  71. 71.

    Korthauer, K. D. et al. A statistical approach for identifying differential distributions in single-cell RNA-seq experiments. Genome Biol. 17, 222 (2016).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  72. 72.

    Nabavi, S., Schmolze, D., Maitituoheti, M., Malladi, S. & Beck, A. H. EMDomics: a robust and powerful method for the identification of genes differentially expressed between heterogeneous classes. Bioinformatics 32, 533–541 (2016).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

Download references

Acknowledgements

We thank all the research groups that generated and shared the scRNA-seq data used in this study. We thank the members of the Zheng lab for valuable discussions, software testing and comments on the manuscript. We also acknowledge funding support from the National Institutes of Health (grants HL133120 to D.Z. and B.Z., HL153920 to D.Z., HD092944 to D.Z. and B.Z., and HD070454 to D.Z.).

Author information

Affiliations

  1. Department of Genetics, Albert Einstein College of Medicine, Bronx, NY, USA

    Yang Liu, Tao Wang, Bin Zhou & Deyou Zheng

  2. Department of Epidemiology and Population Health, Albert Einstein College of Medicine, Bronx, NY, USA

    Tao Wang

  3. Department of Pediatrics, Albert Einstein College of Medicine, Bronx, NY, USA

    Bin Zhou

  4. Department of Medicine (Cardiology), Albert Einstein College of Medicine, Bronx, NY, USA

    Bin Zhou

  5. Department of Neurology, Albert Einstein College of Medicine, Bronx, NY, USA

    Deyou Zheng

  6. Department of Neuroscience, Albert Einstein College of Medicine, Bronx, NY, USA

    Deyou Zheng

Contributions

Y.L. and D.Z. conceived the algorithm and analysis. Y.L. performed the analyses. T.W. and B.Z. contributed to the methods or discussions. D.Z. supervised the study. All authors wrote the manuscript.

Corresponding author

Correspondence to
Deyou Zheng.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Notes, Supplementary Methods and Supplementary Figs. 1–21.

Supplementary Tables 1 and 2

Supplementary Table 1. Metric scores from pairwise integration and full integration of the semi-simulated data. Supplementary Table 2. List of real scRNA-seq datasets.

About this article

Verify currency and authenticity via CrossMark

Cite this article

Liu, Y., Wang, T., Zhou, B. et al. Robust integration of multiple single-cell RNA sequencing datasets using a single reference space.
Nat Biotechnol (2021). https://doi.org/10.1038/s41587-021-00859-x

Download citation

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *