Iterative single-cell multi-omic integration using online learning thumbnail

Iterative single-cell multi-omic integration using online learning

Abstract

Integrating large single-cell gene expression, chromatin accessibility and DNA methylation datasets requires general and scalable computational approaches. Here we describe online integrative non-negative matrix factorization (iNMF), an algorithm for integrating large, diverse and continually arriving single-cell datasets. Our approach scales to arbitrarily large numbers of cells using fixed memory, iteratively incorporates new datasets as they are generated and allows many users to simultaneously analyze a single copy of a large dataset by streaming it over the internet. Iterative data addition can also be used to map new data to a reference dataset. Comparisons with previous methods indicate that the improvements in efficiency do not sacrifice dataset alignment and cluster preservation performance. We demonstrate the effectiveness of online iNMF by integrating more than 1 million cells on a standard laptop, integrating large single-cell RNA sequencing and spatial transcriptomic datasets, and iteratively constructing a single-cell multi-omic atlas of the mouse motor cortex.

Access options

Subscribe to Journal

Get full journal access for 1 year

$59.00

only $4.92 per issue

All prices are NET prices.

VAT will be added later in the checkout.

Tax calculation will be finalised during checkout.

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Data availability

• Human PBMC from Kang et al.9 (GSE96583) distributed by SeuratData

• Human pancreatic islet cells from Grün et al.10 (GSE81076), Muraro et al.11 (GSE85241), Lawlor et al.12 (GSE86469), Baron et al.13 (GSE84133) and Segerstolpe et al.14 (E-MTAB-5061) distributed by SeuratData

• Adult mouse brain cells from Saunders et al.7 (http://dropviz.org/)

• Mouse Organogenesis Cell Atlas from Cao et al.18 (https://oncoscape.v3.sttrcancer.org/atlas.gs.washington.edu.mouse.rna/downloads)

• Mouse hippocampus cells from Rodriques et al.19 (https://singlecell.broadinstitute.org/single_cell/study/SCP354/slide-seq-study#study-download)

• Mouse hippocampus cells from Yao et al.22 (http://data.nemoarchive.org/biccn/grant/zeng/zeng/transcriptome/scell/10X/processed/YaoHippo2020/)

• Mouse hypothalamic pre-optic region data from Moffitt et al.23 (https://datadryad.org/stash/dataset/doi:10.5061/dryad.8t8s248 and GSE113576)

• Mouse primary motor cortex cells from Yao et al.27 (https://assets.nemoarchive.org/dat-ch1nqb7)

Code availability

An R implementation of LIGER is available from the Comprehensive R Archive Network at https://cran.r-project.org/package=rliger and on GitHub at https://github.com/welch-lab/liger, along with detailed installation instructions. Tutorials demonstrating package functionality, including online learning for Scenario 1, Scenario 2 and Scenario 3, are available on the GitHub page.

References

  1. 1.

    Ye, Z. & Sarkar, C. A. Towards a quantitative understanding of cell identity. Trends Cell Biol. 28, 1030–1048 (2018).

    CAS 
    Article 

    Google Scholar
     

  2. 2.

    Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902 (2019).

    CAS 
    Article 

    Google Scholar
     

  3. 3.

    Stuart, T. & Satija, R. Integrative single-cell analysis. Nat. Rev. Genet. 20, 257–272 (2019).

    CAS 
    Article 

    Google Scholar
     

  4. 4.

    Korsunsky, I. et al. Fast, sensitive and accurate integration of single-cell data with Harmony. Nat. Methods 16, 1289–1296 (2019).

    CAS 
    Article 

    Google Scholar
     

  5. 5.

    Welch, J. D. et al. Single-cell multi-omic integration compares and contrasts features of brain cell identity. Cell 177, 1873–1887 (2019).

    CAS 
    Article 

    Google Scholar
     

  6. 6.

    Mairal, J., Bach, F., Ponce, J. & Sapiro, G. Online learning for matrix factorization and sparse coding. J. Mach. Learn. Res. 11, 19–60 (2010).


    Google Scholar
     

  7. 7.

    Saunders, A. et al. Molecular diversity and specializations among the cells of the adult mouse brain. Cell 174, 1015–1030 (2018).

    CAS 
    Article 

    Google Scholar
     

  8. 8.

    Tran, H. T. N. et al. A benchmark of batch-effect correction methods for single-cell RNA sequencing data. Genome Biol. 21, 12 (2020).

    CAS 
    Article 

    Google Scholar
     

  9. 9.

    Kang, H. M. et al. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat. Biotechnol. 36, 89–94 (2018).

    CAS 
    Article 

    Google Scholar
     

  10. 10.

    Grün, D. et al. De novo prediction of stem cell identity using single-cell transcriptome data. Cell Stem Cell 19, 266–277 (2016).

    Article 

    Google Scholar
     

  11. 11.

    Muraro, M. J. et al. A single-cell transcriptome atlas of the human pancreas. Cell Syst. 3, 385–394 (2016).

    CAS 
    Article 

    Google Scholar
     

  12. 12.

    Lawlor, N. et al. Single-cell transcriptomes identify human islet cell signatures and reveal cell-type-specific expression changes in type 2 diabetes. Genome Res. 27, 208–222 (2017).

    CAS 
    Article 

    Google Scholar
     

  13. 13.

    Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Syst. 3, 346–360 (2016).

    CAS 
    Article 

    Google Scholar
     

  14. 14.

    Segerstolpe, Å. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).

    CAS 
    Article 

    Google Scholar
     

  15. 15.

    Toda, T., Parylak, S. L., Linker, S. B. & Gage, F. H. The role of adult hippocampal neurogenesis in brain health and disease. Mol. Psychiatry 24, 67–87 (2019).

    CAS 
    Article 

    Google Scholar
     

  16. 16.

    Ernst, A. et al. Neurogenesis in the striatum of the adult human brain. Cell 156, 1072–1083 (2014).

    CAS 
    Article 

    Google Scholar
     

  17. 17.

    Zeisel, A. et al. Molecular architecture of the mouse nervous system. Cell 174, 999–1014 (2018).

    CAS 
    Article 

    Google Scholar
     

  18. 18.

    Cao, J. et al. The single-cell transcriptional landscape of mammalian organogenesis. Nature 566, 496–502 (2019).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  19. 19.

    Rodriques, S. G. et al. Slide-seq: a scalable technology for measuring genome-wide expression at high spatial resolution. Science 363, 1463–1467 (2019).

    CAS 
    Article 

    Google Scholar
     

  20. 20.

    Stickels, R. R. et al. Highly sensitive spatial transcriptomics at near-cellular resolution with Slide-seqV2. Nat. Biotechnol. 39, 313–319 (2021).

    CAS 
    Article 

    Google Scholar
     

  21. 21.

    Chen, K. H., Boettiger, A. N., Moffitt, J. R., Wang, S. & Zhuang, X. RNA imaging. Spatially resolved, highly multiplexed RNA profiling in single cells. Science 348, aaa6090 (2015).

    Article 

    Google Scholar
     

  22. 22.

    Yao, Z. et al. A taxonomy of transcriptomic cell types across the isocortex and hippocampal formation. Preprint at bioRxiv https://doi.org/10.1101/2020.03.30.015214 (2020).

  23. 23.

    Moffitt, J. R. et al. Molecular, spatial, and functional single-cell profiling of the hypothalamic preoptic region. Science 362, eaau5324 (2018).

    Article 

    Google Scholar
     

  24. 24.

    Ecker, J. R. et al. The BRAIN Initiative Cell Census Consortium: lessons learned toward generating a comprehensive brain cell atlas. Neuron 96, 542–557 (2017).

    CAS 
    Article 

    Google Scholar
     

  25. 25.

    HuBMAP Consortium. The human body at cellular resolution: the NIH Human Biomolecular Atlas Program. Nature 574, 187–192 (2019).

    CAS 
    Article 

    Google Scholar
     

  26. 26.

    Regev, A. et al. The Human Cell Atlas. eLife 6, e27041 (2017).

    Article 

    Google Scholar
     

  27. 27.

    Yao, Z. et al. An integrated transcriptomic and epigenomic atlas of mouse primary motor cortex cell types. Preprint at bioRxiv https://doi.org/10.1101/2020.02.29.970558 (2020).

  28. 28.

    Tasic, B. et al. Adult mouse cortical cell taxonomy revealed by single cell transcriptomics. Nat. Neurosci. 19, 335–346 (2016).

    CAS 
    Article 

    Google Scholar
     

  29. 29.

    Butler, A., Hoffman, P., Smibert, P., Papalexi, E. & Satija, R. Integrating single-cell transcriptomic data across different conditions, technologies, and species. Nat. Biotechnol. 36, 411–420 (2018).

    CAS 
    Article 

    Google Scholar
     

  30. 30.

    Büttner, M., Miao, Z., Wolf, F. A., Teichmann, S. A. & Theis, F. J. A test metric for assessing single-cell RNA-seq batch correction. Nat. Methods 16, 43–49 (2019).

    Article 

    Google Scholar
     

  31. 31.

    Hubert, L. & Arabie, P. Comparing partitions. J. Classification 2, 193–218 (1985).

    Article 

    Google Scholar
     

  32. 32.

    Rand, W. M. Objective criteria for the evaluation of clustering methods. J. Am. Stat. Assoc. 66, 846–850 (1971).

    Article 

    Google Scholar
     

  33. 33.

    Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174 (2017).

    Article 

    Google Scholar
     

Download references

Acknowledgements

This work was supported by National Institutes of Health grants R01AI149669-01, R01HG010883-01 and RF1MH123199 (to J.D.W.) and 5U19MH114831 (to J.R.E.). J.R.E. is an Investigator of the Howard Hughes Medical Institute.

Author information

Author notes

  1. Chongyuan Luo

    Present address: Department of Human Genetics, University of California, Los Angeles, Los Angeles, CA, USA

Affiliations

  1. Department of Computational Medicine and Bioinformatics, University of Michigan, Ann Arbor, MI, USA

    Chao Gao, Jialin Liu, April R. Kriebel & Joshua D. Welch

  2. Center for Epigenomics, Department of Cellular and Molecular Medicine, University of California, San Diego, School of Medicine, La Jolla, CA, USA

    Sebastian Preissl & Bing Ren

  3. Genomic Analysis Laboratory, The Salk Institute for Biological Studies, La Jolla, CA, USA

    Chongyuan Luo, Rosa Castanon, Justin Sandoval, Angeline Rivkin, Joseph R. Nery & Joseph R. Ecker

  4. Howard Hughes Medical Institute, The Salk Institute for Biological Studies, La Jolla, CA, USA

    Chongyuan Luo & Joseph R. Ecker

  5. Computational Neurobiology Laboratory, The Salk Institute for Biological Studies, La Jolla, CA, USA

    Margarita M. Behrens

  6. Department of Computer Science and Engineering, University of Michigan, Ann Arbor, MI, USA

    Joshua D. Welch

Contributions

S.P., C.L., R.C., J.S., A.R., J.R.N., M.M.B., J.R.E. and B.R. generated the snATAC-seq and snmC-seq data. J.D.W. conceived the idea of online iNMF. C.G. and J.D.W. developed and implemented the online iNMF algorithm. C.G., J.L., A.R.K. and J.D.W. carried out data analyses. C.G., J.L., A.R.K. and J.D.W. wrote the paper. All authors read and approved the final manuscript.

Corresponding author

Correspondence to
Joshua D. Welch.

Ethics declarations

Competing interests

A patent application on LIGER has been submitted by the Broad Institute and the General Hospital Corporation with J.D.W. listed as an inventor. The remaining authors declare no competing financial interests.

Additional information

Peer review information Nature Biotechnology thanks Samantha A. Morris and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 Convergence behavior for online iNMF and batch iNMF algorithms on scRNA-seq data from the adult mouse brain, human PBMC and human pancreas.

The online iNMF algorithm exhibits faster convergence and better objective minimization after a fixed amount of training time. The advantage of the online algorithm in convergence speed is more apparent for larger datasets. a-c, Adult mouse brain (n = 691,962 cells, nine individual datasets). d-f, Human PBMC (n = 13,999 cells, two individual datasets). g-i, Human pancreas (n = 14,890 cells, eight individual datasets). Center lines of box plots show the median; box limits, upper and lower quartiles; whiskers, 1.5x interquartile range; and points are outliers.

Extended Data Fig. 2 Online and batch iNMF yield highly similar UMAP visualizations.

We performed online iNMF and batch iNMF on data from mouse cortex (n = 255,353 cells), human PBMC (n = 13,999 cells), and human pancreas (n = 14,890 cells). Online iNMF and batch iNMF produce very similar visualizations, suggesting that the approaches give very similar dataset alignment and cluster preservation. We subsequently confirmed this qualitative observation using quantitative metrics.

Extended Data Fig. 3

Benchmarking integration across data modalities (RNA+ATAC). 5,000 cells from the snRNA-seq dataset and 5,000 cells from the snATAC-seq dataset from MOp data collection were integrated using four different methods. The cells are exhibited in 2-dimensional UMAP space and colored by dataset.

Extended Data Fig. 4 Performing online iNMF in three scenarios produces similar results.

These analyses were carried out separately to integrate eight MOp datasets (scRNA-seq, snRNA-seq, snATAC-seq and snmC-seq, n = 408,885) using online iNMF in scenario 1 (a), scenario 2 (b), and scenario 3 (c). The results are visualized in UMAP coordinates and the cells are colored by the cell type annotations from Fig. 6.

Supplementary information

About this article

Verify currency and authenticity via CrossMark

Cite this article

Gao, C., Liu, J., Kriebel, A.R. et al. Iterative single-cell multi-omic integration using online learning.
Nat Biotechnol (2021). https://doi.org/10.1038/s41587-021-00867-x

Download citation

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *