Modular, efficient and constant-memory single-cell RNA-seq preprocessing thumbnail

Modular, efficient and constant-memory single-cell RNA-seq preprocessing

Abstract

We describe a workflow for preprocessing of single-cell RNA-sequencing data that balances efficiency and accuracy. Our workflow is based on the kallisto and bustools programs, and is near optimal in speed with a constant memory requirement providing scalability for arbitrarily large datasets. The workflow is modular, and we demonstrate its flexibility by showing how it can be used for RNA velocity analyses.

Access options

Subscribe to Journal

Get full journal access for 1 year

$59.00

only $4.92 per issue

All prices are NET prices.

VAT will be added later in the checkout.

Tax calculation will be finalised during checkout.

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Data availability

A diverse set of 20 datasets was compiled for the purpose of benchmarking preprocessing workflows. Datasets produced and distributed by 10x Genomics were downloaded from the 10x Genomics data downloads page: https://support.10xgenomics.com/single-cell-gene-expression/datasets. Six v3 chemistry datasets and two v2 chemistry datasets were downloaded and processed (Supplementary Table 3). Another 12 datasets were obtained from either the SRA or the European Nucleotide Archive; all were produced with 10x Genomics v2 chemistry. For six of the datasets (SRR6956073, SRR6998058, SRR7299563, SRR8206317, SRR8327928 and SRR8524760), the BAM files were downloaded and the Cell Ranger utility bamtofastq was run to produce FASTQ files for preprocessing from Cell Ranger–structured BAM files. FASTQ files were downloaded directly for the datasets E-MTAB-7320, SRR8257100, SRR8513910, SRR8599150 (available at https://github.com/bustools/getting_started/releases/download/getting_started/SRR8599150_S1_L001_R1_001.fastq.gz and https://github.com/bustools/getting_started/releases/download/getting_started/SRR8599150_S1_L001_R2_001.fastq.gz), SRR8611943 and SRR8639063.

Details of all datasets and their accession numbers can be found in Supplementary Table 3. All genome annotations and reference transcriptomes can be found at https://doi.org/10.22002/D1.1876.

Code availability

The software versions used for the results in the paper were: Alevin v0.13.1, bustools v0.39.1, Cell Ranger v3.0.0, DropletUtils v1.6.1, kallisto v0.46.0, Python 3.7, R v3.5.2, Scanpy v1.4.1, scvelo 0.1.17, Seurat v3.0, snakemake v5.3.0, STARsolo v2.7.0e, velocyto v0.17.17, wc v8.22 (GNU coreutils) and zcat v1.5 (gzip). All programs were run with default options unless otherwise specified. The code to reproduce the findings of this paper is available at https://github.com/pachterlab/MBLGLMBHGP_2021/, kallisto is available at https://github.com/pachterlab/kallisto/ and bustools is available at https://github.com/BUStools/bustools/. Documentation and tutorials for using the kallisto bustools scRNA-seq workflow are available at http://pachterlab.github.io/kallistobustools.

References

  1. 1.

    Tian, L. et al. scPipe: a flexible R/Bioconductor preprocessing pipeline for single-cell RNA-sequencing data. PLoS Comput. Biol. 14, e1006361 (2018).

    Article 

    Google Scholar
     

  2. 2.

    Conesa, A. et al. A survey of best practices for RNA-seq data analysis. Genome Biol. 17, 13 (2016).

    Article 

    Google Scholar
     

  3. 3.

    Kivioja, T. et al. Counting absolute numbers of molecules using unique molecular identifiers. Nat. Methods 9, 72–74 (2011).

    Article 

    Google Scholar
     

  4. 4.

    Parekh, S., Ziegenhain, C., Vieth, B., Enard, W. & Hellmann, I. zUMIs – a fast and flexible pipeline to process RNA sequencing data with UMIs. Gigascience 7, giy059 (2018).

    Article 

    Google Scholar
     

  5. 5.

    Srivastava, A., Malik, L., Smith, T., Sudbery, I. & Patro, R. Alevin efficiently estimates accurate gene abundances from dscRNA-seq data. Genome Biol. 20, 65 (2019).

    Article 

    Google Scholar
     

  6. 6.

    Svensson, V., Vento-Tormo, R. & Teichmann, S. A. Exponential scaling of single-cell RNA-seq in the past decade. Nat. Protoc. 13, 599–604 (2018).

    CAS 
    Article 

    Google Scholar
     

  7. 7.

    Zheng, G. X. Y. et al. Massively parallel digital transcriptional profiling of single cells. Nat. Commun. 8, 14049 (2017).

    CAS 
    Article 

    Google Scholar
     

  8. 8.

    Bray, N. L., Pimentel, H., Melsted, P. & Pachter, L. Near-optimal probabilistic RNA-seq quantification. Nat. Biotechnol. 34, 525–527 (2016).

    CAS 
    Article 

    Google Scholar
     

  9. 9.

    Svensson, V. et al. Power analysis of single-cell RNA-sequencing experiments. Nat. Methods 14, 381–387 (2017).

    CAS 
    Article 

    Google Scholar
     

  10. 10.

    Melsted, P., Ntranos, V. & Pachter, L. The barcode, UMI, set format and BUStools. Bioinformatics 35, 4472–4473 (2019).

    CAS 
    Article 

    Google Scholar
     

  11. 11.

    La Manno, G. et al. RNA velocity of single cells. Nature 560, 494–498 (2018).

    Article 

    Google Scholar
     

  12. 12.

    Petukhov, V. et al. dropEst: pipeline for accurate estimation of molecular counts in droplet-based single-cell RNA-seq experiments. Genome Biol. 19, 78 (2018).

    Article 

    Google Scholar
     

  13. 13.

    Hayer, K. E., Pizarro, A., Lahens, N. F., Hogenesch, J. B. & Grant, G. R. Benchmark analysis of algorithms for determining and quantifying full-length mRNA splice forms from RNA-seq data. Bioinformatics 31, 3938–3945 (2015).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  14. 14.

    Hwang, B., Lee, J. H. & Bang, D. Single-cell RNA sequencing technologies and bioinformatics pipelines. Exp. Mol. Med. 50, 1–14 (2018).

    CAS 
    Article 

    Google Scholar
     

  15. 15.

    Ding, J., Adiconis, X., Simmons, S.K. et al. Systematic comparison of single-cell and single-nucleus RNA-sequencing methods. Nat. Biotechnol. 38, 737–746 (2020).

    CAS 
    Article 

    Google Scholar
     

  16. 16.

    Yi, L., Liu, L., Melsted, P. & Pachter, L. A direct comparison of genome alignment and transcriptome pseudoalignment. Preprint at bioRxiv https://doi.org/10.1101/444620 (2018).

  17. 17.

    Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15–21 (2013).

    CAS 
    Article 

    Google Scholar
     

  18. 18.

    Habib, N. et al. Massively parallel single-nucleus RNA-seq with DroNc-seq. Nat. Methods 14, 955–958 (2017).

    CAS 
    Article 

    Google Scholar
     

  19. 19.

    Ryu, K. H., Huang, L., Kang, H. M. & Schiefelbein, J. Single-cell RNA sequencing resolves molecular relationships among individual plant cells. Plant Physiol. 179, 1444–1456 (2019).

    CAS 
    Article 

    Google Scholar
     

  20. 20.

    Packer, J. S. et al. A lineage-resolved molecular atlas of C. elegans embryogenesis at single-cell resolution. Science 365, eaax1971 (2019).

    CAS 
    Article 

    Google Scholar
     

  21. 21.

    Farrell, J. A. et al. Single-cell reconstruction of developmental trajectories during zebrafish embryogenesis. Science 360, eaar3131 (2018).

    Article 

    Google Scholar
     

  22. 22.

    Carosso, G. A. et al. Precocious neuronal differentiation and disrupted oxygen responses in Kabuki syndrome. JCI Insight 4, e129375 (2019).

    Article 

    Google Scholar
     

  23. 23.

    Merino, D. et al. Barcoding reveals complex clonal behavior in patient-derived xenografts of metastatic triple negative breast cancer. Nat. Commun. 10, 766 (2019).

    CAS 
    Article 

    Google Scholar
     

  24. 24.

    O’Koren, E. G. et al. Microglial function is distinct in different anatomical locations during retinal homeostasis and degeneration. Immunity 50, 723–737 (2019).

    Article 

    Google Scholar
     

  25. 25.

    Jin, R. M., Warunek, J. & Wohlfert, E. A. Chronic infection stunts macrophage heterogeneity and disrupts immune-mediated myogenesis. JCI Insight 3, e121549 (2018).

    Article 

    Google Scholar
     

  26. 26.

    Miller, B. C. et al. Subsets of exhausted CD8+ T cells differentially mediate tumor control and respond to checkpoint blockade. Nat. Immunol. 20, 326–336 (2019).

    CAS 
    Article 

    Google Scholar
     

  27. 27.

    Delile, J. et al. Single cell transcriptomics reveals spatial and temporal dynamics of gene expression in the developing mouse spinal cord. Development 146, dev173807. (2019).

    Article 

    Google Scholar
     

  28. 28.

    Guo, L. et al. Resolving cell fate decisions during somatic cell reprogramming by single-cell RNA-seq. Mol. Cell 73, 815–829 (2019).

    CAS 
    Article 

    Google Scholar
     

  29. 29.

    Traag, V. A., Waltman, L. & van Eck, N. J. From Louvain to Leiden: guaranteeing well-connected communities. Sci. Rep. 9, 5233 (2019).

    CAS 
    Article 

    Google Scholar
     

  30. 30.

    Clark, B. S. et al. Single-cell RNA-seq analysis of retinal development identifies NFI factors as regulating mitotic exit and late-born cell specification. Neuron 102, 1111–1126 (2019).

    CAS 
    Article 

    Google Scholar
     

  31. 31.

    Ntranos, V., Yi, L., Melsted, P. & Pachter, L. A discriminative learning approach to differential expression analysis for single-cell RNA-seq. Nat. Methods 16, 163–166 (2019).

    CAS 
    Article 

    Google Scholar
     

  32. 32.

    Soós, S. Age-sensitive bibliographic coupling reflecting the history of science: the case of the Species Problem. Scientometrics 98, 23–51 (2014).

    Article 

    Google Scholar
     

  33. 33.

    Lun, A. T. L. et al. EmptyDrops: distinguishing cells from empty droplets in droplet-based single-cell RNA sequencing data. Genome Biol. 20, 63 (2019).

    Article 

    Google Scholar
     

  34. 34.

    Griffiths, J. A., Richard, A. C., Bach, K., Lun, A. T. L. & Marioni, J. C. Detection and removal of barcode swapping in single-cell RNA-seq data. Nat. Commun. 9, 2667 (2018).

    Article 

    Google Scholar
     

  35. 35.

    Alexa, A., Rahnenführer, J. & Lengauer, T. Improved scoring of functional groups from gene expression data by decorrelating GO graph structure. Bioinformatics 22, 1600–1607 (2006).

    CAS 
    Article 

    Google Scholar
     

  36. 36.

    Ashburner, M. et al. Gene Ontology: tool for the unification of biology. Nat. Genet. 25, 25–29 (2000).

    CAS 
    Article 

    Google Scholar
     

  37. 37.

    The Gene Ontology Consortium. The Gene Ontology Resource: 20 years and still GOing strong. Nucleic Acids Res. 47, D330–D338 (2019).

  38. 38.

    Aran, D. et al. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat. Immunol. 20, 163–172 (2019).

    CAS 
    Article 

    Google Scholar
     

  39. 39.

    Benayoun, B. A. et al. Remodeling of epigenome and transcriptome landscapes with aging in mice reveals widespread induction of inflammatory responses. Genome Res. 29, 697–709 (2019).

    CAS 
    Article 

    Google Scholar
     

  40. 40.

    Street, K. et al. Slingshot: cell lineage and pseudotime inference for single-cell transcriptomics. BMC Genomics 19, 477 (2018).

    Article 

    Google Scholar
     

  41. 41.

    Saelens, W., Cannoodt, R., Todorov, H. & Saeys, Y. A comparison of single-cell trajectory inference methods. Nat. Biotechnol. 37, 547–554 (2019).

    CAS 
    Article 

    Google Scholar
     

  42. 42.

    Macosko, E. Z. et al. Highly parallel genome-wide expression profiling of individual cells using nanoliter droplets. Cell 161, 1202–1214 (2015).

    CAS 
    Article 

    Google Scholar
     

Download references

Acknowledgements

We thank V. Ntranos and V. Svensson for helpful suggestions and comments. We thank J. Farrell for the D. rerio gene annotation used to process SRR6956073, J. Schiefelbein for the A. thaliana gene annotation used to process SRR8257100, J. Fear for the D. melanogaster gene annotation used to process SRR8513910, and J. Kim and Q. Zhu for the C. elegans gene annotation used to process SRR8611943. The benchmarking work was made possible, in part, thanks to support from the Beckman Institute Caltech Bioinformatics Resource Center. A.S.B. and L.P. were funded in part by NIH U19MH114830.

Author information

Author notes

  1. These authors contributed equally: Páll Melsted, A. Sina Booeshaghi.

Affiliations

  1. Faculty of Industrial Engineering, Mechanical Engineering and Computer Science, University of Iceland, Reykjavik, Iceland

    Páll Melsted

  2. Department of Mechanical Engineering, California Institute of Technology, Pasadena, CA, USA

    A. Sina Booeshaghi

  3. Department of Computing and Mathematical Sciences, California Institute of Technology, Pasadena, CA, USA

    Lauren Liu, Kristján Eldjárn Hjörleifsson & Lior Pachter

  4. Division of Biology and Biological Engineering, California Institute of Technology, Pasadena, CA, USA

    Fan Gao, Lambda Lu, Eduardo da Veiga Beltrame & Lior Pachter

  5. Bioinformatics Resource Center, Beckman Institute, California Institute of Technology, Pasadena, CA, USA

    Fan Gao

  6. Department of Electrical Engineering and Computer Science, Massachusetts Institute of Technology, Cambridge, MA, USA

    Kyung Hoi (Joseph) Min

  7. Department of Genome Science, University of Washington, Seattle, WA, USA

    Jase Gehring

Contributions

P.M., A.S.B., L. Liu and L.P. developed the algorithms for bustools and P.M., A.S.B. and L. Liu wrote the software. A.S.B. conceived of and performed the UMI and barcode calculations motivating the algorithms. F.G. implemented and performed the benchmarking procedure, and curated indices for the datasets. A.S.B. and E.d.V.B. designed and produced the comparisons between Cell Ranger and kallisto bustools. L. Lu investigated in detail the performance of different workflows on the “10k mouse neuron” data and produced the analysis of that dataset. A.S.B. designed the RNA velocity workflow and performed the RNA velocity analyses. K.M.H contributed to the development of the reproducible workflow. K.E.H. developed and investigated the effect of reference transcriptome sequences for pseudoalignment. J.G. interpreted results and helped to supervise the research. A.S.B. planned, organized and prepared figures. A.S.B., E.d.V.B., P.M. and L.P. planned the manuscript. A.S.B. and L.P. wrote the manuscript.

Corresponding author

Correspondence to
Lior Pachter.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Peer review information Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.

Supplementary information

About this article

Verify currency and authenticity via CrossMark

Cite this article

Melsted, P., Booeshaghi, A.S., Liu, L. et al. Modular, efficient and constant-memory single-cell RNA-seq preprocessing.
Nat Biotechnol (2021). https://doi.org/10.1038/s41587-021-00870-2

Download citation

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *