Gene signature extraction and cell identity recognition at the single-cell level with Cell-ID thumbnail

Gene signature extraction and cell identity recognition at the single-cell level with Cell-ID

Abstract

Because of the stochasticity associated with high-throughput single-cell sequencing, current methods for exploring cell-type diversity rely on clustering-based computational approaches in which heterogeneity is characterized at cell subpopulation rather than at full single-cell resolution. Here we present Cell-ID, a clustering-free multivariate statistical method for the robust extraction of per-cell gene signatures from single-cell sequencing data. We applied Cell-ID to data from multiple human and mouse samples, including blood cells, pancreatic islets and airway, intestinal and olfactory epithelium, as well as to comprehensive mouse cell atlas datasets. We demonstrate that Cell-ID signatures are reproducible across different donors, tissues of origin, species and single-cell omics technologies, and can be used for automatic cell-type annotation and cell matching across datasets. Cell-ID improves biological interpretation at individual cell level, enabling discovery of previously uncharacterized rare cell types or cell states. Cell-ID is distributed as an open-source R software package.

Access options

Subscribe to Journal

Get full journal access for 1 year

$59.00

only $4.92 per issue

All prices are NET prices.

VAT will be added later in the checkout.

Tax calculation will be finalised during checkout.

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Data availability

All single-cell datasets used in this paper are publicly available (Supplementary Table 7). scRNA-seq datasets for human blood cells profiled by Cite-Seq17 and Reap-Seq18 were downloaded from the Gene Expression Omnibus (GEO) (accession numbers GSE100866 and GSE100501, respectively). Cell-type labels for these two datasets were obtained following the Multimodal Analysis vignette of the Seurat33 R package (https://satijalab.org/seurat/multimodal_vignette.html). Pancreas scRNA-seq datasets from Baron20, Muraro22 and Segerstolpe21, as well as their associated cell-type annotations were downloaded via the scRNA-seq59 R package as a SingleCellExperiment format R object. Plasschaert23 mouse and human and Montoro24 mouse airway epithelium scRNA-seq datasets, and their annotations were downloaded from GEO (GSE102580, GSE103354). Haber34 intestinal epithelium scRNA-seq dataset was downloaded from GEO accession code GSE92332. Olfactory epithelium scRNA-seq datasets from Fletcher36 and Wu35 were downloaded from GEO (GSE95601, GSE120199), and their cell-type annotations were obtained from the associated GitHub repositories: https://github.com/rufletch/p63-HBC-diff and https://www.stowers.org/research/publications/odr for Fletcher36 and Wu35, respectively. Tabula Muris39 10X and Smart-seq mouse scRNA-seq datasets were downloaded from https://tabula-muris.ds.czbiohub.org/. Gene activity score matrices from the Mouse sci-ATAC-seq atlas datasets from Cusanovich40 were obtained from http://atlas.gs.washington.edu/mouse-atac/data/, as provided by the authors and resulting from the aggregation of information across all differentially accessible chromatin sites linked to a target gene.

Code availability

Cell-ID is implemented as an R package and is available on GitHub (https://github.com/RausellLab/CelliD) under the GPL-3 open-source license. Complete documentation is provided with step-by-step procedures for MCA dimensionality reduction, per-cell gene signature extraction, cell-type prediction, label transferring across datasets and functional enrichment analysis. A development version of Cell-ID software is also available in Bioconductor (devel branch 3.13): https://bioconductor.org/packages/CelliD. In addition, R scripts to reproduce all figures in the paper are available on a dedicated GitHub repository (https://github.com/RausellLab/CellIDPaperScript).

References

  1. 1.

    Teichmann, S. et al. The Human Cell Atlas. eLife 6, e27041 (2017).

    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  2. 2.

    National Institutes of Health. The Human BioMolecular Atlas Program: HuBMAP NIH Common Fund Program https://commonfund.nih.gov/HuBMAP (2021).

  3. 3.

    The LifeTime Initiative LifeTime FET Flagship https://lifetime-fetflagship.eu/ (2021).

  4. 4.

    Lähnemann, D. et al. Eleven grand challenges in single-cell data science. Genome Biol. 21, 31 (2020).

    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  5. 5.

    Sun, S., Zhu, J., Ma, Y. & Zhou, X. Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol. 20, 269 (2019).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  6. 6.

    Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nature Biotechnol. 37, 38–44 (2019).

    CAS 
    Article 

    Google Scholar
     

  7. 7.

    Duò, A., Robinson, M. D. & Soneson, C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Res. 7, 1141 (2018).

    PubMed 
    Article 
    CAS 
    PubMed Central 

    Google Scholar
     

  8. 8.

    Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 20, 273–282 (2019).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  9. 9.

    Greenacre, M. J. Theory and Applications of Correspondence Analysis (Academic Press, 1984).


    Google Scholar
     

  10. 10.

    Greenacre, M. & Blasius, J. (eds). Multiple Correspondence Analysis and Related Methods (Chapman & Hall/CRC, 2006).

  11. 11.

    Aşan, Z. & Greenacre, M. Biplots of fuzzy coded data. Fuzzy Set. Syst. 183, 57–71 (2011).

    Article 

    Google Scholar
     

  12. 12.

    Rausell, A., Juan, D., Pazos, F. & Valencia, A. Protein interactions and ligand binding: from protein subfamilies to functional specificity. Proc. Natl Acad. Sci. USA 107, 1995–2000 (2010).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  13. 13.

    Gabriel, K. R. The biplot graphic display of matrices with application to principal component analysis. Biometrika 58, 453–467 (1971).

    Article 

    Google Scholar
     

  14. 14.

    Greenacre, M. Biplots in Practice Ch. 8, 79–88 (Foundation BBVA, Rubes Editorial, 2010).

  15. 15.

    Aibar, S. et al. SCENIC: single-cell regulatory network inference and clustering. Nat. Methods 14, 1083–1086 (2017).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  16. 16.

    Aran, D., Hu, Z. & Butte, A. J. xCell: digitally portraying the tissue cellular heterogeneity landscape. Genome Biol. 18, 220 (2017).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  17. 17.

    Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  18. 18.

    Peterson, V. M. et al. Multiplexed quantification of proteins and transcripts in single cells. Nat. Biotechnol. 35, 936–939 (2017).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  19. 19.

    Zhang et al. SCINA: semi-supervised analysis of single cells in silico. Genes 10, 531–531 (2019).

    CAS 
    PubMed Central 
    Article 

    Google Scholar
     

  20. 20.

    Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Systems 3, 346–360 (2016).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  21. 21.

    Segerstolpe, Å. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  22. 22.

    Muraro, M. J. et al. A single-cell transcriptome atlas of the human pancreas. Cell Systems 3, 385–394.e3 (2016).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  23. 23.

    Plasschaert, L. W. et al. A single-cell atlas of the airway epithelium reveals the CFTR-rich pulmonary ionocyte. Nature 560, 377–381 (2018).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  24. 24.

    Montoro, D. T. et al. A revised airway epithelial hierarchy includes CFTR-expressing ionocytes. Nature 560, 319–324 (2018).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  25. 25.

    Kiselev, V. Y., Yiu, A. & Hemberg, M. scmap: projection of single-cell RNA-seq data across data sets. Nat. Methods 15, 359–359 (2018).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  26. 26.

    Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  27. 27.

    De Kanter, J. K., Lijnzaad, P., Candelli, T., Margaritis, T. & Holstege, F. C. P. CHETAH: a selective, hierarchical cell type identification method for single-cell RNA sequencing. Nucleic Acids Res. 47, e95 (2019).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  28. 28.

    Lieberman, Y., Rokach, L. & Shay, T. CaSTLe–classification of single cells by transfer learning: harnessing the power of publicly available single cell RNA sequencing experiments to annotate new experiments. PLoS ONE 13, e0205499–e0205499 (2018).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  29. 29.

    Boufea, K., Seth, S. & Batada, N. N. scID uses discriminant analysis to identify transcriptionally equivalent cell types across single-cell RNA-seq data with batch effect. iScience 23, 100914 (2020).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  30. 30.

    Tan, Y. & Cahan, P. SingleCellNet: a computational tool to classify single cell RNA-seq data across platforms and across species. Cell Systems 9, 207–213.e2 (2019).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  31. 31.

    Alquicira-Hernandez, J., Sathe, A., Ji, H. P., Nguyen, Q. & Powell, J. E. ScPred: accurate supervised method for cell-type classification from single-cell RNA-seq data. Genome Biol. 20, 264–264 (2019).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  32. 32.

    Aran, D. et al. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat. Immunol. 20, 163–172 (2019).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  33. 33.

    Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902.e21 (2019).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  34. 34.

    Haber, A. L. et al. A single-cell survey of the small intestinal epithelium. Nature 551, 333–339 (2017).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  35. 35.

    Wu, Y. et al. A population of navigator neurons is essential for olfactory map formation during the critical period article a population of navigator neurons is essential for olfactory map formation during the critical period. Neuron 100, 1066–1082.e6 (2018).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  36. 36.

    Fletcher, R. B. et al. Deconstructing olfactory stem cell trajectories at single-cell resolution. Cell Stem Cell 20, 817–830.e8 (2017).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  37. 37.

    Ualiyeva, S. et al. Airway brush cells generate cysteinyl leukotrienes through the ATP sensor P2Y2. Science Immunol. 5, eaax7224–eaax7224 (2020).

    CAS 
    Article 

    Google Scholar
     

  38. 38.

    Bankova, L. G. et al. The cysteinyl leukotriene 3 receptor regulates expansion of IL-25–producing airway brush cells leading to type 2 inflammation. Science Immunol. 3, eaat9453 (2018).

    Article 

    Google Scholar
     

  39. 39.

    Schaum, N. et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018).

    PubMed Central 
    Article 
    CAS 
    PubMed 

    Google Scholar
     

  40. 40.

    Cusanovich, D. A. et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell 174, 1309–1324.e18 (2018).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  41. 41.

    Franzén, O., Gan, L.-M. & Björkegren, J. L. M. PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database 2019, baz046 (2019).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  42. 42.

    Zhang, X. et al. CellMarker: a manually curated resource of cell markers in human and mouse. Nucleic Acids Res. 47, D721–D728 (2019).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  43. 43.

    Liberzon, A. et al. The molecular signatures database hallmark gene set collection. Cells 1, 417–425 (2015).

    CAS 

    Google Scholar
     

  44. 44.

    Gene Ontology Consortium. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 32, D258–D261 (2004).

  45. 45.

    Jassal, B. et al. The reactome pathway knowledgebase. Nucleic Acids Res. 48, D498–D503 (2020).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  46. 46.

    Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, 457–462 (2015).

    Article 
    CAS 

    Google Scholar
     

  47. 47.

    Slenter, D. N. et al. WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research. Nucleic Acids Res. 46, D661–D667 (2018).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  48. 48.

    Efremova, M. & Teichmann, S. A. Computational methods for single-cell omics across modalities. Nat. Methods 17, 14–17 (2020).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  49. 49.

    Hao, Y. et al. Integrated analysis of multimodal single-cell data. Preprint at bioRxiv https://doi.org/10.1101/2020.10.12.335331 (2020).

  50. 50.

    Argelaguet, R. et al. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 21, 111 (2020).

    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  51. 51.

    Zerbino, D. R. et al. Ensembl 2018. Nucleic Acids Res. 46, D754–D761 (2018).

    CAS 
    Article 

    Google Scholar
     

  52. 52.

    Durinck, S., Spellman, P. T., Birney, E. & Huber, W. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat. Protoc. 4, 1184–1191 (2009).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  53. 53.

    Lebart, L, Morineau, A & Warwick, K. M. Multivariate Descriptive Statistical Analysis. Correspondence Analysis and Related Techniques for Large Matrices (John Wiley & Sons, 1984).


    Google Scholar
     

  54. 54.

    Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Stat. Soc. B. (Methodological) 57, 289–300 (1995).

    Article 

    Google Scholar
     

  55. 55.

    Pagès, J. Multiple Factor Analysis by Example Using R (CRC Press, 2014).

  56. 56.

    Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174–174 (2017).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  57. 57.

    Chen, E. Y. et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinform. 14, 128 (2013).

    Article 

    Google Scholar
     

  58. 58.

    Kobak, D. & Berens, P. The art of using t-SNE for single-cell transcriptomics. Nat. Commun. 10, 5416 (2019).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  59. 59.

    Risso, D. & Cole, M. scRNAseq: Collection of public single-cell RNA-Seq datasets. R package v.2.4.0 http://bioconductor.org/packages/scRNAseq/ (Bioconductor, 2020).

Download references

Acknowledgements

We thank the Laboratory of Clinical Bioinformatics and the Laboratory of Human Lymphohematopoiesis for helpful discussions and support. The Laboratory of Clinical Bioinformatics was partly supported by the French National Research Agency (ANR) ‘Investissements d’Avenir’ Program (grant no. ANR-10-IAHU-01). The Laboratory of Clinical Bioinformatics and the Laboratory of Human Lymphohematopoiesis were partly supported by Christian Dior Couture, Dior. We also thank G. Fuentes from The Visual Thinker LLP for the creation of the illustrations in Figs. 1–4.

Author information

Affiliations

  1. Clinical Bioinformatics Laboratory, Université de Paris, INSERM UMR1163, Imagine Institute, Paris, France

    Akira Cortal, Loredana Martignetti & Antonio Rausell

  2. Laboratory of Human Lymphohematopoiesis, Université de Paris, INSERM UMR1163, Imagine Institute, Paris, France

    Emmanuelle Six

  3. Molecular Genetics Service, AP-HP, Necker Hospital for Sick Children, Paris, France

    Antonio Rausell

Contributions

A.C. and A.R. conceived and designed research. A.C. performed research. A.C and L.M. contributed with materials/analysis tools. A.C. and A.R. analyzed data. A.C., E.S. and A.R. interpreted results. A.C., E.S. and A.R. wrote the paper. All authors read and approved the final draft of the paper.

Corresponding author

Correspondence to
Antonio Rausell.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

About this article

Verify currency and authenticity via CrossMark

Cite this article

Cortal, A., Martignetti, L., Six, E. et al. Gene signature extraction and cell identity recognition at the single-cell level with Cell-ID.
Nat Biotechnol (2021). https://doi.org/10.1038/s41587-021-00896-6

Download citation

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *