Because of the stochasticity associated with high-throughput single-cell sequencing, current methods for exploring cell-type diversity rely on clustering-based computational approaches in which heterogeneity is characterized at cell subpopulation rather than at full single-cell resolution. Here we present Cell-ID, a clustering-free multivariate statistical method for the robust extraction of per-cell gene signatures from single-cell sequencing data. We applied Cell-ID to data from multiple human and mouse samples, including blood cells, pancreatic islets and airway, intestinal and olfactory epithelium, as well as to comprehensive mouse cell atlas datasets. We demonstrate that Cell-ID signatures are reproducible across different donors, tissues of origin, species and single-cell omics technologies, and can be used for automatic cell-type annotation and cell matching across datasets. Cell-ID improves biological interpretation at individual cell level, enabling discovery of previously uncharacterized rare cell types or cell states. Cell-ID is distributed as an open-source R software package.
Subscribe to Journal
Get full journal access for 1 year
only $4.92 per issue
All prices are NET prices.
VAT will be added later in the checkout.
Tax calculation will be finalised during checkout.
Rent or Buy article
Get time limited or full article access on ReadCube.
All prices are NET prices.
All single-cell datasets used in this paper are publicly available (Supplementary Table 7). scRNA-seq datasets for human blood cells profiled by Cite-Seq17 and Reap-Seq18 were downloaded from the Gene Expression Omnibus (GEO) (accession numbers GSE100866 and GSE100501, respectively). Cell-type labels for these two datasets were obtained following the Multimodal Analysis vignette of the Seurat33 R package (https://satijalab.org/seurat/multimodal_vignette.html). Pancreas scRNA-seq datasets from Baron20, Muraro22 and Segerstolpe21, as well as their associated cell-type annotations were downloaded via the scRNA-seq59 R package as a SingleCellExperiment format R object. Plasschaert23 mouse and human and Montoro24 mouse airway epithelium scRNA-seq datasets, and their annotations were downloaded from GEO (GSE102580, GSE103354). Haber34 intestinal epithelium scRNA-seq dataset was downloaded from GEO accession code GSE92332. Olfactory epithelium scRNA-seq datasets from Fletcher36 and Wu35 were downloaded from GEO (GSE95601, GSE120199), and their cell-type annotations were obtained from the associated GitHub repositories: https://github.com/rufletch/p63-HBC-diff and https://www.stowers.org/research/publications/odr for Fletcher36 and Wu35, respectively. Tabula Muris39 10X and Smart-seq mouse scRNA-seq datasets were downloaded from https://tabula-muris.ds.czbiohub.org/. Gene activity score matrices from the Mouse sci-ATAC-seq atlas datasets from Cusanovich40 were obtained from http://atlas.gs.washington.edu/mouse-atac/data/, as provided by the authors and resulting from the aggregation of information across all differentially accessible chromatin sites linked to a target gene.
Cell-ID is implemented as an R package and is available on GitHub (https://github.com/RausellLab/CelliD) under the GPL-3 open-source license. Complete documentation is provided with step-by-step procedures for MCA dimensionality reduction, per-cell gene signature extraction, cell-type prediction, label transferring across datasets and functional enrichment analysis. A development version of Cell-ID software is also available in Bioconductor (devel branch 3.13): https://bioconductor.org/packages/CelliD. In addition, R scripts to reproduce all figures in the paper are available on a dedicated GitHub repository (https://github.com/RausellLab/CellIDPaperScript).
Teichmann, S. et al. The Human Cell Atlas. eLife 6, e27041 (2017).
National Institutes of Health. The Human BioMolecular Atlas Program: HuBMAP NIH Common Fund Program https://commonfund.nih.gov/HuBMAP (2021).
The LifeTime Initiative LifeTime FET Flagship https://lifetime-fetflagship.eu/ (2021).
Lähnemann, D. et al. Eleven grand challenges in single-cell data science. Genome Biol. 21, 31 (2020).
Sun, S., Zhu, J., Ma, Y. & Zhou, X. Accuracy, robustness and scalability of dimensionality reduction methods for single-cell RNA-seq analysis. Genome Biol. 20, 269 (2019).
Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nature Biotechnol. 37, 38–44 (2019).
Duò, A., Robinson, M. D. & Soneson, C. A systematic performance evaluation of clustering methods for single-cell RNA-seq data. F1000Res. 7, 1141 (2018).
Kiselev, V. Y., Andrews, T. S. & Hemberg, M. Challenges in unsupervised clustering of single-cell RNA-seq data. Nat. Rev. Genet. 20, 273–282 (2019).
Greenacre, M. J. Theory and Applications of Correspondence Analysis (Academic Press, 1984).
Greenacre, M. & Blasius, J. (eds). Multiple Correspondence Analysis and Related Methods (Chapman & Hall/CRC, 2006).
Aşan, Z. & Greenacre, M. Biplots of fuzzy coded data. Fuzzy Set. Syst. 183, 57–71 (2011).
Rausell, A., Juan, D., Pazos, F. & Valencia, A. Protein interactions and ligand binding: from protein subfamilies to functional specificity. Proc. Natl Acad. Sci. USA 107, 1995–2000 (2010).
Gabriel, K. R. The biplot graphic display of matrices with application to principal component analysis. Biometrika 58, 453–467 (1971).
Greenacre, M. Biplots in Practice Ch. 8, 79–88 (Foundation BBVA, Rubes Editorial, 2010).
Aibar, S. et al. SCENIC: single-cell regulatory network inference and clustering. Nat. Methods 14, 1083–1086 (2017).
Aran, D., Hu, Z. & Butte, A. J. xCell: digitally portraying the tissue cellular heterogeneity landscape. Genome Biol. 18, 220 (2017).
Stoeckius, M. et al. Simultaneous epitope and transcriptome measurement in single cells. Nat. Methods 14, 865–868 (2017).
Peterson, V. M. et al. Multiplexed quantification of proteins and transcripts in single cells. Nat. Biotechnol. 35, 936–939 (2017).
Zhang et al. SCINA: semi-supervised analysis of single cells in silico. Genes 10, 531–531 (2019).
Baron, M. et al. A single-cell transcriptomic map of the human and mouse pancreas reveals inter- and intra-cell population structure. Cell Systems 3, 346–360 (2016).
Segerstolpe, Å. et al. Single-cell transcriptome profiling of human pancreatic islets in health and type 2 diabetes. Cell Metab. 24, 593–607 (2016).
Muraro, M. J. et al. A single-cell transcriptome atlas of the human pancreas. Cell Systems 3, 385–394.e3 (2016).
Plasschaert, L. W. et al. A single-cell atlas of the airway epithelium reveals the CFTR-rich pulmonary ionocyte. Nature 560, 377–381 (2018).
Montoro, D. T. et al. A revised airway epithelial hierarchy includes CFTR-expressing ionocytes. Nature 560, 319–324 (2018).
Kiselev, V. Y., Yiu, A. & Hemberg, M. scmap: projection of single-cell RNA-seq data across data sets. Nat. Methods 15, 359–359 (2018).
Haghverdi, L., Lun, A. T. L., Morgan, M. D. & Marioni, J. C. Batch effects in single-cell RNA-sequencing data are corrected by matching mutual nearest neighbors. Nat. Biotechnol. 36, 421–427 (2018).
De Kanter, J. K., Lijnzaad, P., Candelli, T., Margaritis, T. & Holstege, F. C. P. CHETAH: a selective, hierarchical cell type identification method for single-cell RNA sequencing. Nucleic Acids Res. 47, e95 (2019).
Lieberman, Y., Rokach, L. & Shay, T. CaSTLe–classification of single cells by transfer learning: harnessing the power of publicly available single cell RNA sequencing experiments to annotate new experiments. PLoS ONE 13, e0205499–e0205499 (2018).
Boufea, K., Seth, S. & Batada, N. N. scID uses discriminant analysis to identify transcriptionally equivalent cell types across single-cell RNA-seq data with batch effect. iScience 23, 100914 (2020).
Tan, Y. & Cahan, P. SingleCellNet: a computational tool to classify single cell RNA-seq data across platforms and across species. Cell Systems 9, 207–213.e2 (2019).
Alquicira-Hernandez, J., Sathe, A., Ji, H. P., Nguyen, Q. & Powell, J. E. ScPred: accurate supervised method for cell-type classification from single-cell RNA-seq data. Genome Biol. 20, 264–264 (2019).
Aran, D. et al. Reference-based analysis of lung single-cell sequencing reveals a transitional profibrotic macrophage. Nat. Immunol. 20, 163–172 (2019).
Stuart, T. et al. Comprehensive integration of single-cell data. Cell 177, 1888–1902.e21 (2019).
Haber, A. L. et al. A single-cell survey of the small intestinal epithelium. Nature 551, 333–339 (2017).
Wu, Y. et al. A population of navigator neurons is essential for olfactory map formation during the critical period article a population of navigator neurons is essential for olfactory map formation during the critical period. Neuron 100, 1066–1082.e6 (2018).
Fletcher, R. B. et al. Deconstructing olfactory stem cell trajectories at single-cell resolution. Cell Stem Cell 20, 817–830.e8 (2017).
Ualiyeva, S. et al. Airway brush cells generate cysteinyl leukotrienes through the ATP sensor P2Y2. Science Immunol. 5, eaax7224–eaax7224 (2020).
Bankova, L. G. et al. The cysteinyl leukotriene 3 receptor regulates expansion of IL-25–producing airway brush cells leading to type 2 inflammation. Science Immunol. 3, eaat9453 (2018).
Schaum, N. et al. Single-cell transcriptomics of 20 mouse organs creates a Tabula Muris. Nature 562, 367–372 (2018).
Cusanovich, D. A. et al. A single-cell atlas of in vivo mammalian chromatin accessibility. Cell 174, 1309–1324.e18 (2018).
Franzén, O., Gan, L.-M. & Björkegren, J. L. M. PanglaoDB: a web server for exploration of mouse and human single-cell RNA sequencing data. Database 2019, baz046 (2019).
Zhang, X. et al. CellMarker: a manually curated resource of cell markers in human and mouse. Nucleic Acids Res. 47, D721–D728 (2019).
Liberzon, A. et al. The molecular signatures database hallmark gene set collection. Cells 1, 417–425 (2015).
Gene Ontology Consortium. The Gene Ontology (GO) database and informatics resource. Nucleic Acids Res. 32, D258–D261 (2004).
Jassal, B. et al. The reactome pathway knowledgebase. Nucleic Acids Res. 48, D498–D503 (2020).
Kanehisa, M., Sato, Y., Kawashima, M., Furumichi, M. & Tanabe, M. KEGG as a reference resource for gene and protein annotation. Nucleic Acids Res. 44, 457–462 (2015).
Slenter, D. N. et al. WikiPathways: a multifaceted pathway database bridging metabolomics to other omics research. Nucleic Acids Res. 46, D661–D667 (2018).
Efremova, M. & Teichmann, S. A. Computational methods for single-cell omics across modalities. Nat. Methods 17, 14–17 (2020).
Hao, Y. et al. Integrated analysis of multimodal single-cell data. Preprint at bioRxiv https://doi.org/10.1101/2020.10.12.335331 (2020).
Argelaguet, R. et al. MOFA+: a statistical framework for comprehensive integration of multi-modal single-cell data. Genome Biol. 21, 111 (2020).
Zerbino, D. R. et al. Ensembl 2018. Nucleic Acids Res. 46, D754–D761 (2018).
Durinck, S., Spellman, P. T., Birney, E. & Huber, W. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat. Protoc. 4, 1184–1191 (2009).
Lebart, L, Morineau, A & Warwick, K. M. Multivariate Descriptive Statistical Analysis. Correspondence Analysis and Related Techniques for Large Matrices (John Wiley & Sons, 1984).
Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. Royal Stat. Soc. B. (Methodological) 57, 289–300 (1995).
Pagès, J. Multiple Factor Analysis by Example Using R (CRC Press, 2014).
Zappia, L., Phipson, B. & Oshlack, A. Splatter: simulation of single-cell RNA sequencing data. Genome Biol. 18, 174–174 (2017).
Chen, E. Y. et al. Enrichr: interactive and collaborative HTML5 gene list enrichment analysis tool. BMC Bioinform. 14, 128 (2013).
Kobak, D. & Berens, P. The art of using t-SNE for single-cell transcriptomics. Nat. Commun. 10, 5416 (2019).
Risso, D. & Cole, M. scRNAseq: Collection of public single-cell RNA-Seq datasets. R package v.2.4.0 http://bioconductor.org/packages/scRNAseq/ (Bioconductor, 2020).
We thank the Laboratory of Clinical Bioinformatics and the Laboratory of Human Lymphohematopoiesis for helpful discussions and support. The Laboratory of Clinical Bioinformatics was partly supported by the French National Research Agency (ANR) ‘Investissements d’Avenir’ Program (grant no. ANR-10-IAHU-01). The Laboratory of Clinical Bioinformatics and the Laboratory of Human Lymphohematopoiesis were partly supported by Christian Dior Couture, Dior. We also thank G. Fuentes from The Visual Thinker LLP for the creation of the illustrations in Figs. 1–4.
The authors declare no competing interests.
Peer review information Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Cortal, A., Martignetti, L., Six, E. et al. Gene signature extraction and cell identity recognition at the single-cell level with Cell-ID.
Nat Biotechnol (2021). https://doi.org/10.1038/s41587-021-00896-6