Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra thumbnail

Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra

Abstract

Metabolomics using nontargeted tandem mass spectrometry can detect thousands of molecules in a biological sample. However, structural molecule annotation is limited to structures present in libraries or databases, restricting analysis and interpretation of experimental data. Here we describe CANOPUS (class assignment and ontology prediction using mass spectrometry), a computational tool for systematic compound class annotation. CANOPUS uses a deep neural network to predict 2,497 compound classes from fragmentation spectra, including all biologically relevant classes. CANOPUS explicitly targets compounds for which neither spectral nor structural reference data are available and predicts classes lacking tandem mass spectrometry training data. In evaluation using reference data, CANOPUS reached very high prediction performance (average accuracy of 99.7% in cross-validation) and outperformed four baseline methods. We demonstrate the broad utility of CANOPUS by investigating the effect of microbial colonization in the mouse digestive system, through analysis of the chemodiversity of different Euphorbia plants and regarding the discovery of a marine natural product, revealing biological insights at the compound class level.

Data availability

Input mzML/mzXML files are available at MassIVE (https://massive.ucsd.edu/) with the accession nos. MSV000079949 (mice data) and MSV000081082 (Euphorbia data). The mass spectrometry data for Rivularia sp. cyanobacteria were deposited at MassIVE (accession no. MSV000085578). The spectra for rivulariapeptolide 1155 were annotated in the GNPS spectral library (accession nos. CCMSLIB00005723986 and CCMSLIB00005723388). The structure database with ClassyFire annotations, the publicly available part of the evaluation data and the Cytoscape files for network visualization can be downloaded from https://bio.informatik.uni-jena.de/data/ and https://doi.org/10.6084/m9.figshare.13073051. Source data are provided with this paper.

Code availability

CANOPUS is part of SIRIUS software and can be downloaded from https://bio.informatik.uni-jena.de/software/canopus/. The source code of CANOPUS is available at https://github.com/boecker-lab/sirius-libs. The scripts for analysis and visualization of CANOPUS results are available at https://github.com/kaibioinfo/canopus_treemap.

References

  1. 1.

    Horai, H. et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom. 45, 703–714 (2010).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  2. 2.

    Wang, M. et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat. Biotechnol. 34, 828–837 (2016).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  3. 3.

    Guijas, C. et al. METLIN: a technology platform for identifying knowns and unknowns. Anal. Chem. 90, 3156–3164 (2018).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  4. 4.

    Kind, T. et al. Identification of small molecules using accurate mass MS/MS search. Mass Spectrom. Rev. 37, 513–532 (2018).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  5. 5.

    Allen, F., Greiner, R. & Wishart, D. Competitive fragmentation modeling of ESI–MS/MS spectra for putative metabolite identification. Metabolomics 11, 98–110 (2015).

    CAS 
    Article 

    Google Scholar
     

  6. 6.

    Brouard, C. et al. Fast metabolite identification with Input Output Kernel Regression. Bioinformatics 32, i28–i36 (2016).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  7. 7.

    Dührkop, K., Shen, H., Meusel, M., Rousu, J. & Böcker, S. Searching molecular structure databases with tandem mass spectra using CSI:FingerID. Proc. Natl Acad. Sci. USA 112, 12580–12585 (2015).

    PubMed 
    Article 
    CAS 

    Google Scholar
     

  8. 8.

    Ridder, L. et al. Automatic chemical structure annotation of an LC-MSn based metabolic profile from green tea. Anal. Chem. 85, 6033–6040 (2013).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  9. 9.

    Ruttkies, C., Schymanski, E. L., Wolf, S., Hollender, J. & Neumann, S. MetFrag relaunched: incorporating strategies beyond in silico fragmentation. J. Cheminform. 8, 3 (2016).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  10. 10.

    Tsugawa, H. et al. Hydrogen rearrangement rules: computational MS/MS fragmentation and structure elucidation using MS-FINDER Software. Anal. Chem. 88, 7946–7958 (2016).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  11. 11.

    Schymanski, E. L. et al. Critical assessment of small molecule identification 2016: automated methods. J. Cheminf. 9, 22 (2017).

    Article 

    Google Scholar
     

  12. 12.

    Blaženović, I., Kind, T., Ji, J. & Fiehn, O. Software tools and approaches for compound identification of LC-MS/MS data in metabolomics. Metabolites 8, 31 (2018).

  13. 13.

    Silva, R. R., Dorrestein, P. C. & Quinn, R. A. Illuminating the dark matter in metabolomics. Proc. Natl Acad. Sci. USA 112, 12549–12550 (2015).

    PubMed 
    Article 
    CAS 

    Google Scholar
     

  14. 14.

    Tsugawa, H. Advances in computational metabolomics and databases deepen the understanding of metabolisms. Curr. Opin. Biotechnol. 54, 10–17 (2018).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  15. 15.

    Montenegro-Burke, J. R., Guijas, C. & Siuzdak, G. METLIN: a tandem mass spectral library of standards. Methods Mol. Biol. 2104, 149–163 (2020).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  16. 16.

    Vinaixa, M. et al. Mass spectral databases for LC/MS- and GC/MS-based metabolomics: state of the field and future prospects. Trends Anal. Chem. 78, 23–35 (2016).

    CAS 
    Article 

    Google Scholar
     

  17. 17.

    Aksenov, A. A., Silva, R., Knight, R., Lopes, N. P. & Dorrestein, P. C. Global chemical analysis of biology by mass spectrometry. Nat. Rev. Chem. 1, 0054 (2017).

    CAS 
    Article 

    Google Scholar
     

  18. 18.

    Frainay, C. et al. Mind the gap: mapping mass spectral databases in genome-scale metabolic networks reveals poorly covered areas. Metabolites 8, 51 (2018).

  19. 19.

    Venkataraghavan, R., McLafferty, F. W. & Lear, G. E. Computer-aided interpretation of mass spectra. Org. Mass Spectrom. 2, 1–15 (1969).

    CAS 
    Article 

    Google Scholar
     

  20. 20.

    Curry, B. & Rumelhart, D. E. MSnet: a neural network that classifies mass spectra. Tetrahedron Comput. Methodol. 3, 213–237 (1990).

    CAS 
    Article 

    Google Scholar
     

  21. 21.

    Werther, W., Lohninger, H., Stancl, F. & Varmuza, K. Classification of mass spectra: a comparison of yes/no classification methods for the recognition of simple structural properties. Chemom. Intell. Lab. Syst. 22, 63–76 (1994).

    CAS 
    Article 

    Google Scholar
     

  22. 22.

    Heinonen, M., Shen, H., Zamboni, N. & Rousu, J. Metabolite identification and molecular fingerprint prediction via machine learning. Bioinformatics 28, 2333–2341 (2012).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  23. 23.

    Hastings, J. et al. The ChEBI reference database and ontology for biologically relevant chemistry: enhancements for 2013. Nucleic Acids Res. 41, D456–D463 (2013).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  24. 24.

    Rogers, F. B. Communications to the editor. Bull. Med. Libr. Assoc. 51, 114–116 (1963).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  25. 25.

    Djoumbou Feunang, Y. et al. ClassyFire: automated chemical classification with a comprehensive, computable taxonomy. J. Cheminf. 8, 61 (2016).

    Article 

    Google Scholar
     

  26. 26.

    Blaženović, I. et al. Structure annotation of all mass spectra in untargeted metabolomics. Anal. Chem. 91, 2155–2162 (2019).

    PubMed 
    Article 
    CAS 

    Google Scholar
     

  27. 27.

    Ernst, M. et al. Assessing specialized metabolite diversity in the cosmopolitan plant genus Euphorbia L. Front. Plant Sci. 10, 846 (2019).

    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  28. 28.

    Tsugawa, H. et al. A cheminformatics approach to characterize metabolomes in stable-isotope-labeled organisms. Nat. Methods 16, 295–298 (2019).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  29. 29.

    Barupal, D. K. & Fiehn, O. Chemical Similarity Enrichment Analysis (ChemRICH) as alternative to biochemical pathway mapping for metabolomic datasets. Sci. Rep. 7, 14567 (2017).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  30. 30.

    Rasche, F. et al. Identifying the unknowns by aligning fragmentation trees. Anal. Chem. 84, 3417–3426 (2012).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  31. 31.

    Treutler, H. et al. Discovering regulated metabolite families in untargeted metabolomics studies. Anal. Chem. 88, 8082–8090 (2016).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  32. 32.

    Ernst, M. et al. MolNetEnhancer: enhanced molecular networks by integrating metabolome mining and annotation tools. Metabolites 9, 144 (2019).

  33. 33.

    Lowry, S. R. et al. Comparison of various K-nearest neighbor voting schemes with the self-training interpretive and retrieval system for identifying molecular substructures from mass spectral data. Anal. Chem. 49, 1720–1722 (1977).

    CAS 
    Article 

    Google Scholar
     

  34. 34.

    Askenazi, M. & Linial, M. ARISTO: ontological classification of small molecules by electron ionization-mass spectrometry. Nucleic Acids Res. 39, W505–W510 (2011).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  35. 35.

    Peters, K. et al. Chemical diversity and classification of secondary metabolites in nine bryophyte species. Metabolites 9, 222 (2019).

  36. 36.

    LeCun, Y., Bengio, Y. & Hinton, G. Deep learning. Nature 521, 436–444 (2015).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  37. 37.

    Matthews, B. W. Comparison of the predicted and observed secondary structure of T4 phage lysozyme. Biochim. Biophys. Acta 405, 442–451 (1975).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  38. 38.

    Wolf, S., Schmidt, S., Müller-Hannemann, M. & Neumann, S. In silico fragmentation for computer assisted identification of metabolite mass spectra. BMC Bioinformatics 11, 148 (2010).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  39. 39.

    Dührkop, K. et al. SIRIUS 4: a rapid tool for turning tandem mass spectra into metabolite structure information. Nat. Methods 16, 299–302 (2019).

    PubMed 
    Article 
    CAS 
    PubMed Central 

    Google Scholar
     

  40. 40.

    Cooper, B. T. et al. Hybrid search: a method for identifying metabolites absent from tandem mass spectrometry libraries. Anal. Chem. 91, 13924–13932 (2019).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  41. 41.

    Allard, P.-M. et al. Integration of molecular networking and in-silico MS/MS fragmentation for natural products dereplication. Anal. Chem. 88, 3317–3323 (2016).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  42. 42.

    Silva, R. R. et al. Propagating annotations of molecular networks using in silico fragmentation. PLoS Comput. Biol. 14, e1006089 (2018).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  43. 43.

    Fox Ramos, A. E. et al. CANPA: computer-assisted natural products anticipation. Anal. Chem. 91, 11247–11252 (2019).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  44. 44.

    Quinn, R. A. et al. Global chemical effects of the microbiome include new bile-acid conjugations. Nature 579, 123–129 (2020).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  45. 45.

    Minamida, K. et al. Production of equol from daidzein by Gram-positive rod-shaped bacterium isolated from rat intestine. J. Biosci. Bioeng. 102, 247–250 (2006).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  46. 46.

    Quinn, R. A. et al. Molecular networking as a drug discovery, drug metabolism, and precision medicine strategy. Trends Pharmacol. Sci. 38, 143–154 (2017).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  47. 47.

    Shannon, P. et al. Cytoscape: a software environment for integrated models of biomolecular interaction networks. Genome Res. 13, 2498–2504 (2003).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  48. 48.

    Hooft, J. J. J., Wandy, J., Barrett, M. P., Burgess, K. E. V. & Rogers, S. Topic modeling for untargeted substructure exploration in metabolomics. Proc. Natl Acad. Sci. USA 113, 13738–13743 (2016).

    PubMed 
    Article 
    CAS 

    Google Scholar
     

  49. 49.

    Vasas, A. & Hohmann, J. Euphorbia diterpenes: isolation, structure, biological activity, and synthesis (2008–2012). Chem. Rev. 114, 8579–8612 (2014).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  50. 50.

    Yang, M. et al. Studies on the fragmentation pathways of ingenol esters isolated from Euphorbia esula using IT-MSn and Q-TOF-MS/MS methods in electrospray ionization mode. Int. J. Mass Spectrom. 323-324, 55–62 (2012).

    CAS 
    Article 

    Google Scholar
     

  51. 51.

    Riina, R. et al. A worldwide molecular phylogeny and classification of the leafy spurges, Euphorbia subgenus Esula (Euphorbiaceae). TAXON 62, 316–342 (2013).

    Article 

    Google Scholar
     

  52. 52.

    Horn, J. W. et al. Phylogenetics and the evolution of major structural characters in the giant genus Euphorbia L. (Euphorbiaceae). Mol. Phylogenet. Evol. 63, 305–326 (2012).

  53. 53.

    Horn, J. W. et al. Evolutionary bursts in Euphorbia (Euphorbiaceae) are linked with photosynthetic pathway. Evolution 68, 3485–3504 (2014).

  54. 54.

    Peirson, J. A., Bruyns, P. V., Riina, R., Morawetz, J. J. & Berry, P. E. A molecular phylogeny and classification of the largely succulent and mainly African Euphorbia subg. Athymalus (Euphorbiaceae). TAXON 62, 1178–1199 (2013).

    Article 

    Google Scholar
     

  55. 55.

    Dorsey, B. L. et al. Phylogenetics, morphological evolution, and classification of Euphorbia subgenus Euphorbia. TAXON 62, 291–315 (2013).

    Article 

    Google Scholar
     

  56. 56.

    Yang, Y. et al. Molecular phylogenetics and classification of Euphorbia subgenus Chamaesyce (Euphorbiaceae). TAXON 61, 764–789 (2012).

    Article 

    Google Scholar
     

  57. 57.

    Pluskal, T., Castillo, S., Villar-Briones, A. & Oresic, M. MZmine 2: modular framework for processing, visualizing, and analyzing mass spectrometry-based molecular profile data. BMC Bioinformatics 11, 395 (2010).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  58. 58.

    Nothias, L.-F. et al. Feature-based molecular networking in the GNPS analysis environment. Nat. Methods 17, 905–908 (2020).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  59. 59.

    Schmid, R. et al. Ion identity molecular networking in the GNPS Environment. Preprint at bioRxiv https://doi.org/10.1101/2020.05.11.088948 (2020).

  60. 60.

    Röst, H. L. et al. OpenMS: a flexible open-source software platform for mass spectrometry data analysis. Nat. Methods 13, 741–748 (2016).

    PubMed 
    Article 
    CAS 

    Google Scholar
     

  61. 61.

    Benton, H. P., Wong, D. M., Trauger, S. A. & Siuzdak, G. XCMS2: processing tandem mass spectrometry data for metabolite identification and structural characterization. Anal. Chem. 80, 6382–6389 (2008).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  62. 62.

    Shinbo, Y. et al. in Plant Metabolomics Vol. 57 (eds Saito, K. et al.) 165–181 (Springer, 2006).

  63. 63.

    Wishart, D. S. et al. HMDB 4.0: the human metabolome database for 2018. Nucleic Acids Res. 46, D608–D617 (2018).

    CAS 
    Article 

    Google Scholar
     

  64. 64.

    Kanehisa, M., Goto, S., Kawashima, S. & Nakaya, A. The KEGG databases at GenomeNet. Nucleic Acids Res. 30, 42–46 (2002).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  65. 65.

    Bobach, C., Böhme, T., Laube, U., Püschel, A. & Weber, L. Automated compound classification using a chemical ontology. J. Cheminform. 4, 40 (2012).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  66. 66.

    Klekota, J. & Roth, F. P. Chemical substructures that enrich for biological activity. Bioinformatics 24, 2518–2525 (2008).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  67. 67.

    Rogers, D. & Hahn, M. Extended-connectivity fingerprints. J. Chem. Inf. Model. 50, 742–754 (2010).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  68. 68.

    Willighagen, E. L. et al. The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching. J. Cheminf. 9, 33 (2017).

    Article 
    CAS 

    Google Scholar
     

  69. 69.

    Hähnke, V. D., Kim, S. & Bolton, E. E. PubChem chemical structure standardization. J. Cheminf. 10, 36 (2018).

    Article 
    CAS 

    Google Scholar
     

  70. 70.

    Rogers, D. J. & Tanimoto, T. T. A computer program for classifying plants. Science 132, 1115–1118 (1960).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  71. 71.

    Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. Dropout: a simple way to prevent neural networks from overfitting. J. Mach. Learn. Res. 15, 1929–1958 (2014).


    Google Scholar
     

  72. 72.

    Kingma, D. P. & Ba, J. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2014).

  73. 73.

    Abadi, M. N. et al. in 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI 16) (eds Keeton, K. & Roscoe, T.) 265–283 (USENIX, 2016).

  74. 74.

    Platt, J. C. Advances in Large Margin Classifiers (MIT Press, 2000).

  75. 75.

    Böcker, S. & Dührkop, K. Fragmentation trees reloaded. J. Cheminform. 8, 5 (2016).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  76. 76.

    Ludwig, M. et al. Database-independent molecular formula annotation using Gibbs sampling through ZODIAC. Nat. Mach. Intell. 2, 629–641 (2020).

    Article 

    Google Scholar
     

  77. 77.

    Moorthy, A. S., Wallace, W. E., Kearsley, A. J., Tchekhovskoi, D. V. & Stein, S. E. Combining fragment-ion and neutral-loss matching during mass spectral library searching: a new general purpose algorithm applicable to illicit drug identification. Anal Chem. 89, 13261–13268 (2017).

  78. 78.

    Mann, H. B. & Whitney, D. R. On a test of whether one of two random variables is stochastically larger than the other. Ann. Math. Stat. 18, 50–60 (1947).

    Article 

    Google Scholar
     

  79. 79.

    Virtanen, P. et al. SciPy 1.0: fundamental algorithms for scientific computing in Python. Nat. Meth. 17, 261–272 (2020).

    CAS 
    Article 

    Google Scholar
     

Download references

Acknowledgements

We thank Deutsche Forschungsgemeinschaft for providing financial support (no. BO 1910/20 to S.B., K.D. and M.L. and no. PE 2600/1 to D.P.), and the Academy of Finland (no. 310107/MACOME to J.R.). P.C.D., R.R. and W.H.G. were supported by the Gordon and Betty Moore Foundation (no. GBMF7622) and by the US National Institutes of Health (NIH; no. R01 GM107550). P.C.D. was supported by NIH grants nos. P41 GM103484 and R03 CA211211. L.-F.N. was supported by NIH grant no. R01 GM107550 and by the European Union’s Horizon 2020 program (MSCA-GF, no. 704786). We thank F. Kuhlmann and Agilent Technologies, Inc. for providing data used in the evaluation of CANOPUS. We thank Y. Djoumbou Feunang, D. Arndt and D. Wishart for providing ClassyFire annotations for a database of molecular structures. We thank K. Alexander, E. Caro-Diaz and B. Naman for assistance with the collection of Rivularia sp. Further, we thank S. Whitner and K. Joosten for 16S recombinant DNA analysis. We thank M. Ernst for valuable discussions on the Euphorbia plant study, and J. van der Hooft and S. Rogers for feedback on the manuscript.

Author information

Affiliations

  1. Chair for Bioinformatics, Friedrich-Schiller University, Jena, Germany

    Kai Dührkop, Markus Fleischauer, Marcus Ludwig, Martin A. Hoffmann & Sebastian Böcker

  2. Collaborative Mass Spectrometry Innovation Center, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, La Jolla, CA, USA

    Louis-Félix Nothias, Daniel Petras & Pieter C. Dorrestein

  3. Center for Marine Biotechnology and Biomedicine, Scripps Institution of Oceanography, University of California, San Diego, La Jolla, CA, USA

    Raphael Reher & William H. Gerwick

  4. International Max Planck Research School ‘Exploration of Ecological Interactions with Molecular and Chemical Techniques’, Max Planck Institute for Chemical Ecology, Jena, Germany

    Martin A. Hoffmann

  5. Scripps Institution of Oceanography, University of California, San Diego, La Jolla, CA, USA

    Daniel Petras

  6. Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, La Jolla, CA, USA

    William H. Gerwick

  7. Helsinki Institute for Information Technology, Department of Computer Science, Aalto University, Espoo, Finland

    Juho Rousu

Contributions

K.D., J.R. and S.B. designed the research. K.D. and S.B. developed the computational method. K.D. implemented the computational method with contributions from M.L., M.F. and M.A.H. M.F. integrated CANOPUS into SIRIUS v.4.4. K.D., L.-F.N. and P.C.D. applied and evaluated the method in the mouse and Euphorbia studies. R.R. isolated rivulariapeptolide 1155 and applied CANOPUS (on mass spectrometry data collected and analyzed by D.P. and R.R. and supervised by W.H.G.) and one-/two-dimensional NMR analysis for its structural elucidation. K.D., S.B., L.-F.N. and R.R. wrote the manuscript, in concert with all authors.

Corresponding author

Correspondence to
Sebastian Böcker.

Ethics declarations

Competing interests

S.B., K.D., M.L., M.F. and M.A.H. are cofounders of Bright Giant GmbH. P.C.D. is scientific advisor for Sirenas LLC.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 CANOPUS performance sunburst plot. Matthews correlation coefficient (MCC) for the 782 of 2,497 compound classes with at least 50 positive examples.

SVM training dataset. A darker green coloring corresponds to better prediction performance for the class. The size of each slice is chosen such that all classes fit into the figure and has no further meaning. Inner slices represent parent classes of outer slices.

Extended Data Fig. 2 Effect of removing a subclass from the MS/MS training data.

ac, Regular evaluation setup: classes and subclasses are distributed into cross-validation folds, ensuring that methods are never evaluated on the same MS/MS data or structures they were trained on. df, We remove all flavonoid glycosides (the subclass) from the MS/MS training data (d), and then evaluate the predictor for glycosides (the class) on these removed MS/MS spectra (e). A perfect method would still classify all flavonoid glycoside MS/MS spectra as glycosides (f). CANOPUS exhibits only a small drop (68% to 97%) in correct classifications (c,f). In contrast, direct prediction performed mostly on par with CANOPUS before removing flavonoid glycosides from the MS/MS training data (c), but misses almost all of them (8%) afterwards (f). We were able to attribute this to the presence of isoflavonoid glycosides in the training data; these do not belong to the flavonoid class, but have highly similar structures and MS/MS spectra, except for the presence of a sugar residue. We observed that direct prediction in (df) uses the presence of a sugar residue to infer that a MS/MS spectrum is not a glycoside. In contrast, CANOPUS does not fall for this ‘bait’; heterogeneous training allows us to integrate the substantially more comprehensive structure data in its predictions.

Extended Data Fig. 3 Relative number of compounds annotated at varying ClassyFire class levels in the mice study (a) and the Euphorbia plant study (b).

The ClassyFire ChemOnt ontology is organized as a tree, where the Kingdom is either Organic compounds or Inorganic compounds. Superclasses like Lipids and lipid like molecules, Benzenoids are children of Kingdom class. Flavonoids and Steroids and steroid derivatives are examples for the Class level, while Flavonoid glycosides and Bile acids, alcohols, and derivatives are examples for subclasses. There can be up to 11 levels in the ontology. c, ClassyFire classes of compounds in the biological databases. We observe a similar distribution of class levels as for the two biological datasets, indicating that CANOPUS is comprehensively classifying compounds at all possible compound class levels.

Extended Data Fig. 4 Molecular network and compound class annotations (single class annotations) for the mice digestive system.

Node colors indicate the compound class annotated by CANOPUS; displayed compound classes were manually selected. When a compound is annotated with multiple classes, the class with the larger structural pattern is selected. Nodes are connected by an edge if the spectral similarity is 0.7 or higher.

Extended Data Fig. 5 Molecular network and compound class annotations (muliple class annotations) for the mice digestive system.

Node colors indicate the compound class annotated by CANOPUS; compound classes are the same as in Supplementary Fig. 4 1. Compounds belonging to multiple classes displayed as multicolored nodes. Nodes are connected by an edge if the spectral similarity is 0.7 or higher.

Extended Data Fig. 6 Number of compounds detected for each Euphorbia subgenus.

Orange bars indicate the number of compounds detected here, black ticks indicate the number of compounds reported in the original study. Higher numbers of detected features are not a measure of quality for the two methods, but depend mainly on the preprocessing executed before compound classification.

Extended Data Fig. 7 Number of compounds annotated as diterpenoids in different species of Euphorbia.

Left: absolute number of compounds. Right: relative number of compounds, that is, number of diterpenoids divided by total number of compounds in each species. Black ticks in the left figure mark the reported number of diterpenoids in the original study by Ernst et al.

Extended Data Fig. 8 Number of compounds annotated as triterpenoids in different species of Euphorbia.

Left: absolute number of compounds. Right: relative number of compounds, that is, number of triterpenoids divided by total number of compounds in each species. Black ticks in the left figure mark the reported number of triterpenoids in the original study by Ernst et al.

Extended Data Fig. 9 Number of diterpenoids in different species of Euphorbia.

Black bars show the amount of diterpenoids that have a benzoic acid ester (a), fatty acid ester (b) or two carboxylic acids (c).
Source data

Supplementary information

About this article

Verify currency and authenticity via CrossMark

Cite this article

Dührkop, K., Nothias, LF., Fleischauer, M. et al. Systematic classification of unknown metabolites using high-resolution fragmentation mass spectra.
Nat Biotechnol (2020). https://doi.org/10.1038/s41587-020-0740-8

Download citation

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *