Auto-deconvolution and molecular networking of gas chromatography–mass spectrometry data thumbnail

Auto-deconvolution and molecular networking of gas chromatography–mass spectrometry data

Abstract

We engineered a machine learning approach, MSHub, to enable auto-deconvolution of gas chromatography–mass spectrometry (GC–MS) data. We then designed workflows to enable the community to store, process, share, annotate, compare and perform molecular networking of GC–MS data within the Global Natural Product Social (GNPS) Molecular Networking analysis platform. MSHub/GNPS performs auto-deconvolution of compound fragmentation patterns via unsupervised non-negative matrix factorization and quantifies the reproducibility of fragmentation patterns across samples.

Data availability

All of the data used in the preparation of this manuscript are publicly available at the MassIVE repository at the University of California, San Diego Center for Computational Mass Spectrometry website (https://massive.ucsd.edu). The data set accession numbers are: #1 (MSV000084033), #2 (MSV000085136), #3 (MSV000084034), #4 (MSV000084036), #5 (MSV000084032), #6 (MSV000084038), #7 (MSV000084042), #8 (MSV000084039), #9 (MSV000084040), #10 (MSV000084037), #11 (MSV000084211), #12 (MSV000083598), #13 (MSV000080892), #14 (MSV000080892), #15 (MSV000080892), #16 (MSV000084337), #17 (MSV000083658), #18 (MSV000083743), #19 (MSV000084226), #20 (MSV000083859), #21 (MSV000083294), #22 (MSV000084349), #23 (MSV000081340), #24 (MSV000084348), #25 (MSV000084378), #26 (MSV000084338), #27 (MSV000084339), #28 (MSV000081161), #29 (MSV000084350), #30 (MSV000084377), #31 (MSV000084145), #32 (MSV000084144), #33 (MSV000084146), #34 (MSV000084379), #35 (MSV000084380), #36 (MSV000084276), #37 (MSV000084277) and #38 (MSV000084212).

All of the GNPS analysis jobs for all of the studies are summarized in Supplementary Table 1.

Code availability

The source code of the MSHub software, including low- and high-resolution data processing versions, is available online at Github (version used in GNPS) (https://github.com/CCMS-UCSD/GNPS_Workflows/tree/master/mshub-gc/tools/mshub-gc/proc) and at BitBucket (standalone version in MSHub developers’ repository, both high and low resolution: https://bitbucket.org/iAnalytica/mshub_process/src/master/). Scripts used to parse, filter, organize data and generate the plots in the manuscript are available online at Github (https://github.com/bittremieux/GNPS_GC_fig). Script for merging individual .mgf files into a single file for creating global network is available at Github (https://github.com/bittremieux/GNPS_GC/blob/master/src/merge_mgf.py).

The three-dimensional model, the feature table with coordinates used for the mapping and the snapshots shown in Fig. 4a–d are available at https://github.com/aaksenov1/Human-volatilome-3D-mapping-. The GC–MS-adapted MolNetEnhancer code with an example Jupyter notebook can be found at https://github.com/madeleineernst/pyMolNetEnhancer. Source data are provided with this paper.

References

  1. 1.

    Stein, S. Mass spectral reference libraries: an ever-expanding resource for chemical identification. Anal. Chem. 84, 7274–7282 (2012).

    CAS 
    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  2. 2.

    Aksenov, A. A., da Silva, R., Knight, R., Lopes, N. P. & Dorrestein, P. C. Global chemical analysis of biology by mass spectrometry. Nat. Rev. Chem. 1, 0054 (2017).

  3. 3.

    Smirnov, A. et al. ADAP-GC 4.0: application of clustering-assisted multivariate curve resolution to spectral deconvolution of gas chromatography–mass spectrometry metabolomics data. Anal. Chem. 91, 9069–9077 (2019).

    CAS 
    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  4. 4.

    Tsugawa, H. et al. MS-DIAL: data-independent MS/MS deconvolution for comprehensive metabolome analysis. Nat. Methods 12, 523–526 (2015).

    CAS 
    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  5. 5.

    Amigo, J. M., Skov, T., Bro, R., Coello, J. & Maspoch, S. Solving GC-MS problems with PARAFAC2. Trends Anal. Chem. 27, 714–725 (2008).

    CAS 
    Article 

    Google Scholar
     

  6. 6.

    Kessler, N. et al. MeltDB 2.0-advances of the metabolomics software system. Bioinformatics 29, 2452–2459 (2013).

    CAS 
    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  7. 7.

    Domingo-Almenara, X. et al. eRah: a computational tool integrating spectral deconvolution and alignment with quantification and identification of metabolites in GC/MS-based metabolomics. Anal. Chem. 88, 9821–9829 (2016).

    CAS 
    Article 
    PubMed 

    Google Scholar
     

  8. 8.

    Skogerson, K., Wohlgemuth, G., Barupal, D. K. & Fiehn, O. The volatile compound BinBase mass spectral database. BMC Bioinf. 12, 321 (2011).

    CAS 
    Article 

    Google Scholar
     

  9. 9.

    Akiyama, K. et al. PRIMe: a web site that assembles tools for metabolomics and transcriptomics. In Silico Biol. 8, 339–345 (2008).

    CAS 
    PubMed 

    Google Scholar
     

  10. 10.

    Tautenhahn, R., Patti, G. J., Rinehart, D. & Siuzdak, G. XCMS online: a web-based platform to process untargeted metabolomic data. Anal. Chem. 84, 5035–5039 (2012).

    CAS 
    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  11. 11.

    Horai, H. et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom. 45, 703–714 (2010).

    CAS 
    Article 
    PubMed 

    Google Scholar
     

  12. 12.

    Sud, M. et al. Metabolomics Workbench: an international repository for metabolomics data and metadata, metabolite standards, protocols, tutorials and training, and analysis tools. Nucleic Acids Res. 44, D463–D470 (2016).

  13. 13.

    Carroll, A. J., Badger, M. R. & Harvey Millar, A. The MetabolomeExpress project: enabling web-based processing, analysis and transparent dissemination of GC/MS metabolomics datasets. BMC Bioinf. 11, 376 (2010).

    Article 

    Google Scholar
     

  14. 14.

    Haug, K. et al. MetaboLights—an open-access general-purpose repository for metabolomics studies and associated meta-data. Nucleic Acids Res. 41, D781–D786 (2013).

    CAS 
    Article 
    PubMed 

    Google Scholar
     

  15. 15.

    Hummel, J. et al. Mass spectral search and analysis using the Golm Metabolome Database. in The Handbook of Plant Metabolomics 321–343 (Wiley, 2013).

  16. 16.

    Wang, M. et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat. Biotechnol. 34, 828–837 (2016).

    CAS 
    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

  17. 17.

    Sumner, L. W. et al. Proposed minimum reporting standards for chemical analysis: Chemical Analysis Working Group (CAWG) Metabolomics Standards Initiative (MSI). Metabolomics 3, 211–221 (2007).

  18. 18.

    Kim, S., Gupta, N., Bandeira, N. & Pevzner, P. A. Spectral dictionaries: integrating de novo peptide sequencing with database search of tandem mass spectra. Mol. Cell. Proteom. 8, 53–69 (2009).

    CAS 
    Article 

    Google Scholar
     

  19. 19.

    Protsyuk, I. et al. 3D molecular cartography using LC–MS facilitated by Optimus and ‘ili software. Nat. Protoc. 13, 134–154 (2018).

    CAS 
    Article 
    PubMed 

    Google Scholar
     

  20. 20.

    Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016).

    Article 
    PubMed 
    PubMed Central 

    Google Scholar
     

Download references

Acknowledgements

The conversion of the data from different repositories was supported by grant R03 CA211211 on reuse of metabolomics data, to build enabling chemical analysis tools for the ocean symbiosis program, and the development of a user-friendly interface for GC–MS analysis was supported by the Gordon and Betty Moore Foundation through grant GBMF7622. The University of California, San Diego Center for Microbiome Innovation supported the campus-wide seed grant awards for data collection that enabled the development of some of this infrastructure. P.C.D. was supported by the National Science Foundation (grant no. IOS-1656475) and the National Institutes of Health (NIH) (grant nos. U19 AG063744 01, P41 GM103484, R03 CA211211 and R01 GM107550). K.V. and I.L. are very grateful for the support of the Vodafone Foundation as part of the DRUGS/DreamLab project. The MSHub platform development was supported by NIH/NIAAA grant (R21 AA028432) on integrated machine learning for mass spectrometry data in liver disease, Intelligify Limited and Vodafone Foundation’s DRUGS/CORONA-AI projects on network machine learning for drug repositioning and discovery of hyperfoods with antiviral/anticancer molecules. M.E. was supported by the University of Corsica. L.F.N. was supported by the NIH (R01 GM107550) and the European Union’s Horizon 2020 Research and Innovation Programme (MSCA-GF, 704786). A.B. was supported by the National Institute of Justice Award (2015-DN-BX-K047). Additional support for data acquisition and data storage was provided by the Center for Computational Mass Spectrometry (P41 GM103484). The collection of data from the HomeChem Project was supported by the Sloan Foundation. G.B.H., S.L.F.D., I.L., K.V. and I.B. are grateful for the support of the OG cancer breath analysis study by the National Institute for Health Research London Invitro Diagnostic Co-operative and the NIHR Imperial Biomedical Research Centre, the Rosetrees and Stonegate Trusts and the Imperial College Charity. D.V. acknowledges support from ERC-Consolidator grant 724228 (LEMAN). I.B. acknowledges the contribution of Q. Wen and M. Colavita in the production of the training video. C. Callewaert was supported by the Research Foundation Flanders, with support from the industrial research fund of Ghent University. W.B. was supported by the Research Foundation Flanders. A.A.O. acknowledges the support of the Fulbright Commission and Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET-Argentina). The work of R.L. and P.L.B. on the data set 30 was supported by Metaboscope, part of the ‘Platform 3A’, funded by the European Regional Development Fund, the French Ministry of Research, Higher Education and Innovation, the Provence-Alpes-Côte d’Azur region, the Departmental Council of Vaucluse and the Urban Community of Avignon. S.A. and A.R.F. acknowledge the PlantaSYST project by the European Union’s Horizon 2020 Research and Innovation Programme (SGA-CSA nos. 664621 and 739582 under FPA no. 664620). V.V. acknowledges support from the National Institute on Alcohol Abuse and Alcoholism award R24AA022057. M. Guma and R.C. acknowledge the support of the Krupp Endowed Fund grant. A portion of mass spectra in the public reference library was produced within the framework of the State Task for the Topchiev Institute of Petrochemical Synthesis RAS and with the support of the RUDN University Program 5-100. R.S.B. acknowledges support of the State Task for the Topchiev Institute of Petrochemical Synthesis RAS. L.N.K. acknowledges support of the RUDN University Program 5-100. I.M. acknowledges support of the Israel Science Foundation (project no. 1947/19) and European Research Council under the European Union’s Horizon 2020 Research and Innovation Programme (project no. 640384). J.S. has been supported by NIH/NIAMS R03AR072182, the Colton Center for Autoimmunity, the Rheumatology Research Foundation, the Riley Family Foundation and the Snyder Family Foundation. J. Manasson acknowledges support from the 2017 Group for Research and Assessment of Psoriasis and Psoriatic Arthritis Pilot Research Grant and NIH/NIAMS T32AR069515. R.G. is grateful to the Azrieli Foundation for the award of an Azrieli Fellowship. J.J.J.v.d.H. acknowledges support from an ASDI eScience grant (ASDI.2017.030) from the Netherlands eScience Center-NLeSC. B.A. was supported by the National Science Foundation through the Graduate Research Fellowship Program. GC–MS analyses for collection of the MSV000083743 data set were supported by the Pacific Northwest National Laboratory, Laboratory-Directed Research and Development Program, and were contributed by the Microbiomes in Transition Initiative; data were collected in the Environmental Molecular Sciences Laboratory, a national scientific user facility sponsored by the Department of Energy (DOE) Office of Biological and Environmental Research and located at the Pacific Northwest National Laboratory (PNNL). PNNL is operated by the Battelle Memorial Institute for the DOE under contract DEAC05-76RLO1830. M. Guma and R.C. acknowledge the support of the Krupp Endowed Fund grant. R.C. was also funded by T32AR064194-07. The authors are grateful to R. da Silva for his contribution to developing the first prototype of the EI data network and his continuous assistance with further development and testing of the infrastructure. The authors are also grateful to M. Vance and D. Farmer, who organized the sampling for HomeChem Indoor Chemistry Project (https://indoorchem.org/projects/homechem/) that allowed the collection of samples for the MSV000083598 data set. B. Ross has assisted with collecting data for the MSV000084348 data set. GC–MS analyses for collection of the MSV000084211 and MSV000084212 data sets were supported by N757 Doctorados Nacionales and project EXT-2016-69-1713 from the Departamento Administrativo de Ciencia, Tecnología e Innovación (COLCIENCIAS), the seed project INV-2019-67-1747 and the FAPA project of Chiara Carazzone from the Faculty of Science at Universidad de los Andes and the grant FP80740-064-2016 of COLCIENCIAS. The authors are grateful to L. M. Garzón, P. Palacios, M. Gonzalez and J. Hernandez for their contributions to collecting the samples and to J. Oswaldo Turizo for designing and manufacturing the amphibian electrical stimulator. A.S. and X.D. acknowledge support from National Cancer Institute award U01CA235507. The authors are grateful to S. Neuman for feedback regarding the XCMS deconvolution tool.

Author information

Author notes

  1. James T. Morton

    Present address: Center for Computational Biology, Flatiron Institute, Simons Foundation, New York, NY, USA

  2. These authors contributed equally: Alexander A. Aksenov, Ivan Laponogov.

Affiliations

  1. Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California, San Diego, La Jolla, CA, USA

    Alexander A. Aksenov, Zheng Zhang, Wout Bittremieux, Louis Felix Nothias, Mélissa Nothias-Esposito, Katherine N. Maloney, Alexey V. Melnik, Kenneth L. Jones II, Kathleen Dorrestein, Morgan Panitchpakdi, Madeleine Ernst, Justin J. J. van der Hooft, Amina Bouslimani, Daniel Petras, Mingxun Wang & Pieter C. Dorrestein

  2. Collaborative Mass Spectrometry Innovation Center, Skaggs School of Pharmacy and Pharmaceutical Sciences, University of California,San Diego, La Jolla, CA, USA

    Alexander A. Aksenov, Wout Bittremieux, Louis Felix Nothias, Mélissa Nothias-Esposito, Kathleen Dorrestein, Amina Bouslimani, Daniel Petras, Mingxun Wang & Pieter C. Dorrestein

  3. Department of Surgery and Cancer, Imperial College London, South Kensington Campus, London, UK

    Ivan Laponogov, Sophie L. F. Doran, Ilaria Belluomo, George B. Hanna & Kirill Veselkov

  4. Intelligify Limited, London, UK

    Dennis Veselkov

  5. Department of Computing, Imperial College, South Kensington Campus, London, UK

    Dennis Veselkov

  6. Department of Computer Science, University of Antwerp, Antwerp, Belgium

    Wout Bittremieux

  7. Department of Chemistry, Point Loma Nazarene University, San Diego, CA, USA

    Katherine N. Maloney

  8. Center for Precision Medicine, Department of Internal Medicine, Section of Molecular Medicine, Wake Forest School of Medicine, Winston-Salem, NC, USA

    Biswapriya B. Misra

  9. Department of Bioinformatics and Genomics, University of North Carolina at Charlotte, Charlotte, NC, USA

    Aleksandr Smirnov & Xiuxia Du

  10. Section for Clinical Mass Spectrometry, Department of Congenital Disorders, Danish Center for Neonatal Screening, Statens Serum Institut, Copenhagen, Denmark

    Madeleine Ernst

  11. Bioinformatics Group, Wageningen University, Wageningen, the Netherlands

    Justin J. J. van der Hooft

  12. Department of Chemistry, Universidad de los Andes, Bogotá, Colombia

    Mabel Gonzalez & Chiara Carazzone

  13. Department of Biological Sciences, Universidad de los Andes, Bogotá, Colombia

    Adolfo Amézquita

  14. Center for Microbial Ecology and Technology, Ghent, Belgium

    Chris Callewaert

  15. Department of Pediatrics, University of California, San Diego, La Jolla, CA, USA

    Chris Callewaert, James T. Morton, Rob Knight & Pieter C. Dorrestein

  16. Department of Biochemistry and Molecular Biology, Michigan State University, East Lansing, MI, USA

    Robert A. Quinn

  17. IRNASUS, Universidad Católica de Córdoba, CONICET, Facultad de Ciencias Agropecuarias, Córdoba, Argentina

    Andrea Albarracín Orio

  18. Universidad Nacional de Córdoba, Facultad de Ciencias Químicas, Departamento de Química Biológica Ranwel Caputto, Córdoba, Argentina

    Andrea M. Smania

  19. CONICET, Universidad Nacional de Córdoba, Centro de Investigaciones en Química Biológica de Córdoba (CIQUIBIC), Córdoba, Argentina

    Andrea M. Smania

  20. Biological Sciences Division, Pacific Northwest National Laboratory, Richland, WA, USA

    Sneha P. Couvillion, Meagan C. Burnet, Carrie D. Nicora, Erika Zink & Thomas O. Metz

  21. LECO Corporation, St. Joseph, MI, USA

    Viatcheslav Artaev & Elizabeth Humston-Fulmer

  22. Department of Chemistry and the National Institute for Biotechnology in the Negev, Ben-Gurion University of the Negev, Beer-Sheva, Israel

    Rachel Gregor & Michael M. Meijler

  23. Department of Life Sciences and the National Institute for Biotechnology in the Negev, Ben-Gurion University of the Negev, Beer-Sheva, Israel

    Itzhak Mizrahi & Stav Eyal

  24. Division of Biological Sciences, University of California, San Diego, La Jolla, CA, USA

    Brooke Anderson & Rachel Dutton

  25. UMR Qualisud, Université d’Avignon et des Pays du Vaucluse, Agrosciences, Avignon, France

    Raphaël Lugan & Pauline Le Boulch

  26. Laboratoire d’Etude des Résidus et Contaminants dans les Aliments (LABERCA), Oniris, INRAe, Nantes, France

    Yann Guitton, Stephanie Prevost, Audrey Poirier, Gaud Dervilly & Bruno Le Bizec

  27. The French Associates Institute for Agriculture and Biotechnology of Dryland, The Jacob Blaustein Institutes for Desert Research, Ben Gurion University of the Negev, Sede Boqer Campus, Beer Sheva, Israel

    Aaron Fait, Noga Sikron Persi, Chao Song & Kelem Gashu

  28. Division of Rheumatology, Department of Medicine, University of California, San Diego, La Jolla, CA, USA

    Roxana Coras & Monica Guma

  29. Division of Rheumatology, Department of Medicine, New York University School of Medicine, New York, NY, USA

    Julia Manasson & Jose U. Scher

  30. Department of Environmental Medicine and Public Health, Icahn School of Medicine at Mount Sinai, New York, NY, USA

    Dinesh Kumar Barupal

  31. Max Planck Institute for Molecular Plant Physiology, Potsdam-Golm, Germany

    Saleh Alseekh & Alisdair R. Fernie

  32. Center of Plant Systems Biology and Biotechnology (CPSBB), Plovdiv, Bulgaria

    Saleh Alseekh & Alisdair R. Fernie

  33. Department of Colorectal Surgery, Royal Free Hospital NHS Foundation Trust, Hampstead, London, UK

    Reza Mirnezami

  34. Department of Environmental Health Sciences, Yale School of Public Health, Yale University, New Haven, CT, USA

    Vasilis Vasiliou

  35. Institute of Inorganic and Analytical Chemistry, University of Münster, Münster, Germany

    Robin Schmid

  36. A.V. Topchiev Institute of Petrochemical Synthesis RAS, Moscow, Russian Federation

    Roman S. Borisov

  37. Рeoples’ Friendship University of Russia (RUDN University), Moscow, Russian Federation

    Larisa N. Kulikova

  38. UCSD Center for Microbiome Innovation, University of California, San Diego, La Jolla, CA, USA

    Rob Knight & Pieter C. Dorrestein

  39. Department of Bioengineering, University of California, San Diego, La Jolla, CA, USA

    Rob Knight

  40. Department of Computer Science & Engineering, University of California, San Diego, La Jolla, CA, USA

    Rob Knight

Contributions

P.C.D., A.A.A., M.W. and L.F.N. developed the concept of GNPS for GC–MS data. K.V. designed and supervised MSHub platform development. I.L., D.V., V.V. and K.V. developed the MSHub platform. M.W., Z.Z. and A.A.A. developed the workflows. A.A.A., Z.Z., M.W., B.B.M. and R.S.B. performed infrastructure testing and benchmarking. A.A.A. and Z.Z. assessed EI-based molecular networking. W.B. generated plots for MSHub algorithm performance testing and benchmarking against existing deconvolution tools. Z.Z., A.A. and M.E. generated molecular network plots. M.E. and J.J.J.v.d.H. adapted the MolNetEnhancer workflow for GC–MS molecular networks. A.S., X.D., A.A.A. and B.B.M. conducted comparative testing of MSHub with existing deconvolution tools. A.A.A., A.V.M., M.P., K.L.J. and K.D. conducted three-dimensional skin volatilome mapping studies. S.L.F.D., I.B. and G.B.H. conducted the esophageal and gastric breath analysis cancers detection study. A.A.A., Z.Z., M.P. and M.W. converted and added public libraries to GNPS. A.A.A., A.V.M., S.L.F.D., C. Callewaert, B.B.M., M. Gonzalez, C. Carazzone, A.A., J.T.M., R.A.Q., A.B., A.A.O., D.P., A.M.S., S.P.C., T.O.M., M.C.B., C.D.N., E.Z., V.A., E.H.-F., R.G., M.M.M., I.M., S.E., P.L.B., B.A., R.D., R.L., Y.G., S.P., A.P., G.D., B.L.B., A.F., N.S.P., K.G., C.S., R.C., M. Guma, J. Manasson, J.U.S., D.K.B., S.A. and A.R.F. generated GC–MS data. R.S.B., L.N.K., M.P. and A.A.A. assembled the initial version of the public reference spectra library. R.S. created the MZmine export module for GNPS GC–MS input files and RI markers file export. A.A.A., R.S., I.B., A.A.O., A.M.S., B.A., M. Gonzalez, K.N.M. and R.S.B. produced training videos. M.N.-E., A.A.A., M. Gonzalez, B.B.M., A.S. and L.F.N. wrote and compiled tutorials and documentation. P.C.D., A.A.A., W.B., K.V., R.M. and R.K. wrote the paper.

Corresponding authors

Correspondence to
Pieter C. Dorrestein or Kirill Veselkov.

Ethics declarations

Competing interests

P.C.D. is a scientific advisor for Sirenas, Galileo and Cybele. P.C.D. is scientific adviser and cofounder of Enveda and Ometa; this has been approved by UC San Diego. M.W. is a consultant for Sirenas and the founder of Ometa Labs. A.A.A. is a consultant for Ometa Labs.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

About this article

Verify currency and authenticity via CrossMark

Cite this article

Aksenov, A.A., Laponogov, I., Zhang, Z. et al. Auto-deconvolution and molecular networking of gas chromatography–mass spectrometry data.
Nat Biotechnol (2020). https://doi.org/10.1038/s41587-020-0700-3

Download citation

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *