We engineered a machine learning approach, MSHub, to enable auto-deconvolution of gas chromatography–mass spectrometry (GC–MS) data. We then designed workflows to enable the community to store, process, share, annotate, compare and perform molecular networking of GC–MS data within the Global Natural Product Social (GNPS) Molecular Networking analysis platform. MSHub/GNPS performs auto-deconvolution of compound fragmentation patterns via unsupervised non-negative matrix factorization and quantifies the reproducibility of fragmentation patterns across samples.
All of the data used in the preparation of this manuscript are publicly available at the MassIVE repository at the University of California, San Diego Center for Computational Mass Spectrometry website (https://massive.ucsd.edu). The data set accession numbers are: #1 (MSV000084033), #2 (MSV000085136), #3 (MSV000084034), #4 (MSV000084036), #5 (MSV000084032), #6 (MSV000084038), #7 (MSV000084042), #8 (MSV000084039), #9 (MSV000084040), #10 (MSV000084037), #11 (MSV000084211), #12 (MSV000083598), #13 (MSV000080892), #14 (MSV000080892), #15 (MSV000080892), #16 (MSV000084337), #17 (MSV000083658), #18 (MSV000083743), #19 (MSV000084226), #20 (MSV000083859), #21 (MSV000083294), #22 (MSV000084349), #23 (MSV000081340), #24 (MSV000084348), #25 (MSV000084378), #26 (MSV000084338), #27 (MSV000084339), #28 (MSV000081161), #29 (MSV000084350), #30 (MSV000084377), #31 (MSV000084145), #32 (MSV000084144), #33 (MSV000084146), #34 (MSV000084379), #35 (MSV000084380), #36 (MSV000084276), #37 (MSV000084277) and #38 (MSV000084212).
All of the GNPS analysis jobs for all of the studies are summarized in Supplementary Table 1.
The source code of the MSHub software, including low- and high-resolution data processing versions, is available online at Github (version used in GNPS) (https://github.com/CCMS-UCSD/GNPS_Workflows/tree/master/mshub-gc/tools/mshub-gc/proc) and at BitBucket (standalone version in MSHub developers’ repository, both high and low resolution: https://bitbucket.org/iAnalytica/mshub_process/src/master/). Scripts used to parse, filter, organize data and generate the plots in the manuscript are available online at Github (https://github.com/bittremieux/GNPS_GC_fig). Script for merging individual .mgf files into a single file for creating global network is available at Github (https://github.com/bittremieux/GNPS_GC/blob/master/src/merge_mgf.py).
The three-dimensional model, the feature table with coordinates used for the mapping and the snapshots shown in Fig. 4a–d are available at https://github.com/aaksenov1/Human-volatilome-3D-mapping-. The GC–MS-adapted MolNetEnhancer code with an example Jupyter notebook can be found at https://github.com/madeleineernst/pyMolNetEnhancer. Source data are provided with this paper.
Stein, S. Mass spectral reference libraries: an ever-expanding resource for chemical identification. Anal. Chem. 84, 7274–7282 (2012).
Aksenov, A. A., da Silva, R., Knight, R., Lopes, N. P. & Dorrestein, P. C. Global chemical analysis of biology by mass spectrometry. Nat. Rev. Chem. 1, 0054 (2017).
Smirnov, A. et al. ADAP-GC 4.0: application of clustering-assisted multivariate curve resolution to spectral deconvolution of gas chromatography–mass spectrometry metabolomics data. Anal. Chem. 91, 9069–9077 (2019).
Tsugawa, H. et al. MS-DIAL: data-independent MS/MS deconvolution for comprehensive metabolome analysis. Nat. Methods 12, 523–526 (2015).
Amigo, J. M., Skov, T., Bro, R., Coello, J. & Maspoch, S. Solving GC-MS problems with PARAFAC2. Trends Anal. Chem. 27, 714–725 (2008).
Kessler, N. et al. MeltDB 2.0-advances of the metabolomics software system. Bioinformatics 29, 2452–2459 (2013).
Domingo-Almenara, X. et al. eRah: a computational tool integrating spectral deconvolution and alignment with quantification and identification of metabolites in GC/MS-based metabolomics. Anal. Chem. 88, 9821–9829 (2016).
Skogerson, K., Wohlgemuth, G., Barupal, D. K. & Fiehn, O. The volatile compound BinBase mass spectral database. BMC Bioinf. 12, 321 (2011).
Akiyama, K. et al. PRIMe: a web site that assembles tools for metabolomics and transcriptomics. In Silico Biol. 8, 339–345 (2008).
Tautenhahn, R., Patti, G. J., Rinehart, D. & Siuzdak, G. XCMS online: a web-based platform to process untargeted metabolomic data. Anal. Chem. 84, 5035–5039 (2012).
Horai, H. et al. MassBank: a public repository for sharing mass spectral data for life sciences. J. Mass Spectrom. 45, 703–714 (2010).
Sud, M. et al. Metabolomics Workbench: an international repository for metabolomics data and metadata, metabolite standards, protocols, tutorials and training, and analysis tools. Nucleic Acids Res. 44, D463–D470 (2016).
Carroll, A. J., Badger, M. R. & Harvey Millar, A. The MetabolomeExpress project: enabling web-based processing, analysis and transparent dissemination of GC/MS metabolomics datasets. BMC Bioinf. 11, 376 (2010).
Haug, K. et al. MetaboLights—an open-access general-purpose repository for metabolomics studies and associated meta-data. Nucleic Acids Res. 41, D781–D786 (2013).
Hummel, J. et al. Mass spectral search and analysis using the Golm Metabolome Database. in The Handbook of Plant Metabolomics 321–343 (Wiley, 2013).
Wang, M. et al. Sharing and community curation of mass spectrometry data with Global Natural Products Social Molecular Networking. Nat. Biotechnol. 34, 828–837 (2016).
Sumner, L. W. et al. Proposed minimum reporting standards for chemical analysis: Chemical Analysis Working Group (CAWG) Metabolomics Standards Initiative (MSI). Metabolomics 3, 211–221 (2007).
Kim, S., Gupta, N., Bandeira, N. & Pevzner, P. A. Spectral dictionaries: integrating de novo peptide sequencing with database search of tandem mass spectra. Mol. Cell. Proteom. 8, 53–69 (2009).
Protsyuk, I. et al. 3D molecular cartography using LC–MS facilitated by Optimus and ‘ili software. Nat. Protoc. 13, 134–154 (2018).
Wilkinson, M. D. et al. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 3, 160018 (2016).
The conversion of the data from different repositories was supported by grant R03 CA211211 on reuse of metabolomics data, to build enabling chemical analysis tools for the ocean symbiosis program, and the development of a user-friendly interface for GC–MS analysis was supported by the Gordon and Betty Moore Foundation through grant GBMF7622. The University of California, San Diego Center for Microbiome Innovation supported the campus-wide seed grant awards for data collection that enabled the development of some of this infrastructure. P.C.D. was supported by the National Science Foundation (grant no. IOS-1656475) and the National Institutes of Health (NIH) (grant nos. U19 AG063744 01, P41 GM103484, R03 CA211211 and R01 GM107550). K.V. and I.L. are very grateful for the support of the Vodafone Foundation as part of the DRUGS/DreamLab project. The MSHub platform development was supported by NIH/NIAAA grant (R21 AA028432) on integrated machine learning for mass spectrometry data in liver disease, Intelligify Limited and Vodafone Foundation’s DRUGS/CORONA-AI projects on network machine learning for drug repositioning and discovery of hyperfoods with antiviral/anticancer molecules. M.E. was supported by the University of Corsica. L.F.N. was supported by the NIH (R01 GM107550) and the European Union’s Horizon 2020 Research and Innovation Programme (MSCA-GF, 704786). A.B. was supported by the National Institute of Justice Award (2015-DN-BX-K047). Additional support for data acquisition and data storage was provided by the Center for Computational Mass Spectrometry (P41 GM103484). The collection of data from the HomeChem Project was supported by the Sloan Foundation. G.B.H., S.L.F.D., I.L., K.V. and I.B. are grateful for the support of the OG cancer breath analysis study by the National Institute for Health Research London Invitro Diagnostic Co-operative and the NIHR Imperial Biomedical Research Centre, the Rosetrees and Stonegate Trusts and the Imperial College Charity. D.V. acknowledges support from ERC-Consolidator grant 724228 (LEMAN). I.B. acknowledges the contribution of Q. Wen and M. Colavita in the production of the training video. C. Callewaert was supported by the Research Foundation Flanders, with support from the industrial research fund of Ghent University. W.B. was supported by the Research Foundation Flanders. A.A.O. acknowledges the support of the Fulbright Commission and Consejo Nacional de Investigaciones Científicas y Técnicas (CONICET-Argentina). The work of R.L. and P.L.B. on the data set 30 was supported by Metaboscope, part of the ‘Platform 3A’, funded by the European Regional Development Fund, the French Ministry of Research, Higher Education and Innovation, the Provence-Alpes-Côte d’Azur region, the Departmental Council of Vaucluse and the Urban Community of Avignon. S.A. and A.R.F. acknowledge the PlantaSYST project by the European Union’s Horizon 2020 Research and Innovation Programme (SGA-CSA nos. 664621 and 739582 under FPA no. 664620). V.V. acknowledges support from the National Institute on Alcohol Abuse and Alcoholism award R24AA022057. M. Guma and R.C. acknowledge the support of the Krupp Endowed Fund grant. A portion of mass spectra in the public reference library was produced within the framework of the State Task for the Topchiev Institute of Petrochemical Synthesis RAS and with the support of the RUDN University Program 5-100. R.S.B. acknowledges support of the State Task for the Topchiev Institute of Petrochemical Synthesis RAS. L.N.K. acknowledges support of the RUDN University Program 5-100. I.M. acknowledges support of the Israel Science Foundation (project no. 1947/19) and European Research Council under the European Union’s Horizon 2020 Research and Innovation Programme (project no. 640384). J.S. has been supported by NIH/NIAMS R03AR072182, the Colton Center for Autoimmunity, the Rheumatology Research Foundation, the Riley Family Foundation and the Snyder Family Foundation. J. Manasson acknowledges support from the 2017 Group for Research and Assessment of Psoriasis and Psoriatic Arthritis Pilot Research Grant and NIH/NIAMS T32AR069515. R.G. is grateful to the Azrieli Foundation for the award of an Azrieli Fellowship. J.J.J.v.d.H. acknowledges support from an ASDI eScience grant (ASDI.2017.030) from the Netherlands eScience Center-NLeSC. B.A. was supported by the National Science Foundation through the Graduate Research Fellowship Program. GC–MS analyses for collection of the MSV000083743 data set were supported by the Pacific Northwest National Laboratory, Laboratory-Directed Research and Development Program, and were contributed by the Microbiomes in Transition Initiative; data were collected in the Environmental Molecular Sciences Laboratory, a national scientific user facility sponsored by the Department of Energy (DOE) Office of Biological and Environmental Research and located at the Pacific Northwest National Laboratory (PNNL). PNNL is operated by the Battelle Memorial Institute for the DOE under contract DEAC05-76RLO1830. M. Guma and R.C. acknowledge the support of the Krupp Endowed Fund grant. R.C. was also funded by T32AR064194-07. The authors are grateful to R. da Silva for his contribution to developing the first prototype of the EI data network and his continuous assistance with further development and testing of the infrastructure. The authors are also grateful to M. Vance and D. Farmer, who organized the sampling for HomeChem Indoor Chemistry Project (https://indoorchem.org/projects/homechem/) that allowed the collection of samples for the MSV000083598 data set. B. Ross has assisted with collecting data for the MSV000084348 data set. GC–MS analyses for collection of the MSV000084211 and MSV000084212 data sets were supported by N757 Doctorados Nacionales and project EXT-2016-69-1713 from the Departamento Administrativo de Ciencia, Tecnología e Innovación (COLCIENCIAS), the seed project INV-2019-67-1747 and the FAPA project of Chiara Carazzone from the Faculty of Science at Universidad de los Andes and the grant FP80740-064-2016 of COLCIENCIAS. The authors are grateful to L. M. Garzón, P. Palacios, M. Gonzalez and J. Hernandez for their contributions to collecting the samples and to J. Oswaldo Turizo for designing and manufacturing the amphibian electrical stimulator. A.S. and X.D. acknowledge support from National Cancer Institute award U01CA235507. The authors are grateful to S. Neuman for feedback regarding the XCMS deconvolution tool.
P.C.D. is a scientific advisor for Sirenas, Galileo and Cybele. P.C.D. is scientific adviser and cofounder of Enveda and Ometa; this has been approved by UC San Diego. M.W. is a consultant for Sirenas and the founder of Ometa Labs. A.A.A. is a consultant for Ometa Labs.
Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
About this article
Cite this article
Aksenov, A.A., Laponogov, I., Zhang, Z. et al. Auto-deconvolution and molecular networking of gas chromatography–mass spectrometry data.
Nat Biotechnol (2020). https://doi.org/10.1038/s41587-020-0700-3