Improved metagenome binning and assembly using deep variational autoencoders thumbnail

Improved metagenome binning and assembly using deep variational autoencoders

Abstract

Despite recent advances in metagenomic binning, reconstruction of microbial species from metagenomics data remains challenging. Here we develop variational autoencoders for metagenomic binning (VAMB), a program that uses deep variational autoencoders to encode sequence coabundance and k-mer distribution information before clustering. We show that a variational autoencoder is able to integrate these two distinct data types without any previous knowledge of the datasets. VAMB outperforms existing state-of-the-art binners, reconstructing 29–98% and 45% more near-complete (NC) genomes on simulated and real data, respectively. Furthermore, VAMB is able to separate closely related strains up to 99.5% average nucleotide identity (ANI), and reconstructed 255 and 91 NC Bacteroides vulgatus and Bacteroides dorei sample-specific genomes as two distinct clusters from a dataset of 1,000 human gut microbiome samples. We use 2,606 NC bins from this dataset to show that species of the human gut microbiome have different geographical distribution patterns. VAMB can be run on standard hardware and is freely available at https://github.com/RasmussenLab/vamb.

Data availability

The sequence data used in this study are publicly available from either the respective studies or ENA. The semisynthetic MetaHIT dataset was downloaded from https://portal.nersc.gov/dna/RD/Metagenome_RD/MetaBAT/Files/ as the files depth.txt.gz and assembly-filtered.fa.gz. The simulated CAMI High and CAMI2 datasets were downloaded from https://data.cami-challenge.org/participate from ‘Toy Test Dataset High_Complexity’ and ‘2nd CAMI Toy Human Microbiome Project Dataset’, respectively. The de novo assemblies of the Almeida dataset were obtained through personal communication with A. Almeida and R. D. Finn, and the reads downloaded from ENA as specified in their publication. The data and results of binning the MetaHIT, CAMI2 and Almeida datasets, as well as the source data for Figs. 13, are available on figshare at https://figshare.com/projects/VAMB/72677. A CodeOcean capsule of VAMB v.3.0.1, including the six training and test datasets for reproducing benchmarking results, is available from https://doi.org/10.24433/CO.2518623.v1. Source data are provided with this paper.

Code availability

All code can be found on GitHub at https://github.com/RasmussenLab/vamb and is freely available under the permissive MIT license. All analyses were performed using VAMB v.3.0.1. Additionally, code are available as a CodeOcean capsule at https://doi.org/10.24433/CO.2518623.v1.

References

  1. 1.

    Turaev, D. & Rattei, T. High definition for systems biology of microbial communities: metagenomics gets genome-centric and strain-resolved. Curr. Opin. Biotechnol. 39, 174–181 (2016).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  2. 2.

    Quince, C., Walker, A. W., Simpson, J. T., Loman, N. J. & Segata, N. Shotgun metagenomics, from sampling to analysis. Nat. Biotechnol. 35, 833–844 (2017).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  3. 3.

    Wang, J. & Jia, H. Metagenome-wide association studies: fine-mining the microbiome. Nat. Rev. Microbiol. 14, 508–522 (2016).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  4. 4.

    Kingma, D. P. & Welling, M. Auto-encoding variational Bayes. Preprint at https://arxiv.org/abs/1312.6114 (2014).

  5. 5.

    Rezende, D. J., Mohamed, S. & Wierstra, D. Stochastic backpropagation and approximate inference in deep generative models. Proc. Mach. Learn. Res. 32, 1278–1286 (2014).


    Google Scholar
     

  6. 6.

    Nielsen, H. B. et al. Identification and assembly of genomes and genetic elements in complex metagenomic samples without using reference genomes. Nat. Biotechnol. 32, 822–828 (2014).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  7. 7.

    Teeling, H., Meyerdierks, A., Bauer, M., Amann, R. & Glöckner, F. O. Application of tetranucleotide frequencies for the assignment of genomic fragments. Environ. Microbiol. 6, 938–947 (2004).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  8. 8.

    Alneberg, J. et al. Binning metagenomic contigs by coverage and composition. Nat. Methods 11, 1144–1146 (2014).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  9. 9.

    Albertsen, M. et al. Genome sequences of rare, uncultured bacteria obtained by differential coverage binning of multiple metagenomes. Nat. Biotechnol. 31, 533–538 (2013).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  10. 10.

    Kang, D. D., Froula, J., Egan, R. & Wang, Z. MetaBAT, an efficient tool for accurately reconstructing single genomes from complex microbial communities. PeerJ. 3, e1165 (2015).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  11. 11.

    Wu, Y.-W., Simmons, B. A. & Singer, S. W. MaxBin 2.0: an automated binning algorithm to recover genomes from multiple metagenomic datasets. Bioinformatics 32, 605–607 (2016).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  12. 12.

    Kang, D. D. et al. MetaBAT 2: an adaptive binning algorithm for robust and efficient genome reconstruction from metagenome assemblies. PeerJ. 7, e7359 (2019).

    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  13. 13.

    Plaza Oñate, F. et al. MSPminer: abundance-based reconstitution of microbial pan-genomes from shotgun metagenomic data. Bioinformatics 35, 1544–1552 (2019).

    PubMed 
    Article 
    CAS 
    PubMed Central 

    Google Scholar
     

  14. 14.

    Lin, H. H. & Liao, Y. C. Accurate binning of metagenomic contigs via automated clustering sequences using information of genomic signatures and marker genes. Sci. Rep. 6, 24175 (2016).

  15. 15.

    Chatterji, S., Yamazaki, I., Bai, Z. & Eisen, J. A. CompostBin: A DNA Composition-Based Algorithm for Binning Environmental Shotgun Reads. in Research in Computational Molecular Biology (eds. Vingron, M. & Wong, L.) 17–28 (Springer, 2008).

  16. 16.

    Parks, D. H. et al. Recovery of nearly 8,000 metagenome-assembled genomes substantially expands the tree of life. Nat. Microbiol. 2, 1533–1542 (2017).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  17. 17.

    Pasolli, E. et al. Extensive unexplored human microbiome diversity revealed by over 150,000 genomes from metagenomes spanning age, geography, and lifestyle. Cell 176, 649–662 (2019).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  18. 18.

    Almeida, A. et al. A new genomic blueprint of the human gut microbiota. Nature 568, 499–504 (2019).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  19. 19.

    Brooks, B. et al. Strain-resolved analysis of hospital rooms and infants reveals overlap between the human and room microbiome. Nat. Commun. 8, 1–7 (2017).

    Article 
    CAS 

    Google Scholar
     

  20. 20.

    Sczyrba, A. et al. Critical Assessment of Metagenome Interpretation – a benchmark of metagenomics software. Nat. Methods 14, 1063–1071 (2017).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  21. 21.

    Qin, J. et al. A human gut microbial gene catalogue established by metagenomic sequencing. Nature 464, 59–65 (2010).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  22. 22.

    Bowers, R. M. et al. Minimum information about a single amplified genome (MISAG) and a metagenome-assembled genome (MIMAG) of bacteria and archaea. Nat. Biotechnol. 35, 725–731 (2017).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  23. 23.

    Sieber, C. M. K. et al. Recovery of genomes from metagenomes via a dereplication, aggregation and scoring strategy. Nat. Microbiol. 3, 836–843 (2018).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  24. 24.

    Cleary, B. et al. Detection of low-abundance bacterial strains in metagenomic datasets by eigengenome partitioning. Nat. Biotechnol. 33, 1053–1060 (2015).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  25. 25.

    Huttenhower, C. et al. Structure, function and diversity of the healthy human microbiome. Nature 486, 207–214 (2012).

    CAS 
    Article 

    Google Scholar
     

  26. 26.

    Saeed, I., Tang, S.-L. & Halgamuge, S. K. Unsupervised discovery of microbial population structure within metagenomes using nucleotide base composition. Nucleic Acids Res. 40, e34 (2012).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  27. 27.

    Pride, D. T., Meinersmann, R. J., Wassenaar, T. M. & Blaser, M. J. Evolutionary implications of microbial genome tetranucleotide frequency biases. Genome Res. 13, 145–156 (2003).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  28. 28.

    Chen, L.-X., Anantharaman, K., Shaiber, A., Eren, A. M. & Banfield, J. F. Accurate and complete genomes from metagenomes. Genome Res. 30, 315–333 (2020).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  29. 29.

    Daubin, V., Lerat, E. & Perrière, G. The source of laterally transferred genes in bacterial genomes. Genome Biol. 4, R57 (2003).

    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  30. 30.

    Uritskiy, G. V., DiRuggiero, J. & Taylor, J. MetaWRAP—a flexible pipeline for genome-resolved metagenomic data analysis. Microbiome 6, 158 (2018).

    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  31. 31.

    Schloissnig, S. et al. Genomic variation landscape of the human gut microbiome. Nature 493, 45–50 (2013).

    PubMed 
    Article 
    CAS 
    PubMed Central 

    Google Scholar
     

  32. 32.

    Yatsunenko, T. et al. Human gut microbiome viewed across age and geography. Nature 486, 222–227 (2012).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  33. 33.

    Deschasaux, M. et al. Depicting the composition of gut microbiota in a population with varied ethnic origins but shared geography. Nat. Med. 24, 1526–1531 (2018).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  34. 34.

    He, Y. et al. Regional variation limits applications of healthy gut microbiome reference ranges and disease models. Nat. Med. 24, 1532–1535 (2018).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  35. 35.

    Asnicar, F. et al. Studying vertical microbiome transmission from mothers to infants by strain-level metagenomic profiling. mSystems 2, e00164–16 (2017).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  36. 36.

    Ferretti, P. et al. Mother-to-infant microbial transmission from different body sites shapes the developing infant gut microbiome. Cell Host Microbe 24, 133–145 (2018).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  37. 37.

    Grønbech, C. H. et al. scVAE: variational auto-encoders for single-cell gene expression data. Bioinformatics 36, 4415–4422 (2020).

  38. 38.

    Dilokthanakul, N. et al. Deep unsupervised clustering with Gaussian mixture variational autoencoders. Preprint at https://arxiv.org/abs/1611.02648 (2017).

  39. 39.

    Kislyuk, A., Bhatnagar, S., Dushoff, J. & Weitz, J. S. Unsupervised statistical clustering of environmental shotgun sequences. BMC Bioinform. 10, 316 (2009).

    Article 
    CAS 

    Google Scholar
     

  40. 40.

    Ioffe, S. & Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. Preprint at https://arxiv.org/abs/1502.03167 (2015).

  41. 41.

    Hinton, G. E., Srivastava, N., Krizhevsky, A., Sutskever, I. & Salakhutdinov, R. R. Improving neural networks by preventing co-adaptation of feature detectors. Preprint at https://arxiv.org/pdf/1207.0580.pdf (2012).

  42. 42.

    Maas, A. L., Maas, A. L., Hannun, A. Y. & Ng, A. Y. Rectifier nonlinearities improve neural network acoustic models. Preprint at https://arxiv.org/pdf/1207.0580.pdf (2013).

  43. 43.

    Kingma, D. P. & Ba, J. L. Adam: a method for stochastic optimization. Preprint at https://arxiv.org/abs/1412.6980 (2017).

  44. 44.

    Paszke, A. et al. PyTorch: an imperative style, high-performance deep learning library. Adv. Neural Inf. Process. Syst. 32, 8026–8037 (2019).


    Google Scholar
     

  45. 45.

    Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997?upload=1 (2013).

  46. 46.

    Li, H. et al. The sequence alignment/Map format and SAMtools. Bioinformatics 25, 2078–2079 (2009).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  47. 47.

    Sculley, D. Web-Scale k-Means Clustering. in Proc. 19th International Conference on World Wide Web 1177–1178 (ACM Press, 2010).

  48. 48.

    Huang, W., Li, L., Myers, J. R. & Marth, G. T. ART: a next-generation sequencing read simulator. Bioinformatics 28, 593–594 (2012).

    PubMed 
    Article 
    CAS 
    PubMed Central 

    Google Scholar
     

  49. 49.

    Nurk, S., Meleshko, D., Korobeynikov, A. & Pevzner, P. A. MetaSPAdes: a new versatile metagenomic assembler. Genome Res. 27, 824–834 (2017).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  50. 50.

    Li, H. Minimap2: pairwise alignment for nucleotide sequences. Bioinformatics 34, 3094–3100 (2018).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  51. 51.

    Kent, W. J. BLAT—the BLAST-like alignment tool. Genome Res. 12, 656–664 (2002).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  52. 52.

    Krzywinski, M. et al. Circos: an information aesthetic for comparative genomics. Genome Res. 19, 1639–1645 (2009).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  53. 53.

    Jain, C., Rodriguez-R, L. M., Phillippy, A. M., Konstantinidis, K. T. & Aluru, S. High throughput ANI analysis of 90K prokaryotic genomes reveals clear species boundaries. Nat. Commun. 9, 5114 (2018).

  54. 54.

    Parks, D. H., Imelfort, M., Skennerton, C. T., Hugenholtz, P. & Tyson, G. W. CheckM: assessing the quality of microbial genomes recovered from isolates, single cells, and metagenomes. Genome Res. 25, 1043–1055 (2015).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  55. 55.

    Camacho, C. et al. BLAST+: architecture and applications. BMC Bioinform. 10, 421 (2009).

    Article 
    CAS 

    Google Scholar
     

  56. 56.

    Ondov, B. D. et al. Mash: fast genome and metagenome distance estimation using MinHash. Genome Biol. 17, 132 (2016).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  57. 57.

    Hyatt, D. et al. Prodigal: prokaryotic gene recognition and translation initiation site identification. BMC Bioinform. 11, 119 (2010).

    Article 
    CAS 

    Google Scholar
     

  58. 58.

    Mitchell, A. L. et al. InterPro in 2019: improving coverage, classification and access to protein sequence annotations. Nucleic Acids Res. 47, D351–D360 (2019).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  59. 59.

    Benjamini, Y. & Hochberg, Y. Controlling the false discovery rate: a practical and powerful approach to multiple testing. J. R. Stat. Soc. Series B 57, 289–300 (1995).


    Google Scholar
     

  60. 60.

    Nayfach, S., Pedro Camargo, A., Eloe-Fadrosh, E. & Roux, S. CheckV: assessing the quality of metagenome-assembled viral genomes. Preprint at bioRxiv https://doi.org/10.1101/2020.05.06.081778 (2020).

  61. 61.

    Ren, J. et al. Identifying viruses from metagenomic data using deep learning. Quant. Biol. 8, 64–77 (2020).

    CAS 
    Article 

    Google Scholar
     

  62. 62.

    Parks, D. H. et al. A standardized bacterial taxonomy based on genome phylogeny substantially revises the tree of life. Nat. Biotechnol. 36, 996–1004 (2018).

    PubMed 
    Article 
    CAS 

    Google Scholar
     

  63. 63.

    Chaumeil, P.-A., Mussig, A. J., Hugenholtz, P. & Parks, D. H. GTDB-Tk: a toolkit to classify genomes with the Genome Taxonomy Database. Bioinformatics 36, 1925–1927 (2019).

    PubMed Central 
    PubMed 

    Google Scholar
     

  64. 64.

    Nguyen, L.-T., Schmidt, H. A., von Haeseler, A. & Minh, B. Q. IQ-TREE: a fast and effective stochastic algorithm for estimating maximum-likelihood phylogenies. Mol. Biol. Evol. 32, 268–274 (2015).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  65. 65.

    Cosentino, S. & Iwasaki, W. SonicParanoid: fast, accurate and easy orthology inference. Bioinformatics 35, 149–151 (2018).

    PubMed Central 
    Article 
    CAS 
    PubMed 

    Google Scholar
     

  66. 66.

    Katoh, K. & Standley, D. M. MAFFT multiple sequence alignment software version 7: improvements in performance and usability. Mol. Biol. Evol. 30, 772–780 (2013).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  67. 67.

    Kalyaanamoorthy, S., Minh, B. Q., Wong, T. K. F., Von Haeseler, A. & Jermiin, L. S. ModelFinder: fast model selection for accurate phylogenetic estimates. Nat. Methods 14, 587–589 (2017).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  68. 68.

    Zhang, C., Rabiee, M., Sayyari, E. & Mirarab, S. ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinform. 19, 153 (2018).

    Article 

    Google Scholar
     

  69. 69.

    Hoang, D. T., Chernomor, O., von Haeseler, A., Minh, B. Q. & Vinh, L. S. UFBoot2: improving the ultrafast bootstrap approximation.Mol. Biol. Evol. 35, 518–522 (2018).

    CAS 
    PubMed 
    Article 

    Google Scholar
     

  70. 70.

    Letunic, I. & Bork, P. Interactive Tree Of Life (iTOL) v4: recent updates and new developments. Nucleic Acids Res. 47, W256–W259 (2019).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  71. 71.

    Oksanen, J. et al. Package ‘vegan’. Community Ecology Package v.2.5-6. R Package version 3.4.0 1–296. https://cran.r-project.org/src/contrib/Archive/vegan/vegan_2.5-6.tar.gz (2019).

  72. 72.

    Paradis, E. & Schliep, K. ape 5.0: an environment for modern phylogenetics and evolutionary analyses in R. Bioinformatics 35, 526–528 (2018).

    Article 
    CAS 

    Google Scholar
     

Download references

Acknowledgements

We thank A. Almeida and R. D. Finn for sharing de novo assemblies of the 1,000 gut microbiome samples that we used for benchmarking VAMB. We thank C. Titus Brown for his source code contribution to the VAMB software package. J.N.N., J.J., R.L.A., L.J.J. and S.R. were supported by the Novo Nordisk Foundation (grant NNF14CC0001). S.R. was supported by the Jorck Foundation Research Award.

Author information

Affiliations

  1. Department of Health Technology, Technical University of Denmark, Lyngby, Denmark

    Jakob Nybo Nissen & Jose Juan Almagro Armenteros

  2. Novo Nordisk Foundation Center for Protein Research, Faculty of Health and Medical Sciences, University of Copenhagen, Copenhagen, Denmark

    Jakob Nybo Nissen, Joachim Johansen, Rosa Lundbye Allesøe, Lars Juhl Jensen & Simon Rasmussen

  3. Bioinformatics Centre, Department of Biology, University of Copenhagen, Copenhagen, Denmark

    Casper Kaae Sønderby, Christopher Heje Grønbech & Ole Winther

  4. Department of Applied Mathematics and Computer Science, Technical University of Denmark, Lyngby, Denmark

    Christopher Heje Grønbech & Ole Winther

  5. Clinical-Microbiomics A/S, Copenhagen, Denmark

    Henrik Bjørn Nielsen

  6. National Food Institute, Technical University of Denmark, Lyngby, Denmark

    Thomas Nordahl Petersen

  7. Center for Genomic Medicine, Copenhagen University Hospital, Copenhagen, Denmark

    Ole Winther

Contributions

S.R. conceived the study and guided the analysis. J.N.N., S.R., J.J. and R.L.A. performed the analyses. J.N.N. wrote the software. C.K.S., J.J.A.A., C.H.G., T.N.P., L.J.J., H.B.N. and O.W. provided guidance and input for the analysis. J.N.N., L.J.J. and S.R. wrote the manuscript with contributions from all coauthors. All authors read and approved the final version of the manuscript.

Corresponding author

Correspondence to
Simon Rasmussen.

Ethics declarations

Competing interests

H.B.N. is employed at Clinical-Microbiomics A/S. The remaining authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

About this article

Verify currency and authenticity via CrossMark

Cite this article

Nissen, J.N., Johansen, J., Allesøe, R.L. et al. Improved metagenome binning and assembly using deep variational autoencoders.
Nat Biotechnol (2021). https://doi.org/10.1038/s41587-020-00777-4

Download citation

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *