A unified haplotype-based method for accurate and comprehensive variant calling thumbnail

A unified haplotype-based method for accurate and comprehensive variant calling

Abstract

Almost all haplotype-based variant callers were designed specifically for detecting common germline variation in diploid populations, and give suboptimal results in other scenarios. Here we present Octopus, a variant caller that uses a polymorphic Bayesian genotyping model capable of modeling sequencing data from a range of experimental designs within a unified haplotype-aware framework. Octopus combines sequencing reads and prior information to phase-called genotypes of arbitrary ploidy, including those with somatic mutations. We show that Octopus accurately calls germline variants in individuals, including single nucleotide variants, indels and small complex replacements such as microinversions. Using a synthetic tumor data set derived from clean sequencing data from a sample with known germline haplotypes and observed mutations in a large cohort of tumor samples, we show that Octopus is more sensitive to low-frequency somatic variation, yet calls considerably fewer false positives than other methods. Octopus also outputs realigned evidence BAM files to aid validation and interpretation.

Access options

Subscribe to Journal

Get full journal access for 1 year

$259.00

only $21.58 per issue

All prices are NET prices.

VAT will be added later in the checkout.

Tax calculation will be finalised during checkout.

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Data availability

All germline data used in this manuscript are publicly available from GIAB, Precision FDA and ENA. Links are provided in Supplementary Note 1. Trio data from the WGS500 project are available from the European Nucleotide Archive under accession no. PRJEB9151 (samples AW_SC_4654, AW_SC_4655 and AW_SC_4659). The synthetic-tumor data have been deposited in the Sequence Read Archive under BioProject accession no. PRJNA694520. The corresponding truth sets have been deposited to figshare (https://doi.org/10.6084/m9.figshare.13902212).

Code availability

Octopus source code and documentation is freely available under the MIT licence from https://github.com/luntergroup/octopus. Custom code used for data analysis is available from https://github.com/luntergroup/octopus-paper.

References

  1. 1.

    Rimmer, A. et al. Integrating mapping-, assembly- and haplotype-based approaches for calling variants in clinical sequencing applications. Nat. Genet. 46, 912–918 (2014).

    CAS 
    Article 

    Google Scholar
     

  2. 2.

    Kim, S. et al. Strelka2: fast and accurate calling of germline and somatic variants. Nat. Methods 15, 591–594 (2018).

    CAS 
    Article 

    Google Scholar
     

  3. 3.

    Poplin, R. et al. A universal SNP and small-indel variant caller using deep neural networks. Nat. Biotechnol. 36, 983–987 (2018).

    CAS 
    Article 

    Google Scholar
     

  4. 4.

    DePristo, M. A. et al. A framework for variation discovery and genotyping using next-generation DNA sequencing data. Nat. Genet. 43, 491–498 (2011).

    CAS 
    Article 

    Google Scholar
     

  5. 5.

    Garrison, E. & Marth, G. Haplotype-based variant detection from short-read sequencing. Preprint at https://arxiv.org/abs/1207.3907(2012).

  6. 6.

    Poplin, R. et al. Scaling accurate genetic variant discovery to tens of thousands of samples. Preprint at bioRxiv https://doi.org/10.1101/201178 (2017).

  7. 7.

    Lo, Y. et al. Comparing variant calling algorithms for target-exon sequencing in a large sample. BMC Bioinf. 16, 75 (2015).

    Article 

    Google Scholar
     

  8. 8.

    Nik-Zainal, S. et al. Landscape of somatic mutations in 560 breast cancer whole-genome sequences. Nature 534, 47–54 (2016).

    CAS 
    Article 

    Google Scholar
     

  9. 9.

    Hayward, N. K. et al. Whole-genome landscapes of major melanoma subtypes. Nature 545, 175–180 (2017).

    CAS 
    Article 

    Google Scholar
     

  10. 10.

    Northcott, P. A. et al. The whole-genome landscape of medulloblastoma subtypes. Nature 547, 311–317 (2017).

    CAS 
    Article 

    Google Scholar
     

  11. 11.

    Waddell, N. et al. Whole genomes redefine the mutational landscape of pancreatic cancer. Nature 518, 495–501 (2015).

    CAS 
    Article 

    Google Scholar
     

  12. 12.

    Besenbacher, S. et al. Multi-nucleotide de novo mutations in humans. PLoS Genet. 12, e1006315 (2016).

    Article 

    Google Scholar
     

  13. 13.

    Jonsson, H. et al. Parental influence on human germline de novo mutations in 1,548 trios from Iceland. Nature 549, 519–522 (2017).

    Article 

    Google Scholar
     

  14. 14.

    Deciphering Developmental Disorders, S. Prevalence and architecture of de novo mutations in developmental disorders. Nature 542, 433–438 (2017).

    Article 

    Google Scholar
     

  15. 15.

    Goldmann, J. M. et al. Germline de novo mutation clusters arise during oocyte aging in genomic regions with high double-strand-break incidence. Nat. Genet. 50, 487–492 (2018).

    CAS 
    Article 

    Google Scholar
     

  16. 16.

    Walker, T. M. et al. Whole-genome sequencing for prediction of mycobacterium tuberculosis drug susceptibility and resistance: a retrospective cohort study. Lancet Infect. Dis. 15, 1193–1202 (2015).

    CAS 
    Article 

    Google Scholar
     

  17. 17.

    Tewhey, R., Bansal, V., Torkamani, A., Topol, E. J. & Schork, N. J. The importance of phase information for human genomics. Nat. Rev. Genet. 12, 215–223 (2011).

    CAS 
    Article 

    Google Scholar
     

  18. 18.

    Doucet, A. & Johansen, A. M. A tutorial on particle filtering and smoothing: fifteen years later. In Handbook of Nonlinear Filtering 12, 656–704 (2009).

  19. 19.

    Zook, J. M. et al. Integrating human sequence data sets provides a resource of benchmark SNP and indel genotype calls. Nat. Biotechnol. 32, 246–251 (2014).

    CAS 
    Article 

    Google Scholar
     

  20. 20.

    Li, H. et al. A synthetic-diploid benchmark for accurate variant-calling evaluation. Nat. Methods 15, 595–597 (2018).

    Article 

    Google Scholar
     

  21. 21.

    Li, H. Aligning sequence reads, clone sequences and assembly contigs with BWA-MEM. Preprint at https://arxiv.org/abs/1303.3997 (2013).

  22. 22.

    Cleary, J. G. et al. Comparing variant call files for performance benchmarking of next-generation sequencing variant calling pipelines. Preprint at https://www.biorxiv.org/content/10.1101/023754v2 (2015).

  23. 23.

    Veltman, J. A. & Brunner, H. G. De novo mutations in human genetic disease. Nat. Rev. Genet. 13, 565–575 (2012).

    CAS 
    Article 

    Google Scholar
     

  24. 24.

    Xu, B. et al. De novo gene mutations highlight patterns of genetic and neural complexity in schizophrenia. Nat. Genet. 44, 1365–1369 (2012).

    CAS 
    Article 

    Google Scholar
     

  25. 25.

    Gilissen, C. et al. Genome sequencing identifies major causes of severe intellectual disability. Nature 511, 344–347 (2014).

    CAS 
    Article 

    Google Scholar
     

  26. 26.

    Kong, A. et al. Rate of de novo mutations and the importance of father’s age to disease risk. Nature 488, 471–475 (2012).

    CAS 
    Article 

    Google Scholar
     

  27. 27.

    Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  28. 28.

    Alioto, T. S. et al. A comprehensive assessment of somatic mutation detection in cancer using whole-genome sequencing. Nat. Commun. 6, 10001 (2015).

    CAS 
    Article 

    Google Scholar
     

  29. 29.

    Cibulskis, K. et al. Sensitive detection of somatic point mutations in impure and heterogeneous cancer samples. Nat. Biotechnol. 31, 213–219 (2013).

    CAS 
    Article 

    Google Scholar
     

  30. 30.

    Ewing, A. D. et al. Combining tumor genome simulation with crowdsourcing to benchmark somatic single-nucleotide-variant detection. Nat. Methods 12, 623–630 (2015).

    CAS 
    Article 

    Google Scholar
     

  31. 31.

    Wilm, A. et al. LoFreq: a sequence-quality aware, ultra-sensitive variant caller for uncovering cell-population heterogeneity from high-throughput sequencing datasets. Nucleic Acids Res. 40, 11189–11201 (2012).

    CAS 
    Article 

    Google Scholar
     

  32. 32.

    Narzisi, G. et al. Genome-wide somatic variant calling using localized colored De Bruijn graphs. Commun. Biol. 1, 20 (2018).

    Article 

    Google Scholar
     

  33. 33.

    Lai, Z. et al. VarDict: a novel and versatile variant caller for next-generation sequencing in cancer research. Nucleic Acids Res. 44, e108 (2016).

    Article 

    Google Scholar
     

  34. 34.

    Decker, B. et al. Biallelic BRCA2 mutations shape the somatic mutational landscape of aggressive prostate tumors. Am. J. Hum. Genet. 98, 818–829 (2016).

    CAS 
    Article 

    Google Scholar
     

  35. 35.

    Hause, R. J., Pritchard, C. C., Shendure, J. & Salipante, S. J. Classification and characterization of microsatellite instability across 18 cancer types. Nat. Med. 22, 1342–1350 (2016).

    CAS 
    Article 

    Google Scholar
     

  36. 36.

    Maruvka, Y. E. et al. Analysis of somatic microsatellite indels identifies driver events in human tumors. Nat. Biotechnol. 35, 951–959 (2017).

    CAS 
    Article 

    Google Scholar
     

  37. 37.

    The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium. Pan-cancer analysis of whole genomes. Nature. 578, 82-93 (2020).

  38. 38.

    Montgomery, S. B. et al. The origin, evolution, and functional impact of short insertion-deletion variants identified in 179 human genomes. Genome Res. 23, 749–761 (2013).

    CAS 
    Article 

    Google Scholar
     

  39. 39.

    Fu, Y. X. Probability of a segregating pattern in a sample of DNA sequences. Theor. Popul. Biol. 54, 1–10 (1998).

    CAS 
    Article 

    Google Scholar
     

  40. 40.

    Wright, M. N. & Ziegler, A. Ranger: a fast implementation of random forests for high dimensional data in C++ and R. J. Stat. Softw. 77, 1–17 (2017).

    Article 

    Google Scholar
     

Download references

Acknowledgements

This work was supported by The Wellcome Trust Genomic Medicine and Statistics PhD Program (grant nos. 203735/Z/16/Z to D.P.C.). The computational aspects of this research were supported by the Wellcome Trust Core Award grant number 203141/Z/16/Z and the NIHR Oxford BRC. The views expressed are those of the author(s) and not necessarily those of the NHS, the NIHR or the Department of Health.

Author information

Affiliations

  1. MRC Weatherall Institute of Molecular Medicine, University of Oxford, Oxford, UK

    Daniel P. Cooke & Gerton Lunter

  2. Manchester Cancer Research Centre, University of Manchester, Manchester, UK

    David C. Wedge

  3. Department of Epidemiology, University Medical Centre Groningen, Groningen, the Netherlands

    Gerton Lunter

Contributions

D.P.C. and G.L. designed the algorithm and wrote the manuscript. D.P.C. implemented the algorithm and performed the evaluation. D.C.W. provided data for the synthetic tumors and critically reviewed the manuscript.

Corresponding author

Correspondence to
Daniel P. Cooke.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Biotechnology thanks Federico Abascal and the other, anonymous, reviewer(s) for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

About this article

Verify currency and authenticity via CrossMark

Cite this article

Cooke, D.P., Wedge, D.C. & Lunter, G. A unified haplotype-based method for accurate and comprehensive variant calling.
Nat Biotechnol (2021). https://doi.org/10.1038/s41587-021-00861-3

Download citation

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *