A pan-cancer landscape of somatic mutations in non-unique regions of the human genome thumbnail

A pan-cancer landscape of somatic mutations in non-unique regions of the human genome

Abstract

A substantial fraction of the human genome displays high sequence similarity with at least one other genomic sequence, posing a challenge for the identification of somatic mutations from short-read sequencing data. Here we annotate genomic variants in 2,658 cancers from the Pan-Cancer Analysis of Whole Genomes (PCAWG) cohort with links to similar sites across the human genome. We train a machine learning model to use signals distributed over multiple genomic sites to call somatic events in non-unique regions and validate the data against linked-read sequencing in an independent dataset. Using this approach, we uncover previously hidden mutations in ~1,700 coding sequences and in thousands of regulatory elements, including in known cancer genes, immunoglobulins and highly mutated gene families. Mutations in non-unique regions are consistent with mutations in unique regions in terms of mutation burden and substitution profiles. The analysis provides a systematic summary of the mutation events in non-unique regions at a genome-wide scale across multiple human cancers.

Access options

Subscribe to Journal

Get full journal access for 1 year

$59.00

only $4.92 per issue

All prices are NET prices.

VAT will be added later in the checkout.

Tax calculation will be finalised during checkout.

Rent or Buy article

Get time limited or full article access on ReadCube.

from$8.99

All prices are NET prices.

Data availability

The PCAWG dataset is available through the ICGC data portal, https://dcc.icgc.org/pcawg. Somatic mutations called in this study are available at https://www.synapse.org/#!Synapse:syn22297877.

References

  1. 1.

    The ICGC/TCGA Pan-Cancer Analysis of Whole Genomes Consortium Pan-cancer analysis of whole genomes. Nature 578, 82–93 (2020).

    CAS 
    Article 

    Google Scholar
     

  2. 2.

    Alexandrov, L. B. et al. The repertoire of mutational signatures in human cancer. Nature 578, 94–101 (2020).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  3. 3.

    Li, Y. et al. Patterns of somatic structural variation in human cancer genomes. Nature 578, 112–121 (2020).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  4. 4.

    Rheinbay, E. et al. Analyses of non-coding somatic drivers in 2,658 cancer whole genomes. Nature 578, 102–111 (2020).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  5. 5.

    Gerstung, M. et al. The evolutionary history of 2,658 cancers. Nature 578, 122–128 (2020).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  6. 6.

    Karimzadeh, M., Ernst, C., Kundaje, A. & Hoffman, M. M. Umap and Bismap: quantifying genome and methylome mappability. Nucleic Acids Res. 46, e120 (2018).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  7. 7.

    Lee, H. & Schatz, M. C. Genomic dark matter: the reliability of short read mapping illustrated by the genome mappability score. Bioinformatics 28, 2097–2105 (2012).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  8. 8.

    Treangen, T. J. & Salzberg, S. L. Repetitive DNA and next-generation sequencing: computational challenges and solutions. Nat. Rev. Genet. 13, 36–46 (2011).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  9. 9.

    Mandelker, D. et al. Navigating highly homologous genes in a molecular diagnostic setting: a resource for clinical next-generation sequencing. Genet. Med. 18, 1282–1289 (2016).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  10. 10.

    Suzuki, I. K. et al. Human-specific NOTCH2NL genes expand cortical neurogenesis through Delta/Notch regulation. Cell 173, 1370–1384 (2018).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  11. 11.

    Suzuki, H. et al. Recurrent noncoding U1 snRNA mutations drive cryptic splicing in SHH medulloblastoma. Nature 574, 707–711 (2019).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  12. 12.

    Shuai, S. et al. The U1 spliceosomal RNA is recurrently mutated in multiple cancers. Nature 574, 712–716 (2019).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  13. 13.

    Kerzendorfer, C., Konopka, T. & Nijman, S. M. B. A thesaurus of genetic variation for interrogation of repetitive genomic regions. Nucleic Acids Res. 43, e68 (2015).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  14. 14.

    Konopka, T. & Nijman, S. M. B. Comparison of genetic variants in matched samples using thesaurus annotation. Bioinformatics 32, 657–663 (2016).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  15. 15.

    Alexandrov, L. B. et al. Signatures of mutational processes in human cancer. Nature 500, 415–421 (2013).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  16. 16.

    Ainscough, B. J. et al. A deep learning approach to automate refinement of somatic variant calling from cancer sequencing data. Nat. Genet. 50, 1735–1743 (2018).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  17. 17.

    Anzar, I., Sverchkova, A., Stratford, R. & Clancy, T. NeoMutate: an ensemble machine learning framework for the prediction of somatic mutations in cancer. BMC Med. Genomics 12, 63 (2019).

    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  18. 18.

    Garcia-Prieto, C., Valencia, A. & Porta-Pardo, E. The consequences of variant calling decisions in secondary analyses of cancer sequencing data. Preprint at bioRxiv https://doi.org/10.1101/2020.01.29.924860 (2020).

  19. 19.

    Ellrott, K. et al. Scalable open science approach for mutation calling of tumor exomes using multiple genomic pipelines. Cell Syst. 6, 271–281 (2018).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  20. 20.

    Bishara, A. et al. Read clouds uncover variation in complex regions of the human genome. Genome Res. 25, 1570–1580 (2015).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  21. 21.

    Zheng, G. X. Y. et al. Haplotyping germline and cancer genomes with high-throughput linked-read sequencing. Nat. Biotechnol. 34, 303–311 (2016).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  22. 22.

    Lawrence, M. S. et al. Mutational heterogeneity in cancer and the search for new cancer-associated genes. Nature 499, 214–218 (2013).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  23. 23.

    Martincorena, I. et al. Universal patterns of selection in cancer and somatic tissues. Cell 173, 1823 (2018).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  24. 24.

    Chen, H. et al. Comprehensive assessment of computational algorithms in predicting cancer driver mutations. Genome Biol. 21, 43 (2020).

    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  25. 25.

    Araya, C. L. et al. Identification of significantly mutated regions across cancer types highlights a rich landscape of functional molecular alterations. Nat. Genet. 48, 117–125 (2015).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  26. 26.

    Bailey, M. H. et al. Comprehensive characterization of cancer driver genes and mutations. Cell 174, 1034–1035 (2018).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  27. 27.

    Tate, J. G. et al. COSMIC: the catalogue of somatic mutations in cancer. Nucleic Acids Res. 47, D941–D947 (2019).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  28. 28.

    Jäger, D. et al. Identification of a tissue-specific putative transcription factor in breast tissue by serological screening of a breast cancer library. Cancer Res. 61, 2055–2061 (2001).

    PubMed 
    PubMed Central 

    Google Scholar
     

  29. 29.

    Tapparel, C. et al. The TPTE gene family: cellular expression, subcellular localization and alternative splicing. Gene 323, 189–199 (2003).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  30. 30.

    Jamaspishvili, T. et al. Clinical implications of PTEN loss in prostate cancer. Nat. Rev. Urol. 15, 222–234 (2018).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  31. 31.

    Hatakeyama, S. TRIM family proteins: roles in autophagy, immunity, and carcinogenesis. Trends Biochem. Sci 42, 297–311 (2017).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  32. 32.

    Usher, C. L. et al. Structural forms of the human amylase locus and their relationships to SNPs, haplotypes and obesity. Nat. Genet. 47, 921–925 (2015).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  33. 33.

    Barger, C. J. et al. Expression of the POTE gene family in human ovarian cancer. Sci. Rep. 8, 17136 (2018).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  34. 34.

    Teng, G. & Papavasiliou, F. N. Immunoglobulin somatic hypermutation. Annu. Rev. Genet. 41, 107–120 (2007).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  35. 35.

    Amarasinghe, S. L. et al. Opportunities and challenges in long-read sequencing data analysis. Genome Biol. 21, 30 (2020).

    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  36. 36.

    Marks, P. et al. Resolving the full spectrum of human genome variation using linked-reads. Genome Res. 29, 635–645 (2019).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  37. 37.

    Priestley, P. et al. Pan-cancer whole-genome analyses of metastatic solid tumours. Nature 575, 210–216 (2019).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  38. 38.

    McGranahan, N. et al. Allele-specific HLA loss and immune escape in lung cancer evolution. Cell 171, 1259–1271 (2017).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  39. 39.

    Rodriguez-Martin, B. et al. Pan-cancer analysis of whole genomes identifies driver rearrangements promoted by LINE-1 retrotransposition. Nat. Genet. 52, 306–319 (2020).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  40. 40.

    Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317–330 (2015).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  41. 41.

    Eichler, E. E. Genetic variation, comparative genomics, and the diagnosis of disease. N. Engl. J. Med. 381, 64–74 (2019).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  42. 42.

    Krusche, P. et al. Best practices for benchmarking germline small-variant calls in human genomes. Nat. Biotechnol. 37, 555–560 (2019).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  43. 43.

    Li, H. & Durbin, R. Fast and accurate long-read alignment with Burrows–Wheeler transform. Bioinformatics 26, 589–595 (2010).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  44. 44.

    Frankish, A. et al. GENCODE reference annotation for the human and mouse genomes. Nucleic Acids Res. 47, D766–D773 (2019).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  45. 45.

    Friedman, J., Hastie, T. & Tibshirani, R. Additive logistic regression: a statistical view of boosting (with discussion and a rejoinder by the authors). Ann. Stat. 28, 337–407 (2000).

    Article 

    Google Scholar
     

  46. 46.

    McInnes, L., Healy, J., Saul, N. & Großberger, L. UMAP: uniform manifold approximation and projection. J. Open Source Softw. 3, 861 (2018).

    Article 

    Google Scholar
     

  47. 47.

    Becht, E. et al. Dimensionality reduction for visualizing single-cell data using UMAP. Nat. Biotechnol. 37, 38–44 (2018).

    Article 
    CAS 

    Google Scholar
     

Download references

Acknowledgements

This work is supported by The Francis Crick Institute, which receives its core funding from Cancer Research UK (FC001202), the UK Medical Research Council (FC001202) and the Wellcome Trust (FC001202). M.T. was supported as a postdoctoral fellow by the European Union’s Horizon 2020 research and innovation program (Marie Skłodowska-Curie grant agreement 747852-SIOMICS) and is a postdoctoral researcher of the F.R.S.-FNRS. J.D. is a postdoctoral fellow of the Research Foundation, Flanders (FWO). A.M.F. is an NIHR senior investigator and is supported by the National Institute for Health Research, UCLH Biomedical Research Centre and the CRUK Experimental Cancer Centre. P.V.L. is a Winton Group Leader in recognition of the Winton Charitable Foundation’s support toward the establishment of The Francis Crick Institute. T.K. would like to thank D. Smedley. This project was enabled through the Crick Scientific Computing STP and through access to the MRC eMedLab Medical Bioinformatics infrastructure, supported by the UK Medical Research Council (grant number MR/L016311/1). The Bone Cancer Research Trust funded sample biobanking.

Author information

Author notes

  1. These authors contributed equally: Peter Van Loo, Tomasz Konopka.

Affiliations

  1. The Francis Crick Institute, London, UK

    Maxime Tarabichi, Jonas Demeulemeester, Annelien Verfaillie, Peter Van Loo & Tomasz Konopka

  2. Institute for Interdisciplinary Research, Université Libre de Bruxelles, Brussels, Belgium

    Maxime Tarabichi

  3. Department of Human Genetics, KU Leuven, Leuven, Belgium

    Jonas Demeulemeester

  4. Research Department of Pathology, Cancer Institute, University College London, London, UK

    Adrienne M. Flanagan

  5. Department of Cellular and Molecular Pathology, Royal National Orthopaedic Hospital NHS Trust, Stanmore, UK

    Adrienne M. Flanagan

  6. William Harvey Research Institute, Queen Mary University of London, London, UK

    Tomasz Konopka

Contributions

All authors edited and approved the final manuscript. M.T. wrote the first draft of the paper, designed experiments, performed statistical analyses, performed bioinformatics analyses and performed data visualization. J.D. performed bioinformatics analyses of linked-read data. A.V. generated short-read and linked-read data. A.M.F. provided tumor samples and performed pathology assessments. P.V.L. wrote the first draft of the paper, designed experiments and supervised the study jointly with T.K. T.K. wrote the first draft of the paper, designed experiments, performed statistical analyses, performed bioinformatics analyses, performed data visualization and supervised the study jointly with P.V.L.

Corresponding authors

Correspondence to
Maxime Tarabichi or Peter Van Loo or Tomasz Konopka.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Peer review information Nature Biotechnology thanks the anonymous reviewers for their contribution to the peer review of this work.

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Supplementary information

Supplementary Information

Supplementary Figs. 1–28.

Supplementary Table 1

Summary of z-scores and histology specificity for all genomic regions. The table contains summary statistics for all genes in the annotation set. Counts, z-scores and entropy scores are provided based on non-hypermutated samples.

About this article

Verify currency and authenticity via CrossMark

Cite this article

Tarabichi, M., Demeulemeester, J., Verfaillie, A. et al. A pan-cancer landscape of somatic mutations in non-unique regions of the human genome.
Nat Biotechnol (2021). https://doi.org/10.1038/s41587-021-00971-y

Download citation

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *