MPL resolves genetic linkage in fitness inference from complex evolutionary histories thumbnail

MPL resolves genetic linkage in fitness inference from complex evolutionary histories

Abstract

Genetic linkage causes the fate of new mutations in a population to be contingent on the genetic background on which they appear. This makes it challenging to identify how individual mutations affect fitness. To overcome this challenge, we developed marginal path likelihood (MPL), a method to infer selection from evolutionary histories that resolves genetic linkage. Validation on real and simulated data sets shows that MPL is fast and accurate, outperforming existing inference approaches. We found that resolving linkage is crucial for accurately quantifying selection in complex evolving populations, which we demonstrate through a quantitative analysis of intrahost HIV-1 evolution using multiple patient data sets. Linkage effects generated by variants that sweep rapidly through the population are particularly strong, extending far across the genome. Taken together, our results argue for the importance of resolving linkage in studies of natural selection.

Data availability

Raw data used in our analysis is available in the GitHub repository located at https://github.com/bartonlab/paper-MPL-inference. Source data are provided with this paper.

Code availability

Code used in our analysis is available in the GitHub repository located at https://github.com/bartonlab/paper-MPL-inference. The repository also contains Jupyter notebooks that can be run to reproduce the results presented here. The source code is shared under GPL-3.0 license https://github.com/bartonlab/paper-MPL-inference/blob/master/LICENSE-GPL. An executable version is also provided on Code Ocean at https://codeocean.com/capsule/3400567/tree (ref. 30), distributed under the GPL-3.0 license https://opensource.org/licenses/gpl-license/.

References

  1. 1.

    Bignell, G. R. et al. Signatures of mutation and selection in the cancer genome. Nature 463, 893–898 (2010).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  2. 2.

    Greaves, M. & Maley, C. C. Clonal evolution in cancer. Nature 481, 306–313 (2012).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  3. 3.

    Burrell, R. A., McGranahan, N., Bartek, J. & Swanton, C. The causes and consequences of genetic heterogeneity in cancer evolution. Nature 501, 338–345 (2013).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  4. 4.

    Nik-Zainal, S. et al. The life history of 21 breast cancers. Cell 149, 994–1007 (2012).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  5. 5.

    Landau, D. A. et al. Evolution and impact of subclonal mutations in chronic lymphocytic leukemia. Cell 152, 714–726 (2013).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  6. 6.

    Łuksza, M. et al. A neoantigen fitness model predicts tumour response to checkpoint blockade immunotherapy. Nature 551, 517–520 (2017).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  7. 7.

    McMichael, A. J., Borrow, P., Tomaras, G. D., Goonetilleke, N. & Haynes, B. F. The immune response during acute HIV-1 infection: clues for vaccine development. Nat. Rev. Immunol. 10, 11–23 (2010).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  8. 8.

    Allen, T. M. et al. Selective escape from CD8+ T-cell responses represents a major driving force of human immunodeficiency virus type 1 (HIV-1) sequence diversity and reveals constraints on HIV-1 evolution. J. Virol. 79, 13239–13249 (2005).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  9. 9.

    Zanini, F. et al. Population genomics of intrapatient HIV-1 evolution. eLife 4, e11282 (2015).

    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  10. 10.

    Strelkowa, N. & Lässig, M. Clonal interference in the evolution of influenza. Genetics 192, 671–682 (2012).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  11. 11.

    Łuksza, M. & Lässig, M. A predictive fitness model for influenza. Nature 507, 57–61 (2014).

    PubMed 
    Article 
    CAS 
    PubMed Central 

    Google Scholar
     

  12. 12.

    Muller, H. J. The relation of recombination to mutational advance. Mut. Res. 1, 2–9 (1964).

    Article 

    Google Scholar
     

  13. 13.

    Smith, J. M. & Haigh, J. The hitch-hiking effect of a favourable gene. Genet. Res. 23, 23–35 (1974).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  14. 14.

    Hegreness, M., Shoresh, N., Hartl, D. & Kishony, R. An equivalence principle for the incorporation of favorable mutations in asexual populations. Science 311, 1615–1617 (2006).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  15. 15.

    Lang, G. I. et al. Pervasive genetic hitchhiking and clonal interference in forty evolving yeast populations. Nature 500, 571–574 (2013).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  16. 16.

    Tenaillon, O. et al. Tempo and mode of genome evolution in a 50,000-generation experiment. Nature 536, 165–170 (2016).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  17. 17.

    Levy, S. F. et al. Quantitative evolutionary dynamics using high-resolution lineage tracking. Nature 519, 181–186 (2015).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  18. 18.

    Bollback, J. P., York, T. L. & Nielsen, R. Estimation of 2N
    es from temporal allele frequency data. Genetics 179, 497–502 (2008).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  19. 19.

    Malaspinas, A.-S., Malaspinas, O., Evans, S. N. & Slatkin, M. Estimating allele age and selection coefficient from time-serial data. Genetics 192, 599–607 (2012).

    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  20. 20.

    Mathieson, I. & McVean, G. Estimating selection coefficients in spatially structured populations from time series data of allele frequencies. Genetics 193, 973–984 (2013).

    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  21. 21.

    Feder, A. F., Kryazhimskiy, S. & Plotkin, J. B. Identifying signatures of selection in genetic time series. Genetics 196, 509–522 (2014).

    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  22. 22.

    Lacerda, M. & Seoighe, C. Population genetics inference for longitudinally-sampled mutants under strong selection. Genetics 198, 1237–1250 (2014).

    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  23. 23.

    Foll, M., Shim, H. & Jensen, J. D. WFABC: a Wright–Fisher ABC–based approach for inferring effective population sizes and selection coefficients from time-sampled data. Mol. Ecol. Resour. 15, 87–98 (2015).

    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  24. 24.

    Ferrer-Admetlla, A., Leuenberger, C., Jensen, J. D. & Wegmann, D. An approximate Markov model for the Wright–Fisher diffusion and its application to time series data. Genetics 203, 831–846 (2016).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  25. 25.

    Taus, T., Futschik, A. & Schlötterer, C. Quantifying selection with Pool-Seq time series data. Mol. Biol. Evol. 34, 3023–3034 (2017).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  26. 26.

    Illingworth, C. J. R. & Mustonen, V. Distinguishing driver and passenger mutations in an evolutionary history categorized by interference. Genetics 189, 989–1000 (2011).

    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  27. 27.

    Illingworth, C. J. R., Fischer, A. & Mustonen, V. Identifying selection in the within-host evolution of influenza using viral sequence data. PLoS Comput. Biol. 10, e1003755 (2014).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  28. 28.

    Terhorst, J., Schlötterer, C. & Song, Y. S. Multi-locus analysis of genomic time series data from experimental evolution. PLoS Genet. 11, e1005069 (2015).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  29. 29.

    Sohail, M. S., Louie, R. H. Y., McKay, M. R. & Barton, J. P., MPL resolves genetic linkage in fitness inference from complex evolutionary histories. Github https://github.com/bartonlab/paper-MPL-inference (2020).

  30. 30.

    Sohail, M. S., Louie, R. H. Y., McKay, M. R. & Barton, J. P., MPL resolves genetic linkage in fitness inference from complex evolutionary histories. Code Ocean https://doi.org/10.24433/CO.1795728.v1 (2020).

  31. 31.

    Mustonen, V. & Lässig, M. Fitness flux and ubiquity of adaptive evolution. Proc. Natl Acad. Sci. USA 107, 4248–4253 (2010).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  32. 32.

    Illingworth, C. J. R., Parts, L., Schiffels, S., Liti, G. & Mustonen, V. Quantifying selection acting on a complex trait using allele frequency time series data. Mol. Biol. Evol. 29, 1187–1197 (2011).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  33. 33.

    Schraiber, J. G. A path integral formulation of the Wright–Fisher process with genic selection. Theor. Popul. Biol. 92, 30–35 (2014).

    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  34. 34.

    Ewens, W. J. Mathematical Population Genetics 1: Theoretical Introduction (Springer Science & Business Media, 2012).

  35. 35.

    Iranmehr, A., Akbari, A., Schlötterer, C. & Bafna, V. CLEAR: Composition of likelihoods for evolve and resequence experiments. Genetics 206, 1011–1023 (2017).

    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  36. 36.

    Liu, M. K. P. et al. Vertical T cell immunodominance and epitope entropy determine HIV-1 escape. J. Clin. Invest. 123, 380–393 (2013).

    CAS 
    PubMed 
    PubMed Central 

    Google Scholar
     

  37. 37.

    Moore, P. L. et al. Multiple pathways of escape from HIV broadly cross-neutralizing V2-dependent antibodies. J. Virol. 87, 4882–4894 (2013).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  38. 38.

    Doria-Rose, N. A. et al. Developmental pathway for potent V1V2-directed HIV-neutralizing antibodies. Nature 509, 55–62 (2014).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  39. 39.

    Liu, Y. et al. Selection on the human immunodeficiency virus type 1 proteome following primary infection. J. Virol. 80, 9519–9529 (2006).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  40. 40.

    Neher, R. A. & Leitner, T. Recombination rate and selection strength in HIV intra-patient evolution. PLoS Comput. Biol. 6, e1000660 (2010).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  41. 41.

    Batorsky, R. et al. Estimate of effective recombination rate and average selection coefficient for HIV in chronic infection. Proc. Natl Acad. Sci. USA 108, 5661–5666 (2011).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  42. 42.

    Wang, S. et al. Manipulating the selection forces during affinity maturation to generate cross-reactive HIV antibodies. Cell 160, 785–797 (2015).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  43. 43.

    Liao, H.-X. et al. Co-evolution of a broadly neutralizing HIV-1 antibody and founder virus. Nature 496, 469–476 (2013).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  44. 44.

    Ganusov, V. V. et al. Fitness costs and diversity of the cytotoxic T lymphocyte (CTL) response determine the rate of CTL escape during acute and chronic phases of HIV Infection. J. Virol. 85, 10518–10528 (2011).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  45. 45.

    Ganusov, V. V., Neher, R. A. & Perelson, A. S. Mathematical modeling of escape of HIV from cytotoxic T lymphocyte responses. J. Stat. Mech.: Theory Exp. 2013, P01010 (2013).

    Article 
    CAS 

    Google Scholar
     

  46. 46.

    Kessinger, T., Perelson, A. & Neher, R. Inferring HIV escape rates from multi-locus genotype data. Front. Immunol. 4, 252 (2013).

  47. 47.

    Pandit, A. & de Boer, R. J. Reliable reconstruction of HIV-1 whole genome haplotypes reveals clonal interference and genetic hitchhiking among immune escape variants. Retrovirology 11, 11–56 (2014).

    Article 
    CAS 

    Google Scholar
     

  48. 48.

    Leviyang, S. & Ganusov, V. V. Broad CTL response in early HIV infection drives multiple concurrent CTL escapes. PLoS Comput. Biol. 11, e1004492 (2015).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  49. 49.

    Beerenwinkel, N., Günthard, H. F., Roth, V. & Metzner, K. J. Challenges and opportunities in estimating viral genetic diversity from next-generation sequencing data. Front. Microbiol. 3, 329 (2012).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  50. 50.

    Turajlic, S., Sottoriva, A., Graham, T. & Swanton, C. Resolving genetic heterogeneity in cancer. Nat. Rev. Genet. 20, 404–416 (2019).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  51. 51.

    Good, B. H., McDonald, M. J., Barrick, J. E., Lenski, R. E. & Desai, M. M. The dynamics of molecular evolution over 60,000 generations. Nature 551, 45–50 (2017).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  52. 52.

    Kouyos, R. D., Althaus, C. L. & Bonhoeffer, S. Stochastic or deterministic: what is the effective population size of HIV-1? Trends Microbiol. 14, 507–511 (2006).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  53. 53.

    Cocco, S., Feinauer, C., Figliuzzi, M., Monasson, R. & Weigt, M. Inverse statistical physics of protein sequences: a key issues review. Rep. Prog. Phys. 81, 032601 (2018).

    PubMed 
    Article 
    CAS 
    PubMed Central 

    Google Scholar
     

  54. 54.

    Socolich, M. et al. Evolutionary information for specifying a protein fold. Nature 437, 512–518 (2005).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  55. 55.

    Weigt, M., White, R. A., Szurmant, H., Hoch, J. A. & Hwa, T. Identification of direct residue contacts in protein–protein interaction by message passing. Proc. Natl Acad. Sci. USA 106, 67–72 (2009).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  56. 56.

    Morcos, F. et al. Direct-coupling analysis of residue coevolution captures native contacts across many protein families. Proc. Natl Acad. Sci. USA 108, E1293–E1301 (2011).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  57. 57.

    Russ, W. P., Lowery, D. M., Mishra, P., Yaffe, M. B. & Ranganathan, R. Natural-like function in artificial WW domains. Nature 437, 579–583 (2005).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  58. 58.

    Ferguson, A. L. et al. Translating HIV sequences into quantitative fitness landscapes predicts viral vulnerabilities for rational immunogen design. Immunity 38, 606–617 (2013).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  59. 59.

    Mann, J. K. et al. The fitness landscape of HIV-1 gag: advanced modeling approaches and validation of model predictions by in vitro testing. PLoS Comput. Biol. 10, e1003776 (2014).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  60. 60.

    Figliuzzi, M., Jacquier, H., Schug, A., Tenaillon, O. & Weigt, M. Coevolutionary landscape inference and the context-dependence of mutations in beta-lactamase TEM-1. Mol. Biol. Evol. 33, 268–280 (2015).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  61. 61.

    Barton, J. P. et al. Relative rate and location of intra-host HIV evolution to evade cellular immunity are predictable. Nat. Commun. 7, 11660 (2016).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  62. 62.

    Hopf, T. A. et al. Mutation effects predicted from sequence co-variation. Nat. Biotechnol. 35, 128–135 (2017).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  63. 63.

    Louie, R. H. Y., Kaczorowski, K. J., Barton, J. P., Chakraborty, A. K. & McKay, M. R. Fitness landscape of the human immunodeficiency virus envelope protein that is targeted by antibodies. Proc. Natl Acad. Sci. USA 115, E564–E573 (2018).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  64. 64.

    Quadeer, A. A., Louie, R. H. Y. & Mckay, M. R. Identifying immunologically-vulnerable regions of the HCV E2 glycoprotein and broadly neutralizing antibodies that target them. Nat. Commun. 10, 2073 (2019).

    PubMed 
    PubMed Central 
    Article 
    CAS 

    Google Scholar
     

  65. 65.

    Quadeer, A. A., Barton, J. P., Chakraborty, A. K. & McKay, M. R. Deconvolving mutational patterns of poliovirus outbreaks reveals its intrinsic fitness landscape. Nat. Commun. 11, 377 (2020).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  66. 66.

    Kimura, M. Diffusion models in population genetics. J. Appl. Probab. 1, 177–232 (1964).

    Article 

    Google Scholar
     

  67. 67.

    Tataru, P., Bataillon, T. & Hobolth, A. Inference under a Wright-Fisher model using an accurate beta approximation. Genetics 201, 1133–1141 (2015).

    CAS 
    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

  68. 68.

    He, Z., Beaumont, M. & Yu, F. Effects of the ordering of natural selection and population regulation mechanisms on Wright-Fisher models. G3: Genes, Genomes, Genetics 7, 2095–2106 (2017).

    Article 

    Google Scholar
     

  69. 69.

    Tataru, P., Simonsen, M., Bataillon, T. & Hobolth, A. Statistical inference in the Wright-Fisher model using allele frequency data. Syst. Biol. 66, e30–e46 (2017).

    PubMed 
    PubMed Central 

    Google Scholar
     

  70. 70.

    Risken, H. The FokkerPlanck Equation: Methods of Solution and Applications 2nd edn (Springer, 1989).

  71. 71.

    Gaschen, B., Kuiken, C., Korber, B. & Foley, B. Retrieval and on-the-fly alignment of sequence fragments from the HIV database. Bioinformatics 17, 415–418 (2001).

    CAS 
    PubMed 
    Article 
    PubMed Central 

    Google Scholar
     

  72. 72.

    Korber, B. et al. in Human Retroviruses and AIDS (eds Korber, B. et al.) 102–111 (Los Alamos National Laboratory, 1998)..

  73. 73.

    Zanini, F., Puller, V., Brodin, J., Albert, J. & Neher, R. A. In vivo mutation rates and the landscape of fitness costs of HIV-1. Virus Evol. 3, vex003 (2017).

    PubMed 
    PubMed Central 
    Article 

    Google Scholar
     

Download references

Acknowledgements

We thank A.K. Chakraborty, C.J.R. Illingworth, B. Lee and J.G. Schraiber for helpful discussions and comments on the manuscript. The work of M.S.S., R.H.Y.L. and M.R.M. was supported by the Hong Kong Research Grants Council under grant number 16234716. M.S.S. and M.R.M. were also supported by the Hong Kong Research Grants Council under grant number 16201620, while R.H.Y.L. was also supported by Australia’s National Health and Medical Research Council under grant number APP1121643. The work of J.P.B. reported in this publication was supported by the National Institute of General Medical Sciences of the National Institutes of Health under award R35GM138233.

Author information

Author notes

  1. These authors contributed equally: Muhammad Saqib Sohail, Raymond H. Y. Louie.

Affiliations

  1. Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology, Hong Kong, China

    Muhammad Saqib Sohail, Raymond H. Y. Louie & Matthew R. McKay

  2. Institute for Advanced Study, Hong Kong University of Science and Technology, Hong Kong, China

    Raymond H. Y. Louie

  3. The Kirby Institute, University of New South Wales, Sydney, New South Wales, Australia

    Raymond H. Y. Louie

  4. School of Medical Sciences, University of New South Wales, Sydney, New South Wales, Australia

    Raymond H. Y. Louie

  5. Department of Chemical and Biological Engineering, Hong Kong University of Science and Technology, Hong Kong, China

    Matthew R. McKay

  6. Department of Physics and Astronomy, University of California, Riverside, Riverside, CA, USA

    John P. Barton

Contributions

All authors designed research, developed methods, analyzed data, interpreted results and wrote the paper.

Corresponding authors

Correspondence to
Matthew R. McKay or John P. Barton.

Ethics declarations

Competing interests

The authors declare no competing interests.

Additional information

Publisher’s note Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Extended data

Extended Data Fig. 1 MPL accurately recovers selection coefficients from complex simulated evolutionary trajectories.

a, Trajectories of mutant allele frequencies over time exhibit complex dynamics in a WF simulation with a simple fitness landscape. b, Separate views of individual trajectories for beneficial, neutral, and deleterious mutants (left panel) and inferred selection coefficients (right panel) for a single simulation run. Note that many neutral mutations exhibit temporal variation similar to beneficial or deleterious mutations. MPL estimates the underlying selection coefficients used to generate these trajectories, presented as mean values ± one theoretical standard deviation, and distinguishes between beneficial, neutral, and deleterious mutations, using Eq. (11). Dashed lines mark the true selection coefficients. c, Distributions of selection coefficient estimates across n = 100 replicate simulations with identical parameters in the special case of perfect sampling. MPL is also robust to finite sampling constraints, accurately classifying beneficial (d) and deleterious (e) mutants even when the number of sequences sampled per time point ns is low, and the spacing between time samples Δt is large. Simulation parameters. L = 50 loci with two alleles at each locus (mutant and WT): ten beneficial mutants with s = 0.025, 30 neutral mutants with s = 0, and ten deleterious mutants with s = −0.025. Mutation probability μ = 10−3, population size N = 103. Initial population composed of approximately equal numbers of three random founder sequences, evolved over T = 400 generations.

Extended Data Fig. 2 MPL improves selection inference for simulated data sets.

In Fig. 2, we showed the performance of MPL and existing methods on simulated test data, averaged over n = 100 replicate simulations with identical parameters. Here we show the improvement of MPL over existing methods for the classification of beneficial (a) and deleterious (b) mutations, and for the error in the estimated selection coefficients (c), for each individual simulation. Selection is more difficult to infer in some simulated data sets, but results from MPL show better agreement with the true parameters in the vast majority of simulations. Simulation parameters. L = 50 loci with two alleles at each locus (mutant and WT): ten beneficial mutants (with s = 0.1 for complex, s = 0.025 for simple), 30 neutral mutants (s = 0 for both scenarios), and ten deleterious mutants (s = −0.1 for complex, s = −0.025 for simple). Mutation probability μ = 10−4, population size N = 103. For the complex case, the initial population is composed of equal numbers of five random founder sequences, evolved over T = 310 generations. Recorded trajectory used for inference begins at generation 10. For the simple case, the initial population begins with all WT sequences, evolved over T = 1000 generations.

Extended Data Fig. 3 MPL performs well in the presence of recombination.

a, Classification performance of MPL is robust to variation in per locus recombination probability, r. Results are shown for n = 100 independent Monte-Carlo runs. The lower and upper edge of the boxplot correspond to the 25th to 75th percentiles, the bar corresponds to the median while the top and bottom whiskers show the maximum and minimum value within 1.5× the interquartile range from the boxplot. Linkage effects in the data decrease as the recombination probability increases. As a measure of the linkage disequilibrium in the data, we plot the histograms (b) of the covariance (xijxixj) of mutant allele frequencies integrated over time (300 generations) for a range of recombination probabilities. The number of mutant pairs with strong pairwise covariance values decrease with increasing values of r, indicating lower linkage disequilibrium. Simulation parameters. Same as those of simple scenario used in Fig. 2, that is, L = 50 loci with two alleles at each locus (mutant and WT): ten beneficial mutants (s = 0.025), 30 neutral mutants (s = 0), and ten deleterious mutants (s = −0.025). Mutation probability μ = 10−4, population size N = 103, r = {0, 10−5, 10−4, 10−3}. The initial population begins with all WT sequences, evolved over T = 300 generations.

Extended Data Fig. 4 Performance of MPL on data with HIV-1-like sampling profiles.

a, The number of sequences per time point ns are drawn from a binomial distribution with n = 1000 and p = 0.0139, with the same mean as that of the HIV data. b, The time between samples is drawn from a mixture of two gamma distributions f(x;k,θ), where k and θ are the shape and scale parameters. The mixture distribution has the form w1 × (f(x;k1,θ1) + m1) + w2 × (f((k2θ2 + m2x);k2,θ2) + m2) where m1 = 0, m2 = 120, are constants added to shift the mean, k1 = 3.5, k2 = 3, θ1 = 8.4, θ2 = 2, while w1 = 0.87, and w2 = 0.13 are the mixing weights. The parameters were chosen to mimic the distribution of the time between samples of the HIV data analyzed in the manuscript (Supplementary Table 1). c, The number of generations used for inference is also drawn from a mixture of two gamma distributions, having the form given above and with parameters k1 = 5.5, k2 = 15, θ1 = 7.2, θ2 = 8, m1 = 5, m2 = 143, w1 = 0.21, and w2 = 0.79. The parameters were chosen to mimic the distribution of the trajectory lengths of the HIV data analyzed in the manuscript (Supplementary Table 1). d, A typical sampled trajectory of allele frequencies: beneficial (red), deleterious (blue) and neutral (gray). Dashed lines indicate the sampling time-points. e, The AUROC performance of identifying beneficial and deleterious selection coefficients under perfect and heterogeneous sampling scenarios. Results are evaluated for those sites that are polymorphic in the heterogeneous sampling case. Results are shown for n = 100 independent Monte-Carlo runs. The lower and upper edge of the boxplot correspond to the 25th to 75th percentiles, the bar corresponds to the median while the top and bottom whiskers show the maximum and minimum value within 1.5× the interquartile range from the boxplot. Simulation parameters: population size N = 1000, L = 50 loci with two alleles at each locus (mutant and WT), ten beneficial mutants with selection coefficients s uniformly distributed over the range [0.075, 0.125], 30 neutral mutants with s = 0, and ten deleterious mutants with selection coefficients uniformly distributed over the range [-0.125, -0.075], mutation probability per site per generation μ = 10−4, and recombination probability per site per generation r = 10−4.

Extended Data Fig. 5 Most genetic variants have little effect on inferred selection at other sites, but a small minority have strong effects.

After computing the pairwise effects (Delta hat s_{ij}) of each variant i on the inferred selection coefficient for each other variant j, referred to as the target, we summed the absolute value of the (Delta hat s_{ij}) values over all target variants j to quantify the influence of each variant i on selection at other sites. One histogram is shown for each sequencing region, for each individual. For the vast majority of variants, the total effect on selection at other sites is near zero. However, a small minority have strong effects. We defined a variant to be ‘highly influential’ if the sum of the absolute values of the (Delta hat s_{ij}) over all targets j was larger than 0.4 (=40%).

Extended Data Fig. 6 Variants that strongly influence inferred selection at other sites often act across large genomic distances.

Plot of all linkage effects on inferred selection coefficients (Delta hat s_{ij}) for which |(Delta hat s_{ij})| > 0.004. One plot is shown for each sequencing region, for each individual. These strong effects of linkage on inferred selection coefficients can act at long range across the genome. Approximately 40% of highly influential variants, characterized by strong effects on inferred selection at other sites, lie within identified CD8+ T cell epitopes. The 5′ region for individual CH607 is not shown because no (Delta hat s_{ij}) values are larger than the cutoff.

Extended Data Fig. 7 For most variants, effects on inferred selection coefficients for other variants, and linkage disequilibrium, are stronger at smaller genomic distances.

a, Histogram of the absolute value of linkage effects on inferred selection coefficients for other variants |(Delta hat s_{ij})|, divided into subgroups based on the distance along the genome between variant i and target variant j. Consistent with intuition, the large effects on inferred selection coefficients occur most frequently for different variants that occur at the same site on the genome (that is, distance equal to zero). ‘Interactions’ between such variants are necessarily perfectly competitive because only a single nucleotide is allowed at each position in the genetic sequence. For most variants, stronger linkage effects on inferred selection coefficients are more frequently observed for other variants within a distance of ten base pairs (bp). Large linkage effects for pairs of variants within a distance of 30 bp, the approximate length of a linear T cell epitope, occur appreciably more frequently than for pairs of variants at greater genomic distances. However, there is little difference in the distribution of linkage effect sizes for pairs of variants that are between 31 bp and 100 bp apart compared to pairs of variants that are more than 100 bp apart. Nonetheless, some strong linkage effects on inferred selection are observed at long genomic distances (see Fig. 4 and Supplementary Fig. 5). b, Linkage disequilibrium, measured by the absolute value of the off-diagonal entries of the integrated allele frequency covariance matrix, Cint. Like the |(Delta hat s_{ij})|, linkage decays along with the distance between variants along the genome. However, we note that linkage disequilibrium values in general appear to be more long-ranged.

Extended Data Fig. 8 Estimates of selection coefficients in a simple example of clonal interference.

a, Two escape mutations arise in the TW10 epitope targeted by individual CH58 and compete for dominance. b, MPL infers that both TW10 escape variants are positively selected. Estimates based on trajectories of individual variants only infer substantial positive selection for the 1514A variant that fixes. The magnitude of selection inferred with the independent model is also smaller than that inferred by MPL. c, Inferred selection in the HIV-1 5′ half-genome sequence for CH58. Inferred selection coefficients are plotted in tracks. Coefficients of transmitted/founder nucleotides are normalized to zero. Tick marks denote polymorphic sites. Inner links, shown for sites connected to the TW10 epitope, have widths proportional to matrix elements of the inverse of the integrated covariance. Linked sites affect selection estimates within the epitope.

Extended Data Fig. 9 Estimates of selection coefficients in a complex example of clonal interference.

a, Multiple escape variants for the Nef epitope EV11, targeted by individual CH131, interfere with one another over the course of nearly one year. Here we have omitted the trajectories for transient variants with a deletion at sites 8988a-8988c, which are insertions with respect to the HXB2 reference sequence. b, MPL infers that all nonsynonymous EV11 escape variants are positively selected. Variants 9000C and 9006T are both synonymous, and are inferred to be nearly neutral by MPL. As in previous examples, inferences using only the trajectories of individual variants only infer substantial positive selection for variants that are polymorphic at the final time point, or where the transmitted/founder (TF) allele at the same site appears strongly selected against. In the latter case, positive selection is inferred because all selection coefficients are normalized such that the selection coefficient for the TF variant is zero. This is why the independent model infers 8988T to be beneficial despite its low frequency at the final time point. Note that the independent model also infers the synonymous mutation 9000C to be beneficial. c, Inferred selection in the HIV-1 3′ half-genome sequence for CH131. Inferred selection coefficients are plotted in tracks. Coefficients of TF nucleotides are normalized to zero. Tick marks denote polymorphic sites. Inner links, shown for sites connected to the EV11 epitope, have widths proportional to matrix elements of the inverse of the integrated covariance. Linked sites affect selection estimates within the epitope.

Extended Data Fig. 10 Inferred selection coefficients across patients using different conventions for data processing.

Inferred selection coefficients are highly similar following different choices for processing the sequence data. Pearson R2 values between inferred selection coefficients range from 0.97 to 1.00, with an average of 0.99. Data processing conventions. Reference: current data processing conventions. Max Δt = 200/400: remove time points that are more than 200/400 days beyond the last included time point (reference: 300 days). Max gap freq. = 80%/99%: remove sites where >80%/99% of observed variants are gaps (reference: 95%). Max gap num. = 50/500: remove sequences with >50/500 gaps in excess of subtype consensus (reference: 200). Min seqs. = 2/6: remove time points with <2/6 available sequences (reference: 4). Remove ambiguous: remove sequences that contain ambiguous nucleotides if any other nucleotide variation is observed at the same site. LTR, long terminal repeat.

Supplementary information

About this article

Verify currency and authenticity via CrossMark

Cite this article

Sohail, M.S., Louie, R.H.Y., McKay, M.R. et al. MPL resolves genetic linkage in fitness inference from complex evolutionary histories.
Nat Biotechnol (2020). https://doi.org/10.1038/s41587-020-0737-3

Download citation

Read More

Leave a Reply

Your email address will not be published. Required fields are marked *