---
title: "Allele frequency change and selection"
author: "Matheus Januario, Andressa Viol, and Daniel Rabosky"
date: "Jan 2024"
output: rmarkdown::html_vignette
vignette: >
 %\VignetteIndexEntry{popgen_selection}
 %\VignetteEngine{knitr::rmarkdown}
 %\VignetteEncoding{UTF-8}
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
options(rmarkdown.html_vignette.check_title = FALSE)
```

# Learning objectives

1. Breeding effective population size
2. Mutation-drift equilibrium
3. Selection
4. The case of the peppered moths

First, we load our packages:
```{r}
library(evolved)
```

# Introduction

We already saw in other popgen vignettes that due to sampling only, allele frequencies are predicted to change with time in a finite population -- a process we call genetic drift. Still, many things _other than drift_ also change allele frequency. We will work with them today.

# Breeding effective population size

Following Wright (1931), the breeding effective size is calculated as:

\begin{equation}
N_e = \frac{4N_m N_f}{N_m+N_f}
\end{equation}

\fbox{\begin{minipage}{48em}
Challenge (extra credit question) 1:\\

What ratio of $N_f$ to $N_m$ maximizes $N_e$, relative to the census size? Explain or show your work.\\

Hint 1: Try plotting $N_e$ as a function of $\frac{N_f}{N_m}$. You can create "fake data" for this and plot it with R graphics.\\

Hint 2: Alternatively, you can also try to solve it analytically by considering that $N_{census} = N_f + N_m$.
\end{minipage}}

\fbox{\begin{minipage}{48em}
Question 1:\\

What happens to the above equation under extreme reproductive skew (i.e. one male monopolizes and mates with all females)? Write an equation that represents this. Explain your work.\\

Then, consider what is the limit of this equation, meaning: what is the effective population size when the number of females goes to infinity?

\end{minipage}}

\fbox{\begin{minipage}{48em}
Challenge (extra credit question) 2:\\

In birds, offspring sex is determined by their chromosomal system. Females are the heterogametic sex, having a pair of dissimilar ZW chromosomes, and males have two similar ZZ chromosomes. If the effective size for an autosomal marker is $N_e$, what is the effective size of a Z-linked sequence (a.k.a. a marker), and what is the effective size of a W-linked marker?
\end{minipage}}

# Mutation-drift equilibrium

\fbox{\begin{minipage}{48em}
Question 2:\\

Qualitatively, what is the relationship between the number of new alleles produced per generation and the mutation rate? Assume all mutations produce a unique allele and also an effectively infinite number of possible alleles.\\

Hint: Don't worry about numeric values, but describe the relationship as precisely as you can -- in other words, think about the shape of the relationship.
\end{minipage}}

\fbox{\begin{minipage}{48em}
Question 3:\\

A population with 100,000,000 individuals that was previously at mutation-drift equilibrium undergoes a severe population contraction to only 1,000 individuals. This new census size remains until you stop sampling that population. Assuming that the population conforms to Wright-Fisher assumptions (except during the bottleneck), plot the heterozygosity through time if the mutation rate is $\mu = 10^{-5}$. Assume census size is equal to effective population size during the two phases of population size. Do not forget to plot the initial heterozygosity. 
\end{minipage}}

\fbox{\begin{minipage}{48em}
Question 4:\\

Consider the heterozygosity after the bottleneck. Answer using at maximum two sentences: why would the temporal behavior of heterozygosity, as modeled in question 3, be stable if we saw in the drift vignette that the heterozygosity is expected to decay with time?
\end{minipage}}

\fbox{\begin{minipage}{48em}
Question 5:\\

In a population, the probability of picking two alleles that are identical-by-state is 0.75. Assuming mutation-drift equilibrium, what is the effective size of the population?\\

Hint: We are expecting an equation as the answer. You can keep the $\mu$ term as a constant in your answer.
\end{minipage}}

\fbox{\begin{minipage}{48em}
Challenge (extra credit question) 3:\\ 

Using R, plot 2 qualitatively different temporal patterns of census size that could result in the same overall $N_e$. Don't forget to add code showing that the two effective sizes are the same.
\end{minipage}}

# Selection

\fbox{\begin{minipage}{48em}
Question 6:\\

Consider a population of a diploid organism, initially at HWE frequencies where allele $B$ has frequency $p$. At some point, the environment changes. Although population size is constant, the adult $b$ homozygotes have ninety percent of the survivorship that $B$ homozygotes and heterozygotes have.\\

What is the relative fitness of each genotype?
\end{minipage}}

Using the function `NatSelSim()` (from `evolved`) you can explore how selection occurs in a bi-allelic gene of a diploid organism, given a set of arguments: `p0` is the initial frequency of the focal allele (labeled $A_1$ in the output plots), `w11` is the fitness of the homozygote for the focal allele, and `w12` is the heterozygote fitness. The function also requires the number of generations simulated (`n.gen`).

\fbox{\begin{minipage}{48em}
Question 7:\\ 

If $q = 0.9$ before the environmental change discussed in question 6, and considering the next 100 generations, answer:\\

(A) What command line would simulate the above scenario, using the function NatSelSim()?\\Hint: There are two ways to describe this situation in the function; make sure you are clear which allele you are assigning as $A_1$. It may be helpful to plot the scenario from both perspectives ($B$ as focal allele and $b$ as focal allele).\\ 

(B) Include the simulated plot in your lab. Briefly describe the behavior of:\\

(I) $q$ through each generation of selection, including comments on the rate of change.\\
(II) the frequency of genotypes through each generation of selection\\
(III) the mean population fitness at each generation of selection\\
\\
Hint: Comment a bit on the role of the heterozygote in the simulations. 
\end{minipage}}

The question above describes a situation where the dominant allele has higher fitness. But there are many other possibilities. Tweak the parameters to explore how selection behaves in different situations.

\fbox{\begin{minipage}{48em}
Question 8:

Describe mathematically or verbally which values for parameters w11, w12, and w22 would simulate a situation with:\\

(A) Heterozygote disadvantage\\
(B) A recessive lethal genetic disease\\
(C) A codominant disadvantageous allele\\
(D) Heterozygote advantage\\

Note, this question does not require the actual simulations to be run.

\end{minipage}}

Now, simulate all those biological scenarios. As we said above, before running the code, try to come up with expectations about how the system will behave -- this is fundamental for you to build intuition on how selection works. Try to also vary the initial frequency -- why is this important?

Do your results correspond to your hypotheses?

# The case of the peppered moths

For a long time, melanic individuals of _Biston betularia_ (the peppered moth) were considered to be rare. The first dark morph was deposited in a museum only in 1811 (Berry, 1990), 53 years after the species was first described. The first live specimen was caught 37 years after that by R.S. Edleston in 1848. Only 16 years later, however, Edleston would say that the dark (a.k.a. _carbonaria_) morph was the most common morph that he caught in his garden (Berry, 1990). 

At the time many hypotheses were drawn to explain this phenomenon (Cook & Turner, 2020), but a very good hypothesis is that light-colored morphs were able to camouflage easier in lichenous tree barks (see figure 1), while the dark morph was easily seen by birds on this background. However, after the industrial revolution, many of the lichen died, exposing the darker tree bark underneath and supposedly changing which morph was most visible to birds.

```{r, echo=F, fig.cap="Light and carbonaria morphs of the peppered moth on the same lichenous tree. Image from wikicommons.", out.width=500}
knitr::include_graphics('2_morphs.jpeg')
```

Almost a century ago, J. B. S. Haldane performed some calculations on peppered moth morphs to estimate the selection coefficient ($s$) that natural populations face (Haldane, 1924). Below, you will use a similar approach to also estimate $s$, departing from the same data and assumptions that Haldane used.

Haldane made his estimate using analytic calculations, but we can approach his results by using a procedure similar to the one below:

(I) Multiply the frequency of each genotype in the population by its relative fitness, obtaining the frequency of genotypes that will actually reproduce.
(II) Get the new allele frequency _after_ selection. Then, assume those genotypes mate randomly, which will lead to offspring genotypes that follow frequencies expected by HWE and the new allelic frequency (the one after selection).
(III) Repeat steps I-II until selection for that characteristic is over (i.e., $s = 0$) or there is no genetic variation anymore. Note that this is no different than what the function `NatSelSim()` does "under its hood" (which you can check by typing the function name without the parenthesis in the R console).

We need some data to help us calculate our estimates. Drawing from the literature, Haldane assumed that the **carbonaria morph alelle had a frequency of roughly 1% in 1848 but that its frequency had increased to 99% after 50 generations**.

Taking Haldane's assumptions for granted, and using the procedure explained above, we are going to estimate the intensity of selection favoring the dark morph. To do that:

**(1)** Assume that the _carbonaria_ allele is dominant, and begin with the allele frequency given above for the _carbonaria_ allele before the industrial revolution.

**(2)** Then, use step I-III above to compute the final frequency of the allele after 50 generations of selection. After each simulation, your value of $s$ will be fixed. 

**(3)** Do many simulations across a range of relative fitness values for the light morph allele. 

**(4)** Using the final $p$ frequencies from all simulations, you will generate a "profile plot", where your x-axis should be the relative fitness of light morph, and your y-axis should be the final frequency of the carbonaria allele after 50 generations.

Below, we designed some questions that will guide you through this procedure.

\fbox{\begin{minipage}{48em}
Question 9:\\

If the dark morph has 1.1x higher fitness than the light morph, what are the relative fitnesses of each genotype? \\

Graduate students should note that this is an important idea to understand because the literature will report fitness advantages sometimes as relative advantage and sometimes as relative disadvantage. So, being able to convert between them is an important skill.
\end{minipage}}

\fbox{\begin{minipage}{48em}
Question 10:\\

Without executing any code, answer: will the model result be different if you use advantage of the dark morph or disadvantage of the light morph as the basis for your calculation?
\end{minipage}}

\fbox{\begin{minipage}{48em}
Challenge (extra credit question) 4:\\

Based on the information we drew from the literature, Haldane (1924) concluded that the selective advantage of the dark allele "must be fifty percent greater than that of the recessives". Do you agree with that? Why?\\

Hint: Consider how you would make a plot that shows how the selective advantage covaries with the final frequency of the carbonaria allele. Take another look at point (4) above.
\end{minipage}}

\fbox{\begin{minipage}{48em}
Challenge (extra credit question) 5:\\

If the carbonaria allele is recessive, does your above answer change? If so, how?
\end{minipage}}

\fbox{\begin{minipage}{48em}
Question 11:\\

What would happen to the peppered moth population if one of the morphs had, for a very long time, an absolute (i.e. not relative) fitness of 0.8? Could this occur given what we know about the peppered moth population changes over time? Justify your answer.

Hint: Remember the population growth curves you made in lab 01.
\end{minipage}}

\fbox{\begin{minipage}{48em}
Question 12:\\

Now, integrating results from all the labs up to now, discuss why selection is expected to be more efficient in promoting evolutionary change in populations with larger effective size.
\end{minipage}}

Since Haldane did his approximation, subsequent research has greatly improved our knowledge of peppered moth evolution. The fact that we have few good temporal records from the period in which melanic individuals were increasing in frequency obscures many details, but several possibilities were considered for the observed change, from linkage disequilibrium of melanic alleles to heterozygote advantage (Cook & Turner, 2020). Still, Haldane's point is very clear: **selection can be really strong in natural populations!** However, even though the melanic morph had very high frequency at the early 20th century, this allele was _not_ fixed. Influx of alleles via mating with migrants from elsewhere were not considered a good hypothesis at the time but, much later, DNA-based evidence in favor of that hypothesis was found (Cook & Turner, 2020). Are you surprised that natural populations, even in such "simple" scenarios, are _that_ complex?

# References

Berry, R. J. (1990). Industrial melanism and peppered moths (Biston betularia (L.)). Biological Journal of the Linnean Society, 39(4), 301-322.

Cook, L. M., & Turner, J. R. (2020). Fifty per cent and all that: what Haldane actually said. Biological Journal of the Linnean Society, 129(3), 765-771.

Edleston, R. S., 1864. Amptydasis betularia. The Entomologist, 2: 150

Haldane JBS. 1924. A mathematical theory of natural
and artificial selection. Transactions of the Cambridge
Philosophical Society 23: 19–41.

Wright, S. (1931). Evolution in Mendelian populations. Genetics, 16(2), 97.