Sampling alleles, genetic drift and effective population size

Learning objectives

  1. Genetic drift: qualitative expectation
  2. Genetic drift: building intuition
  3. Dissecting the variability in outcomes
  4. Genetic drift and heterozygosity decay
  5. Effective population size
  6. Historical note

For this tutorial, you must install the plot3D R package.

install.packages("plot3D")

Then, you must load it, as well as our package:

library(evolved)
library(plot3D)

Introduction

In previous labs, we introduced a simple way of applying probability rules to allele change through a small number of generations. In this lab, we will extrapolate qualitatively and quantitatively from those insights, and see what happens to genetic diversity over time.

Genetic drift: qualitative expectation

As you may know, Peter Buri did an experiment on genetic drift in 1956. He started with many flask populations of Drosophila melanogaster and applied a very simple method to keep track of genetic drift. In his own words (Buri, 1956):

“The method consists in observations on the changes in gene frequency between generations in each of a large number of small cultures in which the size of the breeding population is controlled. Each initial culture founds a line in which each successive generation is initiated with a random sample of uniform size taken from among the flies of the preceding generation. The form is that of a fairly large experimental population divided into a number of very small completely isolated subunits.”

Thus, the experiment involves sampling adults at random from a flask and placing them in a new flask where they will mate. Their offspring reach adulthood and are then themselves sampled, thus producing a new generation. All the while, we keep track of relative frequencies of genotypes through time.

Now, imagine you are in Buri’s lab and will run his experiment using 1000 flasks (he tracked 212 in total). When individuals from a flask become able to reproduce, you randomly and uniformly (i.e. all individuals are equally likely to be sampled) pick 8 virgin females from the same flask and generation to mate only once with 8 random males from the same flask and generation. Each mating is successful and generates many viable offspring.

Assume males and females do not diverge in genotypic frequencies. Recall also that Drosophila melanogaster is a diploid organism.

If you are not familiar with inserting an image into an RMarkdown file:

First, make sure your image is inside the current working directory, i.e. inside the reference folder that R will use when you knit the RMarkdown file. Then, you can use the following syntax, replacing file_name.jpg with the specific name of your image.

![](file_name.jpg)

Genetic drift: building intuition

Now we will do a computer simulation of what Buri did. We will use the rbinom() function to draw allele counts from a population of a diploid organism through time. The function does this by applying the following algorithm:

  1. Start with a certain probability p of sampling a specific allele (e.g., A1) based on that allele’s frequency in the population.
  2. Generate the next generation by taking random draws from a binomial distribution given probability p of sampling copies of allele A1.
  3. Recalculate the new allele frequency. You have just observed drift!

Remember that the binomial distribution applies to binary processes and their number of “successes”. Our binary process here is the sampling of alleles – as the organisms are diploid, each of their 2 alleles will either be A1 or A2, with no exceptions. Our number of “successes” is the total number of A1 alleles (i.e., the allele frequency, p) in the next flask.

So, a single flask in the first generation of our experiment can be simulated as:

N <- 32 #population size
n_alleles <- 2*N 
p_gen0 <- 0.25 #Frequency of allele A1 in the first gen
p_gen1 <- rbinom(1, n_alleles, p_gen0) / n_alleles
p_gen1
## [1] 0.328125

Now, to do the next generation, we just have to redo this process, but this time using p1, the allele frequency in generation 1, as the new probability of sampling A1 alleles:

p_gen2 <- rbinom(1, n_alleles, p_gen1) / n_alleles
p_gen2
## [1] 0.34375

What would the third generation be then?

p_gen3 <- rbinom(1, n_alleles, p_gen2) / n_alleles
p_gen3
## [1] 0.25

The fourth would be:

p_gen4 <- rbinom(1, n_alleles, p_gen3) / n_alleles
p_gen4
## [1] 0.25

The fifth would be:

p_gen5 <- rbinom(1, n_alleles, p_gen4) / n_alleles
p_gen5
## [1] 0.296875

What is happening with the code and with the result at each generation?

Let’s plot all results we have up to now. Note that your results will be different because you are generating different random numbers than us here.

generations <- seq(from = 0, to = 5, by = 1)
p_through_time <- c(p_gen0, p_gen1, p_gen2, p_gen3, p_gen4, p_gen5)
plot(generations, p_through_time, type="l", lwd = 2, col = "darkorchid3",
     ylab = "p", xlab = "generations", las = 1)

Repeat the above lines of code a few times to get a feeling for what is happening. Do all simulations behave the same way? Can you identify different “archetypes” of dynamics?

Instead of tediously copying and pasting code in the console, we can run large numbers of simulations quite easily using the function WFDriftSim():

#We will simulate five flasks through 10 generations:
WFDriftSim(Ne = 32, n.gen = 10, p0 = 0.5, n.sim = 5)

Run the above chunk of code above many times to get a feeling for it. How does changing Ne, n.gen, p0, and n.sim affect your simulations?

A function to calculate if a simulation’s alleles have fixed.

# R will not return anything in the console when you run this, but 
# after running you should have this function in your R session
# as an object. Note that if you name another object as "isFixed"
# the function will be overwritten.

isFixed <- function(p, tol = 0.00000000001){
    
    if (p <= tol | p >= (1 - tol)){
        return(TRUE)
    } else{
        return(FALSE)
    }

}

# Assuming your current allele frequency is stored in the p_t object, the
# line below uses the function created above to test if p_t is equal to 
# zero OR equal to one
isFixed(p_t)
# The function will return `TRUE` if p_t is equal to zero or one.

Dissecting the variability in outcomes

If our simulations contain many independent flasks (i.e. when n.sim > 40), it is hard to keep track of the whole history of allele frequency change in each flask. This is when the histogram to the right of the line plot in the output of WFDriftSim becomes particularly helpful. Take a look at the histogram section of the output plot after typing the following into your console:

WFDriftSim(Ne = 32, n.gen = 10, p0 = 0.5, n.sim = 100)


Genetic drift and heterozygosity decay

In lecture, we learned an analytic formula to calculate heterozygosity decay over time due to drift.

Example on how to store (and handle) your simulation results:

data = WFDriftSim(Ne = 8, n.gen = 50, p0 = 0.5, n.sim = 50,
                  print.data = T, plot.type = "none")

#then, if we want to see the 2nd to 5th generations of 
# the simulations number 14, 17 and 18, we type:
sims <- c(14,17,18)
gens <- 3:6
subset_data <- data[sims, gens]
subset_data
##         gen2   gen3   gen4   gen5
## sim14 0.6250 0.6875 0.7500 0.9375
## sim17 0.5625 0.5625 0.4375 0.5625
## sim18 0.3750 0.5000 0.5000 0.4375
#and we can also plot those results:

#first opening an empty plot:
plot(NA, ylab = "Alleleic frequency", xlab = "Generation",
     xlim = c(2, 5), ylim = c(0, 1))

#creating a set of colors to paint our lines
cols <- rainbow(n = nrow(subset_data))

#then we finally add the lines to our plot:
for (i in 1:nrow(subset_data)) {
  lines(x = gens - 1, y = subset_data[i,], col = cols[i])
}

Notice that the actual generation number does not match the column numbers!


Effective population size

We know that in real populations, the census size might be different from the effective population size.

Some food for thought:

Assuming an equal number of females and males, an equal number of generations, and bottlenecks of equal magnitude (equally drastic changes in census size), which of the options below has the greatest influence on Ne given equation (1)? Why?

  1. Having a large census size prior to the bottleneck
  2. Experiencing a bottleneck that is very brief
  3. Having a large census size after the bottleneck

A nice in-depth discussion of what the effective population size means can be found in Waples (2022).


Historical note

Kimura’s (1955) diffusion approximation for the distribution of allele frequencies in populations under drift is both elegant and difficult. Understanding his derivation requires some fairly sophisticated mathematics. However, there is a much easier way to generate and visualize the expectation for the distribution of allele frequencies across populations under drift, if we assume the population size is reasonably small. The solution described below involves a Markov chain model of drift.

Fundamentally, the probability model we will use is a binomial sampling model. This is nothing more than the “coin-toss” distribution: what is the probability of getting (say) k heads out of n coin tosses, given that the probability of heads on any given toss is p? More formally we denote this probability by P(k|n; p).

Under a binomial sampling model, we can estimate the probability P(k|n; p) using the binomial probability, or:

Let’s say we have a population with alleles A and a, with respective frequencies p and 1 − p. The relationship between the binomial model and genetic drift is that, for a finite population, we can use this to compute the exact probability that we will sample k alleles of a given type in a given generation, given an initial frequency p. So, for example, consider Buri’s genetic drift data, where 2N = 32, and where the initial allele frequencies were equal (p = 0.5). Given this starting configuration at time t0 (e.g., n = 32, p = 0.5), we can use the binomial probability to compute the exact probability of sampling k alleles in generation t1.

I’m going to switch notation here. For shorthand, I’m going to use the expression Pi, k to denote the probability that a population with k type A alleles has i copies after one generation of drift (sampling). This necessarily depends on a particular population size. For example, P13, 16 is the probability that a given generation of random mating in a Wright-Fisher population gives 13 A alleles given that the population currently has 16 such alleles, but obviously, this probability is critically dependent on the population size. For the Buri data, with 2N = 32, the probability P13, 16 corresponds to binomial parameters p = 0.5 and n = 32. But this probability would obviously differ for any other n.

Now, imagine forming a matrix of 2N + 1 columns and 2N + 1 rows. Here, each element (i, k) gives the probability – the transition probability – associated with sampling i alleles of type A in generation t + 1 given that you are currently at k A alleles in generation t. The matrix has 2N + 1 rows and columns because we also have an extra row and column to account for the state of 0 alleles.

$$\begin{bmatrix} P_{0,0} & P_{0,1} & P_{0,2} & \cdots & P_{0, n}\\ P_{1,0} & P_{1,1} & P_{1,2} & \cdots & P_{1, n}\\ P_{2,0} & P_{2,1} & P_{2,2} & \cdots & P_{2, n}\\ P_{3,0} & P_{3,1} & P_{3,2} & \cdots & P_{3, n}\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ P_{n,0} & P_{n,1} & P_{n,2} & \cdots & P_{n, n}\\ \end{bmatrix}$$

Of course, the probability of going from 0 A alleles to any number of alleles greater than 0 is also 0 (because 0 is an absorbing state – after the allele is lost in the population, it cannot increase in frequency). So the first column is necessarily zero, with the exception of P0, 0, which is 1 (if you have 0 A alleles at time t, you will have 0 A alleles at time t + 1 with probability 1). The same general idea holds for the last column: if you are at the maximum number of A alleles, e.g., n A alleles, then you can’t sample 2 in the following generation. In our example, with 2N = 32, both 0 and 32 are absorbing states. So the matrix is:

$$\begin{bmatrix} 1 & P_{0,1} & P_{0,2} & \cdots & 0\\ 0 & P_{1,1} & P_{1,2} & \cdots & 0\\ 0 & P_{2,1} & P_{2,2} & \cdots & 0\\ 0 & P_{3,1} & P_{3,2} & \cdots & 0\\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & P_{n,1} & P_{n,2} & \cdots & 1\\ \end{bmatrix}$$

Let’s call this matrix P. It is a transition matrix, and it has some really nice properties. This matrix includes the probabilities for all possible pairwise transitions between states. As such, the columns of the matrix must sum to 1. Once we’ve defined this matrix, we can fill in all of the individual elements with binomial probabilities. So, for example, assuming 2N = 32, we can fill in P3, 5 by first noting that the frequency of A at time t is $p = \frac{5}{32}$.

The amazing thing about the transition matrix P is that we can use it to compute the distribution of alleles after a generation of random mating, given some initial vector of allele counts N0. Here, N0 is a vector with the property that each element i of the vector gives the number of populations having i alleles. For example, N0[0] gives the number of initial populations with 0 A alleles. In the Buri data, with roughly 100 populations with equal allele frequencies, we would have N0[16] = 100 (100 populations, each with 16 A alleles at time t = 0) and all other elements of the vector equal to zero.

We can now construct a Markov chain, to compute the probability distribution of populations with a given allele count, as

where PN0 involves matrix multiplication of the column vector N0 by the square matrix P. N1 is the mean number of populations in each allele count category. For example, N1[7] is the number of populations in generation t = 1 that have 7 alleles (remember: the starting state was given by the vector N0). More generally, we have:

And we can see where this is going. It’s a recursive relationship whereby:

and

Etc etc. Thus, computing the distribution of allele frequencies can easily be done by iterating the equation given above, provided we have P and N0.

The R code below implements this model and uses it to compute the distribution of populations for each possible number of alleles, using Buri’s Drosophila study as a guide.

# A simple genetic drift simulator as a Markov process
N       <- 16
popsize <- 2*N

# Now we define a transition matrix, such that:
# Each element of the transition matrix is a transition probability
#   Specifically: element Pi,j is the probability of going from
#   j alleles to i alleles in a single sampling step 
#  

tmat <- matrix(0, nrow = popsize + 1, ncol = popsize + 1) # along rows: current allele count; along cols: future allele count

# Fill in elements with probabilities from the binomial distribution
# If you have j alleles in generation t-1, then this defines the probability
# of sampling (0,1....2N) alleles in the next generation, under a binomial 
# sampling process:

# Note also that first column and row are absorbing states of 0

# Here is the vector of frequencies:
probvec <- (1:popsize) / popsize  

# What is the probability of getting to 0 allele copies, given 
# that you have 1? This is P01, and the transition probability 
# would be computed using 1/2N (=1/32) as the sampling probability 
# for the allele. 
# Here, this is computed using the function dbinom, which is a shortcut 
#   to the analytical binomial probability:
dbinom(1, popsize, prob = 1/32)
## [1] 0.3737345
# Now we will set up the full matrix by iterating this over rows:

for (ii in 1:nrow(tmat)){
    tmat[ii, 2:ncol(tmat)] <- dbinom(ii - 1, size = popsize, prob = probvec)
}

# Note: we ignore the first column and let almost all elements equal zero, 
# because the probability of sampling k alleles given that you currently have zero
# is zero everywhere except the special case of P00, which is 1 
# (if you have zero now, you will have zero in the next gen with probability 1)
# To address this, we manually set element P00 equal to 1:
tmat[1,1] <- 1

#Now, all the columns should sum to 1:
colSums(tmat)
##  [1] 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
# Pt <- tmat^2*P0 

# Initial vector of population frequencies:
# Like Buri, suppose we do an experiment with 100 populations
# each with equal allele frequencies (p = q = 0.5) initially.
p_init <- rep(0, popsize + 1)

# note, in this indexing:
#   p_init[1] is the number of populations with 0 alleles
#   p_init[2] is the number with 1 allele
#   p_init[popsize + 1] is the number with 2N alleles

# we want the initial population to have 2N/2 = N p alleles (for 50:50 proportion of alleles)
#  so we fill in the N+1 element of p_init vector with the 
#  number of populations

p_init[N+1] <- 100
#Now, we will introduce drift in our simulation
# after 1 generation of random mating in Wright-Fisher (WF) population, 
#  the distribution of populations is:
tmat %*% p_init
 
# and after 2 generations, it is:
tmat %*% (tmat %*% p_init)

# and after 3 generations, it is:
tmat %*% (tmat %*% (tmat %*% p_init))
# This is easy enough to iterate.
# Here, we will make a matrix to hold 
# the results, for up to ngen generations

ngen    <- 20
freqmat <- matrix(NA, nrow=popsize+1, ncol=ngen)

curr_pop <- p_init 

for (ii in 1:ngen){
    
    # recompute the new pop frequency distribution
    curr_pop <- tmat %*% curr_pop
    # store it:
    freqmat[,ii] <- curr_pop
}



plot3D::persp3D(x = c(0, probvec), z=freqmat, theta=45, phi=20, contour =F,
                xlab="Allele frequency", zlab="Number of populations",
                ylab="Generations")

# Plotting selected generations:
########## 
plot.new()
par(oma=c(1,1,1,1))
par(mfrow=c(2,2))
par(mar=c(2,2,2,0))

pfx <- function(gen){
    plot.new()
    plot.window(xlim=c(0,33), ylim=c(0, 15))
    axis(1, at=seq(-4, 36, by=4))
    axis(2, at=seq(-2, 16, by=2), las=1)
    lines(x=c(16,16), y=c(0, 18), lwd=5, col="gray60")
    title(main = paste("After", gen, ifelse(gen == 1, "gen", "gens")))
}

# Expectation after 1 generation of drift:
pfx(1)
points(x=0:32, y=freqmat[,1], pch=21, bg="red", cex=1.5)

# Expectation after 5 generations of drift
pfx(5)
points(x=0:32, y=freqmat[,5], pch=21, bg="red", cex=1.5)

# Expectation after 10 generations of drift
pfx(10)
points(x=0:32, y=freqmat[,10], pch=21, bg="red", cex=1.5)

# Expectation after 15 generations of drift
pfx(15)
points(x=0:32, y=freqmat[,15], pch=21, bg="red", cex=1.5)

The above visually resembles figure 6 in Buri (1956) – but please remember this is an analytical expectation.

A little more on the history of this mathematical model can be found in the initial parts of Tran et al (2013) – but note that the mathematics that is developed in the later parts of the paper may be challenging. The original references on the Wright-Fisher population model are Fisher (1922) and Wright (1931).


References:

Buri, P. (1956). Gene frequency in small populations of mutant Drosophila. Evolution, 367-402.

Fisher RA (1922) On the dominance ratio. Proc. R. Soc. Edinb 42:321–341

Kimura, M. (1955). Solution of a process of random genetic drift with a continuous model. Proceedings of the National Academy of Sciences of the United States of America, 41(3), 144.

Tran, T. D., Hofrichter, J., & Jost, J. (2013). An introduction to the mathematical structure of the Wright–Fisher model of population genetics. Theory in Biosciences, 132(2), 73-82. [good for the historical review, math can be challenging]

Waples, R. S. (2022). What is Ne, anyway?. Journal of Heredity, 113(4), 371–379.

Wright S (1931) Evolution in Mendelian populations. Genetics 16:97–159