qpWave and qpAdm

library(magrittr)
library(tidyverse)
library(admixtools)

qpWave and qpAdm are programs which can answer a number of questions about the relationships of different populations, such as:

Do populations form clades with respect to each other?
What is the minimum number of admixture waves that separates two groups of populations?
Can a population be modeled as a mix of a set of other populations, or would this miss an ancestry component?
What are the relative proportions of the different admixing populations?

The first two questions can be answered by qpWave. The latter two questions are about a specific target population, and they can be answered by qpAdm, an extension of qpWave.

This page describes how both programs use $f_4$-statistics to address these questions. The original description of qpWave and qpAdm which goes into more technical detail can be found in Supplementary Information section 10 in this paper.

qpWave

qpWave and f4

$f_4$-statistics can be used to test whether two pairs of populations ($L_1, L_2$ and $R_1, R_2$) form clades with respect to one another: If $f_4(L_1,L_2; R_1,R_2)$ is significantly different from 0, the two groups do not form a clade. If we have two groups with more than two populations each, and we want to know if the groups form two clades, we could compute all possible $f_4$-statistics and test if any of them deviate from zero. Except we would have to correct for multiple testing, which is not straight-forward in this case because the $f_4$-statistics are not independent of each other.

qpWave provides another way to test whether two groups of populations form two clades. We can empirically see that qpWave and $f_4$ are closely related, because in the special case of applying qpWave to only four populations, we get almost the same p-value as for an $f_4$-statistic of the four populations:

L = c('Altai_Neanderthal.DG', 'Vindija.DG')
R = c('Chimp.REF', 'Mbuti.DG')
f4(example_f2_blocks, L[1], L[2], R[1], R[2])$p

## [1] 0.3576872

qpwave(example_f2_blocks, L, R)$rankdrop$p

## ℹ Computing f4 stats...
## ℹ Computing number of admixture waves...

## [1] 0.3577113

Estimating the number of independent gene flows

Testing whether two pairs of populations form clades with each other is the same as asking whether there is only one gene flow event which separates the two pairs ($f_4 = 0$) or whether there must have been more than a single gene flow event between the two pairs ($f_4 != 0$).

In the graph on the left, $L1$ and $L2$ form a clade relative to $R1$ and $R2$, which means $L$ and $R$ are only connected through one gene flow event. On the right they do not form a clade, and they are connected through two independent gene flow events (one going through the root, and the other going through the edge n1b|n2b).

qpWave allows us to estimate a lower bound on the number of gene flow events that separate two groups of populations $L$ and $R$ of size $n_L$ and $n_R$.

The basic idea behind it is that we can use $f_4$-statistics of the form $f_4(L_i, L_j; R_m, R_n)$ to estimate the genetic drift that is shared between pairs of left populations and pairs of right populations. There are $\frac{n_L(n_L-1)}{2} \frac{n_R(n_R-1)}{2}$ $f_4$-statistics of that form, but not all of them contain unique information. In the case where $L$ and $R$ form two clades, all $f_4$-statistics $f_4(L_i, L_j; R_m, R_n)$ are expected to be zero. If there are two independent gene flow events, a few of these $f_4$-statistics will be enough to determine what the other $f_4$-statistics should be. More generally, we can construct a matrix $X$ of these $f_4$-statistics, and the rank of $X$ tells us what the minimum number of gene flows is between $L$ and $R$. A high rank of $X$ suggests many independent gene flows between $L$ and $R$.

In practice, $X$ is constructed by fixing one population in $L$ and one population in $R$ which results in a matrix of dimensions $(n_L - 1) \times (n_R - 1)$ with all $f_4$-statistics of the form $f_4(L_1, L_i; R_1, R_j)$. By convention, $n_R > n_L$, which means that the highest possible rank of $X$ is $n_L - 1$. A rank of $n_L - 1$ suggests that at least $n_L - 1$ independent gene flows have occurred between $L$ and $R$.

Estimating the rank of X

To estimate the rank of $X$ we test how well $X$ can be approximated by different low rank approximations. An approximation of $X$ with one rank less than full rank can be written as $AB$, where $A$ is $(n_L-1) \times (n_L-2)$, and $B$ is $(n_L-2) \times (n_R-1)$ and $AB$ should be as close as possible to $X$. In this case (one rank less than full rank), $X$ would have rank $n_L - 1$, and $AB$ would have rank $n_L - 2$. If this approximation works well, the residuals $E = X - AB$ will be very small. Large residuals suggest that the low rank approximation $AB$ doesn’t capture all the information in $X$ and that a simple model with only few admixture events doesn’t fit the data. How large these residuals can become before a model can be rejected depends on the standard errors of the $f_4$-statistics. If we have data from a large number of SNPs, these standard errors will be small, which gives us more power to reject models.

To get low rank approximations of $X$, we start with a singular value decompositions (SVD) of $X$, which gives us precursors of the matrices $A$ and $B$. Because we cannot ignore the covariance among the $f_4$-statistics ($Q$), $A$ and $B$ need to be further adjusted. The initial SVD estimates of $A$ and $B$ already minimize the squared sum of the residuals $E = X - AB$. What is actually minimized in the end is the quadratic form $E' Q^{-1} E$ ($E$ is a vector of length $(n_L - 1)(n_R - 1)$, and $Q$ is a square matrix of dimension $(n_L - 1)(n_R - 1) \times (n_L - 1)(n_R - 1)$). This quantity is closely related to the likelihood of the model, $L(A,B) = -\frac{1}{2} E'Q^{-1}E$. The difference between the log likelihoods of two models (same populations, but one with higher rank than the other) approximately follows a $\chi^2$ distribution with $x$ degrees of freedom. From this $\chi^2$ distribution we can compute a p-value that tests whether we can reject simple models with only one or a few admixture waves.

Let’s consider two extreme cases:

First, $L$ and $R$ are two groups of populations which form clades with respect to each other. This means that all $f_4$-statistics of the form $f_4(L_i, L_j; R_m, R_n)$ are $0$, and hence $X$ is just a matrix full of $0s$ and rank has rank $0$.

The opposite extreme is when $X$ (which has dimensions $(n_L - 1) \times (n_R - 1)$) has full rank ($(n_L - 1)$, because $n_L$ <= $n_R$). This means that there is no population in $L$ which can be modeled as a combination of other populations in $L$. Each population in $L$ shares some amount of drift with populations in $R$, which is not captured by other populations in $L$.

This second extreme case is important, because it forms the basis for qpAdm: Once we have identified two groups $L$ and $R$ with this property, we can add an additional population $T$ and test whether this property is retained or not.

qpAdm

Modelling an admixed population

The qpWave section above describes how we can use a matrix of $f_4$-statistics ($X$) to test the number of gene flows that separate $L$ and $R$. The idea behind qpAdm is that we can say something about an additional population $T$ if we follow this approach twice: Once for a model with $L$ and $R$, and once for a model with $L \cup T$ (the union of $L$ and $T$) and $R$. If we find a higher number of gene flows for the model with $L \cup T$ and $R$ than for the model with only $L$ and $R$, this suggests that there was some gene flow between $T$ and $R$. Thus $T$ cannot be modeled as a combination of populations in $L$. This may hint at the existence of a yet undiscovered “ghost” population which is related to a population in $R$, but not represented by any of the populations in $L$. On the other hand, if both models have the same rank (no extra gene flow between $T$ and $R$), then $T$ can be modeled as a combination of populations in $L$, and we can estimate the relative contribution of each population.

qpAdm therefore produces two types of output that are of main interest: A p-value which tests whether there is a connection between $R$ and $T$, other than through $L$, and (if the answer to the last question is no), admixture weights which can tell us in what proportions populations in $L$ have contributed to $T$.

The null hypothesis tested by qpAdm is that the populations in $L$ can account for all shared drift between $T$ and $R$. A significant p-value rejecting this null hypothesis indicates that there has been gene flow between $T$ and $R$ (in either direction) that is not accounted for by any population in $L$.

The admixture weights are estimated by computing the left null space of the matrix $A$ (they satisfy $w A = 0$). This is basically asking in what proportions the populations in $L$ need to be added up so that their $f_4$-statistics will be equal to the $f_4$-statistics involving $T$. These weights will also be computed when the model is rejected (the p-value is significant), but in that case they can’t be interpreted as admixture weights. It is important to keep in mind that the true source populations of $T$ are almost certainly not identical to any of the populations in $L$, and that any inferences about the mixture proportions of the true, unobserved source populations depend on an unverifiable assumption: Each left population needs to be strictly cladal with one of the source populations.

qpadm in practice

The theory above maps onto a single function call. With f-statistics in example_f2_blocks, we can ask whether the Pleistocene European Switzerland_Bichon.SG can be modeled as a mix of an Ust’-Ishim-like ancestor (Russia_Ust_Ishim.DG, an early modern Eurasian) plus Neanderthal admixture from Vindija.DG, with the remaining populations as outgroups:

res = qpadm(example_f2_blocks,
            target = "Switzerland_Bichon.SG",
            left   = c("Russia_Ust_Ishim.DG", "Vindija.DG"),
            right  = c("Chimp.REF", "Mbuti.DG",
                       "Altai_Neanderthal.DG", "Denisova.DG"),
            verbose = FALSE)

The most-frequently-consumed pieces of the return value are the admixture weights and the rank-by-rank fit table:

res$weights

## # A tibble: 2 × 5
##   target                left                  weight      se      z
##   <chr>                 <chr>                  <dbl>   <dbl>  <dbl>
## 1 Switzerland_Bichon.SG Russia_Ust_Ishim.DG  1.01    0.00407 247.  
## 2 Switzerland_Bichon.SG Vindija.DG          -0.00714 0.00407  -1.75

res$rankdrop

## # A tibble: 2 × 7
##   f4rank   dof    chisq      p dofdiff chisqdiff p_nested
##    <int> <int>    <dbl>  <dbl>   <int>     <dbl>    <dbl>
## 1      1     2     5.45 0.0655       4    52390.        0
## 2      0     6 52395.   0           NA       NA        NA

weights$weight are the estimated proportions of each source. se is the standard error; the row weight / se = z follows. For this fit, essentially all of Bichon’s ancestry is modeled as Ust’-Ishim-like, with a small (statistically negligible) Vindija contribution - consistent with the published biology that the Neanderthal trickle into Pleistocene Europeans is largely already carried by the Ust’-Ishim-like ancestor in this set.
rankdrop$p[1] is the p-value for the model at the auto-selected rank (the top row, length(left) - 1). A small p rejects the null hypothesis that the right populations are connected to target only through the sources. Here p ~ 0.07: the 2-source model is not rejected at alpha = 0.05.
rankdrop$p_nested[1] compares the full 2-source model to the simpler one-rank-lower model (the cladal alternative, where target is cladal with one of the sources). A small value rejects the cladal alternative in favor of the 2-source model. This is a structural test about model topology - distinct from the per-source weight z-scores. p_nested can reject the cladal alternative even when the second source’s weight is itself statistically small (as in this fit, where p_nested = 0 despite Vindija’s z = -1.75); the f4 patterns simply don’t fit the rank-0 topology.

When data is the prefix of a genotype file rather than precomputed f2-statistics, qpadm() reads from disk in parallel under any active future::plan() (see the parallelization vignette).

Diagnosing collinear right populations

qpadm() inverts a square matrix of f4 covariances. When two right populations are close to cladal with each other - for example, a sister-clade pair, or one right population too close to a source - that matrix becomes near-singular, the inversion falls back to a pseudo-inverse, and the returned standard errors look misleadingly tight. The fitted weights themselves remain interpretable; the uncertainty on them does not.

qpadm() surfaces this with two diagnostics on the returned list:

res$f4_var_rcond - the reciprocal condition number of the (fudged) f4 variance matrix. Values near machine epsilon (~1e-15) signal a degenerate variance matrix. A value below 1e-8 triggers an automatic warning.
res$f4_var_singular_loadings - when f4_var_rcond < 1e-8, this tibble surfaces which right populations are responsible. Mechanically: it measures linear dependence among the right populations’ own resampled f4 estimates (the same per-block leave-one-out estimates f4_var is built from) via their Gram matrix, then reports for each right population the L2 norm of its rows of the orthogonal projector onto the Gram’s near-null eigenspace (the square root of that projector’s diagonal). Because it uses the right populations’ signatures only, a singularity caused by the left (source) populations does not leak into the loadings; because it projects onto the whole near-null subspace rather than reading a single eigenvector, the loadings are identical across LAPACK/BLAS backends. Right pops at the top of the sorted tibble are the most-likely offenders - typically a sister-clade pair among the right set (two pops with similar high loadings); loadings that are all near zero mean the near-singularity comes from the left (the sources), not the right populations.

For pipelines that should fail loudly on degenerate inputs, the singular_threshold argument promotes the warning to an error:

# Error out (rather than warn) if rcond < 1e-12
qpadm(data, left, right, target, singular_threshold = 1e-12)

Multiple models: qpadm_multi and qpadm_sweep

qpadm_multi() runs a fixed list of (left, right, target) models, re-using f4-statistics where possible:

models = tibble::tibble(
  target = c("Switzerland_Bichon.SG", "Russia_Ust_Ishim.DG"),
  left   = list(c("Altai_Neanderthal.DG"),
                c("Altai_Neanderthal.DG", "Denisova.DG")),
  right  = list(c("Chimp.REF", "Mbuti.DG", "Vindija.DG"),
                c("Chimp.REF", "Mbuti.DG", "Vindija.DG")))
qpadm_multi(example_f2_blocks, models)

qpadm_sweep() covers the “many targets across a grid of source / outgroup configurations” question. Given vectors of targets, source_sets, and right_sets, it fits every combination in the Cartesian product and returns a flat tibble of p-values, weights, and feasible flags ready for filtering:

targets = c("Vindija.DG", "Switzerland_Bichon.SG")
sources = list(neanderthal_only     = c("Altai_Neanderthal.DG"),
               neanderthal_denisova = c("Altai_Neanderthal.DG", "Denisova.DG"))
rights  = list(distal  = c("Chimp.REF", "Mbuti.DG"),
               refined = c("Chimp.REF", "Mbuti.DG", "Russia_Ust_Ishim.DG"))
res = qpadm_sweep(example_f2_blocks, targets, sources, rights)
res %>% dplyr::arrange(target, p)

The same f4_var_rcond and f4_var_singular_loadings diagnostics described above are surfaced on qpadm_sweep()’s output (f4_var_rcond unconditionally; f4_var_singular_loadings as a list-column when full_results = TRUE), making it straightforward to flag or filter rank-deficient fits across an entire sweep. The loadings tibble fires per row when f4_var_rcond < 1e-8 (the auto-bar) and is NULL otherwise. Note that passing singular_threshold via ... to qpadm_sweep() is not the “mark and continue” path callers might expect; qpadm() raises an error when the threshold trips and the sweep aborts. For sweep-wide rank-deficiency triage, omit singular_threshold and filter on the returned f4_var_rcond column instead.

Both functions parallelize across models under future::plan(multisession) (or multicore on Linux). Be aware that the per-block layer inside each model also parallelizes when data is a genotype-file prefix, so two layers compose - usually you want only one of them parallel.

qpwave-style runs against genotype data

When you only care about the rank of $X$ and not about admixture weights for a specific target, the qpwave-on-genotype-data pattern is qpadm(..., target = NULL, return_f4 = TRUE):

# Equivalent to qpwave(geno_prefix, left, right), but uses qpadm's
# parallel per-block f4 computation when data is a genotype-file prefix.
qpadm("path/to/geno_prefix",
      left   = c("L1", "L2"),
      right  = c("R1", "R2", "R3"),
      target = NULL,
      return_f4 = TRUE)

This shape is useful in sweep-style workflows where the rank-only question is one of many you’re asking against the same genotype data - you get the same parallel f4 read for free.