qpWave and qpAdm are programs which can answer a number of questions about the relationships of different populations, such as:
The first two questions can be answered by qpWave. The latter two questions are about a specific target population, and they can be answered by qpAdm, an extension of qpWave.
This page describes how both programs use \(f_4\)-statistics to address these questions. The original description of qpWave and qpAdm which goes into more technical detail can be found in Supplementary Information section 10 in this paper.
\(f_4\)-statistics can be used to test whether two pairs of populations (\(L_1, L_2\) and \(R_1, R_2\)) form clades with respect to one another: If \(f_4(L_1,L_2; R_1,R_2)\) is significantly different from 0, the two groups do not form a clade. If we have two groups with more than two populations each, and we want to know if the groups form two clades, we could compute all possible \(f_4\)-statistics and test if any of them deviate from zero. Except we would have to correct for multiple testing, which is not straight-forward in this case because the \(f_4\)-statistics are not independent of each other.
qpWave provides another way to test whether two groups of populations form two clades. We can empirically see that qpWave and \(f_4\) are closely related, because in the special case of applying qpWave to only four populations, we get almost the same p-value as for an \(f_4\)-statistic of the four populations:
L = c('Altai_Neanderthal.DG', 'Vindija.DG')
R = c('Chimp.REF', 'Mbuti.DG')
f4(example_f2_blocks, L[1], L[2], R[1], R[2])$p## [1] 0.3576872
## ℹ Computing f4 stats...
## ℹ Computing number of admixture waves...
## [1] 0.3577113
Testing whether two pairs of populations form clades with each other is the same as asking whether there is only one gene flow event which separates the two pairs (\(f_4 = 0\)) or whether there must have been more than a single gene flow event between the two pairs (\(f_4 != 0\)).
In the graph on the left, \(L1\) and
\(L2\) form a clade relative to \(R1\) and \(R2\), which means \(L\) and \(R\) are only connected through one gene
flow event. On the right they do not form a clade, and they are
connected through two independent gene flow events (one going through
the root, and the other going through the edge
n1b|n2b).
qpWave allows us to estimate a lower bound on the number of gene flow events that separate two groups of populations \(L\) and \(R\) of size \(n_L\) and \(n_R\).
The basic idea behind it is that we can use \(f_4\)-statistics of the form \(f_4(L_i, L_j; R_m, R_n)\) to estimate the genetic drift that is shared between pairs of left populations and pairs of right populations. There are \(\frac{n_L(n_L-1)}{2} \frac{n_R(n_R-1)}{2}\) \(f_4\)-statistics of that form, but not all of them contain unique information. In the case where \(L\) and \(R\) form two clades, all \(f_4\)-statistics \(f_4(L_i, L_j; R_m, R_n)\) are expected to be zero. If there are two independent gene flow events, a few of these \(f_4\)-statistics will be enough to determine what the other \(f_4\)-statistics should be. More generally, we can construct a matrix \(X\) of these \(f_4\)-statistics, and the rank of \(X\) tells us what the minimum number of gene flows is between \(L\) and \(R\). A high rank of \(X\) suggests many independent gene flows between \(L\) and \(R\).
In practice, \(X\) is constructed by fixing one population in \(L\) and one population in \(R\) which results in a matrix of dimensions \((n_L - 1) \times (n_R - 1)\) with all \(f_4\)-statistics of the form \(f_4(L_1, L_i; R_1, R_j)\). By convention, \(n_R > n_L\), which means that the highest possible rank of \(X\) is \(n_L - 1\). A rank of \(n_L - 1\) suggests that at least \(n_L - 1\) independent gene flows have occurred between \(L\) and \(R\).
To estimate the rank of \(X\) we test how well \(X\) can be approximated by different low rank approximations. An approximation of \(X\) with one rank less than full rank can be written as \(AB\), where \(A\) is \((n_L-1) \times (n_L-2)\), and \(B\) is \((n_L-2) \times (n_R-1)\) and \(AB\) should be as close as possible to \(X\). In this case (one rank less than full rank), \(X\) would have rank \(n_L - 1\), and \(AB\) would have rank \(n_L - 2\). If this approximation works well, the residuals \(E = X - AB\) will be very small. Large residuals suggest that the low rank approximation \(AB\) doesn’t capture all the information in \(X\) and that a simple model with only few admixture events doesn’t fit the data. How large these residuals can become before a model can be rejected depends on the standard errors of the \(f_4\)-statistics. If we have data from a large number of SNPs, these standard errors will be small, which gives us more power to reject models.
To get low rank approximations of \(X\), we start with a singular value decompositions (SVD) of \(X\), which gives us precursors of the matrices \(A\) and \(B\). Because we cannot ignore the covariance among the \(f_4\)-statistics (\(Q\)), \(A\) and \(B\) need to be further adjusted. The initial SVD estimates of \(A\) and \(B\) already minimize the squared sum of the residuals \(E = X - AB\). What is actually minimized in the end is the quadratic form \(E' Q^{-1} E\) (\(E\) is a vector of length \((n_L - 1)(n_R - 1)\), and \(Q\) is a square matrix of dimension \((n_L - 1)(n_R - 1) \times (n_L - 1)(n_R - 1)\)). This quantity is closely related to the likelihood of the model, \(L(A,B) = -\frac{1}{2} E'Q^{-1}E\). The difference between the log likelihoods of two models (same populations, but one with higher rank than the other) approximately follows a \(\chi^2\) distribution with \(x\) degrees of freedom. From this \(\chi^2\) distribution we can compute a p-value that tests whether we can reject simple models with only one or a few admixture waves.
Let’s consider two extreme cases:
First, \(L\) and \(R\) are two groups of populations which form clades with respect to each other. This means that all \(f_4\)-statistics of the form \(f_4(L_i, L_j; R_m, R_n)\) are \(0\), and hence \(X\) is just a matrix full of \(0s\) and rank has rank \(0\).
The opposite extreme is when \(X\) (which has dimensions \((n_L - 1) \times (n_R - 1)\)) has full rank (\((n_L - 1)\), because \(n_L\) <= \(n_R\)). This means that there is no population in \(L\) which can be modeled as a combination of other populations in \(L\). Each population in \(L\) shares some amount of drift with populations in \(R\), which is not captured by other populations in \(L\).
This second extreme case is important, because it forms the basis for qpAdm: Once we have identified two groups \(L\) and \(R\) with this property, we can add an additional population \(T\) and test whether this property is retained or not.
The qpWave section above describes how we can use a matrix of \(f_4\)-statistics (\(X\)) to test the number of gene flows that separate \(L\) and \(R\). The idea behind qpAdm is that we can say something about an additional population \(T\) if we follow this approach twice: Once for a model with \(L\) and \(R\), and once for a model with \(L \cup T\) (the union of \(L\) and \(T\)) and \(R\). If we find a higher number of gene flows for the model with \(L \cup T\) and \(R\) than for the model with only \(L\) and \(R\), this suggests that there was some gene flow between \(T\) and \(R\). Thus \(T\) cannot be modeled as a combination of populations in \(L\). This may hint at the existence of a yet undiscovered “ghost” population which is related to a population in \(R\), but not represented by any of the populations in \(L\). On the other hand, if both models have the same rank (no extra gene flow between \(T\) and \(R\)), then \(T\) can be modeled as a combination of populations in \(L\), and we can estimate the relative contribution of each population.
qpAdm therefore produces two types of output that are of main interest: A p-value which tests whether there is a connection between \(R\) and \(T\), other than through \(L\), and (if the answer to the last question is no), admixture weights which can tell us in what proportions populations in \(L\) have contributed to \(T\).
The null hypothesis tested by qpAdm is that the populations in \(L\) can account for all shared drift between \(T\) and \(R\). A significant p-value rejecting this null hypothesis indicates that there has been gene flow between \(T\) and \(R\) (in either direction) that is not accounted for by any population in \(L\).
The admixture weights are estimated by computing the left null space of the matrix \(A\) (they satisfy \(w A = 0\)). This is basically asking in what proportions the populations in \(L\) need to be added up so that their \(f_4\)-statistics will be equal to the \(f_4\)-statistics involving \(T\). These weights will also be computed when the model is rejected (the p-value is significant), but in that case they can’t be interpreted as admixture weights. It is important to keep in mind that the true source populations of \(T\) are almost certainly not identical to any of the populations in \(L\), and that any inferences about the mixture proportions of the true, unobserved source populations depend on an unverifiable assumption: Each left population needs to be strictly cladal with one of the source populations.
The theory above maps onto a single function call. With f-statistics
in example_f2_blocks, we can ask whether the Pleistocene
European Switzerland_Bichon.SG can be modeled as a mix of
an Ust’-Ishim-like ancestor (Russia_Ust_Ishim.DG, an early
modern Eurasian) plus Neanderthal admixture from
Vindija.DG, with the remaining populations as
outgroups:
res = qpadm(example_f2_blocks,
target = "Switzerland_Bichon.SG",
left = c("Russia_Ust_Ishim.DG", "Vindija.DG"),
right = c("Chimp.REF", "Mbuti.DG",
"Altai_Neanderthal.DG", "Denisova.DG"),
verbose = FALSE)The most-frequently-consumed pieces of the return value are the admixture weights and the rank-by-rank fit table:
## # A tibble: 2 × 5
## target left weight se z
## <chr> <chr> <dbl> <dbl> <dbl>
## 1 Switzerland_Bichon.SG Russia_Ust_Ishim.DG 1.01 0.00407 247.
## 2 Switzerland_Bichon.SG Vindija.DG -0.00714 0.00407 -1.75
## # A tibble: 2 × 7
## f4rank dof chisq p dofdiff chisqdiff p_nested
## <int> <int> <dbl> <dbl> <int> <dbl> <dbl>
## 1 1 2 5.45 0.0655 4 52390. 0
## 2 0 6 52395. 0 NA NA NA
weights$weight are the estimated proportions of each
source. se is the standard error; the row
weight / se = z follows. For this fit, essentially all of
Bichon’s ancestry is modeled as Ust’-Ishim-like, with a small
(statistically negligible) Vindija contribution — consistent with the
published biology that the Neanderthal trickle into Pleistocene
Europeans is largely already carried by the Ust’-Ishim-like ancestor in
this set.rankdrop$p[1] is the p-value for the model at
the auto-selected rank (the top row, length(left) - 1). A
small p rejects the null hypothesis that the right populations
are connected to target only through the sources. Here
p ≈ 0.07: the 2-source model is not rejected at α =
0.05.rankdrop$p_nested[1] compares the full 2-source model
to the simpler one-rank-lower model (the cladal alternative, where
target is cladal with one of the sources). A small value
rejects the cladal alternative in favor of the 2-source model. This is a
structural test about model topology — distinct from the per-source
weight z-scores. p_nested can reject the cladal alternative
even when the second source’s weight is itself statistically small (as
in this fit, where p_nested = 0 despite Vindija’s
z = −1.75); the f4 patterns simply don’t fit the rank-0
topology.When data is the prefix of a genotype file rather than
precomputed f2-statistics, qpadm() reads from disk in
parallel under any active future::plan() (see the parallelization vignette).
qpadm() inverts a square matrix of f4 covariances. When
two right populations are close to cladal with each other — for example,
a sister-clade pair, or one right population too close to a source —
that matrix becomes near-singular, the inversion falls back to a
pseudo-inverse, and the returned standard errors look misleadingly
tight. The fitted weights themselves remain interpretable; the
uncertainty on them does not.
qpadm() surfaces this with two diagnostics on the
returned list:
res$f4_var_rcond — the reciprocal condition number of
the (fudged) f4 variance matrix. Values near machine epsilon
(~1e-15) signal a degenerate variance matrix. A value below
1e-8 triggers an automatic warning.res$f4_var_singular_loadings — when
f4_var_rcond < 1e-8, this tibble surfaces which
right populations are responsible. Mechanically: it measures linear
dependence among the right populations’ own resampled f4 estimates (the
same per-block leave-one-out estimates f4_var is built
from) via their Gram matrix, then reports for each right population the
L2 norm of its rows of the orthogonal projector onto the Gram’s
near-null eigenspace (the square root of that projector’s diagonal).
Because it uses the right populations’ signatures only, a singularity
caused by the left (source) populations does not leak into the
loadings; because it projects onto the whole near-null subspace rather
than reading a single eigenvector, the loadings are identical across
LAPACK/BLAS backends. Right pops at the top of the sorted tibble are the
most-likely offenders — typically a sister-clade pair among the right
set (two pops with similar high loadings); loadings that are all near
zero mean the near-singularity comes from the left (the sources), not
the right populations.For pipelines that should fail loudly on degenerate inputs, the
singular_threshold argument promotes the warning to an
error:
qpadm_multi() runs a fixed list of
(left, right, target) models, re-using f4-statistics where
possible:
models = tibble::tibble(
target = c("Switzerland_Bichon.SG", "Russia_Ust_Ishim.DG"),
left = list(c("Altai_Neanderthal.DG"),
c("Altai_Neanderthal.DG", "Denisova.DG")),
right = list(c("Chimp.REF", "Mbuti.DG", "Vindija.DG"),
c("Chimp.REF", "Mbuti.DG", "Vindija.DG")))
qpadm_multi(example_f2_blocks, models)qpadm_sweep() covers the “many targets across a grid of
source / outgroup configurations” question. Given vectors of
targets, source_sets, and
right_sets, it fits every combination in the Cartesian
product and returns a flat tibble of p-values, weights, and
feasible flags ready for filtering:
targets = c("Vindija.DG", "Switzerland_Bichon.SG")
sources = list(neanderthal_only = c("Altai_Neanderthal.DG"),
neanderthal_denisova = c("Altai_Neanderthal.DG", "Denisova.DG"))
rights = list(distal = c("Chimp.REF", "Mbuti.DG"),
refined = c("Chimp.REF", "Mbuti.DG", "Russia_Ust_Ishim.DG"))
res = qpadm_sweep(example_f2_blocks, targets, sources, rights)
res %>% dplyr::arrange(target, p)The same f4_var_rcond and
f4_var_singular_loadings diagnostics described above are
surfaced on qpadm_sweep()’s output
(f4_var_rcond unconditionally;
f4_var_singular_loadings as a list-column when
full_results = TRUE), making it straightforward to flag or
filter rank-deficient fits across an entire sweep. The loadings tibble
fires per row when f4_var_rcond < 1e-8 (the auto-bar)
and is NULL otherwise. Note that passing
singular_threshold via ... to
qpadm_sweep() is not the “mark and continue” path callers
might expect; qpadm() raises an error when the threshold
trips and the sweep aborts. For sweep-wide rank-deficiency triage, omit
singular_threshold and filter on the returned
f4_var_rcond column instead.
Both functions parallelize across models under
future::plan(multisession) (or multicore on
Linux). Be aware that the per-block layer inside each model also
parallelizes when data is a genotype-file prefix, so two
layers compose — usually you want only one of them parallel.
When you only care about the rank of \(X\) and not about admixture weights for a
specific target, the qpwave-on-genotype-data pattern is
qpadm(..., target = NULL, return_f4 = TRUE):
# Equivalent to qpwave(geno_prefix, left, right), but uses qpadm's
# parallel per-block f4 computation when data is a genotype-file prefix.
qpadm("path/to/geno_prefix",
left = c("L1", "L2"),
right = c("R1", "R2", "R3"),
target = NULL,
return_f4 = TRUE)This shape is useful in sweep-style workflows where the rank-only question is one of many you’re asking against the same genotype data — you get the same parallel f4 read for free.