---
title: "ADMIXTOOLS 2 Tutorial"
author: "Robert Maier"
date: "2026-05-21"
output:
rmarkdown::html_vignette:
toc: true
toc_depth: 2
vignette: >
%\VignetteIndexEntry{ADMIXTOOLS 2 Tutorial}
%\VignetteEngine{knitr::rmarkdown}
\usepackage[utf8]{inputenc}
---
This tutorial gives an overview of the basic workflow for computing *f*-statistics, and for using *qpWave*, *qpAdm*, and *qpGraph*.
Documentation for each *ADMIXTOOLS 2* function can be found under *Reference*, and more detailed information about specific topics under *Articles*.
For the examples here and on the other pages, the following R packages need to be loaded.
```{r, results='hide', message=FALSE, warning=FALSE}
library(admixtools)
library(tidyverse)
```
This tutorial focuses on the R command line interface of *ADMIXTOOLS 2*. Some basic familiarity with R is helpful for using *ADMIXTOOLS 2*, but not required. If you are not familiar with R, have a look at the browser application. To launch it, open R and type the following command:
```{r, eval = FALSE}
admixtools::run_shiny_admixtools()
```
## Introduction
*ADMIXTOOLS* is a set of programs used to infer population histories from genetic data. The main use cases are:
* Finding out if a population is admixed between other populations ([$f_3$](admixtools.html#f3-and-qp3pop), [*qpAdm*](admixtools.html#qpwave-and-qpadm))
* Estimating admixture weights (*qpF4ratio*, [*qpAdm*](admixtools.html#qpwave-and-qpadm))
* Finding out if a set of populations forms a clade relative to another set ([$f_4$](admixtools.html#f3-and-qp3pop), [*qpWave*](admixtools.html#qpwave-and-qpadm))
* Estimating the number of admixture waves separating two sets of populations ([*qpWave*](admixtools.html#qpwave-and-qpadm))
* Fitting an admixture graph to a set of populations ([*qpGraph*](admixtools.html#qpgraph))
All of this is based on *f*-statistics ($f_2$, $f_3$, and $f_4$), and all *f*-statistics can be derived from $f_2$ statistics.
Because of this, *ADMIXTOOLS 2* divides the computations into two steps:
1. Computing $f_2$-statistics and storing them on disk. This can be slow since it accesses the genotype data.
2. Using $f_2$-statistics to fit models. This is fast because $f_2$-statistics are very compact compared to genotype data.
This page shows how standard *ADMIXTOOLS* analyses can be conducted in *ADMIXTOOLS 2*. In addition to that, *ADMIXTOOLS 2* introduces a range of new methods, mostly focused on admixture graphs, which are intended to make analyses simpler, faster, and most importantly, more robust. These methods focus on quantifying variability by resampling SNPs, automated exploration of graph topologies, and simulating data under admixture graphs. They are described [here](#graphs.html).
## *f*-statistics basics
$f_2$, $f_3$, and $f_4$ describe how populations are related to one another. All *ADMIXTOOLS* programs are based on these *f*-statistics. Here, we briefly define them and describe how to compute them. Two excellent papers on *f*-statistics can be found [here](https://www.genetics.org/content/192/3/1065) and [here](https://www.genetics.org/content/202/4/1485).
$f_2$ measures the amount of genetic drift that separates two populations. It is the expected squared difference in allele frequencies and can be estimated across $M$ SNPs as:
$$f_2(A,B) = \frac{1}{M} \sum_{j=1}^M(a_{j} - b_{j})^2$$
$A$ and $B$ are populations, and $a$ and $b$ are their allele frequencies at SNP $j$.
In practice, the estimation of $f_2$ is more complicated than shown here, because it needs to account for low sample counts, missing data, and differences in ploidy. This is described [here](fstats.html).
The real strength of *f*-statistics based methods comes from combining $f_2$-statistics into $f_3$-statistics and $f_4$-statistics.
$f_4$ measures the amount of drift that is shared between two population pairs. It is the covariance of allele frequency differences in each pair, and at the same time the sum of four $f_2$-statistics:
$$
\begin{equation}
\begin{aligned}
f_4(A, B; C, D) &= \frac{1}{M}\sum_{j=1}^M(a_{j} - b_{j})(c_{j} - d_{j}) \\ &= \frac{1}{2}(f_2(A, D) + f_2(B, C) - f_2(A, C) - f_2(B, D) )
\label{eq:f42}
\end{aligned}
\end{equation}
$$
If $A$ and $B$ form a clade relative to $C$ and $D$, there should not be any correlation in the allele frequency differences $A$ - $B$ and $C$ - $D$, so $f_4(A, B; C, D)$ should be zero. If $f_4(A, B; C, D)$ is significantly different from zero, it suggests that $A$ and $B$ are not a clade relative to $C$ and $D$.
Like $f_4$, $f_3$ is the covariance of allele frequency differences between two pairs of populations. The difference is that in $f_3$, one population is the same on both sides. $f_3$ is thus a special case of $f_4$.
$$
\begin{aligned}
f_3(A; B, C) &= f_4(A, B; A, C) \\ &= \frac{1}{2} (f_2(A, B) + f_2(A, C) - f_2(B, C))
\end{aligned}
$$
If $f_3(A; B, C)$ is negative, it means that the more similar allele frequencies are between $A$ and $B$, the more different they are between $A$ and $C$, which suggests that $A$ is admixed between $B$ and $C$, or populations close to them.
$f_3$ and $f_4$ partition genetic drift into a component that is specific to single populations and a component that is shared between two pairs of populations. By measuring only the *shared* drift between pairs of populations, we can make conclusive statements about their relationships. Methods such as PCA and *ADMIXTURE* do not exclude non-shared drift, and therefore don't produce results that can be interpreted as unambiguously. For example, if $f_3(A; B, C)$ is negative, this is conclusive evidence that $A$ is admixed between $B$ and $C$. In that case we would expect $A$ to fall in between $B$ and $C$ in PC space. However, the same PCA pattern could be the results of a range of other demographic histories (for example, $A$ could be ancestral to $B$ and $C$).
## *f*~2~ in *ADMIXTOOLS 2*
In *ADMIXTOOLS 2*, $f_2$-statistics are the foundation for all further analyses. They can be computed from genotype data and saved to disk with this command:
```{r, eval = FALSE}
prefix = '/path/to/geno'
my_f2_dir = '/store/f2data/here/'
extract_f2(prefix, my_f2_dir)
```
This will look for genotype files in *packedancestrymap*, *PLINK*, *PLINK 2* (PFILE), or *EIGENSTRAT* format — auto-detected from the file extensions at the prefix. It computes allele frequencies and blocked $f_4$-statistics for all pairs of populations defined in the `.ind`, `.fam`, or `.psam` file, and writes them to `my_f2_dir`. It is also possible to extract only a subset of the samples or populations by passing IDs to the `inds` and `pops` arguments in `extract_f2()`. To get a description of the arguments and to see examples of how to use it, type
```{r, eval = FALSE}
?extract_f2
```
By default, `extract_f2()` will be very cautious and exclude all SNPs which are missing in any population (`maxmiss = 0`). If you lose too many SNPs this way, you can either
1. limit the number of populations for which to extract $f_2$-statistics,
2. compute $f_3$- and $f_4$-statistics directly from genotype files, or
3. increase the `maxmiss` parameter (`maxmiss = 1` means no SNPs will be excluded).
The advantages and disadvantages of the different approaches are described [here](fstats.html#bias-due-to-missing-data). Briefly, when running `qpadm()` and `qpdstat()` it can be better to choose the safer but slower options 1 and 2, while for `qpgraph()`, which is not centered around hypothesis testing, it is usually fine choose option 3. Since the absolute difference in *f*-statistics between these approaches is usually small, it can also make sense to use option 3 for exploratory analyses, and confirm key results using options 1 or 2.
Once `extract_f2()` has finished, $f_2$-statistics for the populations of interest can be loaded using `f2_from_precomp()`:
```{r, echo = FALSE}
f2_blocks = example_f2_blocks
```
```{r, eval = FALSE}
f2_blocks = f2_from_precomp(my_f2_dir)
```
Or you can load only a subset of the populations:
```{r, eval = FALSE}
mypops = c('Denisova.DG', 'Altai_Neanderthal.DG', 'Vindija.DG')
f2_blocks = f2_from_precomp(my_f2_dir, pops = mypops)
```
If your data is so small that computing $f_2$-statistics doesn't take very long, you can skip writing the data to disk with `extract_f2()` and do everything in one step using `f2_from_geno()`:
```{r, eval = FALSE}
f2_blocks = f2_from_geno(my_f2_dir, pops = mypops)
```
`f2_blocks` is now a 3d-array with $f_2$-statistics for each population pair along dimensions 1 and 2, and each SNP block along the 3rd dimension.
```{r}
dim(f2_blocks)
```
The purpose of having separate estimates for each SNP block is to compute [jackknife or bootstrap standard errors](resampling.html) for *f*-statistics, and for any statistics derived from them.
`f2_blocks` can be used like this:
```{r, eval = FALSE}
f2_blocks[,,1] # f2-statistics of the 1st SNP block
apply(f2_blocks, 1:2, mean) # average across all blocks
f2_blocks[pop1, pop2, ] # f2(pop1, pop2) for all blocks
```
The names along the 3rd dimension contain the SNP block lengths:
```{r}
block_lengths = parse_number(dimnames(f2_blocks)[[3]])
head(block_lengths)
```
To see the total number of SNPs across all blocks, you can use `count_snps()`
```{r}
count_snps(f2_blocks)
```
If you want to try any of this without extracting and loading your own $f_2$-statistics, you can instead use `example_f2_blocks` which becomes available after running `library(admixtools)`.
[More information](fstats.html) on *f*-statistics in *ADMIXTOOLS 2*
## *f*~3~ and *qp3Pop*
There are three main uses of $f_3$-statistics:
1. Testing whether a population is admixed: If $f_3(A; B, C)$ is negative, this suggests that $A$ is admixed between a population related to $B$ and one related to $C$.
2. Estimating the relative divergence time for pairs of populations (outgroup $f_3$-statistics): Pairwise $F_{ST}$ and $f_2$ are simpler estimates of genetic distance or divergence time, but they are affected by differences in population size. If $O$ is an outgroup relative to all populations $i$ and $j$, then $f_3(O; i, j)$ will estimate the genetic distance between $O$ and the points of separation between $i$ and $j$ without being affected to drift that is specific to any population $i$ or $j$.
3. Fitting admixture graphs: $f_3$-statistics of the form $f_3(O; i, j)$ for an arbitrary population $O$, and all pairs of $i$ and $j$ are used in *qpGraph*, which is described below.
The original *ADMIXTOOLS* program for computing $f_3$-statistics is called *qp3Pop*. In *ADMIXTOOLS 2*, you can compute $f_3$-statistics like this:
```{r}
pop1 = 'Denisova.DG'
pop2 = c('Altai_Neanderthal.DG', 'Vindija.DG')
pop3 = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG')
```
```{r, eval = FALSE}
qp3pop(f2_blocks, pop1, pop2, pop3)
```
Or, equivalently
```{r, eval = FALSE}
f3(f2_blocks, pop1, pop2, pop3)
```
```{r, echo = FALSE}
f3(example_f2_blocks, pop1, pop2, pop3)
```
This will compute $f_3$-statistics for all combinations of `pop1`, `pop2`, and `pop3`.
`f3(f2_blocks)` will compute all possible combinations (which can be a large number). If only `pop1` is supplied, all combinations of populations in `pop1` will be computed.
## *f*~4~ and *qpDstat*
The original *ADMIXTOOLS* program for computing $f_4$-statistics is called *qpDstat*. As the name suggests, it computes *D*-statistics by default. To get $f_4$-statistics instead, the `f4mode` argument needs to set to `YES`. In *ADMIXTOOLS 2*, almost everything starts with $f_2$-statistics, so the `qpdstat`/`f4` function computes $f_4$-statistics by default.
```{r}
pop4 = 'Switzerland_Bichon.SG'
```
```{r, eval = FALSE}
f4(f2_blocks, pop1, pop2, pop3, pop4)
qpdstat(f2_blocks, pop1, pop2, pop3, pop4)
# two names for the same function
```
```{r, echo = FALSE}
f4(example_f2_blocks, pop1, pop2, pop3, pop4, verbose = FALSE)
```
The differences between $f_4$-statistics and *D*-statistics are usually negligible. However, it is still possible to compute *D*-statistics in *ADMIXTOOLS 2*, by providing genotype data as the first argument, and setting `f4mode = FALSE`:
```{r, eval = FALSE}
prefix = '/path/to/geno'
f4(prefix, pop1, pop2, pop3, pop4, f4mode = FALSE)
```
Computing $f_4$- or *D*-statistics from genotype data directly is slower, but it has the advantage that it avoids any problems that may arise from large amounts of missing data. More on this [here](fstats.html).
## F~ST~
$F_{ST}$ is closely related to $f_2$, but unlike $f_2$, it doesn't function as a building block for other tools in *ADMIXTOOLS 2*. However, it is the most widely used metric to estimate the genetic distance between populations. Running `extract_f2()` will create files which don't only contain $f_2$ estimates for each population pair, but also separate $F_{ST}$ estimates. The function `fst()` can either read these pre-computed estimates, or compute them directly from genotype files:
```{r, eval = FALSE}
fst(my_f2_dir)
```
```{r, eval = FALSE}
fst(prefix, pop1 = "Altai_Neanderthal.DG", pop2 = c("Denisova.DG", "Vindija.DG"))
```
To estimate $F_{ST}$ without bias, we need at least two independent observations in each population. With pseudohaploid data, we only get one independent observation per sample, and so for populations consisting of only one pseudohaploid sample, $F_{ST}$ cannot be estimated without bias. If we want to ignore that bias and get estimates anyway, we can pretend the pseudohaploid samples are actually diploid using the option `adjust_pseudohaploid = FALSE`.
```{r, eval = FALSE}
fst(prefix, pop1 = "Altai_Neanderthal.DG", pop2 = c("Denisova.DG", "Vindija.DG"),
adjust_pseudohaploid = FALSE)
```
## *qpWave* and *qpAdm*
*qpWave* and *qpAdm* are two programs with different goals - *qpWave* is used for estimating the number of admixture events, and *qpAdm* is used for estimating admixture weights - but they perform almost the same computations. The key difference is that *qpWave* compares two sets of populations (`left` and `right`), while *qpAdm* is tests how a single `target` population (which can be one of the `left` populations) relates to `left` and `right`. In *ADMIXTOOLS 2*, both `qpadm()` and `qpwave()`require at least three arguments:
1. $f_2$-statistics
2. A set of left populations
3. A set of right populations
`qpadm()` additionally requires a `target` population as the 4th argument, which will be modeled as a mixture of `left` populations.
```{r}
left = c('Altai_Neanderthal.DG', 'Vindija.DG')
right = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG', 'Switzerland_Bichon.SG')
target = 'Denisova.DG'
pops = c(left, right, target)
```
Both functions will return $f_4$-statistics, and a data frame that shows how well the $f_4$-matrix can be approximated by lower rank matrices. The last line tests for rank 0, which is equivalent to testing whether the left populations form a clade with respect to the right populations.
```{r, eval = FALSE}
results = qpwave(f2_blocks, left, right)
```
```{r, echo = FALSE}
results = qpwave(example_f2_blocks, left, right, verbose = FALSE)
```
```{r}
results$f4
results$rankdrop
```
`qpadm()` will also compute admixture weights and nested models:
* **weights**: These are the admixture weights, or estimates of the relative contributions of the left population to the target population.
* **popdrop**: popdrop shows the fits of all models generated by dropping a specific subset of left populations, and will only be returned if a target population is specified.
* **f4**: The estimated $f_4$-statistics will now also include lines with fitted $f_4$-statistics, where the target population is in the first column, and a weighted sum of the left populations, `fit`, in the second column.
* **f4_var_rcond** / **f4_var_singular_loadings**: diagnostics for near-singular f4 covariance, indicating when collinear right populations are inflating the apparent precision of the weights. See the [qpWave and qpAdm vignette](qpadm.html) for how to interpret and gate on these.
```{r, eval = FALSE}
results = qpadm(f2_blocks, left, right, target)
```
```{r, echo = FALSE}
results = qpadm(example_f2_blocks, left, right, target, verbose = FALSE)
```
```{r}
results$weights
results$popdrop
```
### Running many models
There are several functions that can be used to run many *qpWave* or *qpAdm* models at the same time.
#### Pairwise cladality tests
`qpwave_pairs()` forms all pairs of `left` populations and tests whether they form a clade with respect to the `right` populations.
```{r, eval = FALSE}
qpwave_pairs(f2_blocks, left = c(target, left), right = right)
```
```{r, echo = FALSE}
qpwave_pairs(example_f2_blocks, left = c(target, left), right = right)
```
#### Rotating outgroups
`qpadm_rotate()` tests many `qpadm()` models at a time. For each model, the `leftright` populations will be split into two groups: The first group will be the `left` populations passed to `qpadm()`, while the second group will be added to `rightfix` and become the set of right populations. By default, this function will only compute p-values but not weights for each model (which makes it faster). If you want the full output for each model, set `full_results = TRUE`.
```{r, eval = FALSE}
qpadm_rotate(f2_blocks, leftright = pops[3:7], target = pops[1], rightfix = pops[1:2])
```
```{r, echo = FALSE}
qpadm_rotate(example_f2_blocks, leftright = pops[3:7], target = pops[1], rightfix = pops[1:2], verbose = FALSE)
```
#### Many qpadm models
Swapping some populations between the `left` and the `right` set is one common way to run multiple `qpadm()` models. There is also a more general function, in which you can specify any models that you want to run. This is faster than looping over several calls to the `qpadm()` function, in particular when reading data from a genotype matrix directly, because it re-uses f4-statistics.
To specify the `qpadm()` models you want to run, you need to make a data frame with columns `left`, `right`, and `target`, where each model is in a different row.
```{r, eval = FALSE}
models = tibble(
left = list(pops[1:2], pops[3]),
right = list(pops[4:6], pops[1:2]),
target = c(pops[7], pops[7]))
results = qpadm_multi('/my/geno/prefix', models)
```
```{r, echo = FALSE}
models = tibble(
left = list(pops[1:2], pops[3]),
right = list(pops[4:6], pops[1:2]),
target = c(pops[7], pops[7]))
results = qpadm_multi(example_f2_blocks, models, verbose = FALSE)
```
The output is a list where each item is the result from one model. The following command would combine the weights for all models into a new data frame:
```{r}
results %>% map('weights') %>% bind_rows(.id = 'model')
```
#### Sweeps over target × source × right
`qpadm_sweep()` covers the "many targets across a grid of source and outgroup configurations" question — the natural fit when you have a set of populations and want to score every plausible (target, source-set, right-set) combination at once. The grid is the Cartesian product of three vectors:
```{r, eval = FALSE}
targets = c(target1, target2, target3)
sources = list(model_A = c(pop1, pop2),
model_B = c(pop1, pop2, pop3))
rights = list(distal = c(out1, out2),
refined = c(out1, out2, out3))
res = qpadm_sweep(my_geno_or_f2, targets, sources, rights)
res %>% arrange(target, p)
```
The output is a flat tibble with one row per combination, with `p`, `chisq`, `feasible` columns ready for filtering. Both `qpadm_multi()` and `qpadm_sweep()` parallelize across models under `future::plan()` — see [the qpWave and qpAdm vignette](qpadm.html) for the deep dive and [the parallelization vignette](parallel.html) for the two-layers-compose caveat.
## *qpGraph*
Single $f_3$- and $f_4$-statistics can tell us how three or four populations are related to each other. *qpGraph* generalizes this concept to any number of populations. It takes estimated $f_3$-statistics and the topology of an admixture graph, finds the edges weights that minimize the difference between fitted and estimated $f_3$-statistics, and summarizes that difference in a likelihood score. A good model should fit all $f_3$-statistics, and have a score close to zero.
```{r, eval = FALSE}
qpg_results = qpgraph(f2_blocks, example_graph)
qpg_results$score
```
```{r, echo = FALSE}
qpg_results = qpgraph(example_f2_blocks, example_graph)
qpg_results$score
```
Here, `example_graph` is a specific graph included in this R package, but you can provide any other graph in one of three formats.
1. An [igraph](https://igraph.org/r/) object.
2. A two column matrix or data frame, where each row is an edge, the first column is the source of the edge, and the second column is the target. Additional columns labelled `lower` and `upper` can be used to constrain certain edges (`NA` = no constraint).
3. The location of a *qpGraph* graph file, which will be read and parsed.
The leaf nodes of this graph have to match the $f_2$-statistic population labels, and the graph has to be a valid admixture graph: a directed acyclic graph where each node has no more than two parents. If nodes with more than two children are present (polytomies or multifurcations) they will be split in a random order, and the new drift edges will be constrained to 0.
The output of `qpgraph()` is a list with several items:
* **edges**: A data frame with estimated edge weights. This includes normal edges as well as admixture edges.
* **score** The likelihood score. Smaller scores indicate a better fit.
* **f2** Estimated $f_2$-statistics and standard errors for all population pairs
* **f3** Estimated and fitted $f_3$-statistics for all population pairs with the outgroup. This includes residuals and z-scores.
* **opt** A data frame with details from the weight optimization, with one row per set of starting values.
* **ppinv** The inverse of the $f_3$-statistics covariance matrix
Optionally, fitted and estimated $f_4$-statistics are returned as **f4** and the worst residual z-score as **worst_residual** if `return_fstats` is set to `TRUE`. When `f2_blocks_test` is provided, an out-of-sample score is computed and returned as **score_test**.
The fitted graph can be plotted like this:
```{r}
plot_graph(qpg_results$edges)
```
or as an interactive plot if you want to know the names of the inner nodes:
```{r, warning = FALSE}
plotly_graph(qpg_results$edges)
```
### More about graphs
*ADMIXTOOLS 2* can automatically find well-fitting graphs, and it has several functions for exploring the topological neighborhood of a graph. These functions are described [here](graphs.html).
To get a better theoretical understanding of *f*-statistics and admixture graphs, I recommend [this paper](https://onlinelibrary.wiley.com/doi/abs/10.1111/1755-0998.13230).
## Comparing results between *ADMIXTOOLS* and *ADMIXTOOLS 2*
The results of *ADMIXTOOLS 2* should generally match those of *ADMIXTOOLS*. However, there are a few parameters where the default behavior in *ADMIXTOOLS 2* differs in subtle ways from that in *ADMIXTOOLS*:
* Selection of SNPs
Different *ADMIXTOOLS* programs select SNPs in slightly different ways.*qpDstat* selects different SNPs for each $f_4$-statistic (similar to `allsnps: YES` in *qpAdm* and *qpGraph*). The default in *qpAdm* and *qpGraph* is to select only those SNPs which are present in all populations in the tested model (`allsnps: NO`). In addition to that, some programs discard SNPs which have identical allele frequencies in all populations, while others do not. (The SNPs that remain are sometimes called "polymorphic" here. They are polymorphic not in the traditional meaning, but with regard to their allele frequencies.) While those SNPs are not informative in the sense that they do not shift $f_4$-statistics away from zero, they can still have a small impact on the magnitude of the estimate, and on the total number of SNPs used.
*ADMIXTOOLS 2* is set up in such a way that the default options are the same as in *ADMIXTOOLS* in most cases. However, missing data can [make things complicated](fstats.html#bias-due-to-missing-data-1). Most notably, in order to get the default behavior of the original *qpDstat*, the first argument to the `qpdstat()` function has to be the prefix of genotype files. Similarly, to get the `allsnps: YES` behavior of *qpAdm* and *qpGraph*, the first argument to the `qpadm()` and `qpgraph()` functions has to be the prefix of genotype files and in addition to that, the `allsnps` parameter in these functions should be set to `TRUE`.
With regard to the use of non-polymorphic/non-informative SNPs, *ADMIXTOOLS 2* includes them for the computation for $f_2$-statistics, but not for the computation of allele frequency products. This usually makes very little difference, but it makes it possible to imitate the default behavior of the original *ADMIXTOOLS* programs.
* Pseudohaploid data
In *ADMIXTOOLS*, it is recommended to use the option `inbreed: YES` when dealing with pseudohaploid data. This will ensure that when the low-sample-size correction factor for $f_2$-statistics is computed, each pseudohaploid sample contributes only one haplotype. However, this will not work for populations of only a single pseudohaploid sample. Which is not a big problem, because the default option `inbreed: NO` usually gives very similar results.
*ADMIXTOOLS 2* automatically detects which samples are diploid and which samples are pseudohaploid based on the first 1000 SNPs, and computes the correction factor appropriately. The exact default behavior of *ADMIXTOOLS* (`inbreed: NO`) can be recovered by setting `adjust_pseudohaploid = FALSE` in `extract_f2()`.
* $f_4$ vs D-statistics
In *ADMIXTOOLS*, *qpDstat* computes D-statistics by default `f4mode: YES` will compute instead compute $f_4$. This is reversed in *ADMIXTOOLS 2*: The default is `f4mode = TRUE`, and setting it to false will compute D-statistics. The reason for this is that `f4mode = FALSE` (computing D-statistics) is only possible when the first argument to the function is the prefix of genotype files.
To make it easier to compare *ADMIXTOOLS* to *ADMIXTOOLS 2*, there are wrapper function which call the original *ADMIXTOOLS* programs and read the results.
```{r, eval = FALSE}
binpath = '/home/np29/o2bin/'
env = 'export LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/n/app/openblas/0.2.19/lib/:/n/app/gsl/2.3/lib/;'
qp3pop_bin = paste0(env, binpath, 'qp3pop')
qpdstat_bin = paste0(env, binpath, 'qpDstat')
qpadm_bin = paste0(env, binpath, 'qpAdm')
qpgraph_bin = paste0(env, binpath, 'qpGraph')
#prefix = '/n/groups/reich/DAVID/V42/V42.1/v42.1'
prefix = '/n/groups/reich/robert/projects/admixprograms/v42.1_small'
outdir = 'write/files/here/'
qp3pop_wrapper(prefix, source1 = pop2, source2 = pop3, target = pop1,
qp3pop_bin, outdir)
qpdstat_wrapper(prefix, pop1, pop2, pop3, pop4,
qpdstat_bin, outdir)
qpadm_wrapper(prefix, left, right, target,
qpadm_bin, outdir)
qpgraph_wrapper(prefix, example_graph,
qpgraph_bin, outdir)
```
**Unless `outdir` is specified, calling these wrapper functions may overwrite files in the working directory!**
If you already have existing population, graph, or parameter files, you can run the wrapper functions like this (though the different programs will require different parfiles and popfiles):
```{r, eval = FALSE}
# using population or graph files
qp3pop_wrapper (prefix, 'popfile.txt', bin = qp3pop_bin)
qpdstat_wrapper(prefix, 'popfile.txt', bin = qpdstat_bin)
qpadm_wrapper(prefix, left = 'left.txt', right = 'right.txt', bin = qpadm_bin)
qpgraph_wrapper(prefix, 'graphfile.txt', bin = qpgraph_bin)
# using parameter files
qp3pop_wrapper (NULL, parfile = 'parfile.txt', bin = qp3pop_bin)
qpdstat_wrapper(NULL, parfile = 'parfile.txt', bin = qpdstat_bin)
qpadm_wrapper (NULL, parfile = 'parfile.txt', bin = qpadm_bin)
qpgraph_wrapper(NULL, 'graphfile.txt', parfile = 'parfile.txt', bin = qpgraph_bin)
```
The following function makes it easy to compare *qpGraph* or *qpAdm* results.
```{r, eval = FALSE}
qpgraph_ref_results = qpgraph_wrapper(prefix, example_graph, qpgraph_bin)
```
```{r, echo = FALSE}
qpgraph_ref_results = example_qpgraph_ref_results
```
```{r, warning = FALSE, message = FALSE, fig.width = 7, fig.height = 7, out.width = "100%"}
plot_comparison(qpg_results, qpgraph_ref_results)
```
This is not only useful for comparing results between *ADMIXTOOLS* and *ADMIXTOOLS 2*, but also for comparing models which used different parameters.
The interactive `plotly_comparison()` makes it easier to identify outliers.
```{r, eval = FALSE}
qpg_results2 = qpgraph(f2_blocks, example_graph, lsqmode = TRUE)
```
```{r, echo = FALSE}
qpg_results2 = qpgraph(example_f2_blocks, example_graph, lsqmode = TRUE)
```
```{r, warning = FALSE, message = FALSE, fig.width = 7, fig.height = 9, out.width = "100%"}
plotly_comparison(qpg_results, qpg_results2)
```