Package 'admixtools' reference manual

Title:	Inferring demographic history from genetic data
Description:	admixtools is a set of methods used to infer demographic history from genetic data using f-statistics. It was originally implemented as a command line program in C. This is a new implementation of the programs qpDstat, qpAdm, qpGraph, and contains several new features. It is efficient because it precomputes f2-statistics which can be re-used for many analyses.
Authors:	Robert Maier [aut, cre], Nick Patterson [aut]
Maintainer:	Robert Maier <[email protected]>
License:	GPL-3
Version:	2.0.8
Built:	2025-02-28 22:25:26 UTC
Source:	https://github.com/uqrmaie1/admixtools

Tools for inferring demographic history from genetic data

Description

admixtools is a collection of several programs for calculating f-statistics and using them to make inferences about demographic history from genetic data Aside from the core functions qp3pop, qpdstat, qpadm, and qpgraph, there are also wrapper functions and parsing functions around the original ADMIXTOOLS software, as well as functions to read genotype files and precompute the data necessary to quickly compute f statistics.

Author(s)

Robert Maier <[email protected]>

Nick Patterson

References

Patterson, N. et al. (2012) Ancient admixture in human history. Genetics

Compute count blocks and write them to disk

Description

This is intended for computing allele count data for a large number of individuals, too many to do everything in working memory. It assumes that the allele frequencies have already been computed and are stored in .rds files, split into consecutive blocks for a set of individuals.

Usage

afs_to_counts(
  genodir,
  outdir,
  chunk1,
  chunk2,
  overwrite = FALSE,
  verbose = TRUE
)
afs_to_counts(
  genodir,
  outdir,
  chunk1,
  chunk2,
  overwrite = FALSE,
  verbose = TRUE
)

Arguments

`genodir`	Directory with genotype files split in chunks created by `extract_afs` (with each individual its own population).
`outdir`	Directory where allele count data will be stored
`chunk1`	Index of the first chunk of individuals
`chunk2`	Index of the second chunk of individuals
`overwrite`	Overwrite existing files (default `FALSE`)
`verbose`	Print progress updates

Compute f2 blocks and write them to disk

Description

This is intended for computing f2-statistics for a large number of populations, too many to do everything in working memory. It assumes that the allele frequencies have already been computed and are stored in .rds files, split into consecutive blocks for a set of populations. This function calls write_f2, which takes a (sub-)chunk of pairwise f2-statistics, and writes one pair at a time to disk.

Usage

afs_to_f2(
  afdir,
  outdir,
  chunk1,
  chunk2,
  blgsize = 0.05,
  snpwt = NULL,
  overwrite = FALSE,
  type = "f2",
  poly_only = FALSE,
  snpdat = NULL,
  apply_corr = TRUE,
  verbose = TRUE
)
afs_to_f2(
  afdir,
  outdir,
  chunk1,
  chunk2,
  blgsize = 0.05,
  snpwt = NULL,
  overwrite = FALSE,
  type = "f2",
  poly_only = FALSE,
  snpdat = NULL,
  apply_corr = TRUE,
  verbose = TRUE
)

Arguments

`afdir`	Directory with allele frequency and counts `.rds` files created by `split_mat`
`outdir`	Directory where data will be stored
`chunk1`	Index of the first chunk of populations
`chunk2`	Index of the second chunk of populations
`blgsize`	SNP block size in Morgan. Default is 0.05 (5 cM). If `blgsize` is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.
`snpwt`	A vector of scaling factors applied to the f2-statistics for each SNP. The length has to match the number of SNPs.
`overwrite`	Overwrite existing files (default `FALSE`)
`verbose`	Print progress updates

Examples

## Not run: 
afdir = 'tmp_af_dir/'
f2dir = 'f2_dir'
extract_afs('path/to/packedancestrymap_prefix', afdir)
numchunks = length(list.files(afdir, 'afs.+rds'))
# numchunks should be the number of split allele frequency files
for(i in 1:numchunks) {
  for(j in i:numchunks) {
    afs_to_f2(afdir, f2dir, chunk1 = i, chunk2 = j)
  }
}

## End(Not run)
# Alternatively, the following code will do the same, while submitting each chunk as a separate job.
# (if \code{\link[future]{plan}} has been set up appropriately)
## Not run: 
furrr::future_map(1:numchunks, ~{i=.; map(i:numchunks, ~{
  afs_to_f2(afdir, f2dir, chunk1 = i, chunk2 = .)
  })})
  
## End(Not run)
## Not run: 
afdir = 'tmp_af_dir/'
f2dir = 'f2_dir'
extract_afs('path/to/packedancestrymap_prefix', afdir)
numchunks = length(list.files(afdir, 'afs.+rds'))
# numchunks should be the number of split allele frequency files
for(i in 1:numchunks) {
  for(j in i:numchunks) {
    afs_to_f2(afdir, f2dir, chunk1 = i, chunk2 = j)
  }
}

## End(Not run)
# Alternatively, the following code will do the same, while submitting each chunk as a separate job.
# (if \code{\link[future]{plan}} has been set up appropriately)
## Not run: 
furrr::future_map(1:numchunks, ~{i=.; map(i:numchunks, ~{
  afs_to_f2(afdir, f2dir, chunk1 = i, chunk2 = .)
  })})
  
## End(Not run)

Compute all pairwise f2 statistics

Description

This function takes a allele frequency data and computes blocked f2 statistics for all population pairs, which are written to outdir. $f2$ for each SNP is computed as $(p1 - p2)^2 - p1 (1 - p1)/(n1 - 1) - p2 (1 - p2)/(n2 - 1)$ , where $p1$ and $p2$ are allele frequencies in populations $1$ and $2$ , and $n1$ and $n2$ is the number of non-missing haplotypes in populations $1$ and $2$ . See details

Usage

afs_to_f2_blocks(
  afdat,
  maxmem = 8000,
  blgsize = 0.05,
  pops1 = NULL,
  pops2 = NULL,
  outpop = NULL,
  outdir = NULL,
  overwrite = FALSE,
  afprod = TRUE,
  fst = TRUE,
  poly_only = c("f2"),
  apply_corr = apply_corr,
  n_cores = 1,
  verbose = TRUE
)
afs_to_f2_blocks(
  afdat,
  maxmem = 8000,
  blgsize = 0.05,
  pops1 = NULL,
  pops2 = NULL,
  outpop = NULL,
  outdir = NULL,
  overwrite = FALSE,
  afprod = TRUE,
  fst = TRUE,
  poly_only = c("f2"),
  apply_corr = apply_corr,
  n_cores = 1,
  verbose = TRUE
)

Arguments

`afdat`	A list with three items with the same SNP in each row (generated by `packedancestrymap_to_afs` or `plink_to_afs`) `afs` A matrix of allele frequencies for all populations (columns) and SNPs (rows) `counts` A matrix of allele counts for all populations (columns) and SNPs (rows) `snpdat` A data frame with SNP metadata
`maxmem`	split up allele frequency data into blocks, if memory requirements exceed `maxmem` MB.
`blgsize`	SNP block size in Morgan. Default is 0.05 (5 cM). If `blgsize` is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.
`outpop`	If specified, f2-statistics will be weighted by heterozygosity in this population
`outdir`	Directory into which to write f2 data (if `NULL`, data is returned instead)
`overwrite`	Should existing files be overwritten? Only relevant if `outdir` is not `NULL`
`poly_only`	Exclude sites with identical allele frequencies in all populations. Can be different for f2-statistics, allele frequency products, and fst. Should be a character vector of length three, with some subset of `c("f2", "ap", "fst")`
`verbose`	Print progress updates
`pop1`	`pops1` and `pops2` can be specified if only a subset of pairs should be computed.
`pop2`	`pops1` and `pops2` can be specified if only a subset of pairs should be computed.

Details

For each population pair, each of the $i = 1, \ldots, n$ resutling values ( $n$ is around 700 in practice) is the mean $f2$ estimate across all SNPs except the ones in block $i$ .

$- p1 (1 - p1)/(2 n1 - 1) - p2 (1 - p2)/(2 n2 - 1)$ is a correction term which makes the estimates unbiased at low sample sizes.

Examples

## Not run: 
afdat = plink_to_afs('/my/geno/prefix')
afs_to_f2_blocks(afdat, outdir = '/my/f2/data/')

## End(Not run)
## Not run: 
afdat = plink_to_afs('/my/geno/prefix')
afs_to_f2_blocks(afdat, outdir = '/my/f2/data/')

## End(Not run)

Convert agraph to igraph

Description

agraph is the format used by the admixturegraph packge. igraph is used by the admixtools package

Usage

agraph_to_igraph(agraph)
agraph_to_igraph(agraph)

Arguments

agraph

An admixture graph in agraph format

Value

An admixture graph in igraph format

Examples

## Not run: 
agraph_to_igraph(agraph)

## End(Not run)
## Not run: 
agraph_to_igraph(agraph)

## End(Not run)

Generate a list of bootstrap resampled arrays

Description

Generate a list of bootstrap resampled arrays

Usage

boo_list(arr, nboot = dim(arr)[3])
boo_list(arr, nboot = dim(arr)[3])

Arguments

`arr`	3d array with blocked estimates, with blocks in the 3rd dimension
`nboot`	Number of bootstrap iterations

Value

A list of bootstrap resampled arrays, with 3rd dimension equal to nboot

Compare the fit of two qpgraph models

Description

Takes the bootstrap score distribution of two fits on the same populations and tests whether the scores of one graph are significantly higher or lower than the scores of the other graph.

Usage

compare_fits(scores1, scores2)
compare_fits(scores1, scores2)

Arguments

`scores1`	Scores for the first graph. Each score should be for same model evaluated on different bootstrap samples of the SNP blocks. See `qpgraph_resample_multi`
`scores2`	Scores for the second graph, evaluated on the same bootstrap samples as the first graph

Value

A list with statistics comparing the two models

p_emp: The two-sided bootstrap p-value testing whether the scores of one model are higher or lower than the scores of the other model. It is two times the fraction of bootstrap replicates in which model 1 has a lower score than model 2 (or vice-vera, whichever is lower). This is simply a bootstrap test for testing whether some quantity (the the score difference of two models in this case) is significantly different from zero.
p_emp_nocorr: p_emp is truncated to be no less than the reciprocal of the number of bootstrap iterations (the length of the score vectors). p_emp_nocorr is not truncated and can be equal to 0.
ci_low: The 2.5% quantile of distribution of score differences
ci_high: The 97.5% quantile of distribution of score differences
The other items in this list are less important and can be ignored. In particular, p is not as reliable as p_emp because it assumes that the score differences follow a normal distribution.

Examples

## Not run: 
fits = qpgraph_resample_multi(f2_blocks, list(graph1, graph2), nboot = 100)
compare_fits(fits[[1]]$score, fits[[2]]$score)

## End(Not run)
## Not run: 
fits = qpgraph_resample_multi(f2_blocks, list(graph1, graph2), nboot = 100)
compare_fits(fits[[1]]$score, fits[[2]]$score)

## End(Not run)

Compare the fit of two qpgraph models

Description

Takes two data frames with model fits computed on two graphs for on the same populations and tests whether the scores of one graph are significantly better than the scores of the other

Usage

compare_fits2(fits1, fits2, boot = FALSE)
compare_fits2(fits1, fits2, boot = FALSE)

Arguments

`fits1`	The fits of the first graph
`fits2`	The fits of the second graph
`boot`	should match the `boot` parameter in `qpgraph_resample_snps` (`FALSE` by default)

Examples

## Not run: 
nblocks = dim(example_f2_blocks)[3]
train = sample(1:nblocks, round(nblocks/2))
fits1 = qpgraph_resample_snps(example_f2_blocks[,,train], graph = graph1,
                              f2_blocks_test = example_f2_blocks[,,-train])
fits2 = qpgraph_resample_snps(example_f2_blocks[,,train], graph = graph2,
                              f2_blocks_test = example_f2_blocks[,,-train])
compare_fits2(fit1, fit2)

## End(Not run)
## Not run: 
nblocks = dim(example_f2_blocks)[3]
train = sample(1:nblocks, round(nblocks/2))
fits1 = qpgraph_resample_snps(example_f2_blocks[,,train], graph = graph1,
                              f2_blocks_test = example_f2_blocks[,,-train])
fits2 = qpgraph_resample_snps(example_f2_blocks[,,train], graph = graph2,
                              f2_blocks_test = example_f2_blocks[,,-train])
compare_fits2(fit1, fit2)

## End(Not run)

Compare the fit of two qpgraph models

Description

Takes two data frames with model fits computed on two graphs for on the same populations and tests whether the scores of one graph are significantly better than the scores of the other.

Usage

compare_fits4(fit1, fit2, f2_blocks, f2_blocks_test, boot = FALSE, seed = NULL)
compare_fits4(fit1, fit2, f2_blocks, f2_blocks_test, boot = FALSE, seed = NULL)

Arguments

`fit1`	The fit of the first graph
`fit2`	The fit of the second graph
`f2_blocks`	f2 blocks used for fitting `fit1` and `fit2`. Used in combination with `f2_blocks_test` to compute f-statistics covariance matrix.
`f2_blocks_test`	f2 blocks which were not used for fitting `fit1` and `fit2`
`boot`	If `FALSE` (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If `boot` is an integer, that number will specify the number of bootstrap resamplings. If `boot = TRUE`, the number of bootstrap resamplings will be equal to the number of SNP blocks. If bootstrap resampling is enabled, empirical p-values (`p_emp`) and 95 confidence intervals (`ci_low` and `ci_high`) will be reported.
`seed`	Random seed used if `boot` is `TRUE`. Does not need to match a seed used in fitting the models

Examples

## Not run: 
nblocks = dim(example_f2_blocks)[3]
train = sample(1:nblocks, round(nblocks/2))
fit1 = qpgraph(example_f2_blocks[,,train], graph1)
fit2 = qpgraph(example_f2_blocks[,,train], graph2)
compare_fits4(fit1, fit2, example_f2_blocks[,,train], example_f2_blocks[,,-train])

## End(Not run)
## Not run: 
nblocks = dim(example_f2_blocks)[3]
train = sample(1:nblocks, round(nblocks/2))
fit1 = qpgraph(example_f2_blocks[,,train], graph1)
fit2 = qpgraph(example_f2_blocks[,,train], graph2)
compare_fits4(fit1, fit2, example_f2_blocks[,,train], example_f2_blocks[,,-train])

## End(Not run)

Count SNPs in an f2-statistics array

Description

This function adds up all block lengths (number of SNPs in each SNP blocks), which are stored in the names along the third dimension of the array. When the f2-statistics were computed while setting maxmiss to values greater than 0, it is possible that not all f2-statistics are based on the same number of SNPs. In that case, the value returned by this function is merely an upper bound on the number of SNPs used.

Usage

count_snps(f2_blocks)
count_snps(f2_blocks)

Arguments

f2_blocks

A 3d array of per block f2-statistics

Value

The total number of SNPs across all blocks

Count zero-length edges

Description

agraph is the format used by the admixturegraph packge. igraph is used by the admixtools package

Usage

count_zero_edges(edges, epsilon = 1e-06)
count_zero_edges(edges, epsilon = 1e-06)

Arguments

`edges`	Edges data frame from fitted admixture graph
`epsilon`	Every edge with length

Value

The number of edges with length < epsilon

Examples

## Not run: 
fit = qpgraph(example_f2_blocks, example_igraph)
count_zero_edges(fit$edges)

## End(Not run)
## Not run: 
fit = qpgraph(example_f2_blocks, example_igraph)
count_zero_edges(fit$edges)

## End(Not run)

Find all trees within SPR distance of 1 of all graph component trees

Description

Returns all trees which can be reached through one iteration of subtree-prune-and-regraft on any graph component tree

Usage

decomposed_tree_neighbors(graph)
decomposed_tree_neighbors(graph)

Arguments

graph

An admixture graph

Value

A data frame with all trees

Delete an admixture edge

Description

Delete an admixture edge

Usage

delete_admix(graph, from = NULL, to = NULL)
delete_admix(graph, from = NULL, to = NULL)

Arguments

`graph`	An admixture graph
`from`	Edge source node
`to`	Edge target node

Value

Admixture graph with one deleted edge

Delete groups

Description

This function deletes data for groups created by group_samples

Usage

delete_groups(dir, groups = NULL, verbose = TRUE)
delete_groups(dir, groups = NULL, verbose = TRUE)

Arguments

`dir`	Directory with precomputed individual pair data
`groups`	Groups to delete. Defaults to all groups
`verbose`	Print progress updates

Value

Invisibly returns sample IDs in deleted groups as character vector

Examples

## Not run: 
dir = 'my/f2/dir/'
inds = c('ind1', 'ind2', 'ind3', 'ind4', 'ind5')
pops = c('pop_A', 'pop_A', 'pop_A', 'pop_B', 'pop_B')
group_samples(dir, inds, pops)

## End(Not run)
## Not run: 
dir = 'my/f2/dir/'
inds = c('ind1', 'ind2', 'ind3', 'ind4', 'ind5')
pops = c('pop_A', 'pop_A', 'pop_A', 'pop_B', 'pop_B')
group_samples(dir, inds, pops)

## End(Not run)

Remove population from graph

Description

Remove population from graph

Usage

delete_leaf(graph, leaf)
delete_leaf(graph, leaf)

Arguments

`graph`	An admixture graph
`leaf`	Population to be removed

Value

Admixture graph with removed population

Add two nodes before each admixture node

Description

This is used to revert simplify_graph.

Usage

desimplify_graph(graph)
desimplify_graph(graph)

Arguments

graph

An admixture graph

Examples

simple = simplify_graph(example_igraph)
desimple = desimplify_graph(simple)
plot_graph(simple)
plot_graph(desimple)
simple = simplify_graph(example_igraph)
desimple = desimplify_graph(simple)
plot_graph(simple)
plot_graph(desimple)

Convert data frame graph to igraph

Description

Convert data frame graph to igraph

Usage

edges_to_igraph(edges)
edges_to_igraph(edges)

Arguments

edges

An admixture graph as an edge list data frame

Value

An igraph object

Read allele frequencies from EIGENSTRAT files

Description

Read allele frequencies from EIGENSTRAT files

Usage

eigenstrat_to_afs(
  pref,
  inds = NULL,
  pops = NULL,
  numparts = 100,
  adjust_pseudohaploid = TRUE,
  verbose = TRUE
)
eigenstrat_to_afs(
  pref,
  inds = NULL,
  pops = NULL,
  numparts = 100,
  adjust_pseudohaploid = TRUE,
  verbose = TRUE
)

Arguments

`pref`	Prefix of EIGENSTRAT files (files have to end in `.geno`, `.ind`, `.snp`)
`inds`	Individuals from which to compute allele frequencies
`pops`	Populations from which to compute allele frequencies. If `NULL` (default), populations will be extracted from the third column in the `.ind` file. If population labels are provided, they should have the same length as `inds`, and will be matched to them by position
`numparts`	Number of parts into which the genotype file is split. Lowering this number can speed things up, but will take more memory.
`adjust_pseudohaploid`	Genotypes of pseudohaploid samples are usually coded as `0` or `2`, even though only one allele is observed. `adjust_pseudohaploid` ensures that the observed allele count increases only by `1` for each pseudohaploid sample. If `TRUE` (default), samples that don't have any genotypes coded as `1` among the first 1000 SNPs are automatically identified as pseudohaploid. This leads to slightly more accurate estimates of f-statistics. Setting this parameter to `FALSE` is equivalent to the ADMIXTOOLS `inbreed: NO` option. Setting `adjust_pseudohaploid` to an integer `n` will check the first `n` SNPs instead of the first 1000 SNPs.
`verbose`	Print progress updates

Value

A list with three data frames: allele frequency data, allele counts, and SNP metadata

Examples

## Not run: 
afdat = eigenstrat_to_afs(prefix, pops = pops)
afs = afdat$afs
counts = afdat$counts

## End(Not run)
## Not run: 
afdat = eigenstrat_to_afs(prefix, pops = pops)
afs = afdat$afs
counts = afdat$counts

## End(Not run)

Turn per-block estimates into bootstrap estimates

Description

This works for any statistics which, when computed across N blocks, are equal to the weighted mean of the statistics across the N blocks.

Usage

est_to_boo(arr, nboot = dim(arr)[3], block_lengths = NULL)
est_to_boo(arr, nboot = dim(arr)[3], block_lengths = NULL)

Arguments

`arr`	3d array with blocked estimates, with blocks in the 3rd dimension.
`nboot`	Number of bootstrap iterations

Value

A 3d array with bootstrap estimates. The first two dimensions are equal to those of arr. The 3rd dimension is equal to nboot.

Turn per-block estimates into leave-one-out estimates

Description

This works for any statistics which, when computed across N blocks, are equal to the weighted mean of the statistics across the N blocks.

Usage

est_to_loo(arr, block_lengths = NULL)
est_to_loo(arr, block_lengths = NULL)

Arguments

`arr`	3d array with blocked estimates, with blocks in the 3rd dimension
`block_lengths`	Optional block lengths. If `NULL`, will be parsed from 3rd dimnames in blocks

Value

A 3d array with leave-one-out estimates for jackknife. Dimensions are equal to those of arr.

Data frame with sample annotations

Description

Data frame with sample annotations

Usage

example_anno
example_anno

Format

A data frame with sample annotations

Blocked f2-statistics for 7 populations

Description

Blocked f2-statistics for 7 populations

Usage

example_f2_blocks
example_f2_blocks

Format

A 3d array with populations along the first two dimensions, and SNP blocks along the 3rd dimension

Admixture graph for 7 populations

Description

Admixture graph for 7 populations

Usage

example_graph
example_graph

Format

A two column matrix where each row represents one edge

Admixture graph for 7 populations

Description

Admixture graph for 7 populations

Usage

example_igraph
example_igraph

Format

An igraph object

Data frame with one fitted admixture graph

Description

Data frame with one fitted admixture graph

Usage

example_opt
example_opt

Format

A data frame with a fitted admixture graph, generated by find_graphs_old

example_graph fitted using qpGraph

Description

example_graph fitted using qpGraph

Usage

example_qpgraph_ref_results
example_qpgraph_ref_results

Format

A list with parsed qpGraph output

Data frame with population triples

Description

Data frame with population triples

Usage

example_triples
example_triples

Format

A data frame with population triples, generated by summarize_triples

Compute and store blocked allele frequency data

Description

Prepare data for various ADMIXTOOLS 2 functions. Reads data from packedancestrymap or PLINK files, and computes allele frequencies for selected populations and stores it as .rds files in outdir.

Usage

extract_afs(
  pref,
  outdir,
  inds = NULL,
  pops = NULL,
  cols_per_chunk = 10,
  numparts = 100,
  maxmiss = 0,
  minmaf = 0,
  maxmaf = 0.5,
  minac2 = FALSE,
  outpop = NULL,
  auto_only = TRUE,
  transitions = TRUE,
  transversions = TRUE,
  keepsnps = NULL,
  format = NULL,
  poly_only = FALSE,
  adjust_pseudohaploid = TRUE,
  verbose = TRUE
)
extract_afs(
  pref,
  outdir,
  inds = NULL,
  pops = NULL,
  cols_per_chunk = 10,
  numparts = 100,
  maxmiss = 0,
  minmaf = 0,
  maxmaf = 0.5,
  minac2 = FALSE,
  outpop = NULL,
  auto_only = TRUE,
  transitions = TRUE,
  transversions = TRUE,
  keepsnps = NULL,
  format = NULL,
  poly_only = FALSE,
  adjust_pseudohaploid = TRUE,
  verbose = TRUE
)

Arguments

`pref`	Prefix of PLINK/EIGENSTRAT/PACKEDANCESTRYMAP files. EIGENSTRAT/PACKEDANCESTRYMAP have to end in `.geno`, `.snp`, `.ind`, PLINK has to end in `.bed`, `.bim`, `.fam`
`outdir`	Directory where data will be stored.
`inds`	Individuals for which data should be extracted
`pops`	Populations for which data should be extracted. If both `pops` and `inds` are provided, they should have the same length and will be matched by position. If only `pops` is provided, all individuals from the `.ind` or `.fam` file in those populations will be extracted. If only `inds` is provided, each indivdual will be assigned to its own population of the same name. If neither `pops` nor `inds` is provided, all individuals and populations in the `.ind` or `.fam` file will be extracted.
`cols_per_chunk`	Number of populations per chunk. Lowering this number will lower the memory requirements when running `afs_to_f2`, but more chunk pairs will have to be computed.
`numparts`	Number of parts in which genotype data will be read for computing allele frequencies
`maxmiss`	Discard SNPs which are missing in a fraction of populations higher than `maxmiss`
`minmaf`	Discard SNPs with minor allele frequency less than `minmaf`
`maxmaf`	Discard SNPs with minor allele frequency greater than than `maxmaf`
`minac2`	Discard SNPs with allele count lower than 2 in any population (default `FALSE`). This option should be set to `TRUE` when computing f3-statistics where one population consists mostly of pseudohaploid samples. Otherwise heterozygosity estimates and thus f3-estimates can be biased. `minac2 == 2` will discard SNPs with allele count lower than 2 in any non-singleton population (this option is experimental and is based on the hypothesis that using SNPs with allele count lower than 2 only leads to biases in non-singleton populations). While the `minac2` option discards SNPs with allele count lower than 2 in any population, the `qp3pop` function will only discard SNPs with allele count lower than 2 in the first (target) population (when the first argument is the prefix of a genotype file).
`outpop`	Keep only SNPs which are heterozygous in this population
`auto_only`	Keep only SNPs on chromosomes 1 to 22
`transitions`	Set this to `FALSE` to exclude transition SNPs
`transversions`	Set this to `FALSE` to exclude transversion SNPs
`keepsnps`	SNP IDs of SNPs to keep. Overrides other SNP filtering options
`format`	Supply this if the prefix can refer to genotype data in different formats and you want to choose which one to read. Should be `plink` to read `.bed`, `.bim`, `.fam` files, or `eigenstrat`, or `packedancestrymap` to read `.geno`, `.snp`, `.ind` files.
`poly_only`	Specify whether SNPs with identical allele frequencies in every population should be discarded (`poly_only = TRUE`), or whether they should be used (`poly_only = FALSE`). By default (`poly_only = c("f2")`), these SNPs will be used to compute FST and allele frequency products, but not to compute f2 (this is the default option in the original ADMIXTOOLS).
`adjust_pseudohaploid`	Genotypes of pseudohaploid samples are usually coded as `0` or `2`, even though only one allele is observed. `adjust_pseudohaploid` ensures that the observed allele count increases only by `1` for each pseudohaploid sample. If `TRUE` (default), samples that don't have any genotypes coded as `1` among the first 1000 SNPs are automatically identified as pseudohaploid. This leads to slightly more accurate estimates of f-statistics. Setting this parameter to `FALSE` treats all samples as diploid and is equivalent to the ADMIXTOOLS `inbreed: NO` option. Setting `adjust_pseudohaploid` to an integer `n` will check the first `n` SNPs instead of the first 1000 SNPs.
`verbose`	Print progress updates

Value

SNP metadata (invisibly)

Examples

## Not run: 
pref = 'my/genofiles/prefix'
outdir = 'dir/for/afdata/'
extract_afs(pref, outdir)

## End(Not run)
## Not run: 
pref = 'my/genofiles/prefix'
outdir = 'dir/for/afdata/'
extract_afs(pref, outdir)

## End(Not run)

Compute and store blocked allele frequency data

Description

Prepare data for various ADMIXTOOLS 2 functions. Reads data from packedancestrymap or PLINK files, and computes allele frequencies for selected populations and stores it as .rds files in outdir.

Usage

extract_afs_simple(
  pref,
  outdir,
  inds = NULL,
  pops = NULL,
  blgsize = 0.05,
  cols_per_chunk = 10,
  maxmiss = 0,
  minmaf = 0,
  maxmaf = 0.5,
  minac2 = FALSE,
  outpop = NULL,
  transitions = TRUE,
  transversions = TRUE,
  keepsnps = NULL,
  format = NULL,
  poly_only = FALSE,
  adjust_pseudohaploid = TRUE,
  verbose = TRUE
)
extract_afs_simple(
  pref,
  outdir,
  inds = NULL,
  pops = NULL,
  blgsize = 0.05,
  cols_per_chunk = 10,
  maxmiss = 0,
  minmaf = 0,
  maxmaf = 0.5,
  minac2 = FALSE,
  outpop = NULL,
  transitions = TRUE,
  transversions = TRUE,
  keepsnps = NULL,
  format = NULL,
  poly_only = FALSE,
  adjust_pseudohaploid = TRUE,
  verbose = TRUE
)

Arguments

`pref`	Prefix of PLINK/EIGENSTRAT/PACKEDANCESTRYMAP files. EIGENSTRAT/PACKEDANCESTRYMAP have to end in `.geno`, `.snp`, `.ind`, PLINK has to end in `.bed`, `.bim`, `.fam`
`outdir`	Directory where data will be stored.
`inds`	Individuals for which data should be extracted
`pops`	Populations for which data should be extracted. If both `pops` and `inds` are provided, they should have the same length and will be matched by position. If only `pops` is provided, all individuals from the `.ind` or `.fam` file in those populations will be extracted. If only `inds` is provided, each indivdual will be assigned to its own population of the same name. If neither `pops` nor `inds` is provided, all individuals and populations in the `.ind` or `.fam` file will be extracted.
`blgsize`	SNP block size in Morgan. Default is 0.05 (5 cM). If `blgsize` is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.
`cols_per_chunk`	Number of populations per chunk. Lowering this number will lower the memory requirements when running `afs_to_f2`, but more chunk pairs will have to be computed.
`maxmiss`	Discard SNPs which are missing in a fraction of populations higher than `maxmiss`
`minmaf`	Discard SNPs with minor allele frequency less than `minmaf`
`maxmaf`	Discard SNPs with minor allele frequency greater than than `maxmaf`
`minac2`	Discard SNPs with allele count lower than 2 in any population (default `FALSE`). This option should be set to `TRUE` when computing f3-statistics where one population consists mostly of pseudohaploid samples. Otherwise heterozygosity estimates and thus f3-estimates can be biased. `minac2 == 2` will discard SNPs with allele count lower than 2 in any non-singleton population (this option is experimental and is based on the hypothesis that using SNPs with allele count lower than 2 only leads to biases in non-singleton populations). While the `minac2` option discards SNPs with allele count lower than 2 in any population, the `qp3pop` function will only discard SNPs with allele count lower than 2 in the first (target) population (when the first argument is the prefix of a genotype file).
`outpop`	Keep only SNPs which are heterozygous in this population
`transitions`	Set this to `FALSE` to exclude transition SNPs
`transversions`	Set this to `FALSE` to exclude transversion SNPs
`keepsnps`	SNP IDs of SNPs to keep. Overrides other SNP filtering options
`format`	Supply this if the prefix can refer to genotype data in different formats and you want to choose which one to read. Should be `plink` to read `.bed`, `.bim`, `.fam` files, or `eigenstrat`, or `packedancestrymap` to read `.geno`, `.snp`, `.ind` files.
`poly_only`	Specify whether SNPs with identical allele frequencies in every population should be discarded (`poly_only = TRUE`), or whether they should be used (`poly_only = FALSE`). By default (`poly_only = c("f2")`), these SNPs will be used to compute FST and allele frequency products, but not to compute f2 (this is the default option in the original ADMIXTOOLS).
`adjust_pseudohaploid`	Genotypes of pseudohaploid samples are usually coded as `0` or `2`, even though only one allele is observed. `adjust_pseudohaploid` ensures that the observed allele count increases only by `1` for each pseudohaploid sample. If `TRUE` (default), samples that don't have any genotypes coded as `1` among the first 1000 SNPs are automatically identified as pseudohaploid. This leads to slightly more accurate estimates of f-statistics. Setting this parameter to `FALSE` treats all samples as diploid and is equivalent to the ADMIXTOOLS `inbreed: NO` option. Setting `adjust_pseudohaploid` to an integer `n` will check the first `n` SNPs instead of the first 1000 SNPs.
`verbose`	Print progress updates

Value

SNP metadata (invisibly)

Examples

## Not run: 
pref = 'my/genofiles/prefix'
outdir = 'dir/for/afdata/'
extract_afs(pref, outdir)

## End(Not run)
## Not run: 
pref = 'my/genofiles/prefix'
outdir = 'dir/for/afdata/'
extract_afs(pref, outdir)

## End(Not run)

Extract and store data needed to compute blocked f2

Description

Prepare data for various ADMIXTOOLS 2 functions. This function reads data from genotype files, and extracts data required to compute blocked f-statistics for any sets of samples. The data consists of .rds files with total and alternative allele counts for each individual, and products of total and alternative allele counts for each pair. The function calls packedancestrymap_to_afs or plink_to_afs and afs_to_f2_blocks.

Usage

extract_counts(
  pref,
  outdir,
  inds = NULL,
  blgsize = 0.05,
  maxmiss = 0,
  minmaf = 0,
  maxmaf = 0.5,
  transitions = TRUE,
  transversions = TRUE,
  auto_only = TRUE,
  keepsnps = NULL,
  maxmem = 8000,
  overwrite = FALSE,
  format = NULL,
  cols_per_chunk = NULL,
  verbose = TRUE
)
extract_counts(
  pref,
  outdir,
  inds = NULL,
  blgsize = 0.05,
  maxmiss = 0,
  minmaf = 0,
  maxmaf = 0.5,
  transitions = TRUE,
  transversions = TRUE,
  auto_only = TRUE,
  keepsnps = NULL,
  maxmem = 8000,
  overwrite = FALSE,
  format = NULL,
  cols_per_chunk = NULL,
  verbose = TRUE
)

Arguments

`pref`	Prefix of PLINK/EIGENSTRAT/PACKEDANCESTRYMAP files. EIGENSTRAT/PACKEDANCESTRYMAP have to end in `.geno`, `.snp`, `.ind`, PLINK has to end in `.bed`, `.bim`, `.fam`
`outdir`	Directory where data will be stored.
`inds`	Individuals for which data should be read. Defaults to all individuals
`blgsize`	SNP block size in Morgan. Default is 0.05 (5 cM). If `blgsize` is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.
`maxmiss`	Discard SNPs which are missing in a fraction of individuals greater than `maxmiss`
`minmaf`	Discard SNPs with minor allele frequency less than `minmaf`
`maxmaf`	Discard SNPs with minor allele frequency greater than than `maxmaf`
`transitions`	Set this to `FALSE` to exclude transition SNPs
`transversions`	Set this to `FALSE` to exclude transversion SNPs
`auto_only`	Keep only SNPs on chromosomes 1 to 22
`keepsnps`	SNP IDs of SNPs to keep. Overrides other SNP filtering options
`maxmem`	Maximum amount of memory to be used. If the required amount of memory exceeds `maxmem`, allele frequency data will be split into blocks, and the computation will be performed separately on each block pair. This doesn't put a precise cap on the amount of memory used (it used to at some point). Set this parameter to lower values if you run out of memory while running this function. Set it to higher values if this function is too slow and you have lots of memory.
`overwrite`	Overwrite existing files in `outdir`
`format`	Supply this if the prefix can refer to genotype data in different formats and you want to choose which one to read. Should be `plink` to read `.bed`, `.bim`, `.fam` files, or `eigenstrat`, or `packedancestrymap` to read `.geno`, `.snp`, `.ind` files.
`cols_per_chunk`	Number of genotype chunks to store on disk. Setting this to a positive integer makes the function slower, but requires less memory. The default value for `cols_per_chunk` in `extract_afs` is 10. Lower numbers will lower the memory requirement but increase the time it takes.
`verbose`	Print progress updates

Compute and store blocked f2 statistics

Description

This function prepares data for various other ADMIXTOOLS 2 functions. It reads data from genotype files, computes allele frequencies and blocked f2-statistics for selected populations, and writes the results to outdir.

Usage

extract_f2(
  pref,
  outdir,
  inds = NULL,
  pops = NULL,
  blgsize = 0.05,
  maxmem = 8000,
  maxmiss = 0,
  minmaf = 0,
  maxmaf = 0.5,
  minac2 = FALSE,
  pops2 = NULL,
  outpop = NULL,
  outpop_scale = TRUE,
  transitions = TRUE,
  transversions = TRUE,
  auto_only = TRUE,
  keepsnps = NULL,
  overwrite = FALSE,
  format = NULL,
  adjust_pseudohaploid = TRUE,
  cols_per_chunk = NULL,
  fst = TRUE,
  afprod = TRUE,
  poly_only = c("f2"),
  apply_corr = TRUE,
  qpfstats = FALSE,
  n_cores = 1,
  verbose = TRUE,
  ...
)
extract_f2(
  pref,
  outdir,
  inds = NULL,
  pops = NULL,
  blgsize = 0.05,
  maxmem = 8000,
  maxmiss = 0,
  minmaf = 0,
  maxmaf = 0.5,
  minac2 = FALSE,
  pops2 = NULL,
  outpop = NULL,
  outpop_scale = TRUE,
  transitions = TRUE,
  transversions = TRUE,
  auto_only = TRUE,
  keepsnps = NULL,
  overwrite = FALSE,
  format = NULL,
  adjust_pseudohaploid = TRUE,
  cols_per_chunk = NULL,
  fst = TRUE,
  afprod = TRUE,
  poly_only = c("f2"),
  apply_corr = TRUE,
  qpfstats = FALSE,
  n_cores = 1,
  verbose = TRUE,
  ...
)

Arguments

`pref`	Prefix of PLINK/EIGENSTRAT/PACKEDANCESTRYMAP files. EIGENSTRAT/PACKEDANCESTRYMAP have to end in `.geno`, `.snp`, `.ind`, PLINK has to end in `.bed`, `.bim`, `.fam`
`outdir`	Directory where data will be stored.
`inds`	Individuals for which data should be extracted
`pops`	Populations for which data should be extracted. If both `pops` and `inds` are provided, they should have the same length and will be matched by position. If only `pops` is provided, all individuals from the `.ind` or `.fam` file in those populations will be extracted. If only `inds` is provided, each indivdual will be assigned to its own population of the same name. If neither `pops` nor `inds` is provided, all individuals and populations in the `.ind` or `.fam` file will be extracted.
`blgsize`	SNP block size in Morgan. Default is 0.05 (5 cM). If `blgsize` is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.
`maxmem`	Maximum amount of memory to be used. If the required amount of memory exceeds `maxmem`, allele frequency data will be split into blocks, and the computation will be performed separately on each block pair. This doesn't put a precise cap on the amount of memory used (it used to at some point). Set this parameter to lower values if you run out of memory while running this function. Set it to higher values if this function is too slow and you have lots of memory.
`maxmiss`	Discard SNPs which are missing in a fraction of populations higher than `maxmiss`
`minmaf`	Discard SNPs with minor allele frequency less than `minmaf`
`maxmaf`	Discard SNPs with minor allele frequency greater than than `maxmaf`
`minac2`	Discard SNPs with allele count lower than 2 in any population (default `FALSE`). This option should be set to `TRUE` when computing f3-statistics where one population consists mostly of pseudohaploid samples. Otherwise heterozygosity estimates and thus f3-estimates can be biased. `minac2 == 2` will discard SNPs with allele count lower than 2 in any non-singleton population (this option is experimental and is based on the hypothesis that using SNPs with allele count lower than 2 only leads to biases in non-singleton populations). While the `minac2` option discards SNPs with allele count lower than 2 in any population, the `qp3pop` function will only discard SNPs with allele count lower than 2 in the first (target) population (when the first argument is the prefix of a genotype file).
`pops2`	If specified, only a pairs between `pops` and `pops2` will be computed
`outpop`	Keep only SNPs which are heterozygous in this population
`outpop_scale`	Scale f2-statistics by the inverse `outpop` heteroygosity (`1/(p(1-p))`). Providing `outpop` and setting `outpop_scale` to `TRUE` will give the same results as the original qpGraph* when the `outpop` parameter has been set, but it has the disadvantage of treating one population different from the others. This may limit the use of these f2-statistics for other models.
`transitions`	Set this to `FALSE` to exclude transition SNPs
`transversions`	Set this to `FALSE` to exclude transversion SNPs
`auto_only`	Keep only SNPs on chromosomes 1 to 22
`keepsnps`	SNP IDs of SNPs to keep. Overrides other SNP filtering options
`overwrite`	Overwrite existing files in `outdir`
`format`	Supply this if the prefix can refer to genotype data in different formats and you want to choose which one to read. Should be `plink` to read `.bed`, `.bim`, `.fam` files, or `eigenstrat`, or `packedancestrymap` to read `.geno`, `.snp`, `.ind` files.
`adjust_pseudohaploid`	Genotypes of pseudohaploid samples are usually coded as `0` or `2`, even though only one allele is observed. `adjust_pseudohaploid` ensures that the observed allele count increases only by `1` for each pseudohaploid sample. If `TRUE` (default), samples that don't have any genotypes coded as `1` among the first 1000 SNPs are automatically identified as pseudohaploid. This leads to slightly more accurate estimates of f-statistics. Setting this parameter to `FALSE` treats all samples as diploid and is equivalent to the ADMIXTOOLS `inbreed: NO` option. Setting `adjust_pseudohaploid` to an integer `n` will check the first `n` SNPs instead of the first 1000 SNPs.
`cols_per_chunk`	Number of allele frequency chunks to store on disk. Setting this to a positive integer makes the function slower, but requires less memory. The default value for `cols_per_chunk` in `extract_afs` is 10. Lower numbers will lower the memory requirement but increase the time it takes.
`fst`	Write files with pairwise FST for every population pair. Setting this to FALSE can make `extract_f2` faster and will require less memory.
`afprod`	Write files with allele frequency products for every population pair. Setting this to FALSE can make `extract_f2` faster and will require less memory.
`poly_only`	Specify whether SNPs with identical allele frequencies in every population should be discarded (`poly_only = TRUE`), or whether they should be used (`poly_only = FALSE`). By default (`poly_only = c("f2")`), these SNPs will be used to compute FST and allele frequency products, but not to compute f2 (this is the default option in the original ADMIXTOOLS).
`apply_corr`	Apply small-sample-size correction when computing f2-statistics (default `TRUE`)
`qpfstats`	Compute smoothed f2-statistics (default `FALSE`). In the presence of large amounts of missing data, this option can be used to retain information from all SNPs while introducing less bias than setting `maxmiss` to values greater than 0. When setting `qpfstats = TRUE`, most other options to `extract_f2` will be ignored. See `qpfstats` for more information. Arguments to `qpfstats` can be passed via `...`
`n_cores`	Parallelize computation across `n_cores` cores via the `doParallel` package.
`verbose`	Print progress updates
`...`	Pass arguments to `qpfstats`

Value

SNP metadata (invisibly)

Examples

## Not run: 
pref = 'my/genofiles/prefix'
f2dir = 'my/f2dir/'
extract_f2(pref, f2dir, pops = c('popA', 'popB', 'popC'))

## End(Not run)
## Not run: 
pref = 'my/genofiles/prefix'
f2dir = 'my/f2dir/'
extract_f2(pref, f2dir, pops = c('popA', 'popB', 'popC'))

## End(Not run)

Compute and store blocked f2 statistics

Description

extract_f2_large does the same as extract_f2, but it requires less memory and is slower. outdir has to be set in extract_f2_large.

Usage

extract_f2_large(
  pref,
  outdir,
  inds = NULL,
  pops = NULL,
  blgsize = 0.05,
  cols_per_chunk = 10,
  maxmiss = 0,
  minmaf = 0,
  maxmaf = 0.5,
  minac2 = FALSE,
  outpop = NULL,
  outpop_scale = TRUE,
  transitions = TRUE,
  transversions = TRUE,
  keepsnps = NULL,
  snpblocks = NULL,
  overwrite = FALSE,
  format = NULL,
  adjust_pseudohaploid = TRUE,
  afprod = TRUE,
  fst = TRUE,
  poly_only = c("f2"),
  apply_corr = TRUE,
  verbose = TRUE
)
extract_f2_large(
  pref,
  outdir,
  inds = NULL,
  pops = NULL,
  blgsize = 0.05,
  cols_per_chunk = 10,
  maxmiss = 0,
  minmaf = 0,
  maxmaf = 0.5,
  minac2 = FALSE,
  outpop = NULL,
  outpop_scale = TRUE,
  transitions = TRUE,
  transversions = TRUE,
  keepsnps = NULL,
  snpblocks = NULL,
  overwrite = FALSE,
  format = NULL,
  adjust_pseudohaploid = TRUE,
  afprod = TRUE,
  fst = TRUE,
  poly_only = c("f2"),
  apply_corr = TRUE,
  verbose = TRUE
)

Arguments

`pref`	Prefix of PLINK/EIGENSTRAT/PACKEDANCESTRYMAP files. EIGENSTRAT/PACKEDANCESTRYMAP have to end in `.geno`, `.snp`, `.ind`, PLINK has to end in `.bed`, `.bim`, `.fam`
`outdir`	Directory where data will be stored.
`inds`	Individuals for which data should be extracted
`pops`	Populations for which data should be extracted. If both `pops` and `inds` are provided, they should have the same length and will be matched by position. If only `pops` is provided, all individuals from the `.ind` or `.fam` file in those populations will be extracted. If only `inds` is provided, each indivdual will be assigned to its own population of the same name. If neither `pops` nor `inds` is provided, all individuals and populations in the `.ind` or `.fam` file will be extracted.
`blgsize`	SNP block size in Morgan. Default is 0.05 (5 cM). If `blgsize` is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.
`cols_per_chunk`	Number of populations per chunk. Lowering this number will lower the memory requirements when running
`maxmiss`	Discard SNPs which are missing in a fraction of populations higher than `maxmiss`
`minmaf`	Discard SNPs with minor allele frequency less than `minmaf`
`maxmaf`	Discard SNPs with minor allele frequency greater than than `maxmaf`
`minac2`	Discard SNPs with allele count lower than 2 in any population (default `FALSE`). This option should be set to `TRUE` when computing f3-statistics where one population consists mostly of pseudohaploid samples. Otherwise heterozygosity estimates and thus f3-estimates can be biased. `minac2 == 2` will discard SNPs with allele count lower than 2 in any non-singleton population (this option is experimental and is based on the hypothesis that using SNPs with allele count lower than 2 only leads to biases in non-singleton populations). While the `minac2` option discards SNPs with allele count lower than 2 in any population, the `qp3pop` function will only discard SNPs with allele count lower than 2 in the first (target) population (when the first argument is the prefix of a genotype file).
`outpop`	Keep only SNPs which are heterozygous in this population
`outpop_scale`	Scale f2-statistics by the inverse `outpop` heteroygosity (`1/(p(1-p))`). Providing `outpop` and setting `outpop_scale` to `TRUE` will give the same results as the original qpGraph* when the `outpop` parameter has been set, but it has the disadvantage of treating one population different from the others. This may limit the use of these f2-statistics for other models.
`transitions`	Set this to `FALSE` to exclude transition SNPs
`transversions`	Set this to `FALSE` to exclude transversion SNPs
`keepsnps`	SNP IDs of SNPs to keep. Overrides other SNP filtering options
`overwrite`	Overwrite existing files in `outdir`
`format`	Supply this if the prefix can refer to genotype data in different formats and you want to choose which one to read. Should be `plink` to read `.bed`, `.bim`, `.fam` files, or `eigenstrat`, or `packedancestrymap` to read `.geno`, `.snp`, `.ind` files.
`adjust_pseudohaploid`	Genotypes of pseudohaploid samples are usually coded as `0` or `2`, even though only one allele is observed. `adjust_pseudohaploid` ensures that the observed allele count increases only by `1` for each pseudohaploid sample. If `TRUE` (default), samples that don't have any genotypes coded as `1` among the first 1000 SNPs are automatically identified as pseudohaploid. This leads to slightly more accurate estimates of f-statistics. Setting this parameter to `FALSE` treats all samples as diploid and is equivalent to the ADMIXTOOLS `inbreed: NO` option. Setting `adjust_pseudohaploid` to an integer `n` will check the first `n` SNPs instead of the first 1000 SNPs.
`afprod`	Write files with allele frequency products for every population pair. Setting this to FALSE can make `extract_f2` faster and will require less memory.
`fst`	Write files with pairwise FST for every population pair. Setting this to FALSE can make `extract_f2` faster and will require less memory.
`poly_only`	Specify whether SNPs with identical allele frequencies in every population should be discarded (`poly_only = TRUE`), or whether they should be used (`poly_only = FALSE`). By default (`poly_only = c("f2")`), these SNPs will be used to compute FST and allele frequency products, but not to compute f2 (this is the default option in the original ADMIXTOOLS).
`apply_corr`	Apply small-sample-size correction when computing f2-statistics (default `TRUE`)
`verbose`	Print progress updates

Details

extract_f2_large requires less memory because it writes allele frequency data to disk, and doesn't store the allele frequency matrix for all populations and SNPs in memory. If you still run out of memory, reduce cols_per_chunk. This function is a wrapper around extract_afs and afs_to_f2, and is slower than extract_f2. It may be faster to call extract_afs and afs_to_f2 directly, parallelizing over the different calls to afs_to_f2.

Value

SNP metadata (invisibly)

Examples

## Not run: 
pref = 'my/genofiles/prefix'
f2dir = 'my/f2dir/'
extract_f2_large(pref, f2dir, pops = c('popA', 'popB', 'popC'))

## End(Not run)
## Not run: 
pref = 'my/genofiles/prefix'
f2dir = 'my/f2dir/'
extract_f2_large(pref, f2dir, pops = c('popA', 'popB', 'popC'))

## End(Not run)

Copy f2-statistics

Description

Copy a subset of f2-statistics to a new directory

Usage

extract_f2_subset(from, to, pops, verbose = TRUE)
extract_f2_subset(from, to, pops, verbose = TRUE)

Arguments

`from`	Directory with f2-statistics
`to`	Target directory
`pops`	Populations to copy
`verbose`	Print progress updates

Examples

## Not run: 
pref = 'my/genofiles/prefix'
outdir = 'dir/for/afdata/'
extract_f2_subset(pref, outdir)

## End(Not run)
## Not run: 
pref = 'my/genofiles/prefix'
outdir = 'dir/for/afdata/'
extract_f2_subset(pref, outdir)

## End(Not run)

Extract samples from PLINK files

Description

This function reads PLINK files, extracts a subset of samples, and writes new PLINK files using write_plink. It's probably slower than running the equivalent command in PLINK directly, but it can be useful to do this from within R. When inds or pops is provided, only a subset of samples will be extracted.

Usage

extract_samples(
  inpref,
  outpref,
  inds = NULL,
  pops = NULL,
  overwrite = FALSE,
  verbose = TRUE
)
extract_samples(
  inpref,
  outpref,
  inds = NULL,
  pops = NULL,
  overwrite = FALSE,
  verbose = TRUE
)

Arguments

`inpref`	Prefix of the PLINK input files
`outpref`	Prefix of the PLINK output files
`inds`	Individuals which should be extracted
`pops`	Populations which should be extracted. Can not be provided together with `inds`
`overwrite`	Set this to `TRUE` if `inpref == outpref` and you really want to overwrite the input files.
`verbose`	Print progress updates

Estimate f2 statistics

Description

Computes f2 statistics from f2 blocks of the form $f2(A, B)$

Usage

f2(
  data,
  pop1 = NULL,
  pop2 = NULL,
  boot = FALSE,
  sure = FALSE,
  unique_only = TRUE,
  verbose = FALSE,
  ...
)
f2(
  data,
  pop1 = NULL,
  pop2 = NULL,
  boot = FALSE,
  sure = FALSE,
  unique_only = TRUE,
  verbose = FALSE,
  ...
)

Arguments

`data`	Input data in one of three forms: A 3d array of blocked f2 statistics, output of `f2_from_precomp` or `f2_from_geno` (fastest option) A directory which contains pre-computed f2-statistics The prefix of genotype files (slowest option)
`pop1`	One of the following four: `NULL`: all possible population combinations will be returned A vector of population labels. All combinations with the other `pop` arguments will be returned A matrix with population combinations to be tested, with one population per column and one combination per row. Other `pop` arguments will be ignored. the location of a file (`poplistname` or `popfilename`) which specifies the populations or population combinations to be tested. Other `pop` arguments will be ignored.
`pop2`	A vector of population labels
`boot`	If `FALSE` (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If `boot` is an integer, that number will specify the number of bootstrap resamplings. If `boot = TRUE`, the number of bootstrap resamplings will be equal to the number of SNP blocks.
`sure`	The number of population combinations can get very large. This is a safety option that stops you from accidently computing all combinations if that number is large.
`unique_only`	If `TRUE` (the default), redundant combinations will be excluded
`verbose`	Print progress updates
`...`	Additional arguments passed to `f2_from_geno` when `data` is a genotype prefix

Value

f2 returns a data frame with f2 statistics

References

Patterson, N. et al. (2012) Ancient admixture in human history Genetics

Peter, B. (2016) Admixture, Population Structure, and F-Statistics Genetics

Examples

pop1 = 'Denisova.DG'
pop2 = c('Altai_Neanderthal.DG', 'Vindija.DG')
f2(example_f2_blocks, pop1, pop2)
## Not run: 
f2(f2_dir, pop1, pop2)

## End(Not run)
pop1 = 'Denisova.DG'
pop2 = c('Altai_Neanderthal.DG', 'Vindija.DG')
f2(example_f2_blocks, pop1, pop2)
## Not run: 
f2(f2_dir, pop1, pop2)

## End(Not run)

Compute blocked f2 statistics

Description

Usage

f2_from_geno(
  pref,
  inds = NULL,
  pops = NULL,
  blgsize = 0.05,
  maxmem = 8000,
  maxmiss = 0,
  minmaf = 0,
  maxmaf = 0.5,
  pops2 = NULL,
  outpop = NULL,
  outpop_scale = TRUE,
  transitions = TRUE,
  transversions = TRUE,
  auto_only = TRUE,
  keepsnps = NULL,
  afprod = FALSE,
  fst = FALSE,
  poly_only = c("f2"),
  format = NULL,
  adjust_pseudohaploid = TRUE,
  remove_na = TRUE,
  apply_corr = TRUE,
  qpfstats = FALSE,
  verbose = TRUE,
  ...
)
f2_from_geno(
  pref,
  inds = NULL,
  pops = NULL,
  blgsize = 0.05,
  maxmem = 8000,
  maxmiss = 0,
  minmaf = 0,
  maxmaf = 0.5,
  pops2 = NULL,
  outpop = NULL,
  outpop_scale = TRUE,
  transitions = TRUE,
  transversions = TRUE,
  auto_only = TRUE,
  keepsnps = NULL,
  afprod = FALSE,
  fst = FALSE,
  poly_only = c("f2"),
  format = NULL,
  adjust_pseudohaploid = TRUE,
  remove_na = TRUE,
  apply_corr = TRUE,
  qpfstats = FALSE,
  verbose = TRUE,
  ...
)

Arguments

`pref`	Prefix of PLINK/EIGENSTRAT/PACKEDANCESTRYMAP files. EIGENSTRAT/PACKEDANCESTRYMAP have to end in `.geno`, `.snp`, `.ind`, PLINK has to end in `.bed`, `.bim`, `.fam`
`inds`	Individuals for which data should be extracted
`pops`	Populations for which data should be extracted. If both `pops` and `inds` are provided, they should have the same length and will be matched by position. If only `pops` is provided, all individuals from the `.ind` or `.fam` file in those populations will be extracted. If only `inds` is provided, each indivdual will be assigned to its own population of the same name. If neither `pops` nor `inds` is provided, all individuals and populations in the `.ind` or `.fam` file will be extracted.
`blgsize`	SNP block size in Morgan. Default is 0.05 (5 cM). If `blgsize` is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.
`maxmem`	Maximum amount of memory to be used. If the required amount of memory exceeds `maxmem`, allele frequency data will be split into blocks, and the computation will be performed separately on each block pair. This doesn't put a precise cap on the amount of memory used (it used to at some point). Set this parameter to lower values if you run out of memory while running this function. Set it to higher values if this function is too slow and you have lots of memory.
`maxmiss`	Discard SNPs which are missing in a fraction of populations higher than `maxmiss`
`minmaf`	Discard SNPs with minor allele frequency less than `minmaf`
`maxmaf`	Discard SNPs with minor allele frequency greater than than `maxmaf`
`pops2`	If specified, only a pairs between `pops` and `pops2` will be computed
`outpop`	Keep only SNPs which are heterozygous in this population
`outpop_scale`	Scale f2-statistics by the inverse `outpop` heteroygosity (`1/(p(1-p))`). Providing `outpop` and setting `outpop_scale` to `TRUE` will give the same results as the original qpGraph* when the `outpop` parameter has been set, but it has the disadvantage of treating one population different from the others. This may limit the use of these f2-statistics for other models.
`transitions`	Set this to `FALSE` to exclude transition SNPs
`transversions`	Set this to `FALSE` to exclude transversion SNPs
`auto_only`	Keep only SNPs on chromosomes 1 to 22
`keepsnps`	SNP IDs of SNPs to keep. Overrides other SNP filtering options
`afprod`	Return negative average allele frequency products instead of f2-statistics. Setting `afprod = TRUE` will result in more precise f4-statistics when the original data had large amounts of missingness, and should be used in that case for `qpdstat` and `qpadm`. It can also be used for outgroup f3-statistics with a fixed outgroup (for example for `qpgraph`); values will be shifted by a constant amount compared to regular f3-statistics. This shift affects the fit of a graph only by small amounts, possibly less than bias in regular f3-statistics introduced by large amounts of missing data.
`fst`	Write files with pairwise FST for every population pair. Setting this to FALSE can make `extract_f2` faster and will require less memory.
`poly_only`	Specify whether SNPs with identical allele frequencies in every population should be discarded (`poly_only = TRUE`), or whether they should be used (`poly_only = FALSE`). By default (`poly_only = c("f2")`), these SNPs will be used to compute FST and allele frequency products, but not to compute f2 (this is the default option in the original ADMIXTOOLS).
`format`	Supply this if the prefix can refer to genotype data in different formats and you want to choose which one to read. Should be `plink` to read `.bed`, `.bim`, `.fam` files, or `eigenstrat`, or `packedancestrymap` to read `.geno`, `.snp`, `.ind` files.
`adjust_pseudohaploid`	Genotypes of pseudohaploid samples are usually coded as `0` or `2`, even though only one allele is observed. `adjust_pseudohaploid` ensures that the observed allele count increases only by `1` for each pseudohaploid sample. If `TRUE` (default), samples that don't have any genotypes coded as `1` among the first 1000 SNPs are automatically identified as pseudohaploid. This leads to slightly more accurate estimates of f-statistics. Setting this parameter to `FALSE` treats all samples as diploid and is equivalent to the ADMIXTOOLS `inbreed: NO` option. Setting `adjust_pseudohaploid` to an integer `n` will check the first `n` SNPs instead of the first 1000 SNPs.
`apply_corr`	Apply small-sample-size correction when computing f2-statistics (default `TRUE`)
`qpfstats`	Compute smoothed f2-statistics (default `FALSE`). In the presence of large amounts of missing data, this option can be used to retain information from all SNPs while introducing less bias than setting `maxmiss` to values greater than 0. When setting `qpfstats = TRUE`, most other options to `extract_f2` will be ignored. See `qpfstats` for more information. Arguments to `qpfstats` can be passed via `...`
`verbose`	Print progress updates
`...`	Pass arguments to `qpfstats`

Value

A 3d array of f2-statistics (or scaled allele frequency products if afprod = TRUE)

Simulate an admixture graph in msprime

Description

This function generates an msprime simulation script, executes it in python, and turns the resulting genotype data into f2-statistics

Usage

f2_from_msprime(..., blgsize = 0.05, cleanup = TRUE, verbose = TRUE)
f2_from_msprime(..., blgsize = 0.05, cleanup = TRUE, verbose = TRUE)

Read blocked f2 statistics from disk

Description

Read blocked f2 statistics from disk

Usage

f2_from_precomp(
  dir,
  inds = NULL,
  pops = NULL,
  pops2 = NULL,
  afprod = FALSE,
  fst = FALSE,
  return_array = TRUE,
  apply_corr = TRUE,
  remove_na = TRUE,
  verbose = TRUE
)
f2_from_precomp(
  dir,
  inds = NULL,
  pops = NULL,
  pops2 = NULL,
  afprod = FALSE,
  fst = FALSE,
  return_array = TRUE,
  apply_corr = TRUE,
  remove_na = TRUE,
  verbose = TRUE
)

Arguments

`dir`	Directory with precomputed f2 statistics, or precomputed individual pair data
`inds`	Individuals for which data should be read. Defaults to all individuals, which may require a lot of memory.
`pops`	Populations for which data should be read. Defaults to all populations, which may require a lot of memory.
`pops2`	Optional second vector of populations. Useful if a f4 statistics of a few against many populations should be computed. `pops2` should not be specified in other cases, as most functions depend on f2-statistics for all population pairs in `pops`.
`afprod`	Return negative average allele frequency products instead of f2 estimates. This will result in more precise f4-statistics when the original data had large amounts of missingness, and should be used in that case for `qpdstat` and `qpadm`. It can also be used for outgroup f3-statistics with a fixed outgroup (for example for `qpgraph`); values will be shifted by a constant amount compared to regular f3-statistics. This shift affects the fit of a graph only by small amounts, possibly less than bias in regular f3-statistics introduced by large amounts of missing data. This option is currently ineffective when reading data extracted with `extract_counts`.
`return_array`	Return a 3d array (default). If false, a data frame will be returned.
`apply_corr`	Subtract the f2 correction factor. Setting this to `FALSE` can occasionally be useful
`remove_na`	Remove blocks with missing values
`verbose`	Print progress updates

Value

A 3d array of f2 statistics

Examples

## Not run: 
dir = 'my/f2/dir/'
f2_blocks = f2_from_precomp(dir, pops = c('pop1', 'pop2', 'pop3'))

## End(Not run)
## Not run: 
dir = 'my/f2/dir/'
f2_blocks = f2_from_precomp(dir, pops = c('pop1', 'pop2', 'pop3'))

## End(Not run)

Turn f2 data to f4 data

Description

Turn f2 data to f4 data

Usage

f2dat_f4dat(f2dat, popcomb = NULL)
f2dat_f4dat(f2dat, popcomb = NULL)

Arguments

f2dat

A data frame of f2-statistics with columns pop1, pop2, f2

Value

A data frame with f4-statistics

f3 from genotype data

Description

Compute per-block f3-statistics directly from genotype data

Usage

f3blockdat_from_geno(
  pref,
  popcombs,
  auto_only = TRUE,
  blgsize = 0.05,
  block_lengths = NULL,
  allsnps = FALSE,
  adjust_pseudohaploid = TRUE,
  poly_only = FALSE,
  apply_corr = TRUE,
  outgroupmode = FALSE,
  verbose = TRUE
)
f3blockdat_from_geno(
  pref,
  popcombs,
  auto_only = TRUE,
  blgsize = 0.05,
  block_lengths = NULL,
  allsnps = FALSE,
  adjust_pseudohaploid = TRUE,
  poly_only = FALSE,
  apply_corr = TRUE,
  outgroupmode = FALSE,
  verbose = TRUE
)

Arguments

`pref`	Prefix of genotype files
`popcombs`	A data frame with one population combination per row, and columns `pop1`, `pop2`, `pop3`, `pop4`. If there is an additional integer column named `model` and `allsnps = FALSE`, only SNPs present in every population in any given model will be used to compute f4-statistics for that model.
`auto_only`	Use only chromosomes 1 to 22.
`blgsize`	SNP block size in Morgan. Default is 0.05 (5 cM). If `blgsize` is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.
`block_lengths`	An optional vector with block lengths. If `NULL`, block lengths will be computed.
`allsnps`	Use all SNPs with allele frequency estimates in every population of any given population quadruple. If `FALSE` (the default) only SNPs which are present in all populations in `popcombs` (or any given model in it) will be used. Setting `allsnps = TRUE` in the presence of large amounts of missing data might lead to false positive results.
`adjust_pseudohaploid`	Genotypes of pseudohaploid samples are usually coded as `0` or `2`, even though only one allele is observed. `adjust_pseudohaploid` ensures that the observed allele count increases only by `1` for each pseudohaploid sample. If `TRUE` (default), samples that don't have any genotypes coded as `1` among the first 1000 SNPs are automatically identified as pseudohaploid. This leads to slightly more accurate estimates of f-statistics. Setting this parameter to `FALSE` is equivalent to the ADMIXTOOLS `inbreed: NO` option. Setting `adjust_pseudohaploid` to an integer `n` will check the first `n` SNPs instead of the first 1000 SNPs.
`apply_corr`	With `apply_corr = FALSE`, no bias correction is performed. With `apply_corr = TRUE` (the default), a bias correction term based on the heterozygosity in the first population is subtracted from the f3 estimate. With `apply_corr = 2`, the bias correction term is calculated based on all 3 populations. This option is not generally recommended, and only exists to match how the f3-statistics are estimated in certain scenarios in the original qpGraph program.
`outgroupmode`	With `outgroupmode = FALSE`, estimates of f3 will be normalized by estimates of the heterozygosity of the target population. This is the default option if the first argument is the prefix of genotype data. If the first argument is an array of precomputed f2-statistics, then no normalization can be performed, which corresponds to `outgroupmode = TRUE`.
`verbose`	Print progress updates

Value

A data frame with per-block f4-statistics for each population quadruple.

Compute f4 from allele frequencies

Description

Compute f4 from allele frequencies

Usage

f4_from_afdat(afdat, popcombs)
f4_from_afdat(afdat, popcombs)

Arguments

`afdat`	A data frame with allele frequencies and SNP metadata. Can be grouped.
`popcombs`	A data frame with population combinations. Columns `pop1` to `pop4`

Examples

## Not run: 
# Compute f4 for all mutatation classes separately
afs = plink_to_afs('/my/geno/prefix', pops = c('p1', 'p2', 'p3', 'p4', 'p5'))
afdat = bind_cols(afs$snpfile, afs$afs %>% as_tibble()) %>%
        mutate(gr = paste0(pmin(A1, A2), pmax(A1, A2))) %>%
        group_by(gr)
popcombs = tibble(pop1 = c('p1', 'p5'), pop2 = 'p2', pop3 = 'p3', pop4 = 'p4')
out = f4_from_afdat(afdat, popcombs)
out %>% ggplot(aes(gr, est)) + geom_point() +
          geom_errorbar(aes(ymin = est - se, ymax = est + se)) +
          facet_wrap(~paste(pop1, pop2, pop3, pop4), scales = 'free')

## End(Not run)
## Not run: 
# Compute f4 for all mutatation classes separately
afs = plink_to_afs('/my/geno/prefix', pops = c('p1', 'p2', 'p3', 'p4', 'p5'))
afdat = bind_cols(afs$snpfile, afs$afs %>% as_tibble()) %>%
        mutate(gr = paste0(pmin(A1, A2), pmax(A1, A2))) %>%
        group_by(gr)
popcombs = tibble(pop1 = c('p1', 'p5'), pop2 = 'p2', pop3 = 'p3', pop4 = 'p4')
out = f4_from_afdat(afdat, popcombs)
out %>% ggplot(aes(gr, est)) + geom_point() +
          geom_errorbar(aes(ymin = est - se, ymax = est + se)) +
          facet_wrap(~paste(pop1, pop2, pop3, pop4), scales = 'free')

## End(Not run)

Get per-block f4-statistics

Description

This function turns per-block f2-statistics into per-block f4-statistics of the form ⁠f4(pop1, pop2; pop3, pop4)⁠

Usage

f4_from_f2(f2_data, pop1, pop2 = NULL, pop3 = NULL, pop4 = NULL)
f4_from_f2(f2_data, pop1, pop2 = NULL, pop3 = NULL, pop4 = NULL)

Arguments

`f2_data`	A 3d array with blocked f2 statistics, output of `f2_from_precomp` or `f2_from_geno` Alternatively, a directory with precomputed data. See `extract_f2` and `extract_counts`.
`pop1`	Either the name(s) of the first population(s), or a four column matrix with the names of all four populations.
`pop2`	Population 2 (same length as `pop1`)
`pop3`	Population 3 (same length as `pop1`)
`pop4`	Population 4 (same length as `pop1`)

Value

A matrix of per-block f4-statistics (⁠popcomb x block⁠)

f4 from genotype data

Description

Compute per-block f4-statistics directly from genotype data

Usage

f4blockdat_from_geno(
  pref,
  popcombs = NULL,
  left = NULL,
  right = NULL,
  auto_only = TRUE,
  blgsize = 0.05,
  block_lengths = NULL,
  f4mode = TRUE,
  allsnps = FALSE,
  poly_only = FALSE,
  snpwt = NULL,
  keepsnps = NULL,
  verbose = TRUE
)
f4blockdat_from_geno(
  pref,
  popcombs = NULL,
  left = NULL,
  right = NULL,
  auto_only = TRUE,
  blgsize = 0.05,
  block_lengths = NULL,
  f4mode = TRUE,
  allsnps = FALSE,
  poly_only = FALSE,
  snpwt = NULL,
  keepsnps = NULL,
  verbose = TRUE
)

Arguments

`pref`	Prefix of genotype files
`popcombs`	A data frame with one population combination per row, and columns `pop1`, `pop2`, `pop3`, `pop4`. If there is an additional integer column named `model` and `allsnps = FALSE`, only SNPs present in every population in any given model will be used to compute f4-statistics for that model.
`left`	Populations on the left side of f4 (`pop1` and `pop2`). Can be provided together with `right` in place of `popcombs`.
`right`	Populations on the right side of f4 (`pop3` and `pop4`). Can be provided together with `left` in place of `popcombs`.
`auto_only`	Use only chromosomes 1 to 22.
`blgsize`	SNP block size in Morgan. Default is 0.05 (5 cM). If `blgsize` is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.
`block_lengths`	An optional vector with block lengths. If `NULL`, block lengths will be computed.
`f4mode`	If `TRUE`: f4 is computed from allele frequencies `a`, `b`, `c`, and `d` as `(a-b)(c-d)`. if `FALSE`, D-statistics are computed instead, defined as `(a-b)(c-d) / ((a + b - 2ab) * (c + d - 2cd))`, which is the same as `(P(BABA) - P(ABBA)) / (P(ABBA) + P(BABA))`.
`allsnps`	Use all SNPs with allele frequency estimates in every population of any given population quadruple. If `FALSE` (the default) only SNPs which are present in all populations in `popcombs` (or any given model in it) will be used. Setting `allsnps = TRUE` in the presence of large amounts of missing data might lead to false positive results.
`poly_only`	Only keep SNPs with mean allele frequency not equal to 0 or 1 (default `FALSE`).
`snpwt`	A vector of SNP weights
`keepsnps`	A vector of SNP IDs to keep
`verbose`	Print progress updates

Value

A data frame with per-block f4-statistics for each population quadruple.

Turn f4 block data to 3d array

Description

Turn f4 block data to 3d array

Usage

f4blockdat_to_f4blocks(f4blockdat, remove_na = TRUE)
f4blockdat_to_f4blocks(f4blockdat, remove_na = TRUE)

Arguments

`f4blockdat`	f4 block data frame generated by `f4blockdat_from_geno`
`remove_na`	Remove blocks with missing values

Find admixture edges

Description

Find admixture edges

Usage

find_admixedges(graph)
find_admixedges(graph)

Arguments

graph

An admixture graph

Value

A data frame with columns from and to with admixture edges

Find well fitting admixture graphs

Description

This function generates and evaluates admixture graphs in numgen iterations to find well fitting admixturegraphs.

Usage

find_graphs(
  data,
  numadmix = 0,
  outpop = NULL,
  stop_gen = 100,
  stop_gen2 = 15,
  stop_score = 0,
  stop_sec = NULL,
  initgraph = NULL,
  numgraphs = 10,
  mutfuns = namedList(spr_leaves, spr_all, swap_leaves, move_admixedge_once,
    flipadmix_random, place_root_random, mutate_n),
  opt_worst_residual = FALSE,
  plusminus_generations = 5,
  return_searchtree = FALSE,
  admix_constraints = NULL,
  event_constraints = NULL,
  reject_f4z = 0,
  max_admix = numadmix,
  verbose = TRUE,
  ...
)
find_graphs(
  data,
  numadmix = 0,
  outpop = NULL,
  stop_gen = 100,
  stop_gen2 = 15,
  stop_score = 0,
  stop_sec = NULL,
  initgraph = NULL,
  numgraphs = 10,
  mutfuns = namedList(spr_leaves, spr_all, swap_leaves, move_admixedge_once,
    flipadmix_random, place_root_random, mutate_n),
  opt_worst_residual = FALSE,
  plusminus_generations = 5,
  return_searchtree = FALSE,
  admix_constraints = NULL,
  event_constraints = NULL,
  reject_f4z = 0,
  max_admix = numadmix,
  verbose = TRUE,
  ...
)

Arguments

`data`	Input data in one of three forms: A 3d array of blocked f2 statistics, output of `f2_from_precomp` or `f2_from_geno` A directory which contains pre-computed f2-statistics The prefix of genotype files
`numadmix`	Number of admixture events within each graph. (Only relevant if `initgraph = NULL`)
`outpop`	Name of the outgroup population
`stop_gen`	Total number of generations after which to stop
`stop_gen2`	Number of generations without improvement after which to stop
`stop_score`	Stop once this score has been reached
`stop_sec`	Number of seconds after which to stop
`initgraph`	Graph to start with. If it is specified, `numadmix` and `outpop` will be inferred from this graph.
`numgraphs`	Number of graphs in each generation
`mutfuns`	Functions used to modify graphs. Defaults to the following: `spr_leaves`: Subtree prune and regraft leaves. Cuts a leaf node and attaches it to a random other edge in the graph. `spr_all`: Subtree prune and regraft. Cuts any edge and attaches the new orphan node to a random other edge in the graph, keeping the number of admixture nodes constant. `swap_leaves`: Swaps two leaf nodes. `move_admixedge_once`: Moves an admixture edge to a nearby location. `flipadmix_random`: Flips the direction of an admixture edge (if possible). `mutate_n`: Apply `n` of the mutation functions in this list to a graph (defaults to 2).
`opt_worst_residual`	Optimize for lowest worst residual instead of best score. `FALSE` by default, because the likelihood score is generally a better indicator of the quality of the model fit, and because optimizing for the lowest worst residual is slower (because f4-statistics need to be computed).
`plusminus_generations`	If the best score does not improve after `plusminus_generations` generations, another approach to improving the score will be attempted: A number of graphs with on additional admixture edge will be generated and evaluated. The resulting graph with the best score will be picked, and new graphs will be created by removing any one admixture edge (bringing the number back to what it was originally). The graph with the lowest score will then be selected. This often makes it possible to break out of local optima, but is slower than regular graph modifications. If the current number of admixture events is lower than `max_numadmix`, the last step (removing an admixture edge) will be skipped.
`return_searchtree`	Return the search tree in addition to the models. Output will be a list with three items: models, search tree, search tree as data frame
`admix_constraints`	A data frame with constraints on the number of admixture events for each population. See `satisfies_numadmix` As soon as one graph happens to satisfy these constraints, all subsequently generated graphs will be required to also satisfy them.
`event_constraints`	A data frame with constraints on the order of events in an admixture graph. See `satisfies_eventorder` As soon as one graph happens to satisfy these constraints, all subsequently generated graphs will be required to also satisfy them.
`reject_f4z`	If this is a number greater than zero, all f4-statistics with `abs(z) > reject_f4z` will be used to constrain the search space of admixture graphs: Any graphs in which f4-statistics greater than `reject_f4z` are expected to be zero will not be evaluated.
`max_admix`	Maximum number of admixture edges. By default, this number is equal to `numadmix`, or to the number of admixture edges in `initgraph`, so the number of admixture edges stays constant. Setting this to a higher number will lead to more admixture edges being added occasionally (see `plusminus_generations`). Graphs with additional admixture edges will only be accepted if they improve the score by 5% or more.
`verbose`	Print progress updates
`...`	Additional arguments passed to `qpgraph`

Value

A nested data frame with one model per line

Examples

## Not run: 
res = find_graphs(example_f2_blocks, numadmix = 2)
res %>% slice_min(score)

## End(Not run)
## Not run: 
# Start with a graph with 0 admixture events, increase up to 3, and stop after 10 generations of no improvement
pops = dimnames(example_f2_blocks)[[1]]
initgraph = random_admixturegraph(pops, 0, outpop = 'Chimp.REF')
res = find_graphs(example_f2_blocks, initgraph = initgraph, stop_gen2 = 10, max_admix = 3)
res %>% slice_min(score)

## End(Not run)
## Not run: 
res = find_graphs(example_f2_blocks, numadmix = 2)
res %>% slice_min(score)

## End(Not run)
## Not run: 
# Start with a graph with 0 admixture events, increase up to 3, and stop after 10 generations of no improvement
pops = dimnames(example_f2_blocks)[[1]]
initgraph = random_admixturegraph(pops, 0, outpop = 'Chimp.REF')
res = find_graphs(example_f2_blocks, initgraph = initgraph, stop_gen2 = 10, max_admix = 3)
res %>% slice_min(score)

## End(Not run)

Find well fitting admixture graphs

Description

This function generates and evaluates admixture graphs in numgen iterations across numrep independent repeats to find well fitting admixturegraphs. It uses the function future_map to parallelize across the independent repeats. The function plan can be called to specify the details of the parallelization. This can be used to parallelize across cores or across nodes on a compute cluster. Setting numadmix to 0 will search for well fitting trees, which is much faster than searching for admixture graphs with many admixture nodes.

Usage

find_graphs_old(
  data,
  pops = NULL,
  outpop = NULL,
  numrep = 1,
  numgraphs = 50,
  numgen = 5,
  numsel = 5,
  numadmix = 0,
  numstart = 1,
  keep = c("all", "best", "last"),
  initgraphs = NULL,
  mutfuns = namedList(spr_leaves, spr_all, swap_leaves, move_admixedge_once,
    flipadmix_random, mutate_n),
  mutprobs = NULL,
  opt_worst_residual = FALSE,
  store_intermediate = NULL,
  parallel = TRUE,
  stop_after = NULL,
  verbose = TRUE,
  ...
)
find_graphs_old(
  data,
  pops = NULL,
  outpop = NULL,
  numrep = 1,
  numgraphs = 50,
  numgen = 5,
  numsel = 5,
  numadmix = 0,
  numstart = 1,
  keep = c("all", "best", "last"),
  initgraphs = NULL,
  mutfuns = namedList(spr_leaves, spr_all, swap_leaves, move_admixedge_once,
    flipadmix_random, mutate_n),
  mutprobs = NULL,
  opt_worst_residual = FALSE,
  store_intermediate = NULL,
  parallel = TRUE,
  stop_after = NULL,
  verbose = TRUE,
  ...
)

Arguments

`data`	Input data in one of three forms: A 3d array of blocked f2 statistics, output of `f2_from_precomp` or `f2_from_geno` A directory which contains pre-computed f2-statistics The prefix of genotype files
`pops`	Populations for which to fit admixture graphs (default all)
`outpop`	An outgroup population which will split at the root from all other populations in all tested graphs. If one of the populations is know to be an outgroup, designating it as `outpop` will greatly reduce the search space compared to including it and not designating it as `outpop`.
`numrep`	Number of independent repetitions (each repetition can be run in parallel)
`numgraphs`	Number of graphs in each generation
`numgen`	Number of generations
`numsel`	Number of graphs which are selected in each generation. Should be less than `numgraphs`.
`numadmix`	Number of admixture events within each graph
`numstart`	Number of random initializations in each call to `qpgraph`. Defaults to 1, to speed up the graph optimization.
`keep`	Which models should be returned. One of `all`, `best`, `last` `all` (default): Return all evaluated graphs `best`: Return only the best fitting graph from each repeat and each generation `last`: Return all graphs from the last generation
`initgraphs`	Optional graph or list of igraphs to start with. If `NULL`, optimization will start with random graphs.
`mutfuns`	Functions used to modify graphs. Defaults to the following: `spr_leaves`: Subtree prune and regraft leaves. Cuts a leaf node and attaches it to a random other edge in the graph. `spr_all`: Subtree prune and regraft. Cuts any edge and attaches the new orphan node to a random other edge in the graph, keeping the number of admixture nodes constant. `swap_leaves`: Swaps two leaf nodes. `move_admixedge_once`: Moves an admixture edge to a nearby location. `flipadmix_random`: Flips the direction of an admixture edge (if possible). `mutate_n`: Apply `n` of the mutation functions in this list to a graph (defaults to 2). See examples for how to make new mutation functions.
`mutprobs`	Relative frequencies of each mutation function. `NULL` (default) means each mutation function is picked with equal probability A numeric vector of length equal to `mutfuns` defines the relative frequency of each mutation function A matrix of dimensions `numgen` x `length(mutfuns)` defines the relative frequency of each mutation function in each generation
`opt_worst_residual`	Optimize for lowest worst residual instead of best score. `FALSE` by default, because the likelihood score is generally a better indicator of the quality of the model fit. Optimizing for the lowest worst residual is also slower (because f4-statistics need to be computed).
`store_intermediate`	Path and prefix of files for intermediate results to `.rds`. Can be useful if `find_graphs_old` doesn't finish sucessfully.
`parallel`	Parallelize over repeats (if `numrep > 1`) or graphs (if `numrep == 1`) by replacing `map` with `future_map`. Will only be effective if `plan` has been set.
`stop_after`	Stop optimization after `stop_after` seconds (and after finishing the current generation).
`verbose`	Print progress updates
`...`	Additional arguments passed to `qpgraph`

Value

A nested data frame with one model per line

Examples

## Not run: 
find_graphs_old(example_f2_blocks, numrep = 200, numgraphs = 100,
            numgen = 20, numsel = 5, numadmix = 3)

## End(Not run)
## Not run: 
# Making new mutation functions by modifying or combining existing ones:
newfun1 = function(graph, ...) mutate_n(graph, 3, ...)
newfun2 = function(graph, ...) flipadmix_random(spr_leaves(graph, ...), ...)
find_graphs_old(f2_blocks, mutfuns = namedList(spr_leaves, newfun1, newfun2), mutprobs = c(0.2, 0.3, 0.5))

## End(Not run)
## Not run: 
find_graphs_old(example_f2_blocks, numrep = 200, numgraphs = 100,
            numgen = 20, numsel = 5, numadmix = 3)

## End(Not run)
## Not run: 
# Making new mutation functions by modifying or combining existing ones:
newfun1 = function(graph, ...) mutate_n(graph, 3, ...)
newfun2 = function(graph, ...) flipadmix_random(spr_leaves(graph, ...), ...)
find_graphs_old(f2_blocks, mutfuns = namedList(spr_leaves, newfun1, newfun2), mutprobs = c(0.2, 0.3, 0.5))

## End(Not run)

Find possible new edges

Description

Find possible new edges

Usage

find_newedges(graph, fix_outgroup = TRUE, all = TRUE)
find_newedges(graph, fix_outgroup = TRUE, all = TRUE)

Arguments

graph

An admixture graph

Value

A data frame with columns from and to. New edges which begin above from and end above to could be inserted

Find drift edges

Description

Find drift edges

Usage

find_normedges(graph, exclude_first = FALSE)
find_normedges(graph, exclude_first = FALSE)

Arguments

`graph`	An admixture graph
`exclude_first`	Do not return edge from root to outgroup

Value

A data frame with columns from and to with drift edges

Modify a graph flipping the direction of an admixture edge

Description

Modify a graph flipping the direction of an admixture edge

Usage

flipadmix_random(graph, fix_outgroup = TRUE)
flipadmix_random(graph, fix_outgroup = TRUE)

Arguments

`graph`	An admixture graph
`fix_outgroup`	Keep outgroup in place (has no effect here)

Value

A new admixture graph

Compute Fst

Description

This function computes pairwise Fst from genotype files, or from precomputed per-block Fst-statistics (not to be confused with per-block f2-statistics, which are used by most other functions). See details for how Fst is computed.

Usage

fst(data, pop1 = NULL, pop2 = NULL, boot = FALSE, verbose = FALSE, ...)
fst(data, pop1 = NULL, pop2 = NULL, boot = FALSE, verbose = FALSE, ...)

Arguments

`data`	Input data in one of three forms: The prefix of genotype files A directory with pre-computed Fst `.rds` files generated by `extract_f2` A 3d array of blocked Fst, output of `f2_from_precomp` with `fst = TRUE`. This is not recommended, since pre-computed Fst blocks are not used by any other functions, and they can be mixed up with the more commonly used pre-computed f2 blocks.
`pop1`	One of the following four: `NULL`: all possible population combinations will be returned A vector of population labels. All combinations with the other `pop` arguments will be returned A matrix with population combinations to be tested, with one population per column and one combination per row. Other `pop` arguments will be ignored. the location of a file (`poplistname` or `popfilename`) which specifies the populations or population combinations to be tested. Other `pop` arguments will be ignored.
`pop2`	A vector of population labels
`boot`	If `FALSE` (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If `boot` is an integer, that number will specify the number of bootstrap resamplings. If `boot = TRUE`, the number of bootstrap resamplings will be equal to the number of SNP blocks.
`verbose`	Print progress updates
`...`	Additional arguments passed to `f2_from_geno` when `data` is a genotype prefix

Details

The Hudson Fst estimator used here is described in the two publications below. For two populations with estimated allele frequency vectors p1 and p2, and allele count vectors n1 and n2, it is calculated as follows:

num = (p1 - p2)^2 - p1*(1-p1)/(n1-1) - p2*(1-p2)/(n2-1)
denom = p1 + p2 - 2*p1*p2
fst = mean(num)/mean(denom)

This is done independently for each SNP block, and is stored on disk for each population pair. Jackknifing or bootstrapping across these per-block estimates yields the overall estimates and standard errors.

References

Reich, D. (2009) Reconstructing Indian population history Nature

Bhatia, G. (2013) Estimating and interpreting Fst: the impact of rare variants Genome Research

Examples

## Not run: 
pop1 = 'Denisova.DG'
pop2 = c('Altai_Neanderthal.DG', 'Vindija.DG')
fst(f2_dir, pop1, pop2)

## End(Not run)
## Not run: 
pop1 = 'Denisova.DG'
pop2 = c('Altai_Neanderthal.DG', 'Vindija.DG')
fst(f2_dir, pop1, pop2)

## End(Not run)

Generate all graphs

Description

This functions generates all possible admixture graphs with a set number of admixture events for a given set of leaf nodes. It's pretty slow, and may not terminate in reasonable time for more than 5 leaves and 2 admixture events. The function is similar to the all_graphs function in the admixturegraph package, but there are a few differences:

The function does not return graphs with fewer than nadmix admixture events
The function does not return most graphs which are unidentifiable and would have equal fits as simpler identifiable graphs (for example it does not return graphs where a node is expanded to a loop)
The function does not return duplicated graphs, as identified by the graph_hash function
The function generates unique graphs which are missing in the output of all_graphs

Usage

generate_all_graphs(leaves, nadmix = 0, verbose = TRUE)
generate_all_graphs(leaves, nadmix = 0, verbose = TRUE)

Arguments

`leaves`	The leaf nodes
`nadmix`	The number of admixture nodes
`verbose`	Print progress updates

Value

A list of graphs in igraph format

Examples

## Not run: 
graphs = generate_all_graphs(letters[1:4], 1)

## End(Not run)
## Not run: 
graphs = generate_all_graphs(letters[1:4], 1)

## End(Not run)

Generate all trees

Description

This functions generates all possible trees with for a given set of leaf nodes.

Usage

generate_all_trees(leaves)
generate_all_trees(leaves)

Arguments

leaves

The leaf nodes

Value

A list of trees in igraph format

Find LD-independent blocks

Description

A new block begins at the SNP after the first SNP which is not within blgsize of the start of the last block. dat needs to be ordered first by 'CHR', then by 'POS' or 'cm'

Usage

get_block_lengths(dat, blgsize = 0.05, cpp = TRUE, verbose = TRUE)
get_block_lengths(dat, blgsize = 0.05, cpp = TRUE, verbose = TRUE)

Arguments

`dat`	Data frame with columns 'CHR' and either 'POS' or 'cm'
`blgsize`	SNP block size in Morgan. Default is 0.05 (5 cM). If `blgsize` is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.
`cpp`	Should the faster C++ version be used?
`verbose`	Print progress updates

Value

A numeric vector where the ith element lists the number of SNPs in the ith block.

Examples

## Not run: 
prefix = 'path/to/packedancestrymap_prefix'
pops = c('pop1', 'pop2', 'pop3')
afdat = packedancestrymap_to_afs(prefix, pops = pops)
block_lengths = get_block_lengths(afdat)

## End(Not run)
## Not run: 
prefix = 'path/to/packedancestrymap_prefix'
pops = c('pop1', 'pop2', 'pop3')
afdat = packedancestrymap_to_afs(prefix, pops = pops)
block_lengths = get_block_lengths(afdat)

## End(Not run)

Turns f2_data into f2_blocks

Description

Turns f2_data into f2_blocks

Usage

get_f2(f2_data, pops = NULL, pops2 = NULL, afprod = FALSE, verbose = TRUE, ...)
get_f2(f2_data, pops = NULL, pops2 = NULL, afprod = FALSE, verbose = TRUE, ...)

Arguments

`f2_data`	f2 data as genotype file prefix, f2 directory, or f2 blocks
`pops`	Populations for which to extract f2-stats (defaults to all)
`pops2`	Optional second vector of populations. Can result in non-square array
`afprod`	Get allele frequency products
`verbose`	Print progress updates
`...`	Additional arguments passed to `f2_from_precomp` or `f2_from_geno`

Value

A 3d array with f2-statistics

Get the population names of a graph

Description

Get the population names of a graph

Usage

get_leafnames(graph)
get_leafnames(graph)

Arguments

graph

An admixture graph

Value

Population names

Get the outgroup from a graph (if it exists)

Description

Get the outgroup from a graph (if it exists)

Usage

get_outpop(graph)
get_outpop(graph)

Arguments

graph

An admixture graph

Value

Outgroup name

Get the root name

Description

Get the root name

Usage

get_rootname(graph)
get_rootname(graph)

Arguments

graph

An admixture graph

Value

Root name

Add a population to an admixture graph

Description

Add a population to an admixture graph

Usage

graph_addleaf(graph, pop)
graph_addleaf(graph, pop)

Arguments

`graph`	An admixture graph
`pop`	Population to add to the graph

Value

Admixture graph with the added population

Pairwise distance estimates for graphs

Description

Computes a distance estimate for each graph pair. Each graph is first summarized as a vector which counts for every leaf pair how many internal nodes reach that pair. The distance between two graphs is the Euclidean distance between the vectors of two graphs, and is scaled to fall between 0 and 1.

Usage

graph_distances(graphlist)
graph_distances(graphlist)

Arguments

graphlist

List of graphs

Value

A data frame with graph distances

Find well fitting admixture graphs

Description

This function generates and evaluates admixture graphs in numgen iterations to find well fitting admixturegraphs.

Usage

graph_equations(
  graph,
  substitute = TRUE,
  nam = c("a", "e", "f"),
  return_everything = FALSE
)
graph_equations(
  graph,
  substitute = TRUE,
  nam = c("a", "e", "f"),
  return_everything = FALSE
)

Arguments

`graph`	An admixture graph
`substitute`	Should edge names be represented by shorter symbols?
`nam`	Symbols used to shorten edge names

Value

A list with two data frames: equations holds the equtions for all f2-statistics; coding has the mapping from edge names to edge symbols, which is used when substitute = TRUE

Make a function representing a graph

Description

This function takes an igraph object and turns it into a function that takes edge weights as input, and outputs the expected f2-statistics.

Usage

graph_f2_function(
  graph,
  admix_default = 0.5,
  drift_default = 0.01,
  random_defaults = FALSE
)
graph_f2_function(
  graph,
  admix_default = 0.5,
  drift_default = 0.01,
  random_defaults = FALSE
)

Arguments

`graph`	An admixture graph
`admix_default`	The default weights for admixture edges
`drift_default`	The default weights for drift edges
`random_defaults`	Set default weights randomly for each edge between 0 and 1

Value

A function mapping edge weights to f2-statistics

Examples

## Not run: 
mygraph = graph_f2_function(example_igraph)
mygraph(N3N8 = 0.1, `N2N1|Vindija.DG` = 0.4)

## End(Not run)
## Not run: 
mygraph = graph_f2_function(example_igraph)
mygraph(N3N8 = 0.1, `N2N1|Vindija.DG` = 0.4)

## End(Not run)

Find all valid graphs which result from flipping one admixture edge

Description

Find all valid graphs which result from flipping one admixture edge

Usage

graph_flipadmix(graph)
graph_flipadmix(graph)

Arguments

graph

Admixture graph in igraph format

Value

A data frame with columns from, to, and graph

Examples

## Not run: 
newgraphs = graph_flipadmix(example_igraph)
# now evaluate the new graphs
newgraphs %>%
  rowwise %>%
  mutate(res = list(qpgraph(example_f2_blocks, graph))) %>%
  unnest_wider(res)

## End(Not run)
## Not run: 
newgraphs = graph_flipadmix(example_igraph)
# now evaluate the new graphs
newgraphs %>%
  rowwise %>%
  mutate(res = list(qpgraph(example_f2_blocks, graph))) %>%
  unnest_wider(res)

## End(Not run)

Get unique hash of an admixture graph

Description

This can be used to check if two graphs are identical. For two graphs to be identical, the topology and leaf nodes have to match, but internal node names do not matter.

Usage

graph_hash(graph)
graph_hash(graph)

Arguments

graph

Admixture graph in igraph format

Value

A hash string of the admixture graph

Find all graphs which result from removing one admixture edge

Description

Find all graphs which result from removing one admixture edge

Usage

graph_minusone(graph, ntry = Inf)
graph_minusone(graph, ntry = Inf)

Arguments

graph

Admixture graph in igraph format

Value

A data frame with columns from, to, and graph

Examples

## Not run: 
newgraphs = graph_minusone(example_igraph)
# now evaluate the new graphs
newgraphs %>%
  rowwise %>%
  mutate(res = list(qpgraph(example_f2_blocks, graph))) %>%
  unnest_wider(res)

## End(Not run)
## Not run: 
newgraphs = graph_minusone(example_igraph)
# now evaluate the new graphs
newgraphs %>%
  rowwise %>%
  mutate(res = list(qpgraph(example_f2_blocks, graph))) %>%
  unnest_wider(res)

## End(Not run)

Find all graphs which result from adding and removing one admixture edge

Description

Find all graphs which result from adding and removing one admixture edge

Usage

graph_minusplus(graph)
graph_minusplus(graph)

Arguments

graph

Admixture graph in igraph format

Value

A data frame with columns source_from, source_to, dest_from, dest_to, and graph

Examples

## Not run: 
newgraphs = graph_minusplus(example_igraph)
# now evaluate the new graphs
newgraphs %>%
  rowwise %>%
  mutate(res = list(qpgraph(example_f2_blocks, graph))) %>%
  unnest_wider(res)

## End(Not run)
## Not run: 
newgraphs = graph_minusplus(example_igraph)
# now evaluate the new graphs
newgraphs %>%
  rowwise %>%
  mutate(res = list(qpgraph(example_f2_blocks, graph))) %>%
  unnest_wider(res)

## End(Not run)

Find all graphs which result from adding one admixture edge

Description

Find all graphs which result from adding one admixture edge

Usage

graph_plusone(graph, ntry = Inf)
graph_plusone(graph, ntry = Inf)

Arguments

`graph`	Admixture graph in `igraph` format
`ntry`	Specify this to return only a subset of all possible graphs with one more edge

Value

A data frame with columns from, to, and graph

Examples

## Not run: 
newgraphs = graph_plusone(example_igraph)
# now evaluate the new graphs
newgraphs %>%
  rowwise %>%
  mutate(res = list(qpgraph(example_f2_blocks, graph))) %>%
  unnest_wider(res)

## End(Not run)
## Not run: 
newgraphs = graph_plusone(example_igraph)
# now evaluate the new graphs
newgraphs %>%
  rowwise %>%
  mutate(res = list(qpgraph(example_f2_blocks, graph))) %>%
  unnest_wider(res)

## End(Not run)

Find all trees which are part of the admixture graph

Description

Find all trees which are part of the admixture graph

Usage

graph_splittrees(graph, return_admix = FALSE, simplify = TRUE)
graph_splittrees(graph, return_admix = FALSE, simplify = TRUE)

Arguments

graph

Admixture graph in igraph format

Value

A data frame with columns name and graph

Examples

## Not run: 
trees = graph_splittrees(example_igraph)
# now evaluate the trees
trees %>%
  rowwise %>%
  mutate(res = list(qpgraph(example_f2_blocks, graph))) %>%
  unnest_wider(res)

## End(Not run)
## Not run: 
trees = graph_splittrees(example_igraph)
# now evaluate the trees
trees %>%
  rowwise %>%
  mutate(res = list(qpgraph(example_f2_blocks, graph))) %>%
  unnest_wider(res)

## End(Not run)

Simulate allele frequncies under an admixture graph

Description

This function performs a very crude simulation of allele frequencies under an admixture graph model

Usage

graph_to_afs(
  graph,
  nsnps = 10000,
  drift_default = 0.02,
  admix_default = 0.5,
  leaves_only = FALSE
)
graph_to_afs(
  graph,
  nsnps = 10000,
  drift_default = 0.02,
  admix_default = 0.5,
  leaves_only = FALSE
)

Arguments

`graph`	An admixture graph as igraph object, or as edge list data frame with a column `weight`, as returned by `qpgraph()$edges`
`nsnps`	Number of SNPs to simulate
`drift_default`	Default branch lengths. Ignored if `graph` is a data frame with weights
`admix_default`	Default admixture weights. Ignored if `graph` is a data frame with weights
`leaves_only`	Return allele frequencies for leaf nodes only

Value

A data frame with allele frequencies

Examples

## Not run: 
afs = graph_to_afs(example_igraph)

## End(Not run)
## Not run: 
afs = graph_to_afs(example_igraph)

## End(Not run)

Simulate PCs under an admixture graph

Description

This function simulates PCA of allele frequencies under an admixture graph model

Usage

graph_to_pcs(
  graph,
  nsnps = 10000,
  drift_default = 0.02,
  admix_default = 0.5,
  leaves_only = TRUE
)
graph_to_pcs(
  graph,
  nsnps = 10000,
  drift_default = 0.02,
  admix_default = 0.5,
  leaves_only = TRUE
)

Arguments

`graph`	An admixture graph as igraph object, or as edge list data frame with a column `weight`, as returned by `qpgraph()$edges`
`nsnps`	Number of SNPs to simulate
`drift_default`	Default branch lengths. Ignored if `graph` is a data frame with weights
`admix_default`	Default admixture weights. Ignored if `graph` is a data frame with weights
`leaves_only`	Return PCs for leaf nodes only

Value

A data frame with PCs for each population

Examples

## Not run: 
pcs = graph_to_pcs(example_igraph)
pcs %>% ggplot(aes(PC1, PC2, label = pop)) + geom_text() + geom_point()

## End(Not run)
## Not run: 
pcs = graph_to_pcs(example_igraph)
pcs %>% ggplot(aes(PC1, PC2, label = pop)) + geom_text() + geom_point()

## End(Not run)

Get all qpadm models for a graph

Description

This function tests which qpadm models should be valid for an admixture graph and a target population. By default, all models returned by qpadm_models are tested. For large graphs this will be too slow, and you may want test only some models by providing the models argument, or only a single model by providing the left and right arguments.

Usage

graph_to_qpadm(
  graph,
  target,
  left = NULL,
  right = NULL,
  models = NULL,
  weights = TRUE,
  f4dat = NULL,
  allpops = TRUE,
  more_right = TRUE,
  return_f4 = FALSE,
  eps = 1e-10
)
graph_to_qpadm(
  graph,
  target,
  left = NULL,
  right = NULL,
  models = NULL,
  weights = TRUE,
  f4dat = NULL,
  allpops = TRUE,
  more_right = TRUE,
  return_f4 = FALSE,
  eps = 1e-10
)

Arguments

`graph`	An admixture graph
`target`	Name of the target population.
`left`	Left populations (provide this optionally if you want to test only a single qpadm model)
`right`	Right populations (provide this optionally if you want to test only a single qpadm model)
`models`	A two column nested data frame with models to be evaluated, one model per row. The first column, `l`, should contain the left populations, the second column, `r`, should contain the right populations. The target population is provided separately in the `target` argument.
`weights`	Set this to `FALSE` to return only information on the ranks, not the weights, of each qpadm model. The ranks should depend only on the graph topology, while the weights and weight-validity (all weights for left populations between 0 and 1) can depend on the branch lengths of the graph. By default f4-statistics are based on equal branch lengths and admixture weights of 0.5. This can be overridden by providing `f4dat`.
`f4dat`	A data frame of f4-statistics which can be provided to override the default branch lengths.
`allpops`	Evaluate only models which use all populations in the admixture graph. See `qpadm_models`
`return_f4`	Include f4 statistic matrices in the results (default `FALSE`)
`eps`	Epsilon value close to zero which is used for determining which f4 matrix elements should be considered non-zero, and which weights are strictly between 0 and 1.

Details

Two validity criteria are tested for each qpadm model: Rank validity and weight validity. Rank validity means that the rank of the f4 statistic matrix for only left and right populations is the same as the rank of the f4 statistic matrix that includes the target population among the left populations. Weight validity means that the estimated admixture weights for all left populations are between 0 and 1.

Value

A data frame with one qpadm model per row and columns valid_rank and valid_weights indicating whether a model should be valid under the graph.

Note

An earlier version of this function tried to use the graph topology for identifying valid qpadm models, but this didn't work reliably. Christian Huber had the idea of using the ranks of expected f4 statistic matrices instead.

Examples

## Not run: 
graph2 = example_igraph %>% simplify_graph() %>%
  delete_admix('N3N0', 'N3N4') %>% delete_admix('N3N1', 'N3N8')
graph_to_qpadm(graph2, 'Mbuti.DG') %>% filter(valid_rank, valid_weights)

## End(Not run)
## Not run: 
graph2 = example_igraph %>% simplify_graph() %>%
  delete_admix('N3N0', 'N3N4') %>% delete_admix('N3N1', 'N3N8')
graph_to_qpadm(graph2, 'Mbuti.DG') %>% filter(valid_rank, valid_weights)

## End(Not run)

Return all graphs created from permuting a subclade

Description

generates new graphs from basegraph as follows:

generates all possible trees using addpops (which are not in basegraph)
attaches trees to connection_edge, which is defined by two nodes in basegraph
adds edges originating above each edge in source_node, to each node above addpops

Usage

graphmod_pavel(basegraph, addpops, connection_edge, source_nodes)
graphmod_pavel(basegraph, addpops, connection_edge, source_nodes)

Arguments

`basegraph`	an admixture graph as igraph object. (convert from edge list using `igraph::graph_from_edgelist`)
`addpops`	a vector of population labels which are not in `basegraph`. These populations should form a clade. All possible trees will be generated and those trees will be attached to `basegraph`.
`connection_edge`	edge in `basegraph` where the tree made from `addpops` should be attached
`source_nodes`	nodes in `basegraph`. edges above these nodes will be added and attached to all terminal edges leading to `addpops`

Examples

## Not run: 
graphlist = graphmod_pavel(example_igraph, addpops = c('pop1', 'pop2', 'pop3'),
                           connection_edge = c('N2N0', 'N1N'),
                           source_nodes = c('Denisova.DG', 'N2N2'))
results = tibble(graph = graphlist) %>%
  mutate(res = map(graph, ~qpgraph(example_f2_blocks, .))) %>%
  unnest_wider(res) %>%
  mutate(worstz = map_dbl(f3, ~max(abs(.$z))))

## End(Not run)
## Not run: 
graphlist = graphmod_pavel(example_igraph, addpops = c('pop1', 'pop2', 'pop3'),
                           connection_edge = c('N2N0', 'N1N'),
                           source_nodes = c('Denisova.DG', 'N2N2'))
results = tibble(graph = graphlist) %>%
  mutate(res = map(graph, ~qpgraph(example_f2_blocks, .))) %>%
  unnest_wider(res) %>%
  mutate(worstz = map_dbl(f3, ~max(abs(.$z))))

## End(Not run)

Group precomputed data

Description

Computing f2 statistics for populations consisting of many individuals can be slow and require a lot of memory. To speed this up, this function groups individuals into populations, computes allele counts and products of pairwise allele counts with all individuals and groups, and writes the data to disk. f2 statistics for a combination of grouped and ungrouped precomputed data can then be read using f2_from_precomp, replacing individual IDs of the grouped samples with the new group labels. All groupings are listed in ⁠{dir}/groups/{groupname}.rds⁠

Usage

group_samples(dir, inds, pops, overwrite = FALSE, verbose = TRUE)
group_samples(dir, inds, pops, overwrite = FALSE, verbose = TRUE)

Arguments

`dir`	Directory with precomputed individual pair data
`inds`	Individuals to group
`pops`	Group names, either length 1, or same length as `inds`
`overwrite`	Overwrite existing files in `outdir`
`verbose`	print progress updates

Examples

## Not run: 
dir = 'my/f2/dir/'
inds = c('ind1', 'ind2', 'ind3', 'ind4', 'ind5')
pops = c('pop_A', 'pop_A', 'pop_A', 'pop_B', 'pop_B')
group_samples(dir, inds, pops)

## End(Not run)
## Not run: 
dir = 'my/f2/dir/'
inds = c('ind1', 'ind2', 'ind3', 'ind4', 'ind5')
pops = c('pop_A', 'pop_A', 'pop_A', 'pop_B', 'pop_B')
group_samples(dir, inds, pops)

## End(Not run)

Convert igraph to agraph

Description

agraph is the format used by the admixturegraph packge. igraph is used by the admixtools package

Usage

igraph_to_agraph(igraph)
igraph_to_agraph(igraph)

Arguments

agraph

An admixture graph in igraph format

Value

An admixture graph in agraph format

Examples

## Not run: 
igraph_to_agraph(example_igraph)

## End(Not run)
## Not run: 
igraph_to_agraph(example_igraph)

## End(Not run)

Insert a single edge into graph

Description

Insert a single edge into graph

Usage

insert_admix(
  graph,
  source_from = NULL,
  source_to = NULL,
  dest_from = NULL,
  dest_to = NULL,
  substitute = FALSE,
  fix_outgroup = TRUE
)
insert_admix(
  graph,
  source_from = NULL,
  source_to = NULL,
  dest_from = NULL,
  dest_to = NULL,
  substitute = FALSE,
  fix_outgroup = TRUE
)

Arguments

`graph`	An admixture graph
`source_from`	Parent node of the source edge
`source_to`	Child node of the source edge
`dest_from`	Parent node of the destination edge
`dest_to`	Child node of the destination edge
`substitute`	Should another edge be inserted, if the one specified doesn't work?

Value

Adxmiture graph with inserted edge

Insert admixture edges into graph

Description

Insert admixture edges into graph

Usage

insert_admix_n(graph, n = 1, fix_outgroup = TRUE)
insert_admix_n(graph, n = 1, fix_outgroup = TRUE)

Arguments

`graph`	An admixture graph
`from`	List of nodes. New edges will originate above these nodes.
`to`	List of nodes. New edges will end above these nodes.
`substitute`	Should another edge be inserted, if the one specified doesn't work?

Value

Admixture graph with inserted edges

Insert admixture edges into graph

Description

Insert admixture edges into graph

Usage

insert_admix_old(
  graph,
  fromnodes,
  tonodes,
  substitute_missing = FALSE,
  allow_below_admix = FALSE
)
insert_admix_old(
  graph,
  fromnodes,
  tonodes,
  substitute_missing = FALSE,
  allow_below_admix = FALSE
)

Arguments

`graph`	An admixture graph
`fromnodes`	List of nodes. New edges will originate above these nodes.
`tonodes`	List of nodes. New edges will end above these nodes.
`substitute_missing`	If `TRUE`, an attempt will be made to insert random other edges if some of the provided edges could not be inserted.
`allow_below_admix`	Allow insertion of edges which begin or end directly underneath an admixture node

Value

Adxmiture graph with inserted edges

Add population to graph

Description

Add population to graph

Usage

insert_leaf(graph, leaf, from, to)
insert_leaf(graph, leaf, from, to)

Arguments

`graph`	An admixture graph
`leaf`	Population to be added
`from`	Source node of edge onto which `leaf` should be added
`to`	Target node of edge onto which `leaf` should be added

Value

Admixture graph with added population

Test if an admixture graph is valid

Description

Test if an admixture graph is valid

Usage

is_valid(graph)
is_valid(graph)

Arguments

graph

An admixture graph

Value

TRUE if graph is valid, otherwise FALSE

Find identical graphs

Description

Find identical graphs

Usage

isomorphism_classes(igraphlist)
isomorphism_classes(igraphlist)

Arguments

igraphlist

A list with admixture graphs

Value

An integer vector with isomorphism classes. Graphs with the same number have identical topology (but may have different labels).

Find identical graphs

Description

Find identical graphs

Usage

isomorphism_classes2(igraphlist)
isomorphism_classes2(igraphlist)

Arguments

igraphlist

A list with admixture graphs

Value

An integer vector with isomorphism classes. Graphs with the same number have identical topology and leaf labels (but may have different internal labels).

Joint site frequency spectrum

Description

This function computes the joint site frequency spectrum from genotype files or allele frequency and count matrices. The joint site frequency spectrum lists how often each combination of haplotypes is observed. The number of combinations is equal to the product of one plus the number of haplotypes in each population. For example, five populations with a single diploid individual each have 3^5 possible combinations.

Usage

joint_sfs(afs, pref = NULL)
joint_sfs(afs, pref = NULL)

Arguments

`afs`	A named list of length two where the first element (`afs`) contains the allele frequency matrix, and the second element (`counts`) contains the allele count matrix.
`pref`	Instead of `afs`, the prefix of genotype files can be provided.

Value

A data frame with the number of times each possible combination of allele counts is observed.

Examples

## Not run: 
dat = plink_to_afs('/my/plink/file', pops = c('pop1', 'pop2', 'pop3', 'pop4', 'pop5'))
joint_sfs(dat)

## End(Not run)
## Not run: 
dat = plink_to_afs('/my/plink/file', pops = c('pop1', 'pop2', 'pop3', 'pop4', 'pop5'))
joint_sfs(dat)

## End(Not run)

Estimate joint allele frequency spectrum

Description

Estimate joint allele frequency spectrum

Usage

joint_spectrum(afs)
joint_spectrum(afs)

Arguments

afs

A matrix or data frame of allele frequencies

Value

A data frame with columns pattern and proportion

Examples

## Not run: 
dat = plink_to_afs('/my/plink/file', pops = c('pop1', 'pop2', 'pop3', 'pop4', 'pop5'))

# Spectrum across all SNPs
joint_spectrum(dat$afs)

# Stratify by allele frequency in one population
dat$afs %>% as_tibble %>% select(1:4) %>%
  group_by(grp = cut(pop1, 10)) %>%
  group_modify(joint_spectrum) %>%
  ungroup

# Stratify by mutation class
dat$afs %>% as_tibble %>% select(1:4) %>%
  mutate(mut = paste(dat$snpfile$A1, dat$snpfile$A2)) %>%
  group_by(mut) %>%
  group_modify(joint_spectrum) %>%
  ungroup

## End(Not run)
## Not run: 
dat = plink_to_afs('/my/plink/file', pops = c('pop1', 'pop2', 'pop3', 'pop4', 'pop5'))

# Spectrum across all SNPs
joint_spectrum(dat$afs)

# Stratify by allele frequency in one population
dat$afs %>% as_tibble %>% select(1:4) %>%
  group_by(grp = cut(pop1, 10)) %>%
  group_modify(joint_spectrum) %>%
  ungroup

# Stratify by mutation class
dat$afs %>% as_tibble %>% select(1:4) %>%
  mutate(mut = paste(dat$snpfile$A1, dat$snpfile$A2)) %>%
  group_by(mut) %>%
  group_modify(joint_spectrum) %>%
  ungroup

## End(Not run)

Estimate admixture weights

Description

Models target as a mixture of left populations, and outgroup right populations. Uses Lazaridis method based non-negative least squares of f4 matrix.

Usage

lazadm(data, left, right, target, boot = FALSE, constrained = TRUE)
lazadm(data, left, right, target, boot = FALSE, constrained = TRUE)

Arguments

`data`	The input data in the form of: A 3d array of blocked f2 statistics, output of `f2_from_precomp` or `extract_f2` A directory with f2 statistics The prefix of a genotype file
`left`	Left populations (sources)
`right`	Right populations (outgroups)
`target`	Target population
`boot`	If `FALSE` (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If `boot` is an integer, that number will specify the number of bootstrap resamplings. If `boot = TRUE`, the number of bootstrap resamplings will be equal to the number of SNP blocks.
`constrained`	Constrain admixture weights to be non-negative

Value

lazadm returns a data frame with weights and standard errors for each left population

References

Patterson, N. et al. (2012) Ancient admixture in human history. Genetics

Haak, W. et al. (2015) Massive migration from the steppe was a source for Indo-European languages in Europe. Nature (SI 9)

Examples

target = 'Denisova.DG'
left = c('Altai_Neanderthal.DG', 'Vindija.DG')
right = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG', 'Switzerland_Bichon.SG')
lazadm(example_f2_blocks, left, right, target)
lazadm(example_f2_blocks, left, right, target, constrained = FALSE)
target = 'Denisova.DG'
left = c('Altai_Neanderthal.DG', 'Vindija.DG')
right = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG', 'Switzerland_Bichon.SG')
lazadm(example_f2_blocks, left, right, target)
lazadm(example_f2_blocks, left, right, target, constrained = FALSE)

Generate a list of leave-one-out arrays

Description

Generate a list of leave-one-out arrays

Usage

loo_list(arr)
loo_list(arr)

Arguments

arr

3d array with blocked estimates, with blocks in the 3rd dimension

Value

A list of leave-one-out arrays, each of which is 1 element shorter than arr along the 3rd dimension

Turn leave-one-out estimates to per-block estimates

Description

Inverse of est_to_loo This works for any statistics which, when computed across N blocks, are equal to the weighted mean of the statistics across the N blocks.

Usage

loo_to_est(arr, block_lengths = NULL)
loo_to_est(arr, block_lengths = NULL)

Arguments

`arr`	3d array with blocked estimates, with blocks in the 3rd dimension.
`block_lengths`	Optional block lengths. If `NULL`, will be parsed from 3rd dimnames in blocks

Value

A 3d array with leave-one-out estimates for jackknife. Dimensions are equal to those of arr.

Modify a graph by moving an admixture edge

Description

Modify a graph by moving an admixture edge

Usage

move_admixedge_once(graph, fix_outgroup = TRUE)
move_admixedge_once(graph, fix_outgroup = TRUE)

Arguments

`graph`	An admixture graph
`fix_outgroup`	Keep outgroup in place

Value

A new admixture graph

Simulate an admixture graph in msprime v1.x.

Description

This function generates an msprime simulation script, and optionally executes it in python. Unlike msprime_sim, this function can simulate continuous sequence (not independent SNPs) and multiple chromosomes.

Usage

msprime_genome(
  graph,
  outpref = "msprime_sim",
  neff = 1000,
  ind_per_pop = 1,
  mutation_rate = 1.25e-08,
  time = 1000,
  fix_leaf = FALSE,
  nchr = 1,
  recomb_rate_chr = 2e-08,
  seq_length = 1000,
  admix_default = 0.5,
  run = FALSE,
  ghost_lineages = FALSE,
  shorten_admixed_leaves = FALSE
)
msprime_genome(
  graph,
  outpref = "msprime_sim",
  neff = 1000,
  ind_per_pop = 1,
  mutation_rate = 1.25e-08,
  time = 1000,
  fix_leaf = FALSE,
  nchr = 1,
  recomb_rate_chr = 2e-08,
  seq_length = 1000,
  admix_default = 0.5,
  run = FALSE,
  ghost_lineages = FALSE,
  shorten_admixed_leaves = FALSE
)

Arguments

`graph`	A graph as an `igraph` object or edge list with columns ’from’ and ’to’. If it is an edge list with a column ’weight’ (derived possibly from a fitted graph), the admixture weights will be used. Otherwise, all admixture edges will have a weight of 0.5.
`outpref`	A prefix of output files.
`neff`	Effective population size (in diploid individuals). If a scalar value, it will be constant across all populations. Alternatively, it can be a named vector with a different value for each population (e.g., `c('R'=100, 'A'=50, 'B'=50)`).
`ind_per_pop`	The number of diploid individuals to simulate for each population. If a scalar value, it will be constant across all populations. Alternatively, it can be a named vector with a different value for each population (e.g., `c('A'=10, 'B'=20)` to sample 10 and 20 diploid individuals from populations A and B, respectively).
`mutation_rate`	Mutation rate per site per generation. The default is `1.25e-8` per base pair per generation.
`time`	Either a scalar value (1000 generations by default) with the dates generated by `pseudo_dates`, or a named vector with dates for each graph node (in generations).
`fix_leaf`	A boolean value specifying if the dates of the leaf nodes will be fixed at time 0. If `TRUE`, all samples will be drawn at the end of the simulation (i.e., from “today”).
`nchr`	Number of chromosomes to simulate.
`recomb_rate_chr`	A float value specifying recombination rate along the chromosomes. The default is `2e-8` per base pair per generation.
`seq_length`	The sequence length of the chromosomes. If it is a scalar value, the sequence length will be constant for all chromosomes. Alternatively, it can be a vector with a length equal to the number of chromosomes (i.e., `c(100,50)` to simulate 2 chromosomes with the lengths of 100 and 50 base pairs).
`admix_default`	A float value specifying default admixture proportion for all admixture nodes. The default is `0.5`. If another value between 0 and 1 is specified, admixture weights for each admixture event will be (value, 1-value).
`run`	If `FALSE`, the function will terminate after writing the msprime script. If `TRUE`, it will try to execute the msprime script with the default python installation. If you want to use some other python installation, you can set `⁠run = /my/python⁠`.
`ghost_lineages`	A boolean value specifying whether ghost lineages will be allowed. If `TRUE`, admixture happens at the time points defined by the y-axis generated while plotting the graph by `plot_graph`. If `FALSE` (default), admixture occurs at the time of the previous split event.
`shorten_admixed_leaves`	If `TRUE` simulate the behavior of treemix where drift after admixture is not allowed

Value

The file name and path of the simulation script

Examples

results = qpgraph(example_f2_blocks, example_graph)
# Simulate 3 chromosomes whose lengths are 50, 100 and 100
msprime_genome(results$edges, nchr=3, seq_length=c(50, 100, 100))
results = qpgraph(example_f2_blocks, example_graph)
# Simulate 3 chromosomes whose lengths are 50, 100 and 100
msprime_genome(results$edges, nchr=3, seq_length=c(50, 100, 100))

Simulate an admixture graph in msprime v1.x

Description

This function generates an msprime simulation script, and optionally executes it in python. Unlike the msprime_genome function, this function simulates independent SNP sites.

Usage

msprime_sim(
  graph,
  outpref = "msprime_sim",
  nsnps = 1000,
  neff = 1000,
  ind_per_pop = 1,
  mutation_rate = 0.001,
  time = 1000,
  fix_leaf = FALSE,
  admix_default = 0.5,
  run = FALSE,
  numcores = NULL,
  ghost_lineages = FALSE,
  shorten_admixed_leaves = FALSE
)
msprime_sim(
  graph,
  outpref = "msprime_sim",
  nsnps = 1000,
  neff = 1000,
  ind_per_pop = 1,
  mutation_rate = 0.001,
  time = 1000,
  fix_leaf = FALSE,
  admix_default = 0.5,
  run = FALSE,
  numcores = NULL,
  ghost_lineages = FALSE,
  shorten_admixed_leaves = FALSE
)

Arguments

`graph`	A graph as an `igraph` object or edge list with columns ’from’ and ’to’. If it is an edge list with a column ’weight’ (derived possibly from a fitted graph), the admixture weights will be used. Otherwise, all admixture edges will have a weight of 0.5.
`outpref`	A prefix of output files.
`nsnps`	The number of SNPs to simulate. All SNPs will be simulated independently of each other.
`neff`	Effective population size (in diploid individuals). If a scalar value, it will be constant across all populations. Alternatively, it can be a named vector with a different value for each population (e.g., `c('R'=100, 'A'=50, 'B'=50)`).
`ind_per_pop`	The number of diploid individuals to simulate for each population. If a scalar value, it will be constant across all populations. Alternatively, it can be a named vector with a different value for each population (e.g., `c('A'=10, 'B'=20)` to sample 10 and 20 diploid individuals from populations A and B, respectively).
`mutation_rate`	Mutation rate per site per generation. The default is set to a high value (0.001 per site per generation) to obtain more polymorphic SNPs in order to speed up the simulation.
`time`	Either a scalar value (1000 generations by default) with the dates generated by `pseudo_dates` function, or a named vector with dates for each graph nodes (in generations).
`fix_leaf`	A boolean value specifying if the dates of the leaf nodes will be fixed at time 0. If `TRUE`, all samples will be drawn at the end of the simulation (i.e., from “today”).
`admix_default`	A float value specifying default admixture proportion for all admixture nodes. The default is `0.5`. If another value between 0 and 1 is specified, admixture weights for each admixture event will be (value, 1-value).
`run`	If `FALSE`, the function will terminate after writing the msprime script. If `TRUE`, it will try to execute the msprime script with the default python installation. If you want to use some other python installation, you can set `⁠run = /my/python⁠`.
`numcores`	The number of cores to use when simulating data.
`ghost_lineages`	A boolean value specifying whether ghost lineages will be allowed. If `TRUE`, admixture happens at the time points defined by the y-axis generated while plotting the graph by `plot_graph`. If `FALSE` (default), admixture occurs at the time of the previous split event.
`shorten_admixed_leaves`	If `TRUE` simulate the behavior of treemix where drift after admixture is not allowed

Value

The file name and path of the simulation script

Examples

## Not run: 
results = qpgraph(example_f2_blocks, example_graph)
msprime_sim(results$edges, nsnps=100)

## End(Not run)
## Not run: 
results = qpgraph(example_f2_blocks, example_graph)
msprime_sim(results$edges, nsnps=100)

## End(Not run)

Modify a graph by applying n mutation functions

Description

Modify a graph by applying n mutation functions

Usage

mutate_n(
  graph,
  n = 2,
  funs = list(spr_all, spr_leaves, swap_leaves, move_admixedge_once, flipadmix_random,
    mutate_n),
  fix_outgroup = TRUE
)
mutate_n(
  graph,
  n = 2,
  funs = list(spr_all, spr_leaves, swap_leaves, move_admixedge_once, flipadmix_random,
    mutate_n),
  fix_outgroup = TRUE
)

Arguments

`graph`	An admixture graph
`n`	Number of functions to apply
`funs`	List of function from which to choose
`fix_outgroup`	Keep outgroup in place

Value

A new admixture graph

Turn a newick format tree to a matrix of edges

Description

Turn a newick format tree to a matrix of edges

Usage

newick_to_edges(newick, node = "R", edgemat = matrix(NA, 0, 2))
newick_to_edges(newick, node = "R", edgemat = matrix(NA, 0, 2))

Arguments

`newick`	Tree in newick format.
`node`	Root label of the tree.
`edgemat`	Used for recursive function calls.

Value

Tree as two column matrix of edges (adjacency list)

Examples

newick = random_newick(c('a', 'b', 'c', 'd'))
newick
newick_to_edges(newick)
newick = random_newick(c('a', 'b', 'c', 'd'))
newick
newick_to_edges(newick)

Count how often each node in graph occurs in other graphs

Description

Count how often each node in graph occurs in other graphs

Usage

node_counts(graph, graphlist)
node_counts(graph, graphlist)

Arguments

`graph`	An admixture graph
`graphlist`	List of graphs

Returns a signature of a graph consisting of the left and right descendent leaf nodes of each internal node (sorted and concatenated)

Description

Can be used to determine how often internal nodes occur in a list of other well fitting models

Usage

node_signature(graph)
node_signature(graph)

Arguments

graph

An admixture graph

Value

A graph signature as character vector

Examples

## Not run: 
sigs = example_winners %>% mutate(sig = map(igraph, node_signature)) %$%
         sig %>% unlist %>% table %>% c
node_signature(example_winners$igraph[[1]])

## End(Not run)
## Not run: 
sigs = example_winners %>% mutate(sig = map(igraph, node_signature)) %$%
         sig %>% unlist %>% table %>% c
node_signature(example_winners$igraph[[1]])

## End(Not run)

Count number of admixture nodes

Description

Count number of admixture nodes

Usage

numadmix(graph)
numadmix(graph)

Arguments

graph

An admixture graph

Value

Number of admixture nodes

Read allele frequencies from packedancestrymap files

Description

Read allele frequencies from packedancestrymap files

Usage

packedancestrymap_to_afs(
  pref,
  inds = NULL,
  pops = NULL,
  adjust_pseudohaploid = TRUE,
  first = 1,
  last = NULL,
  verbose = TRUE
)
packedancestrymap_to_afs(
  pref,
  inds = NULL,
  pops = NULL,
  adjust_pseudohaploid = TRUE,
  first = 1,
  last = NULL,
  verbose = TRUE
)

Arguments

`pref`	Prefix of packedancestrymap files (files have to end in `.geno`, `.ind`, `.snp`)
`inds`	Individuals from which to compute allele frequencies
`pops`	Populations from which to compute allele frequencies. If `NULL` (default), populations will be extracted from the third column in the `.ind` file. If population labels are provided, they should have the same length as `inds`, and will be matched to them by position
`adjust_pseudohaploid`	Genotypes of pseudohaploid samples are usually coded as `0` or `2`, even though only one allele is observed. `adjust_pseudohaploid` ensures that the observed allele count increases only by `1` for each pseudohaploid sample. If `TRUE` (default), samples that don't have any genotypes coded as `1` among the first 1000 SNPs are automatically identified as pseudohaploid. This leads to slightly more accurate estimates of f-statistics. Setting this parameter to `FALSE` is equivalent to the ADMIXTOOLS `inbreed: NO` option. Setting `adjust_pseudohaploid` to an integer `n` will check the first `n` SNPs instead of the first 1000 SNPs.
`verbose`	Print progress updates

Value

A list with three data frames: allele frequency data, allele counts, and SNP metadata

Examples

## Not run: 
afdat = packedancestrymap_to_afs(prefix, pops = pops)
afs = afdat$afs
counts = afdat$counts

## End(Not run)
## Not run: 
afdat = packedancestrymap_to_afs(prefix, pops = pops)
afs = afdat$afs
counts = afdat$counts

## End(Not run)

Convert EIGENSTRAT or PACKEDANCESTRYMAP to PLINK

Description

This function converts EIGENSTRAT/PACKEDANCESTRYMAP format files to PLINK files, using write_plink. When inds or pops is provided, only a subset of samples will be extracted. This function can have a high memory footprint, because data for all SNPs will be read before writing the PLINK files.

Usage

packedancestrymap_to_plink(
  inpref,
  outpref,
  inds = NULL,
  pops = NULL,
  verbose = TRUE
)
packedancestrymap_to_plink(
  inpref,
  outpref,
  inds = NULL,
  pops = NULL,
  verbose = TRUE
)

Arguments

`inpref`	Prefix of the input files
`outpref`	Prefix of the PLINK output files
`inds`	Individuals which should be extracted
`pops`	Populations which should be extracted. Can not be provided together with `inds`
`verbose`	Print progress updates

Alias

eigenstrat_to_plink

Read graph in dot format

Description

Read graph in dot format

Usage

parse_dot(dotfile)
parse_dot(dotfile)

Arguments

dotfile

Name of a file with a dot formatted admixture graph

Examples

## Not run: 
graph = parse_dot("/my/graph.dot")

## End(Not run)
## Not run: 
graph = parse_dot("/my/graph.dot")

## End(Not run)

Read qp3Pop output file

Description

Read qp3Pop output file

Usage

parse_qp3pop_output(outfile)
parse_qp3pop_output(outfile)

Arguments

outfile

output file generated by qp3Pop.

Value

tibble with output data.

Read qpAdm output file

Description

Read qpAdm output file

Usage

parse_qpadm_output(outfile)
parse_qpadm_output(outfile)

Arguments

outfile

Output file generated by qpAdm.

Value

Data frame with output data.

Read qpDstat output file

Description

Read qpDstat output file

Usage

parse_qpdstat_output(outfile)
parse_qpdstat_output(outfile)

Arguments

outfile

output file generated by qpDstat.

Value

tibble with output data.

Read qpF4ratio output file

Description

Read qpF4ratio output file

Usage

parse_qpf4ratio_output(parfile)
parse_qpf4ratio_output(parfile)

Arguments

outfile

Output file generated by qpF4ratio

Value

Data frame with output data

Read qpGraph graph file

Description

Read qpGraph graph file

Usage

parse_qpgraph_graphfile(
  graphfile,
  split_multi = TRUE,
  igraph = FALSE,
  uselabels = FALSE
)
parse_qpgraph_graphfile(
  graphfile,
  split_multi = TRUE,
  igraph = FALSE,
  uselabels = FALSE
)

Arguments

`graphfile`	File with admixture graph in qpGraph format.
`split_multi`	Split multifurcations
`igraph`	Convert to igraph format
`uselabels`	Should labels or full names be used? Defaults to full names.

Value

Graph represented as two column edge matrix. Can have four columns if edges are locked

Read qpGraph output file

Description

Read qpGraph output file

Usage

parse_qpgraph_output(outfile, logfile = outfile)
parse_qpgraph_output(outfile, logfile = outfile)

Arguments

outfile

output file generated by qpGraph.

Value

list of output data.

Modify a graph by permuting leaf nodes

Description

Modify a graph by permuting leaf nodes

Usage

permute_leaves(graph, fix_outgroup = TRUE)
permute_leaves(graph, fix_outgroup = TRUE)

Arguments

`graph`	An admixture graph
`fix_outgroup`	Keep outgroup in place

Value

A new admixture graph

Modify a graph by changing the position of the root node

Description

Modify a graph by changing the position of the root node

Usage

place_root_random(graph, fix_outgroup = TRUE)
place_root_random(graph, fix_outgroup = TRUE)

Arguments

`graph`	An admixture graph
`fix_outgroup`	Keep outgroup in place

Value

A new admixture graph

Read allele frequencies from `PLINK` files

Description

Read allele frequencies from PLINK files

Usage

plink_to_afs(
  pref,
  inds = NULL,
  pops = NULL,
  adjust_pseudohaploid = TRUE,
  first = 1,
  last = NULL,
  numblocks = 1,
  poly_only = FALSE,
  verbose = TRUE
)
plink_to_afs(
  pref,
  inds = NULL,
  pops = NULL,
  adjust_pseudohaploid = TRUE,
  first = 1,
  last = NULL,
  numblocks = 1,
  poly_only = FALSE,
  verbose = TRUE
)

Arguments

`pref`	prefix of `PLINK` files (files have to end in `.bed`, `.bim`, `.fam`).
`inds`	Individuals from which to compute allele frequencies
`pops`	Populations from which to compute allele frequencies. If `NULL` (default), populations will be extracted from the third column in the `.ind` file. If population labels are provided, they should have the same length as `inds`, and will be matched to them by position
`adjust_pseudohaploid`	Genotypes of pseudohaploid samples are usually coded as `0` or `2`, even though only one allele is observed. `adjust_pseudohaploid` ensures that the observed allele count increases only by `1` for each pseudohaploid sample. If `TRUE` (default), samples that don't have any genotypes coded as `1` among the first 1000 SNPs are automatically identified as pseudohaploid. This leads to slightly more accurate estimates of f-statistics. Setting this parameter to `FALSE` is equivalent to the ADMIXTOOLS `inbreed: NO` option. Setting `adjust_pseudohaploid` to an integer `n` will check the first `n` SNPs instead of the first 1000 SNPs.
`numblocks`	Number of blocks in which to read genotype file. Setting this to a number greater than one is more memory efficient, but slower.
`poly_only`	Only keep SNPs with mean allele frequency not equal to 0 or 1 (default `FALSE`).
`verbose`	Print progress updates

Value

A list with three items: Allele frequency matrix, allele count matrix, and SNP meta data.

Examples

## Not run: 
afdat = plink_to_afs(prefix, pops)
afs = afdat$afs
counts = afdat$counts

## End(Not run)
## Not run: 
afdat = plink_to_afs(prefix, pops)
afs = afdat$afs
counts = afdat$counts

## End(Not run)

Compare two models

Description

Plot a comparison of two runs of qpgraph or two runs of qpadm

Usage

plot_comparison(out1, out2, name1 = NULL, name2 = NULL)
plot_comparison(out1, out2, name1 = NULL, name2 = NULL)

Arguments

`out1`	First model
`out2`	Second model
`name1`	Optional name for first model
`name2`	Optional name for second model

Value

A ggplot object.

Examples

fit1 = qpgraph(example_f2_blocks, example_graph, lsqmode = FALSE)
fit2 = qpgraph(example_f2_blocks, example_graph, lsqmode = TRUE)
plot_comparison(fit1, fit2)
fit1 = qpgraph(example_f2_blocks, example_graph, lsqmode = FALSE)
fit2 = qpgraph(example_f2_blocks, example_graph, lsqmode = TRUE)
plot_comparison(fit1, fit2)

Plot an admixture graph

Description

Plot an admixture graph

Usage

plot_graph(
  graph,
  fix = NULL,
  title = "",
  color = TRUE,
  textsize = 2.5,
  highlight_unidentifiable = FALSE,
  pos = NULL,
  dates = NULL,
  neff = NULL,
  scale_y = FALSE,
  hide_weights = FALSE
)
plot_graph(
  graph,
  fix = NULL,
  title = "",
  color = TRUE,
  textsize = 2.5,
  highlight_unidentifiable = FALSE,
  pos = NULL,
  dates = NULL,
  neff = NULL,
  scale_y = FALSE,
  hide_weights = FALSE
)

Arguments

`graph`	An admixture graph. If it's an edge list with a `label` column, those values will be displayed on the edges
`fix`	If `TRUE`, there will be an attempt to rearrange the nodes to minimize the number of intersecting edges. This can take very long for large graphs. By default this is only done for graphs with fewer than 10 leaves.
`title`	A plot title
`color`	Plot it in color or greyscale
`textsize`	Size of edge and node labels
`highlight_unidentifiable`	Highlight unidentifiable edges in red. Can be slow for large graphs. See `unidentifiable_edges`.
`pos`	An optional data frame with node coordinates (columns `node`, `x`, `y`)
`dates`	An optional named vector with dates (in generations) for each node to plot dates on the y-axis (e.g., `c('R'=1000, 'A'=0, 'B'=0)`). If this option is supplied, the y-axis will display dates in generations.
`neff`	An optional named vector with effective population sizes for each population (e.g., `c('R'=100, 'A'=100, 'B'=100)`). If this option is supplied, the effective population size of each population will be shown next to the corresponding edge.
`scale_y`	If `TRUE`, scale the y-axis according to `dates` vector. The default is `FALSE`.
`hide_weights`	A boolean value specifying if the drift values on the edges will be hidden. The default is `FALSE`.

Value

A ggplot object

Examples

plot_graph(example_graph)

# Plot a random simulation output. Show dates and population sizes on the plot
out = random_sim(nleaf=5, nadmix=1)
plot_graph(out$edges, dates=out$dates, neff=out$neff)
plot_graph(example_graph)

# Plot a random simulation output. Show dates and population sizes on the plot
out = random_sim(nleaf=5, nadmix=1)
plot_graph(out$edges, dates=out$dates, neff=out$neff)

Plot an admixture graph on a map

Description

Plot an admixture graph on a map

Usage

plot_graph_map(graph, leafcoords, shapedata = NULL)
plot_graph_map(graph, leafcoords, shapedata = NULL)

Arguments

`graph`	a two column matrix specifying the admixture graph. first column is source node, second column is target node. the first edge has to be root -> outgroup. admixture nodes are inferred as those nodes which are the targets of two different sources.
`leafcoords`	data frame with columns `group`, `lon`, `lat`
`shapedata`	shapedata

Value

a plotly object

Examples

plot_graph_map(example_igraph, example_anno)
plot_graph_map(example_igraph, example_anno)

Plot samples on a map

Description

Plot samples on a map

Usage

plot_map(
  leafcoords,
  map_layout = 1,
  color = "yearsbp",
  colorscale = "Portland",
  collog10 = TRUE
)
plot_map(
  leafcoords,
  map_layout = 1,
  color = "yearsbp",
  colorscale = "Portland",
  collog10 = TRUE
)

Arguments

`leafcoords`	data frame with columns `iid`, `lon`, `lat`
`map_layout`	1 or 2
`color`	color
`colorscale`	colorscale
`collog10`	collog10

Value

a plotly object.

Examples

## Not run: 
plot_map(example_anno, 1)

## End(Not run)
## Not run: 
plot_map(example_anno, 1)

## End(Not run)

Compare two models

Description

Plot a comparison of two runs of qpgraph or two runs of qpadm

Usage

plotly_comparison(out1, out2, name1 = NULL, name2 = NULL)
plotly_comparison(out1, out2, name1 = NULL, name2 = NULL)

Arguments

`out1`	First model
`out2`	Second model
`name1`	Optional name for first model
`name2`	Optional name for second model

Value

A plotly object.

Examples

fit1 = qpgraph(example_f2_blocks, example_graph, lsqmode = FALSE)
fit2 = qpgraph(example_f2_blocks, example_graph, lsqmode = TRUE)
plotly_comparison(fit1, fit2)
fit1 = qpgraph(example_f2_blocks, example_graph, lsqmode = FALSE)
fit2 = qpgraph(example_f2_blocks, example_graph, lsqmode = TRUE)
plotly_comparison(fit1, fit2)

Plot an admixture graph using plotly

Description

Plot an admixture graph using plotly

Usage

plotly_graph(
  graph,
  collapse_threshold = 0,
  fix = FALSE,
  print_highlow = FALSE,
  highlight_unidentifiable = FALSE,
  pos = NULL,
  nudge_y = -0.1,
  annot = ""
)
plotly_graph(
  graph,
  collapse_threshold = 0,
  fix = FALSE,
  print_highlow = FALSE,
  highlight_unidentifiable = FALSE,
  pos = NULL,
  nudge_y = -0.1,
  annot = ""
)

Arguments

`graph`	An admixture graph
`collapse_threshold`	Collapse nodes if they are separated by less than this distance (for fitted graphs)
`fix`	If `TRUE`, there will be an attempt to rearrange the nodes to minimize the number of intersecting edges. This can take very long for large graphs. By default this is only done for graphs with fewer than 10 leaves.
`highlight_unidentifiable`	Highlight unidentifiable edges in red. Can be slow for large graphs. See `unidentifiable_edges`.
`pos`	Optional data frame with node coordinates (columns `node`, `x`, `y`)

Value

A plotly object

Examples

plotly_graph(example_graph)
plotly_graph(example_graph)

Get pseudo dates for graph nodes

Description

This function assigns a date (in generations) to each node in an admixture graph and is used in msprime_sim and msprime_genome. The date of the node will correspond to the y-coordinate of that node used for plotting in plotly_graph, unless the fix_leaf option is set to TRUE, in which case the dates of all leaf nodes returned by get_leafnames will be set to 0.

Usage

pseudo_dates(graph, time = 1000, fix_leaf = FALSE)
pseudo_dates(graph, time = 1000, fix_leaf = FALSE)

Arguments

`graph`	An admixture graph
`time`	A scalar by which y-coordinate values will be multiplied to get dates
`fix_leaf`	A boolean specifying if the dates of the leaf nodes will be fixed at time 0 (i.e., at the most recent time). If `TRUE`, all samples will be drawn at the end of the simulation (i.e., from "today"). The default is `FALSE`

Value

A named vector with pseudo dates for each graph node

Estimate f3 statistics

Description

Computes f3 statistics of the form $f3(A; B, C)$ . This is generally equivalent to $(f2(A, B) + f2(A, C) - f2(B, C)) / 2$ and to $f4(A, B; A, C)$ . However, the exact estimates depend on the type of input data, selection of SNPs, and on the values of some of the arguments, which are described below.

Usage

qp3pop(
  data,
  pop1 = NULL,
  pop2 = NULL,
  pop3 = NULL,
  boot = FALSE,
  sure = FALSE,
  unique_only = TRUE,
  blgsize = NULL,
  block_lengths = NULL,
  verbose = FALSE,
  ...
)
qp3pop(
  data,
  pop1 = NULL,
  pop2 = NULL,
  pop3 = NULL,
  boot = FALSE,
  sure = FALSE,
  unique_only = TRUE,
  blgsize = NULL,
  block_lengths = NULL,
  verbose = FALSE,
  ...
)

Arguments

`data`	Input data in one of three forms: A 3d array of blocked f2 statistics, output of `f2_from_precomp` or `f2_from_geno` (fastest option) A directory which contains pre-computed f2-statistics The prefix of genotype files (slowest option)
`pop1`	One of the following four: `NULL`: all possible population combinations will be returned A vector of population labels. All combinations with the other `pop` arguments will be returned A matrix with population combinations to be tested, with one population per column and one combination per row. Other `pop` arguments will be ignored. the location of a file (`poplistname` or `popfilename`) which specifies the populations or population combinations to be tested. Other `pop` arguments will be ignored.
`pop2`	A vector of population labels
`pop3`	A vector of population labels
`boot`	If `FALSE` (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If `boot` is an integer, that number will specify the number of bootstrap resamplings. If `boot = TRUE`, the number of bootstrap resamplings will be equal to the number of SNP blocks.
`sure`	The number of population combinations can get very large. This is a safety option that stops you from accidently computing all combinations if that number is large.
`unique_only`	If `TRUE` (the default), redundant combinations will be excluded
`verbose`	Print progress updates
`...`	Additional arguments passed to `f3blockdat_from_geno` if `data` is a genotype prefix, or to `get_f2` otherwise
`apply_corr`	With `apply_corr = FALSE`, no bias correction is performed. With `apply_corr = TRUE` (the default), a bias correction term based on the heterozygosity in the first population is subtracted from the f3 estimate. The option is ineffective if the first argument is an array of pre-computed f2-statistics. In that case, the bias correction can be toggled by the `apply_corr` parameter in `extract_f2` or `f2_from_geno`. The heterozygosity calculation used in `apply_corr = TRUE` requires at least two haploid samples or one diploid sample in the first population. Otherwise, `apply_corr = TRUE` will result in missing values.
`outgroupmode`	With `outgroupmode = FALSE`, estimates of f3 will be normalized by estimates of the heterozygosity of the target population. This is the default option if the first argument is the prefix of genotype data. If the first argument is an array of pre-computed f2-statistics, then no normalization can be performed, which corresponds to `outgroupmode = TRUE`. As with `apply_corr = TRUE`, the heterozygosity calculation used in `outgroupmode = FALSE` requires at least two pseudodiploid samples or one diploid sample in the first population. Otherwise, `outgroupmode = FALSE` will result in missing values.
`poly_only`	Only keep SNPs with mean allele frequency not equal to 0 or 1. Defaults to `FALSE`. Only effective if the first argument is the prefix of genotype data. If the first argument is an array of pre-computed f2-statistics, use the `poly_only` argument in `extract_f2` or `f2_from_geno`

Details

The default values of the parameters apply_corr, outgroupmode, poly_only depend on the first argument (data), and they can affect the estimated f3-statistics. When the data is the prefix of genotype data, the default parameters are generally the same as in the original qp3pop program. When data is an array of pre-computed f2-statistics, the parameters may be set while computing f2-statistics. If the first population is a single pseudodiploid sample, it is not possible to get unbiased estimates of f3. To get biased estimates for a single pseudodiploid sample from genotype data, set outgroupmode = TRUE and apply_corr = FALSE. Under this combination of parameters, ⁠f3(A; B, C)⁠ should also be identical to ⁠f4(A, B; A, C)⁠, since f4-statistics generally do not require any bias correction. See examples for more information.

Value

qp3pop returns a data frame with f3 statistics, with columns for populations 1 to 3, f3-estimate (est), standard error of the estimate (se), z-score (z, est/se), p-value (2*(1-pnorm(z))), and the number of SNPs used (n; only if first argument is genotype prefix)

Alias

f3

References

Patterson, N. et al. (2012) Ancient admixture in human history Genetics

Peter, B. (2016) Admixture, Population Structure, and F-Statistics Genetics

Examples

## Not run: 
pop1 = 'Denisova.DG'
pop2 = c('Altai_Neanderthal.DG', 'Vindija.DG')
pop3 = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG')
qp3pop(example_f2_blocks, pop1, pop2, pop3)

pops = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG')
qp3pop(example_f2_blocks, pops)
qp3pop(example_f2_blocks, pops, unique_only = FALSE)
qp3pop(f2_dir, pop1, pop2, pop3)

# Below are three scenarios, and in each one `qp3pop()` and `qp3pop_wrapper()`
# should give the same or very similar estimates. Note that to compute `f3(A; B, C)`,
# `qp3pop_wrapper()`, and the original qp3pop program, expect populations to be in the order `B`, `C`, `A`.

prefix = '/path/to/geno/prefix'
qp3popbin = '/path/to/AdmixTools/bin/qp3Pop'
pops = dimnames(example_f2_blocks)[[1]]

# 1. target diploid, outgroupmode NO (this is the default when passing a geno file prefix)
qp3pop_wrapper(prefix, pops[2], pops[3], pops[1], bin = qp3popbin, outgroupmode = FALSE)
qp3pop(prefix, pops[1], pops[2], pops[3])
  # est, se, z match well; n is higher
qp3pop(prefix, pops[1], pops[2], pops[3], poly_only = TRUE)
  # est, se, z match less well; n is identical

# 2. target diploid, outgroupmode YES (this is the only option with precomputed f2-stats)
qp3pop_wrapper(prefix, pops[2], pops[3], pops[1], bin = qp3popbin, outgroupmode = TRUE)
qp3pop(prefix, pops[1], pops[2], pops[3], outgroupmode = TRUE)
  # est, se, z match (except for factor 1000)
f2b = f2_from_geno(prefix, pops = pops[1:3], poly_only = FALSE)
qp3pop(f2b, pops[1], pops[2], pops[3])

# 3. target pseudodiploid (no heterozygotes means heterozygosity rate correction is not possible)
qp3pop_wrapper(prefix, pops[1], pops[3], pops[2], bin = qp3popbin, outgroupmode = TRUE)
qp3pop(prefix, pops[2], pops[1], pops[3], outgroupmode = TRUE, apply_corr = FALSE)
  # est, se, z match (except for factor 1000)

## End(Not run)
## Not run: 
pop1 = 'Denisova.DG'
pop2 = c('Altai_Neanderthal.DG', 'Vindija.DG')
pop3 = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG')
qp3pop(example_f2_blocks, pop1, pop2, pop3)

pops = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG')
qp3pop(example_f2_blocks, pops)
qp3pop(example_f2_blocks, pops, unique_only = FALSE)
qp3pop(f2_dir, pop1, pop2, pop3)

# Below are three scenarios, and in each one `qp3pop()` and `qp3pop_wrapper()`
# should give the same or very similar estimates. Note that to compute `f3(A; B, C)`,
# `qp3pop_wrapper()`, and the original qp3pop program, expect populations to be in the order `B`, `C`, `A`.

prefix = '/path/to/geno/prefix'
qp3popbin = '/path/to/AdmixTools/bin/qp3Pop'
pops = dimnames(example_f2_blocks)[[1]]

# 1. target diploid, outgroupmode NO (this is the default when passing a geno file prefix)
qp3pop_wrapper(prefix, pops[2], pops[3], pops[1], bin = qp3popbin, outgroupmode = FALSE)
qp3pop(prefix, pops[1], pops[2], pops[3])
  # est, se, z match well; n is higher
qp3pop(prefix, pops[1], pops[2], pops[3], poly_only = TRUE)
  # est, se, z match less well; n is identical

# 2. target diploid, outgroupmode YES (this is the only option with precomputed f2-stats)
qp3pop_wrapper(prefix, pops[2], pops[3], pops[1], bin = qp3popbin, outgroupmode = TRUE)
qp3pop(prefix, pops[1], pops[2], pops[3], outgroupmode = TRUE)
  # est, se, z match (except for factor 1000)
f2b = f2_from_geno(prefix, pops = pops[1:3], poly_only = FALSE)
qp3pop(f2b, pops[1], pops[2], pops[3])

# 3. target pseudodiploid (no heterozygotes means heterozygosity rate correction is not possible)
qp3pop_wrapper(prefix, pops[1], pops[3], pops[2], bin = qp3popbin, outgroupmode = TRUE)
qp3pop(prefix, pops[2], pops[1], pops[3], outgroupmode = TRUE, apply_corr = FALSE)
  # est, se, z match (except for factor 1000)

## End(Not run)

Wrapper function around the original qp3Pop program

Description

Computes f3 statistics of the form $f3(A; B, C)$ . Equivalent to $(f2(A, B) + f2(A, C) - f2(B, C)) / 2$ and to $f4(A, B; A, C)$ . Requires a working installation of qp3Pop, which will be called using system

Usage

qp3pop_wrapper(
  pref,
  source1,
  source2 = NULL,
  target = NULL,
  bin = "~np29/o2bin/qp3Pop",
  outdir = ".",
  parfile = NULL,
  inbreed = "NO",
  outgroupmode = "NO",
  printonly = FALSE,
  env = "",
  verbose = TRUE
)
qp3pop_wrapper(
  pref,
  source1,
  source2 = NULL,
  target = NULL,
  bin = "~np29/o2bin/qp3Pop",
  outdir = ".",
  parfile = NULL,
  inbreed = "NO",
  outgroupmode = "NO",
  printonly = FALSE,
  env = "",
  verbose = TRUE
)

Arguments

`pref`	Path to and prefix of the packedancestrymap genotype files
`source1`	One of the following four: `NULL`: Populations will be read from `poplistname` or `popfilename` specified in `parfile` A vector of population labels A data frame in which the first four columns specify the population triples to be tested. `source2`, `target` will be ignored. The location of a file (`poplistname` or `popfilename`) which specifies the populations or population combinations to be tested. `source2` and `target` will be ignored.
`source2`	A vector of population labels
`target`	A vector of population labels
`bin`	Path to the qp3Pop binary file
`outdir`	Output directory. files `out`, `parfile`, `poplistname`, `popfilename` may be overwritten
`parfile`	qp3Pop parameter file. If this is specified, `source1`, `source2`, `target` will be ignored.
`inbreed`	inbreed
`outgroupmode`	outgroupmode
`printonly`	Should the command be printed or executed?
`env`	Export environmental variables. See examples.
`verbose`	Print progress updates

Value

If printonly, the qp3Pop command, otherwise a data frame with parsed qp3Pop output

Examples

## Not run: 
source1 = c('Altai_Neanderthal.DG', 'Vindija.DG')
source2 = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG')
target = 'Denisova.DG'
qp3pop_wrapper('genotype_prefix', source1, source2, target,
  bin = 'path/to/qp3Pop')

## End(Not run)
## Not run: 
source1 = c('Altai_Neanderthal.DG', 'Vindija.DG')
source2 = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG')
target = 'Denisova.DG'
qp3pop_wrapper('genotype_prefix', source1, source2, target,
  bin = 'path/to/qp3Pop')

## End(Not run)

Estimate admixture weights

Description

qpadm models a target population as a mixture of left (source) populations, given a set of right (outgroup) populations. It can be used to estimate whether the left populations explain all genetic variation in the target population, relative to the right populations, and to estimate admixture proportions of the left populations to the target population.

Usage

qpadm(
  data,
  left,
  right,
  target,
  f4blocks = NULL,
  fudge = 1e-04,
  fudge_twice = FALSE,
  boot = FALSE,
  getcov = TRUE,
  constrained = FALSE,
  return_f4 = FALSE,
  cpp = TRUE,
  verbose = TRUE,
  ...
)
qpadm(
  data,
  left,
  right,
  target,
  f4blocks = NULL,
  fudge = 1e-04,
  fudge_twice = FALSE,
  boot = FALSE,
  getcov = TRUE,
  constrained = FALSE,
  return_f4 = FALSE,
  cpp = TRUE,
  verbose = TRUE,
  ...
)

Arguments

`data`	The input data in the form of: A 3d array of blocked f2 statistics, output of `f2_from_precomp` or `extract_f2` A directory with f2 statistics The prefix of a genotype file
`left`	Left populations (sources)
`right`	Right populations (outgroups)
`target`	Target population
`f4blocks`	Instead of f2 blocks, f4 blocks can be supplied. This is used by `qpadm_multi`
`fudge`	Value added to diagonal matrix elements before inverting
`fudge_twice`	Setting this to `TRUE` should result in p-values that better match those in the original qpAdm program
`boot`	If `FALSE` (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If `boot` is an integer, that number will specify the number of bootstrap resamplings. If `boot = TRUE`, the number of bootstrap resamplings will be equal to the number of SNP blocks.
`getcov`	Compute weights covariance. Setting `getcov = FALSE` will speed up the computation.
`constrained`	Constrain admixture weights to be non-negative
`return_f4`	Return f4-statistics
`cpp`	Use C++ functions. Setting this to `FALSE` will be slower but can help with debugging.
`verbose`	Print progress updates
`...`	If `data` is the prefix of genotype files, additional arguments will be passed to `f4blockdat_from_geno`

Value

qpadm returns a list with up to four data frames describing the model fit:

weights: A data frame with estimated admixture proportions where each row is a left population.
f4: A data frame with estimated and fitted f4-statistics
rankdrop: A data frame describing model fits with different ranks, including p-values for the overall fit and for nested models (comparing two models with rank difference of one). A model with L left populations and R right populations has an f4-matrix of dimensions (L-1)*(R-1). If no two left population form a clade with respect to all right populations, this model will have rank (L-1)*(R-1).
- f4rank: Tested rank
- dof: Degrees of freedom of the chi-squared null distribution: (L-1-f4rank)*(R-1-f4rank)
- chisq: Chi-sqaured statistic, obtained as ⁠E'QE⁠, where E is the difference between estimated and fitted f4-statistics, and Q is the f4-statistic covariance matrix.
- p: p-value obtained from chisq as pchisq(chisq, df = dof, lower.tail = FALSE)
- dofdiff: Difference in degrees of freedom between this model and the model with one less rank
- chisqdiff: Difference in chi-squared statistics
- p_nested: p-value testing whether the difference between two models of rank difference 1 is significant
popdrop: A data frame describing model fits with different populations. Note that all models with fewer populations use the same set of SNPs as the first model.
- pat: A binary code indicating which populations are present in this model. A 1 represents dropped populations. The full model is all zeros.
- wt: Number of populations dropped
- dof: Degrees of freedom of the chi-squared null distribution: (L-1-f4rank)*(R-1-f4rank)
- chisq: Chi-sqaured statistic, obtained as ⁠E'QE⁠, where E is the difference between estimated and fitted f4-statistics, and Q is the f4-statistic covariance matrix.
- p: p-value obtained from chisq as pchisq(chisq, df = dof, lower.tail = FALSE)
- f4rank: Tested rank
- feasible: A model is feasible if all weights fall between 0 and 1
- ⁠<population name>⁠: The weights for each population in this model

References

Haak, W. et al. (2015) Massive migration from the steppe was a source for Indo-European languages in Europe. Nature (SI 10)

Examples

left = c('Altai_Neanderthal.DG', 'Vindija.DG')
right = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG', 'Switzerland_Bichon.SG')
target = 'Denisova.DG'
qpadm(example_f2_blocks, left, right, target)
## Not run: 
# The original ADMIXTOOLS qpAadm program has an option called "allsnps"
# that selects different SNPs for each f4-statistic, which is
# useful when working with sparse genotype data.
# To get the same behavior in ADMIXTOOLS 2, supply the genotype data prefix
# and set `allsnps = TRUE`
qpadm("/my/geno/prefix", left, right, target, allsnps = TRUE)

## End(Not run)
left = c('Altai_Neanderthal.DG', 'Vindija.DG')
right = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG', 'Switzerland_Bichon.SG')
target = 'Denisova.DG'
qpadm(example_f2_blocks, left, right, target)
## Not run: 
# The original ADMIXTOOLS qpAadm program has an option called "allsnps"
# that selects different SNPs for each f4-statistic, which is
# useful when working with sparse genotype data.
# To get the same behavior in ADMIXTOOLS 2, supply the genotype data prefix
# and set `allsnps = TRUE`
qpadm("/my/geno/prefix", left, right, target, allsnps = TRUE)

## End(Not run)

Partition a list of populations into left and right populations

Description

Partition a list of populations into left and right populations

Usage

qpadm_models(pops, allpops = TRUE, more_right = TRUE)
qpadm_models(pops, allpops = TRUE, more_right = TRUE)

Arguments

`pops`	A vector of populations
`allpops`	Return only models which use all provided populations
`more_right`	Return only models with more right than left populations

Value

A data frame with one qpadm model per row

Examples

## Not run: 
graph_to_qpadm(get_leafnames(example_igraph))

## End(Not run)
## Not run: 
graph_to_qpadm(get_leafnames(example_igraph))

## End(Not run)

Return all valid qpAdm models for an admixturegraph

Description

For large admixturegraph, there may be a large number of valid qpAdm models!

Usage

qpadm_models_old(graph, add_outgroup = FALSE, nested = TRUE, abbr = -1)
qpadm_models_old(graph, add_outgroup = FALSE, nested = TRUE, abbr = -1)

Arguments

`graph`	An admixture graph as igraph object
`add_outgroup`	Should the graph outgroup be added to the qpAdm right populations? Technically this shouldn't be an informative outgroup for qpAdm.
`nested`	Should a nested data frame be returned (`TRUE`), or should populations be concatenated into strings (`FALSE`)?
`abbr`	Maximum number of characters to print for each population. The default (-1) doesn't abbreviate the names.

Examples

## Not run: 
qpadm_models_old(igraph2, add_outgroup = TRUE)
qpadm_models_old(igraph2, add_outgroup = TRUE) %>% slice(1) %$% list(target, left, right)

## End(Not run)
## Not run: 
qpadm_models_old(igraph2, add_outgroup = TRUE)
qpadm_models_old(igraph2, add_outgroup = TRUE) %>% slice(1) %$% list(target, left, right)

## End(Not run)

Run multiple qpadm models

Description

This function runs multiple qpadm models, re-using f4-statistics where possible. Supports parallel evaluation of models, which can be turned on with future::plan('multisession') or similar, and turned off with future::plan('sequential').

Usage

qpadm_multi(
  data,
  models,
  allsnps = FALSE,
  full_results = TRUE,
  verbose = TRUE,
  ...
)
qpadm_multi(
  data,
  models,
  allsnps = FALSE,
  full_results = TRUE,
  verbose = TRUE,
  ...
)

Arguments

`data`	The input data in the form of: A 3d array of blocked f2 statistics, output of `f2_from_precomp` or `extract_f2` A directory with f2 statistics The prefix of a genotype file
`models`	A nested list (or data frame) with qpadm models. It should consist of two or three other named lists (or columns) containing `left`, `right`, (and `target`) populations.
`allsnps`	Use all SNPs with allele frequency estimates in every population of any given population quadruple. If `FALSE` (the default) only SNPs which are present in all populations in `popcombs` (or any given model in it) will be used. When there are populations with lots of (non-randomly) missing data, `allsnps = TRUE` can lead to false positive results. This option only has an effect if `data` is the prefix of a genotype file. If `data` are f2-statistics, the behavior will be determined by the options that were used in computing the f2-statistics.
`verbose`	Print progress updates
`...`	Further arguments passed to `qpadm`

Value

A list where each element is the output of one qpadm model.

Examples

## Not run: 
# the following specifies two models: one with 2/3/1 and one with 1/2/1 left/right/target populations
models = tibble(
           left = list(c('pop1', 'pop2'), c('pop3')),
           right = list(c('pop5', 'pop6', 'pop7'), c('pop7', 'pop8')),
           target = c('pop10', 'pop10'))
results = qpadm_multi('/my/geno/prefix', models)

## End(Not run)
## Not run: 
# the following specifies two models: one with 2/3/1 and one with 1/2/1 left/right/target populations
models = tibble(
           left = list(c('pop1', 'pop2'), c('pop3')),
           right = list(c('pop5', 'pop6', 'pop7'), c('pop7', 'pop8')),
           target = c('pop10', 'pop10'))
results = qpadm_multi('/my/geno/prefix', models)

## End(Not run)

Faster version of `qpadm` with reduced output

Description

Models target as a mixture of left populations, given a set of outgroup right populations. Can be used to estimate admixture proportions, and to estimate the number of independent admixture events.

Usage

qpadm_p(
  f2_data,
  left,
  right,
  target = NULL,
  fudge = 1e-04,
  boot = FALSE,
  constrained = FALSE,
  rnk = length(setdiff(left, target)) - 1,
  cpp = TRUE,
  weights = FALSE,
  f4blocks = NULL
)
qpadm_p(
  f2_data,
  left,
  right,
  target = NULL,
  fudge = 1e-04,
  boot = FALSE,
  constrained = FALSE,
  rnk = length(setdiff(left, target)) - 1,
  cpp = TRUE,
  weights = FALSE,
  f4blocks = NULL
)

Arguments

`left`	Left populations (sources)
`right`	Right populations (outgroups)
`target`	Target population
`fudge`	Value added to diagonal matrix elements before inverting
`boot`	If `FALSE` (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If `boot` is an integer, that number will specify the number of bootstrap resamplings. If `boot = TRUE`, the number of bootstrap resamplings will be equal to the number of SNP blocks.
`constrained`	Constrain admixture weights to be non-negative
`rnk`	Rank of f4-matrix. Defaults to one less than full rank.
`cpp`	Use C++ functions. Setting this to `FALSE` will be slower but can help with debugging.
`weights`	Return weights (default = `FALSE`)
`f4blocks`	Instead of f2 blocks, f4 blocks can be supplied. This is used by `qpadm_multi`

Value

Data frame with f4rank, dof, chisq, p, feasible

Examples

left = c('Altai_Neanderthal.DG', 'Vindija.DG')
right = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG', 'Switzerland_Bichon.SG')
target = 'Denisova.DG'
qpadm_p(example_f2_blocks, left, right, target)
left = c('Altai_Neanderthal.DG', 'Vindija.DG')
right = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG', 'Switzerland_Bichon.SG')
target = 'Denisova.DG'
qpadm_p(example_f2_blocks, left, right, target)

Compute p-values for many qpadm models

Description

This functions evaluates many qpadm models simultaneously by keeping the target population and the rightfix populations fixed, and distributing the leftright populations by keeping some in the set of left population and adding the remaining populations to the right populations. (See details for an example of how models are generated)

Usage

qpadm_rotate(
  f2_blocks,
  leftright,
  target,
  rightfix = NULL,
  full_results = FALSE,
  verbose = TRUE
)
qpadm_rotate(
  f2_blocks,
  leftright,
  target,
  rightfix = NULL,
  full_results = FALSE,
  verbose = TRUE
)

Arguments

`f2_blocks`	3d array of blocked f2 statistics, output of `f2_from_precomp`.
`leftright`	Populations which will be distributed between left and right
`target`	Target population
`rightfix`	Populations which will be on the right side in all models
`full_results`	Return all output items which are returned by `qpadm`. By default (`full_results = FALSE`), weights and several other statistics will not be computed for each model, making it faster and the output more readable. If `full_results = TRUE`, the output will be a nested data frame where each row is one `qpadm` model, and each column has one data frame item from the regular qpadm output (`weights`, `f4`, `rankdrop`, `popdrop`).
`verbose`	Print progress updates

Details

If leftright consists of the populations L1, L2, L3, L4; rightfix is the population R; and target is T, the following models will be genrated:

(left), (right), (target)
(L1), (L2, L3, L4, R), (T)
(L2), (L1, L3, L4, R), (T)
(L3), (L1, L2, L4, R), (T)
(L4), (L1, L2, L3, R), (T)
(L1, L2), (L3, L4, R), (T)
(L1, L3), (L2, L4, R), (T)
(L1, L4), (L2, L3, R), (T)
(L2, L3), (L1, L4, R), (T)
(L2, L4), (L1, L3, R), (T)
(L3, L4), (L1, L2, R), (T)

Value

A data frame with Chi-squared statistics and p-values for each population combination

Examples

## Not run: 
pops = dimnames(example_f2_blocks)[[1]]
qpadm_rotate(example_f2_blocks, leftright = pops[1:4],
             target = pops[5], rightfix = pops[6:7])

## End(Not run)
## Not run: 
pops = dimnames(example_f2_blocks)[[1]]
qpadm_rotate(example_f2_blocks, leftright = pops[1:4],
             target = pops[5], rightfix = pops[6:7])

## End(Not run)

Wrapper function around the original qpAdm program

Description

This requires a working installation of qpAdm, which will be called using system

Usage

qpadm_wrapper(
  pref,
  left,
  right,
  target = NULL,
  bin = "~np29/o2bin/qpAdm",
  outdir = "./",
  parfile = NULL,
  allsnps = "NO",
  blgsize = 0.05,
  fancyf4 = "NO",
  f4mode = "YES",
  inbreed = "NO",
  printonly = FALSE,
  env = "",
  verbose = TRUE
)
qpadm_wrapper(
  pref,
  left,
  right,
  target = NULL,
  bin = "~np29/o2bin/qpAdm",
  outdir = "./",
  parfile = NULL,
  allsnps = "NO",
  blgsize = 0.05,
  fancyf4 = "NO",
  f4mode = "YES",
  inbreed = "NO",
  printonly = FALSE,
  env = "",
  verbose = TRUE
)

Arguments

`pref`	Path to and prefix of the packedancestrymap genotype files
`left`	Left populations (or leftlist file)
`right`	Right populations (or rightlist file)
`target`	Target population
`bin`	Path to the qpAdm binary file
`outdir`	Output directory. files `out`, `parfile`, `leftlist`, `rightlist` will be overwritten
`parfile`	qpAdm parameter file
`allsnps`	allsnps
`blgsize`	blgsize
`fancyf4`	fancyf4
`f4mode`	f4mode
`inbreed`	inbreed
`printonly`	Should the command be printed or executed?
`env`	Export environmental variables. See examples.
`verbose`	Print progress updates

Value

If not printonly, a data frame with parsed qpAdm output

Examples

## Not run: 
left = c('Altai_Neanderthal.DG', 'Vindija.DG')
right = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG', 'Switzerland_Bichon.SG')
target = 'Denisova.DG'
qpadm_wrapper('genotype_prefix', left, right, target,
  bin = 'path/to/qpAdm')

## End(Not run)
## Not run: 
left = c('Altai_Neanderthal.DG', 'Vindija.DG')
right = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG', 'Switzerland_Bichon.SG')
target = 'Denisova.DG'
qpadm_wrapper('genotype_prefix', left, right, target,
  bin = 'path/to/qpAdm')

## End(Not run)

Estimate f4 statistics

Description

Computes f4-statistics of the form $f4(A, B; C, D)$ . For allele frequencies a, b, c, d, ⁠f4(A, B; C, D)⁠ is computed as the average of (a-b)*(c-d) over all SNPs in each SNP block. This is equivalent to $(f2(A, D) + f2(B, C) - f2(A, C) - f2(B, D)) / 2$ (assuming no missing data). The input of this function can either be a 3d array of f2-statistics generated by f2_from_precomp or f2_from_geno, a directory with f2-statistics, or the prefix of genotype files. Computing f4 from genotype files directly is slower, but provides more flexibility in dealing with missing data (see details).

Usage

qpdstat(
  data,
  pop1 = NULL,
  pop2 = NULL,
  pop3 = NULL,
  pop4 = NULL,
  boot = FALSE,
  sure = FALSE,
  unique_only = TRUE,
  comb = TRUE,
  blgsize = NULL,
  block_lengths = NULL,
  f4mode = TRUE,
  afprod = TRUE,
  cpp = TRUE,
  verbose = is.character(data),
  ...
)
qpdstat(
  data,
  pop1 = NULL,
  pop2 = NULL,
  pop3 = NULL,
  pop4 = NULL,
  boot = FALSE,
  sure = FALSE,
  unique_only = TRUE,
  comb = TRUE,
  blgsize = NULL,
  block_lengths = NULL,
  f4mode = TRUE,
  afprod = TRUE,
  cpp = TRUE,
  verbose = is.character(data),
  ...
)

Arguments

`data`	Input data in one of three forms: A 3d array of blocked f2 statistics, output of `f2_from_precomp` or `f2_from_geno` (fastest option) A directory which contains pre-computed f2-statistics The prefix of genotype files (slowest option)
`pop1`	One of the following four: `NULL`: all possible population combinations will be returned A vector of population labels. All combinations with the other `pop` arguments will be returned A matrix with population combinations to be tested, with one population per column and one combination per row. Other `pop` arguments will be ignored. the location of a file (`poplistname` or `popfilename`) which specifies the populations or population combinations to be tested. Other `pop` arguments will be ignored.
`pop2`	A vector of population labels
`pop3`	A vector of population labels
`pop4`	A vector of population labels
`boot`	If `FALSE` (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If `boot` is an integer, that number will specify the number of bootstrap resamplings. If `boot = TRUE`, the number of bootstrap resamplings will be equal to the number of SNP blocks.
`sure`	The number of population combinations can get very large. This is a safety option that stops you from accidently computing all combinations if that number is large.
`unique_only`	If `TRUE` (the default), redundant combinations will be excluded
`comb`	Generate all combinations of `pop1`, `pop2`, `pop3`, `pop4`. If `FALSE`, `pop1`, `pop2`, `pop3`, `pop4` should all be vectors of the same length.
`blgsize`	SNP block size in Morgan. Default is 0.05 (5 cM). Only used when `data` is the prefix of genotype files
`block_lengths`	Vector with lengths of each jackknife block. `sum(block_lengths)` has to match the number of SNPs. only used when `data` is the prefix of genotype files
`f4mode`	Set this to `FALSE` to compute D-statistics instead of f4. This only has an effect if the first argument is a genotype prefix. D-statistics are computed as `(a-b)(c-d) / ((a + b - 2ab) (c + d - 2cd))`, which is the same as `(P(BABA) - P(ABBA)) / (P(ABBA) + P(BABA))`
`afprod`	Compute f4 from allele frequency products instead of f2 (default `TRUE`). Only used if `data` is a directory with precomputed data. For populations with lots of missing data, this option reduces bias that can result from setting `maxmiss` to values greater than 0. In all other cases it should not make a difference.
`cpp`	Use C++ functions. Setting this to `FALSE` will be slower but can help with debugging.
`verbose`	Print progress updates
`...`	Additional arguments passed to `f4blockdat_from_geno` if `data` is a genotype file prefix or `f2_from_precomp` if `data` is a directory with f2-statistics

Details

f4- and D-statistics are informative about how four populations are related to one another. Estimates of f4 are unbiased as long as assumptions about SNP ascertainment and mutation rates are met. Missing data can violate the ascertainment assumptions: an f4-statistic may be significantly different when it is calculated from all non-missing SNPs, compared to what it would be if it were calculated from all SNPs in the genome. However, because this difference is often small, f4 is often calculated using samples or populations with missing data, on a particular subset of all SNPs. There are different strategies for choosing the SNPs in this case, and these strategies differ in how many SNPs they use, how likely they lead to bias, and whether pre-computed f2-statistics can be used.

Use the same SNPs for every f4-statistic
This is the most conservative option, but also the option which will use the smallest number of SNPs, which may result in a lack of power. It is the default option when pre-computing f2-statistics (maxmiss = 0 in extract_f2 or f2_from_geno).
Use different SNPs for each f4-statistic
This option strikes a balance between avoiding bias and using a larger number of SNPs. For each f4-statistic it selects all SNPs which are present in all four populations. This option only works when the first argument is a genotype prefix (it doesn't work with pre-computed f2-statistics). In that case, this option is used by default. To turn it off and instead use the same SNPs for every f4-statistic, set allsnps = FALSE. This option is the default option in the original qpDstat program (it is in fact the only option for selecting SNPs in the original qpDstat; however in the original qpAdm and qpGraph programs, this mode of selecting SNPs can be enabled with the allsnps option, whereas the default mode in qpAdm and qpGraph is equivalent to maxmiss = 0)
Use different SNPs for each f2-statistic
This is the least conservative option, but also the option which uses most of the available information. It makes it possible to use pre-computed f2-statistics for a large number of populations without losing a large number of SNPs. To use this option, set the maxmiss parameter to a value greater than 0 (and not larger than 1) in extract_f2 or f2_from_geno. When using this option, be aware that bias is possible, in particular for f4-statistics where some populations have large amounts of missing data. To reduce the bias that can result from using this option, you may want to combine it with using the option afprod = TRUE in f2_from_precomp. In summary, whenever you work with populations with missing data, there is no guarantee that f4- or D-statistics involving these populations are not skewed in some way. If you choose to analyze these populations anyway, and you decide which SNPs to use, there is a trade-off between maximizing power and minimizing the risk of bias. One strategy might be to first use the least conservative option (setting maxmiss = 1 in extract_f2) to get an overview, and then spot-check individual results using more conservative options.

Value

qpdstat returns a data frame with f4 statistics, with columns for populations 1 to 4, f4-estimate (est), standard error of the estimate (se), z-score (z, est/se), p-value (2*(1-pnorm(z))), and the number of SNPs used (n; only if first argument is genotype prefix)

Alias

f4

References

Patterson, N. et al. (2012) Ancient admixture in human history Genetics

Peter, B. (2016) Admixture, Population Structure, and F-Statistics Genetics

Examples

pop1 = 'Denisova.DG'
pop2 = c('Altai_Neanderthal.DG', 'Vindija.DG')
pop3 = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG')
pop4 = 'Switzerland_Bichon.SG'
qpdstat(example_f2_blocks, pop1, pop2, pop3, pop4)
## Not run: 
qpdstat(f2_dir, pop1, pop2, pop3, pop4)

## End(Not run)
## Not run: 
# compute D-statistics instead
qpdstat("/geno/prefix", pop1, pop2, pop3, pop4)

## End(Not run)
## Not run: 
# make a data frame with the population combinations for which f4 should be computed
combinations = tibble(pop1 = pop1, pop2 = pop2[1], pop3 = pop3, pop4 = pop4)
qpdstat(example_f2_blocks, combinations)

## End(Not run)
pop1 = 'Denisova.DG'
pop2 = c('Altai_Neanderthal.DG', 'Vindija.DG')
pop3 = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG')
pop4 = 'Switzerland_Bichon.SG'
qpdstat(example_f2_blocks, pop1, pop2, pop3, pop4)
## Not run: 
qpdstat(f2_dir, pop1, pop2, pop3, pop4)

## End(Not run)
## Not run: 
# compute D-statistics instead
qpdstat("/geno/prefix", pop1, pop2, pop3, pop4)

## End(Not run)
## Not run: 
# make a data frame with the population combinations for which f4 should be computed
combinations = tibble(pop1 = pop1, pop2 = pop2[1], pop3 = pop3, pop4 = pop4)
qpdstat(example_f2_blocks, combinations)

## End(Not run)

Wrapper function around the original qpDstat program

Description

This requires a working installation of qpDstat, which will be called using system

Usage

qpdstat_wrapper(
  pref,
  pop1,
  pop2 = NULL,
  pop3 = NULL,
  pop4 = NULL,
  bin = "~np29/o2bin/qpDstat",
  outdir = ".",
  parfile = NULL,
  f4mode = "YES",
  inbreed = "NO",
  printonly = FALSE,
  env = "",
  verbose = TRUE
)
qpdstat_wrapper(
  pref,
  pop1,
  pop2 = NULL,
  pop3 = NULL,
  pop4 = NULL,
  bin = "~np29/o2bin/qpDstat",
  outdir = ".",
  parfile = NULL,
  f4mode = "YES",
  inbreed = "NO",
  printonly = FALSE,
  env = "",
  verbose = TRUE
)

Arguments

`pref`	Path to and prefix of the packedancestrymap genotype files
`pop1`	One of the following four: `NULL`: populations will be read from `poplistname` or `popfilename` specified in `parfile` A vector of population labels A data frame in which the first four columns specify the population combinations to be tested. `pop2`, `pop3`, `pop4` will be ignored. the location of a file (`poplistname` or `popfilename`) which specifies the populations or population combinations to be tested. `pop2`, `pop3`, `pop4` will be ignored.
`pop2`	A vector of population labels
`pop3`	A vector of population labels
`pop4`	A vector of population labels
`bin`	Path to the qpDstat binary file
`outdir`	Output directory. files `out`, `parfile`, `poplistname`, `popfilename` may be overwritten
`parfile`	qpDstat parameter file. If this is specified, `pop`, `pop2`, `pop3`, and `pop4` will be ignored.
`f4mode`	f4mode
`inbreed`	inbreed
`printonly`	Should the command be printed or executed?
`env`	Export environmental variables. See examples.
`verbose`	Print progress updates

Value

If printonly, the qpDstat command, otherwise a data frame with parsed qpDstat output

Examples

## Not run: 
pop1 = 'Denisova.DG'
pop2 = c('Altai_Neanderthal.DG', 'Vindija.DG')
pop3 = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG')
pop4 = 'Switzerland_Bichon.SG'
qpdstat_wrapper('genotype_prefix', pop1, pop2, pop3, pop4,
  bin = 'path/to/qpDstat', pref = 'path/to/packedancestrymap_prefix')

## End(Not run)
## Not run: 
pop1 = 'Denisova.DG'
pop2 = c('Altai_Neanderthal.DG', 'Vindija.DG')
pop3 = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG')
pop4 = 'Switzerland_Bichon.SG'
qpdstat_wrapper('genotype_prefix', pop1, pop2, pop3, pop4,
  bin = 'path/to/qpDstat', pref = 'path/to/packedancestrymap_prefix')

## End(Not run)

Estimate f4 differences

Description

Estimate f4 differences

Usage

qpf4diff(data, pops, boot = FALSE, verbose = FALSE)
qpf4diff(data, pops, boot = FALSE, verbose = FALSE)

Arguments

`data`	Input data in one of three forms: A 3d array of blocked f2 statistics, output of `f2_from_precomp` or `f2_from_geno` (fastest option) A directory which contains pre-computed f2-statistics The prefix of genotype files (slowest option)
`pops`	A vector of 5 populations or a five column population matrix. The following ratios will be computed: `⁠f4(1, 2; 3, 4)/f4(1, 2; 5, 4)⁠`
`boot`	If `FALSE` (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If `boot` is an integer, that number will specify the number of bootstrap resamplings. If `boot = TRUE`, the number of bootstrap resamplings will be equal to the number of SNP blocks.
`verbose`	Print progress updates

Value

qpf4diff returns a data frame with f4 ratios

Estimate admixture proportions via f4 ratios

Description

Estimate admixture proportions via f4 ratios

Usage

qpf4ratio(data, pops, boot = FALSE, verbose = FALSE)
qpf4ratio(data, pops, boot = FALSE, verbose = FALSE)

Arguments

`data`	Input data in one of three forms: A 3d array of blocked f2 statistics, output of `f2_from_precomp` or `f2_from_geno` (fastest option) A directory which contains pre-computed f2-statistics The prefix of genotype files (slowest option)
`pops`	A vector of 5 populations or a five column population matrix. The following ratios will be computed: `⁠f4(1, 2; 3, 4)/f4(1, 2; 5, 4)⁠`
`boot`	If `FALSE` (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If `boot` is an integer, that number will specify the number of bootstrap resamplings. If `boot = TRUE`, the number of bootstrap resamplings will be equal to the number of SNP blocks.
`verbose`	Print progress updates

Value

qpf4ratio returns a data frame with f4 ratios

Wrapper function around the original qpF4ratio program

Description

This requires a working installation of qpF4ratio, which will be called using system

Usage

qpf4ratio_wrapper(
  pref,
  pops,
  bin = "~np29/o2bin/qpF4ratio",
  outdir = ".",
  parfile = NULL,
  blgsize = 0.05,
  fancyf4 = "YES",
  printonly = FALSE,
  env = "",
  verbose = TRUE
)
qpf4ratio_wrapper(
  pref,
  pops,
  bin = "~np29/o2bin/qpF4ratio",
  outdir = ".",
  parfile = NULL,
  blgsize = 0.05,
  fancyf4 = "YES",
  printonly = FALSE,
  env = "",
  verbose = TRUE
)

Arguments

`pref`	Path to and prefix of the packedancestrymap genotype files
`pops`	A vector of five populations, or a 5 x n matrix with population names. For each line `alpha` will be computed as `⁠f4(1,2; 3,4)/f4(1,2; 5,4)⁠`
`bin`	Path to the qpF4ratio binary file
`outdir`	Output directory. files `out`, `parfile`, `poplistname`, `popfilename` may be overwritten
`parfile`	qpF4ratio parameter file. If this is specified, `pops` will be ignored.
`blgsize`	blgsize
`fancyf4`	fancyf4
`printonly`	Should the command be printed or executed?
`env`	Export environmental variables. See examples.
`verbose`	Print progress updates

Value

If printonly, the qpF4ratio command, otherwise a data frame with parsed qpF4ratio output

Examples

## Not run: 
pops = c('Denisova.DG', 'Altai_Neanderthal.DG', 'Vindija.DG', 'Chimp.REF', 'Mbuti.DG')
qpf4ratio_wrapper('genotype_prefix', pops, bin = 'path/to/qpDstat')

## End(Not run)
## Not run: 
pops = c('Denisova.DG', 'Altai_Neanderthal.DG', 'Vindija.DG', 'Chimp.REF', 'Mbuti.DG')
qpf4ratio_wrapper('genotype_prefix', pops, bin = 'path/to/qpDstat')

## End(Not run)

Get smoothed f2-statistics

Description

This function returns an array of (pseudo-) f2-statistics which are computed by taking into account other f2-, f3-, and f4-statistics. The advantage of doing that is that f3- and f4-statistics computed from these smoothed f2-statistics can be more accurate for populations with large amounts of missing data. The function uses SNPs which are missing in some populations in a manner which tends to introduce less bias than setting maxmiss to values greater than 0.

Usage

qpfstats(
  pref,
  pops,
  include_f2 = TRUE,
  include_f3 = TRUE,
  include_f4 = TRUE,
  verbose = TRUE
)
qpfstats(
  pref,
  pops,
  include_f2 = TRUE,
  include_f3 = TRUE,
  include_f4 = TRUE,
  verbose = TRUE
)

Arguments

`pref`	Prefix of genotype files
`pops`	Populations for which to compute f2-statistics
`include_f2`	Should f2-statistics be used to get smoothed f2-statistics? If `include_f2` is a positive integer, it specifies how many randomly chosen f2-statistics should be used.
`include_f3`	Should f3-statistics be used to get smoothed f2-statistics? If `include_f3` is a positive integer, it specifies how many randomly chosen f3-statistics should be used.
`include_f4`	Should f4-statistics be used to get smoothed f2-statistics? If `include_f4` is a positive integer, it specifies how many randomly chosen f4-statistics should be used.

Value

A 3d-array of smoothed f2-statistics

Examples

## Not run: 
f2_blocks = qpfstats(geno_prefix, mypops)

## End(Not run)
## Not run: 
f2_blocks = qpfstats(geno_prefix, mypops)

## End(Not run)

Compute the fit of an admixture graph

Description

Computes the fit of a given admixturegraph from f2-statistics. Drift edge weights and admixture edges weights are optimized until the (negative) likelihood score is minimized. The likelihood score is based on the squared difference between estimated and fitted f3-statistics.

Usage

qpgraph(
  data,
  graph,
  lambdascale = 1,
  boot = FALSE,
  diag = 1e-04,
  diag_f3 = 1e-05,
  lsqmode = FALSE,
  numstart = 10,
  seed = NULL,
  cpp = TRUE,
  return_fstats = FALSE,
  return_pvalue = FALSE,
  f3precomp = NULL,
  f3basepop = NULL,
  constrained = TRUE,
  allsnps = FALSE,
  ppinv = NULL,
  f2_blocks_test = NULL,
  verbose = FALSE
)
qpgraph(
  data,
  graph,
  lambdascale = 1,
  boot = FALSE,
  diag = 1e-04,
  diag_f3 = 1e-05,
  lsqmode = FALSE,
  numstart = 10,
  seed = NULL,
  cpp = TRUE,
  return_fstats = FALSE,
  return_pvalue = FALSE,
  f3precomp = NULL,
  f3basepop = NULL,
  constrained = TRUE,
  allsnps = FALSE,
  ppinv = NULL,
  f2_blocks_test = NULL,
  verbose = FALSE
)

Arguments

`data`	Input data in one of three forms: A 3d array of blocked f2 statistics, output of `f2_from_precomp` or `extract_f2` (fastest option) A directory which contains pre-computed f2-statistics The prefix of genotype files (slowest option)
`graph`	An admixture graph represented as a matrix of edges, an `igraph` object, or the path to a qpGraph graph file. Edges can be constrained by providing a matrix or data frame of edges with columns titled `lower` and `upper` with lower and upper bounds, respectively. By default, admixture edges are constrained to be between zero and one (with paired edges summing to one), and drift edges have a lower bound at zero.
`lambdascale`	Scales f2-statistics. This has no effect on the fit, but is used in the original qpGraph program to display branch weights on a scale that corresponds to FST distances.
`boot`	If `FALSE` (the default), each block will be left out at a time and the covariance matrix of f3 statistics will be computed using block-jackknife. Otherwise bootstrap resampling is performed `n` times, where `n` is either equal to `boot` if it is an integer, or equal to the number of blocks if `boot` is `TRUE`. The covariance matrix of f3 statistics will be computed using bootstrap resampling.
`diag`	Regularization term added to the diagonal elements of the covariance matrix of fitted branch lengths (after scaling by the matrix trace). Default is 0.0001.
`diag_f3`	Regularization term added to the diagonal elements of the covariance matrix of estimated f3 statistics (after scaling by the matrix trace). In the original qpGraph program, this is fixed at 0.00001.
`lsqmode`	Least-squares mode. If `TRUE`, the likelihood score will be computed using a diagonal matrix with `1/(sum(diag(f3_var)) * diag_f3)`, in place of the inverse f3-statistic covariance matrix. `lsqmode = 2` will use the identity matrix instead, which is equivalent to computing the score as the sum of squared residuals (`sum((f3_est-f3_fit)^2)`). Both of these options do not take the covariance of f3-statistics into account. This can lead to bias, but is more stable in cases where the inverse f3-statistics covariance matrix can not be estimated precisely (for example because the number of populations is large). An alternative to `lsqmode = TRUE` that doesn't completely ignore the covariance of f3-statistics is to increase `diag_f3`.
`numstart`	Number of random initializations of starting weights. Defaults to 10. Increasing this number will make the optimization slower, but reduce the risk of not finding the optimal weights. Check the `opt` output to see how much the optimization depends on the starting weights.
`seed`	Random seed for generating starting weights.
`cpp`	Use C++ functions. Setting this to `FALSE` will be slower but can help with debugging.
`return_fstats`	Return estimated and fitted f2- and f4-statistics, as well as the worst f4-statistic residual Z-score. Defaults to `FALSE` because this can be slow.
`f3precomp`	Optional precomputed f3-statistics. This should be the output of `qpgraph_precompute_f3` and can be provided instead of `data`. This can speed things up if many graphs are evaluated using the same set of f3-statistics.
`f3basepop`	Optional f3-statistics base population. Inference will be based on f3-statistics of the form `⁠f3(f3basepop; i, j)⁠` for all population pairs `⁠(i, j)⁠`. Defaults to the outgroup population if the graph has one. This option is ignored if `f3precomp` is provided. Changing `f3basepop` should make very little difference.
`constrained`	Constrain estimated drift edge weights to be non-negative, and admixture edge weights to be between zero and one.
`allsnps`	Compute f3 from different SNPs for each population triplet (if data is missing for some SNPs and populations). This only has an effect when `data` is the prefix of genotype files.
`ppinv`	Optional inverse f3-statistics covariance matrix
`f2_blocks_test`	An optional 3d array of f2-statistics used for computing an out-of-sample score. This should contain only SNP blocks which are not part of `f2_blocks`. This allows to estimate the fit of a graph without overfitting and will not be used during the optimization step
`verbose`	Print progress updates

Value

qpgraph returns a list with data describing the model fit:

edges: A data frame where each row is an edge in the graph. For regular edges, the column weight is the estimated edge length, and for admixture edges, it is the estimated admixture weight.
score: The likelihood score of the fitted graph. Lower values correspond to better fits. The score is calculated as the inner product of the residuals (difference between estimated and fitted f3 statistics), weighted by the inverse of the f3 covariance matrix. See qpgraph_score
f2: Estimated and fitted f2 statistics (if return_fstats = TRUE). p-values and z-scores test the significance of the difference.
f3: Estimated and fitted f3 statistics. p-values and z-scores test the significance of the difference.
f4: Estimated and fitted f4 statistics (if return_fstats = TRUE). p-values and z-scores test the significance of the difference.
opt: A data frame with details of the weight-fitting step, including the randomly sampled starting weights. The column value contains the score for each set of starting weights. Columns starting with x denote initial weights, and columns starting with y denote fitted weights.
worst_residual: The highest absolute z-score of f4-statistics residuals (fitted - estimated f4); (returned if return_fstats = TRUE)

References

Patterson, N. et al. (2012) Ancient admixture in human history. Genetics

Examples

out = qpgraph(example_f2_blocks, example_graph)
plot_graph(out$edges)
out = qpgraph(example_f2_blocks, example_graph)
plot_graph(out$edges)

Compute f3-statistics from f2-statistics.

Description

Takes a 3d array of f2 block jackknife estimates and computes f3-statistics between the first population $p1$ and all population pairs $i, j$ : $f3(p1; p_i, p_j)$

Usage

qpgraph_precompute_f3(
  data,
  pops,
  f3basepop = NULL,
  lambdascale = 1,
  boot = FALSE,
  seed = NULL,
  diag_f3 = 1e-05,
  lsqmode = FALSE
)
qpgraph_precompute_f3(
  data,
  pops,
  f3basepop = NULL,
  lambdascale = 1,
  boot = FALSE,
  seed = NULL,
  diag_f3 = 1e-05,
  lsqmode = FALSE
)

Arguments

`data`	Input data in one of three forms: A 3d array of blocked f2 statistics, output of `f2_from_precomp` or `extract_f2` (fastest option) A directory which contains pre-computed f2-statistics The prefix of genotype files (slowest option)
`pops`	Populations for which to compute f3-statistics
`f3basepop`	f3-statistics base population. If `NULL` (the default), the first population in `pops` will be used as the basis.
`lambdascale`	Scales f2-statistics. This has no effect on the fit, but is used in the original qpGraph program to display branch weights on a scale that corresponds to FST distances.
`boot`	If `FALSE` (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If `boot` is an integer, that number will specify the number of bootstrap resamplings. If `boot = TRUE`, the number of bootstrap resamplings will be equal to the number of SNP blocks.
`seed`	Random seed used if `boot` is `TRUE`.
`diag_f3`	Regularization term added to the diagonal elements of the covariance matrix of estimated f3 statistics (after scaling by the matrix trace). In the original qpGraph program, this is fixed at 0.00001.
`lsqmode`	Least-squares mode. If `TRUE`, the likelihood score will be computed using a diagonal matrix with `1/(sum(diag(f3_var)) * diag_f3)`, in place of the inverse f3-statistic covariance matrix. `lsqmode = 2` will use the identity matrix instead, which is equivalent to computing the score as the sum of squared residuals (`sum((f3_est-f3_fit)^2)`). Both of these options do not take the covariance of f3-statistics into account. This can lead to bias, but is more stable in cases where the inverse f3-statistics covariance matrix can not be estimated precisely (for example because the number of populations is large). An alternative to using `lsqmode = TRUE` which doesn't completely ignore the covariance of f3-statistics is to increase `diag_f3`.

Value

A list with four items

f3_est a matrix with f3-statistics for all population pairs with the output
ppinv a matrix with the inverse of the f3-statistic covariance matrix
f2out a data frame with f2 estimates
f3out a data frame with f3 estimates

Examples

pops = get_leafnames(example_igraph)
qpgraph_precompute_f3(example_f2_blocks, pops)$f3_est
## Not run: 
qpgraph_precompute_f3(f2_dir, pops)

## End(Not run)
pops = get_leafnames(example_igraph)
qpgraph_precompute_f3(example_f2_blocks, pops)$f3_est
## Not run: 
qpgraph_precompute_f3(f2_dir, pops)

## End(Not run)

Evaluate a qpgraph models many times

Description

This function is used in combination with compare_fits in order to test whether one graph has a significantly better fit than another. It creates bootstrap resampled SNP block training and test sets, and uses them to evaluate multiple graphs.

Usage

qpgraph_resample_multi(f2_blocks, graphlist, nboot, verbose = TRUE, ...)
qpgraph_resample_multi(f2_blocks, graphlist, nboot, verbose = TRUE, ...)

Arguments

`f2_blocks`	3d array of f2-statistics
`graphlist`	A list of admixture graphs
`nboot`	Number of bootstrap iterations
`verbose`	Print progress updates
`...`	Arguments passed to `qpgraph`

Value

A list of same length as graphlist with data frames with qpgraph results for each iteration of bootstrap resampled f2-statistics

Examples

## Not run: 
fits = qpgraph_resample_multi(f2_blocks, list(graph1, graph2), nboot = 100)
compare_fits(fits[[1]]$score_test, fits[[2]]$score_test)

## End(Not run)
## Not run: 
fits = qpgraph_resample_multi(f2_blocks, list(graph1, graph2), nboot = 100)
compare_fits(fits[[1]]$score_test, fits[[2]]$score_test)

## End(Not run)

Evaluate a qpgraph model many times

Description

This function is used in combination with compare_fits in order to test whether one graph has a significantly better fit than another. using a list of bootstrap resampled f2-statistics and corresponding test-set f2-statistics

Usage

qpgraph_resample_snps2(f2_blocks, graph, f2_blocks_test, verbose = TRUE, ...)
qpgraph_resample_snps2(f2_blocks, graph, f2_blocks_test, verbose = TRUE, ...)

Arguments

`f2_blocks`	A list of bootstrap resampled f2-statistics
`graph`	An admixture graph
`f2_blocks_test`	A list of f2-statistics from all blocks which were not used in the corresponding f2-array in `f2_blocks`
`verbose`	Print progress updates
`...`	Arguments passed to `qpgraph`

Value

A data frame with qpgraph results for each iteration of bootstrap resampled f2-statistics

Wrapper function around the original qpGraph program

Description

Wrapper function around the original qpGraph program

Usage

qpgraph_wrapper(
  pref,
  graph,
  bin = "~np29/o2bin/qpGraph",
  parfile = NULL,
  outdir = ".",
  printonly = FALSE,
  badsnps = NULL,
  lambdascale = -1,
  inbreed = "NO",
  diag = 1e-04,
  blgsize = 0.05,
  outpop = "NULL",
  loadf3 = NULL,
  lsqmode = "NO",
  fstdmode = "NO",
  hires = "NO",
  forcezmode = "NO",
  zthresh = 0,
  allsnps = "NO",
  oldallsnps = "NO",
  doanalysis = "YES",
  bigiter = 100,
  initmix = 0,
  env = "",
  verbose = TRUE
)
qpgraph_wrapper(
  pref,
  graph,
  bin = "~np29/o2bin/qpGraph",
  parfile = NULL,
  outdir = ".",
  printonly = FALSE,
  badsnps = NULL,
  lambdascale = -1,
  inbreed = "NO",
  diag = 1e-04,
  blgsize = 0.05,
  outpop = "NULL",
  loadf3 = NULL,
  lsqmode = "NO",
  fstdmode = "NO",
  hires = "NO",
  forcezmode = "NO",
  zthresh = 0,
  allsnps = "NO",
  oldallsnps = "NO",
  doanalysis = "YES",
  bigiter = 100,
  initmix = 0,
  env = "",
  verbose = TRUE
)

Arguments

`pref`	Prefix of the packedancestrymap format genotype files.
`graph`	An admixture graph or qpGraph graph file
`bin`	Location of the qpGraph binary
`parfile`	qpGraph parameter file
`outdir`	Output directory
`printonly`	Should output be executed or the command just be printed?
`badsnps`	badsnps
`lambdascale`	lambdascale
`inbreed`	inbreed
`diag`	diag
`blgsize`	blgsize
`outpop`	outgroup population
`loadf3`	loadf3
`lsqmode`	least-squares mode. sets the offdiagonal elements of the block-jackknife covariance matrix to zero.
`fstdmode`	fstdmode
`hires`	hires
`forcezmode`	forcezmode
`zthresh`	zthresh
`allsnps`	allsnps
`oldallsnps`	oldallsnps
`doanalysis`	doanalysis
`bigiter`	bigiter
`initmix`	initmix
`env`	Export environmental variables. See examples.
`verbose`	Print progress updates

Value

A list with parsed qpGraph output

edges: data frame
score: scalar
f2: data frame

Examples

## Not run: 
qpgraph_wrapper('genotype_prefix', example_graph,
                 bin = 'path/to/qpGraph')

## End(Not run)
## Not run: 
qpgraph_wrapper('genotype_prefix', example_graph,
                 bin = 'path/to/qpGraph')

## End(Not run)

Estimate admixture waves

Description

qpwave compares two sets of populations (left and right) to each other. It estimates a lower bound on the number of admixtue waves that went from left into right, by comparing a matrix of f4-statistics to low-rank approximations. For a rank of 0 this is equivalent to testing whether left and right form clades relative to each other.

Usage

qpwave(
  data,
  left,
  right,
  fudge = 1e-04,
  boot = FALSE,
  constrained = FALSE,
  cpp = TRUE,
  verbose = TRUE
)
qpwave(
  data,
  left,
  right,
  fudge = 1e-04,
  boot = FALSE,
  constrained = FALSE,
  cpp = TRUE,
  verbose = TRUE
)

Arguments

`data`	The input data in the form of: A 3d array of blocked f2 statistics, output of `f2_from_precomp` or `extract_f2` A directory with f2 statistics The prefix of a genotype file
`left`	Left populations (sources)
`right`	Right populations (outgroups)
`fudge`	Value added to diagonal matrix elements before inverting
`boot`	If `FALSE` (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If `boot` is an integer, that number will specify the number of bootstrap resamplings. If `boot = TRUE`, the number of bootstrap resamplings will be equal to the number of SNP blocks.
`constrained`	Constrain admixture weights to be non-negative
`cpp`	Use C++ functions. Setting this to `FALSE` will be slower but can help with debugging.
`verbose`	Print progress updates

Value

qpwave returns a list with up to two data frames describing the model fit:

f4 A data frame with estimated f4-statistics
rankdrop: A data frame describing model fits with different ranks, including p-values for the overall fit and for nested models (comparing two models with rank difference of one). A model with L left populations and R right populations has an f4-matrix of dimensions (L-1)*(R-1). If no two left population form a clade with respect to all right populations, this model will have rank (L-1)*(R-1).
- f4rank: Tested rank
- dof: Degrees of freedom of the chi-squared null distribution: (L-1-f4rank)*(R-1-f4rank)
- chisq: Chi-sqaured statistic, obtained as ⁠E'QE⁠, where E is the difference between estimated and fitted f4-statistics, and Q is the f4-statistic covariance matrix.
- p: p-value obtained from chisq as pchisq(chisq, df = dof, lower.tail = FALSE)
- dofdiff: Difference in degrees of freedom between this model and the model with one less rank
- chisqdiff: Difference in chi-squared statistics
- p_nested: p-value testing whether the difference between two models of rank difference 1 is significant

References

Patterson, N. et al. (2012) Ancient admixture in human history. Genetics

Haak, W. et al. (2015) Massive migration from the steppe was a source for Indo-European languages in Europe. Nature (SI 10)

Examples

left = c('Altai_Neanderthal.DG', 'Vindija.DG')
right = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG', 'Switzerland_Bichon.SG')
qpwave(example_f2_blocks, left, right)
left = c('Altai_Neanderthal.DG', 'Vindija.DG')
right = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG', 'Switzerland_Bichon.SG')
qpwave(example_f2_blocks, left, right)

Compute all pairwise qpwave p-values

Description

For all pairs of left populations qpwave rank 0 Chi-squared statistics and p-values will be computed. This tests for each pair of left populations whether it forms a clade with respect to the right populations.

Usage

qpwave_pairs(f2_data, left, right)
qpwave_pairs(f2_data, left, right)

Arguments

`f2_data`	A 3d array of blocked f2 statistics, output of `f2_from_precomp`. Alternatively, a directory with f2 statistics.
`left`	Left populations
`right`	Right populations

Examples

## Not run: 
left = pops[5:7]
right = pops[1:4]
f2_blocks = f2_from_precomp('/my/f2/dir/', pops = left, pops2 = right, afprod = TRUE)
qpwave_pairs(f2_blocks, left, right)
# If f2-stats are too big to load them into memory,
# the following will read the data for each pair from disk:
qpwave_pairs('/my/f2/dir/', left, right)

## End(Not run)
## Not run: 
left = pops[5:7]
right = pops[1:4]
f2_blocks = f2_from_precomp('/my/f2/dir/', pops = left, pops2 = right, afprod = TRUE)
qpwave_pairs(f2_blocks, left, right)
# If f2-stats are too big to load them into memory,
# the following will read the data for each pair from disk:
qpwave_pairs('/my/f2/dir/', left, right)

## End(Not run)

Wrapper function around the original qpWave program

Description

Wrapper function around the original qpWave program

Usage

qpwave_wrapper(
  pref,
  left,
  right,
  target = NULL,
  bin = "~np29/o2bin/qpAdm",
  outdir = "./",
  parfile = NULL,
  allsnps = "NO",
  blgsize = 0.05,
  fancyf4 = "NO",
  f4mode = "YES",
  inbreed = "NO",
  printonly = FALSE,
  env = "",
  verbose = TRUE
)
qpwave_wrapper(
  pref,
  left,
  right,
  target = NULL,
  bin = "~np29/o2bin/qpAdm",
  outdir = "./",
  parfile = NULL,
  allsnps = "NO",
  blgsize = 0.05,
  fancyf4 = "NO",
  f4mode = "YES",
  inbreed = "NO",
  printonly = FALSE,
  env = "",
  verbose = TRUE
)

Arguments

`left`	Left populations (sources)
`right`	Right populations (outgroups)
`target`	Target population
`verbose`	Print progress updates

Generate a random admixture graph

Description

This function randomly generates an admixture graph for a given set of leaf nodes

Usage

random_admixturegraph(
  leaves,
  numadmix = 0,
  simple = TRUE,
  outpop = NULL,
  nonzero_f4 = NULL,
  admix_constraints = NULL,
  event_order = NULL,
  ntry = 100
)
random_admixturegraph(
  leaves,
  numadmix = 0,
  simple = TRUE,
  outpop = NULL,
  nonzero_f4 = NULL,
  admix_constraints = NULL,
  event_order = NULL,
  ntry = 100
)

Arguments

`leaves`	Names of the leaf nodes, or a number specifying how many leaf nodes there should be
`numadmix`	Number of admixture events
`simple`	Should edges leading to admixture nodes consist of separate admix edges and normal edges
`outpop`	Outgroup population

Examples

rand_graph = random_admixturegraph(10, numadmix = 5)
plot_graph(rand_graph)
rand_graph = random_admixturegraph(10, numadmix = 5)
plot_graph(rand_graph)

Get random dates for graph nodes

Description

This function assigns a date (in generations) to each node in an admixture graph and is used in random_sim. The dates are drawn from a uniform distribution with given lower and upper bounds, unless the fix_leaf option is set to TRUE, in which case the dates of all leaf nodes returned by get_leafnames will be set to 0.

Usage

random_dates(graph, min = 1000, max = 1000, fix_leaf = FALSE)
random_dates(graph, min = 1000, max = 1000, fix_leaf = FALSE)

Arguments

`graph`	An admixture graph
`min`	A lower limit of uniform distribution
`max`	An upper limit of uniform distribution
`fix_leaf`	A boolean specifying if the dates of the leaf nodes will be fixed at time 0 (i.e., at the most recent time). If `TRUE`, all samples will be drawn at the end of the simulation (i.e., from "today"). The default is `FALSE`.

Value

A named vector with random dates for each graph node

Generate a random binary graph

Description

Generate a random binary graph

Usage

random_newick(n, start = "", end = "", outpop = NULL)
random_newick(n, start = "", end = "", outpop = NULL)

Arguments

`n`	The number of terminal nodes, or a vector of population labels.
`start`	Prefix.
`end`	Postfix.
`outpop`	Outgroup (if `n` is a vector of labels).

Value

Tree in newick format.

Examples

random_newick(5)
random_newick(c('a', 'b', 'c', 'd')) # toplogy random, but pop order fixed
random_newick(sample(c('a', 'b', 'c', 'd'))) # toplogy and pop order random
random_newick(5)
random_newick(c('a', 'b', 'c', 'd')) # toplogy random, but pop order fixed
random_newick(sample(c('a', 'b', 'c', 'd'))) # toplogy and pop order random

Generate a random graph and simulate it in msprime v1.x

Description

This is basically a wrapper function around the msprime_genome that allows user to create a random graph and simulate it in msprime v1.x.

Usage

random_sim(
  nleaf,
  nadmix,
  outpref = "random_sim",
  max_depth = NULL,
  ind_per_pop = 1,
  mutation_rate = 1.25e-08,
  admix_weights = 0.5,
  neff = 1000,
  time = 1000,
  fix_leaf = FALSE,
  outpop = NULL,
  nchr = 1,
  recomb_rate = 2e-08,
  seq_length = 1000,
  ghost_lineages = TRUE,
  run = FALSE
)
random_sim(
  nleaf,
  nadmix,
  outpref = "random_sim",
  max_depth = NULL,
  ind_per_pop = 1,
  mutation_rate = 1.25e-08,
  admix_weights = 0.5,
  neff = 1000,
  time = 1000,
  fix_leaf = FALSE,
  outpop = NULL,
  nchr = 1,
  recomb_rate = 2e-08,
  seq_length = 1000,
  ghost_lineages = TRUE,
  run = FALSE
)

Arguments

`nleaf`	The number of leaf nodes
`nadmix`	The number of admixture events
`outpref`	A prefix of output files
`max_depth`	A constraint specifying the maximum time depth of the admixture graph (in generations)
`ind_per_pop`	The number of individuals to simulate for each population. If a scalar value, it will be constant across all populations. Alternatively, it can be a named vector with a different value for each population.
`mutation_rate`	Mutation rate per site per generation. The default is `1.25e-8` per base pair per generation.
`admix_weights`	Admixture weights. If a float value (0 < value < 1), admixture weights for each admixture event will be (value, 1-value). Alternatively, it can be a range, i.e., `c(0.1, 0.4)` specifying lower and upper limits of a uniform distribution from which the admixture weight value will be drawn. By default, all admixture edges have a weight of 0.5.
`neff`	Effective population size (in diploid individuals). If a scalar value, it will be constant across all populations. Alternatively, it can be a range, i.e., `c(500, 1000)` specifying lower and upper limits of an uniform distribution from which values will be drawn
`time`	Time between nodes. Either a scalar value (1000 by default) with the dates generated by `pseudo_dates`, or a range, i.e., `c(500, 1000)` specifying lower and upper limits of a uniform distribution from which values will be drawn (see `random_dates`)
`fix_leaf`	A boolean specifying if the dates of the leaf nodes will be fixed at time 0. If `TRUE`, all samples will be drawn at the end of the simulation (i.e., from “today”).
`outpop`	A name of the (optional) outgroup population.
`nchr`	The number of chromosomes to simulate
`recomb_rate`	A float value specifying recombination rate along the chromosomes. The default is `2e-8` per base pair per generation.
`seq_length`	The sequence length of the chromosomes. If it is a scalar value, the sequence length will be constant for all chromosomes. Alternatively, it can be a vector with a length equal to number of chromosomes (i.e., `c(100,50)` to simulate 2 chromosomes with the lengths of 100 and 50 base pairs).
`ghost_lineages`	A boolean value specifying whether ghost lineages will be allowed. If `TRUE`, admixture happens at the time points defined by the y-axis generated while plotting the graph by `plot_graph` If `FALSE` (default), admixture occurs at the time of the previous split event
`run`	If `FALSE`, the function will terminate after writing the msprime script. If `TRUE`, it will try and execute the function with the default python installation. If you want to use some other python installation, you can set `⁠run = /my/python⁠`.

Value

A list with the path of simulation script, a data frame of graph edges, dates and population sizes:

out - A file name and path of the simulation script
edges - An edge dataframe with admixture weights
dates - A named vector with a date for each node
neffs - A named vector with an effective population size for each node

Examples

# Create simulation script that simulates 2 chromosomes that are 50base long
# where maximum depth of the tree is 5000 generations, and plot the output graph
## Not run: 
out = random_sim(nleaf=4, nadmix=0, max_depth=5000, nchr=2, seq_length=50)
plot_graph(out$edges, dates = out$dates, neff = out$neff, hide_weights = TRUE)

## End(Not run)
# Create simulation script that simulates 2 chromosomes that are 50base long
# where maximum depth of the tree is 5000 generations, and plot the output graph
## Not run: 
out = random_sim(nleaf=4, nadmix=0, max_depth=5000, nchr=2, seq_length=50)
plot_graph(out$edges, dates = out$dates, neff = out$neff, hide_weights = TRUE)

## End(Not run)

Read genotype data from EIGENSTRAT files

Description

Read genotype data from EIGENSTRAT files

Usage

read_eigenstrat(
  pref,
  inds = NULL,
  pops = NULL,
  first = 1,
  last = Inf,
  transpose = FALSE,
  verbose = TRUE
)
read_eigenstrat(
  pref,
  inds = NULL,
  pops = NULL,
  first = 1,
  last = Inf,
  transpose = FALSE,
  verbose = TRUE
)

Arguments

`pref`	Prefix of the packedancestrymap files
`inds`	Individuals for which data should be read. Defaults to all individuals
`pops`	Populations for which data should be read. Cannot be provided together with 'inds'
`first`	Index of first SNP to read
`last`	Index of last SNP to read
`transpose`	Transpose genotype matrix (default is `snps` x `individuals`)
`verbose`	Print progress updates

Value

A list with the genotype data matrix, the .ind file, and the .snp file

Examples

## Not run: 
samples = c('Ind1', 'Ind2', 'Ind3')
geno = read_packedancestrymap(prefix, samples)

## End(Not run)
## Not run: 
samples = c('Ind1', 'Ind2', 'Ind3')
geno = read_packedancestrymap(prefix, samples)

## End(Not run)

Read blocked f2 estimates from disk

Description

This function reads blocked f2 estimates (or allele frequency products) which were writtend to disk by write_f2 and returns them as a 3d array.

Usage

read_f2(
  f2_dir,
  pops = NULL,
  pops2 = NULL,
  type = "f2",
  counts = FALSE,
  remove_na = TRUE,
  verbose = FALSE
)
read_f2(
  f2_dir,
  pops = NULL,
  pops2 = NULL,
  type = "f2",
  counts = FALSE,
  remove_na = TRUE,
  verbose = FALSE
)

Arguments

`f2_dir`	Directory from which to read files
`pops`	Populations for which f2 statistics should be read. Defaults to all populations, which may require a lot of memory.
`pops2`	Specify this if you only want to read a subset of all population pairs. The resulting array will differ on 1st and 2nd dimension and will not work with all functions.
`counts`	Return allele counts instead of f2 estimates
`remove_na`	Remove blocks with missing values
`verbose`	Print progress updates
`afprod`	Return allele frequency products instead of f2 estimates

Value

A 3d array of block jackknife estimates

Examples

## Not run: 
read_f2(f2_dir, pops = c('pop1', 'pop2', 'pop3'))

## End(Not run)
## Not run: 
read_f2(f2_dir, pops = c('pop1', 'pop2', 'pop3'))

## End(Not run)

Read genotype data from packedancestrymap files

Description

Read genotype data from packedancestrymap files

Usage

read_packedancestrymap(
  pref,
  inds = NULL,
  pops = NULL,
  first = 1,
  last = Inf,
  transpose = FALSE,
  verbose = TRUE
)
read_packedancestrymap(
  pref,
  inds = NULL,
  pops = NULL,
  first = 1,
  last = Inf,
  transpose = FALSE,
  verbose = TRUE
)

Arguments

`pref`	Prefix of the packedancestrymap files
`inds`	Individuals for which data should be read. Defaults to all individuals
`pops`	Populations for which data should be read. Cannot be provided together with 'inds'
`first`	Index of first SNP to read
`last`	Index of last SNP to read
`transpose`	Transpose genotype matrix (default is `snps` x `individuals`)
`verbose`	Print progress updates

Value

A list with the genotype data matrix, the .ind file, and the .snp file

Examples

## Not run: 
samples = c('Ind1', 'Ind2', 'Ind3')
geno = read_packedancestrymap(prefix, samples)

## End(Not run)
## Not run: 
samples = c('Ind1', 'Ind2', 'Ind3')
geno = read_packedancestrymap(prefix, samples)

## End(Not run)

Read genotype data from `PLINK` files

Description

See genio for a dedicated R package for reading and writing PLINK files. This function is based on a similar function in the plink2R package.

Usage

read_plink(pref, inds = NULL, pops = NULL, verbose = FALSE)
read_plink(pref, inds = NULL, pops = NULL, verbose = FALSE)

Arguments

`pref`	Prefix of packedancestrymap files (files have to end in `.geno`, `.ind`, `.snp`)
`inds`	Individuals for which data should be read. Defaults to all individuals
`pops`	Populations for which data should be read. Cannot be provided together with 'inds'
`verbose`	Print progress updates

Value

A list with the genotype data matrix, the .ind file, and the .snp file

Examples

## Not run: 
samples = c('Ind1', 'Ind2', 'Ind3')
geno = read_packedancestrymap(prefix, samples)

## End(Not run)
## Not run: 
samples = c('Ind1', 'Ind2', 'Ind3')
geno = read_packedancestrymap(prefix, samples)

## End(Not run)

Run models while leaving out individuals

Description

These are wrapper functions around various qp functions, which will evaluate many models at once. The models are formed by leaving out every individual which belongs to a population with more than one sample, one at a time.

Usage

qp3pop_resample_inds(dir, inds, pops, verbose = TRUE, ...)

qpdstat_resample_inds(dir, inds, pops, verbose = TRUE, ...)

qpwave_resample_inds(dir, inds, pops, verbose = TRUE, ...)

qpadm_resample_inds(dir, inds, pops, verbose = TRUE, ...)

qpgraph_resample_inds(dir, inds, pops, verbose = TRUE, ...)
qp3pop_resample_inds(dir, inds, pops, verbose = TRUE, ...)

qpdstat_resample_inds(dir, inds, pops, verbose = TRUE, ...)

qpwave_resample_inds(dir, inds, pops, verbose = TRUE, ...)

qpadm_resample_inds(dir, inds, pops, verbose = TRUE, ...)

qpgraph_resample_inds(dir, inds, pops, verbose = TRUE, ...)

Arguments

`dir`	directory with precomputed data
`inds`	vector of individual names
`pops`	vector of population names. Should be the same length as `inds`
`verbose`	print progress updates
`...`	named arguments passed to the `qp` function.

Value

a nested data frame where each model is a row, and the columns are model parameters and model outputs

Examples

## Not run: 
res = qpadm_resample_inds
unnest(res, weights)

## End(Not run)
## Not run: 
res = qpadm_resample_inds
unnest(res, weights)

## End(Not run)

Run models while leaving out SNP blocks

Description

These are wrapper functions around various qp functions, which will evaluate a model on multiple subdivisions of the data. The subdivisions are formed by leaving out one or more SNP block at a time (see boot for details).

Usage

qp3pop_resample_snps(f2_blocks, boot = FALSE, ...)

qpdstat_resample_snps(f2_blocks, boot = FALSE, ...)

qpwave_resample_snps(f2_blocks, boot = FALSE, ...)

qpadm_resample_snps(f2_blocks, boot = FALSE, ...)

qpgraph_resample_snps(f2_blocks, boot = FALSE, ...)
qp3pop_resample_snps(f2_blocks, boot = FALSE, ...)

qpdstat_resample_snps(f2_blocks, boot = FALSE, ...)

qpwave_resample_snps(f2_blocks, boot = FALSE, ...)

qpadm_resample_snps(f2_blocks, boot = FALSE, ...)

qpgraph_resample_snps(f2_blocks, boot = FALSE, ...)

Arguments

`f2_blocks`	a 3d array of blocked f2 statistics
`boot`	If `FALSE` (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If `boot` is an integer, that number will specify the number of bootstrap resamplings. If `boot = TRUE`, the number of bootstrap resamplings will be equal to the number of SNP blocks.
`...`	named arguments which are passed to the `qp` function.
`verbose`	print progress updates

Value

a nested data frame where each model is a row, and the columns are model parameters and model outputs

Examples

## Not run: 
res = qpadm_resample_snps(example_f2_blocks, target = target, left = left, right = right)
unnest(res, weights)

## End(Not run)
## Not run: 
res = qpadm_resample_snps(example_f2_blocks, target = target, left = left, right = right)
unnest(res, weights)

## End(Not run)

Rotate populations between left and right

Description

This functions creates a data frame with population combinations which can be used as the input for qpadm_multi

Usage

rotate_models(leftright, target, rightfix = NULL)
rotate_models(leftright, target, rightfix = NULL)

Arguments

`leftright`	Populations which will be distributed between left and right
`target`	Target population
`rightfix`	Populations which will be on the right side in all models

Value

A data frame with Chi-squared statistics and p-values for each population combination

Examples

## Not run: 
pops = dimnames(example_f2_blocks)[[1]]
rotate_models(leftright = pops[1:4],
             target = pops[5], rightfix = pops[6:7])

## End(Not run)
## Not run: 
pops = dimnames(example_f2_blocks)[[1]]
rotate_models(leftright = pops[1:4],
             target = pops[5], rightfix = pops[6:7])

## End(Not run)

Launch ADMIXTOOLS 2 GUI

Description

Launch ADMIXTOOLS 2 GUI

Usage

run_shiny_admixtools()
run_shiny_admixtools()

Test constraints on a graph

Description

This function tests whether a graph satisfies a number of different types of constraints

Usage

satisfies_constraints(
  graph,
  nonzero_f4 = NULL,
  admix_constraints = NULL,
  event_order = NULL
)
satisfies_constraints(
  graph,
  nonzero_f4 = NULL,
  admix_constraints = NULL,
  event_order = NULL
)

Arguments

`graph`	An admixture graph
`nonzero_f4`	A data frame or matrix with four columns. One row for each f4-statistic which is observed to be significantly non-zero
`admix_constraints`	A data frame with columns `pop`, `min`, `max`
`event_order`	A data frame with columns `earlier1`, `earlier2`, `later1`, `later2`

Value

TRUE if all admixture constraints are satisfied, else FALSE

Test f4 constraints on a graph

Description

This function returns TRUE if and only if the admixture graph is compatible with all event orders listed in eventorder

Usage

satisfies_eventorder(graph, eventorder, strict = TRUE)
satisfies_eventorder(graph, eventorder, strict = TRUE)

Arguments

`graph`	An admixture graph
`eventorder`	A data frame with columns `earlier1`, `earlier2`, `later1`, `later2`, optionally `type`
`strict`	What to do in case some events are not determined by the graph. If `strict = TRUE` (the default), the function will only return `TRUE` if there are no ambiguous constraints. Otherwise, `TRUE` will be returned as long as no constraint is directly contradicted.

Details

Each row in eventorder represents a constraint that earlier1 and earlier2 split earlier than later1 and later2. If later2 is NA, later2 will be set to the parent node of later1. By default (type = 1), a constraint will be satisfied as long as there is any lineage in which a split between earlier1 and earlier2 is ancestral to a split between later1 and later2. type = 2 is stricter and requires that the earlier1, earlier2 split is ancestral to the later1, later2 split in all lineages. In graphs with multiple admixture events there can be multiple splits between earlier1, earlier2 and later1, later2, and many ways in which these splits can relate to each other. The current implementation only covers some of the many possible topological relationships.

Value

TRUE if all constraints are satisfied, else FALSE

Examples

## Not run: 
# Test whether the split between A and B is earlier than the split between C and D,
#   and whether the split between C and D is earlier than the terminal branch leading to E
constrain_events = tribble(
  ~earlier1, ~earlier2, ~later1, ~later2,
  'A', 'B', 'C', 'D',
  'C', 'D', 'E', NA)
satisfies_eventorder(random_admixturegraph(5, 0), eventorder = constrain_events)

## End(Not run)
## Not run: 
# Test whether the split between A and B is earlier than the split between C and D,
#   and whether the split between C and D is earlier than the terminal branch leading to E
constrain_events = tribble(
  ~earlier1, ~earlier2, ~later1, ~later2,
  'A', 'B', 'C', 'D',
  'C', 'D', 'E', NA)
satisfies_eventorder(random_admixturegraph(5, 0), eventorder = constrain_events)

## End(Not run)

Test f4 constraints on a graph

Description

This function returns TRUE if and only if the admixture graph is compatible with the f4-statistics listed in nonzero_f4 being non-zero

Usage

satisfies_nonzerof4(graph, nonzero_f4)
satisfies_nonzerof4(graph, nonzero_f4)

Arguments

`graph`	An admixture graph
`nonzero_f4`	A data frame or matrix with four columns. One row for each f4-statistic which is expected to be non-zero

Value

TRUE if all constraints are satisfied, else FALSE

Examples

## Not run: 
# Test whether f4(A,B; C,D) is expected to be non-zero in this graph:
nonzero_f4 = matrix(c('A', 'B', 'C', 'D'), 1)
satisfies_nonzerof4(random_admixturegraph(5, 2), nonzero_f4)

## End(Not run)
## Not run: 
# Test whether f4(A,B; C,D) is expected to be non-zero in this graph:
nonzero_f4 = matrix(c('A', 'B', 'C', 'D'), 1)
satisfies_nonzerof4(random_admixturegraph(5, 2), nonzero_f4)

## End(Not run)

Test admixture constraints on a graph

Description

This function returns TRUE if and only if the admixture graph satisfies all constraints on the number of admixture events for the populations in admix_constraints

Usage

satisfies_numadmix(graph, admix_constraints)
satisfies_numadmix(graph, admix_constraints)

Arguments

`graph`	An admixture graph
`admix_constraints`	A data frame with columns `pop`, `min`, `max`

Value

TRUE if all admixture constraints are satisfied, else FALSE

Examples

## Not run: 
# At least one admixture event for C, and none for D:
constrain_cd = tribble(
  ~pop, ~min, ~max,
  'C', 1, NA,
  'D', NA, 0)
satisfies_numadmix(random_admixturegraph(5, 2), constrain_cd)

## End(Not run)
## Not run: 
# At least one admixture event for C, and none for D:
constrain_cd = tribble(
  ~pop, ~min, ~max,
  'C', 1, NA,
  'D', NA, 0)
satisfies_numadmix(random_admixturegraph(5, 2), constrain_cd)

## End(Not run)

Test f4 constraints on a graph

Description

This function returns TRUE if and only if the admixture graph is compatible with the f4-statistics listed in nonzero_f4 being non-zero

Usage

satisfies_zerof4(graph, zero_f4)
satisfies_zerof4(graph, zero_f4)

Arguments

`graph`	An admixture graph
`zero_f4`	A data frame or matrix with four columns. One row for each f4-statistic which is expected to be zero

Value

TRUE if all constraints are satisfied, else FALSE

Examples

## Not run: 
# Test whether Chimp and Altai are a clade with respect to all populations X and Y:
# (whether f4("Chimp", "Altai"; X, Y) is 0 for all pairs of X and Y)
zero_f4 = expand_grid(pop1 = "Chimp", pop2 = "Altai", pop3 = X, pop4 = Y)
satisfies_zerof4(random_admixturegraph(5, 2), zero_f4)

## End(Not run)
## Not run: 
# Test whether Chimp and Altai are a clade with respect to all populations X and Y:
# (whether f4("Chimp", "Altai"; X, Y) is 0 for all pairs of X and Y)
zero_f4 = expand_grid(pop1 = "Chimp", pop2 = "Altai", pop3 = X, pop4 = Y)
satisfies_zerof4(random_admixturegraph(5, 2), zero_f4)

## End(Not run)

Return shortest unique prefixes

Description

Return shortest unique prefixes

Usage

shortest_unique_prefixes(strings, min_length = 1)
shortest_unique_prefixes(strings, min_length = 1)

Arguments

`strings`	A character vector
`min_length`	Minimum length of prefixes

Value

Character vector with shortest unique prefixes

Examples

strings = c("Abra", "Abracadabra", "Simsalabim")
shortest_unique_prefixes(strings)
strings = c("Abra", "Abracadabra", "Simsalabim")
shortest_unique_prefixes(strings)

Remove redundant edges

Description

Nodes with in and out degree of one will be removed.

Usage

simplify_graph(graph)
simplify_graph(graph)

Arguments

graph

an igraph object

Examples

simple = simplify_graph(example_igraph)
plot_graph(example_igraph)
plot_graph(simple)
simple = simplify_graph(example_igraph)
plot_graph(example_igraph)
plot_graph(simple)

Split a matrix into blocks

Description

This function splits a large matrix into smaller blocks with cols_per_chunk columns per block, and saves them as .rds files with prefix prefix

Usage

split_mat(mat, cols_per_chunk, prefix, overwrite = TRUE, verbose = TRUE)
split_mat(mat, cols_per_chunk, prefix, overwrite = TRUE, verbose = TRUE)

Arguments

`mat`	The matrix to be split
`cols_per_chunk`	Number of columns per block
`prefix`	Prefix of output files
`overwrite`	Overwrite existing files (default `TRUE`)
`verbose`	Print progress updates

Examples

## Not run: 
afdat = packedancestrymap_to_afs('path/to/packedancestrymap_prefix', allpopulations)
split_mat(afdat$afs, cols_per_chunk = 20, prefix = 'afdat_split_v42.1/afs')

## End(Not run)
## Not run: 
afdat = packedancestrymap_to_afs('path/to/packedancestrymap_prefix', allpopulations)
split_mat(afdat$afs, cols_per_chunk = 20, prefix = 'afdat_split_v42.1/afs')

## End(Not run)

Split nodes with more than two edges

Description

Split nodes with more than two edges

Usage

split_multifurcations(graph)
split_multifurcations(graph)

Arguments

graph

An admixture graph

Value

A new admixture graph in which no node has more than two incoming or outgoing edges

Modify a graph by regrafting a subcomponent

Description

Modify a graph by regrafting a subcomponent

Usage

spr_all(graph, fix_outgroup = TRUE)
spr_all(graph, fix_outgroup = TRUE)

Arguments

graph

An admixture graph

Value

A new admixture graph

Modify a graph by regrafting a leaf

Description

Modify a graph by regrafting a leaf

Usage

spr_leaves(graph, fix_outgroup = TRUE)
spr_leaves(graph, fix_outgroup = TRUE)

Arguments

graph

An admixture graph

Value

A new admixture graph

List leaf nodes for all internal nodes

Description

List leaf nodes for all internal nodes

Usage

summarize_descendants(graph)
summarize_descendants(graph)

Arguments

graph

An admixture graphs

Value

A data frame with columns from, to, id

Examples

## Not run: 
summarize_descendants(graph)

## End(Not run)
## Not run: 
summarize_descendants(graph)

## End(Not run)

List leaf nodes for all internal nodes in a list of graphs

Description

List leaf nodes for all internal nodes in a list of graphs

Usage

summarize_descendants_list(graphlist, rename = FALSE)
summarize_descendants_list(graphlist, rename = FALSE)

Arguments

`graphlist`	A list of admixture graphs
`rename`	If `FALSE` (default), the output will be a data frame indicating how often each node in each graph is observed in all other graphs. If `TRUE`, the output will be a list, where the inner nodes will be renamed to have these percentages as part of their name. `plot_graph` will print the percentages of graphs renamed in this way.

Value

A data frame with columns graph, from, n, frac

Examples

## Not run: 
summarize_descendants_list(graphlist)

## End(Not run)
## Not run: 
summarize_descendants_list(graphlist)

## End(Not run)

List population split events in a graph

Description

List population split events in a graph

Usage

summarize_eventorder(graph, unique_only = TRUE)
summarize_eventorder(graph, unique_only = TRUE)

Arguments

graph

An admixture graph

Value

A data frame with columns earlier1, earlier2, later1, later2

Examples

## Not run: 
summarize_eventorder(example_igraph)

## End(Not run)
## Not run: 
summarize_eventorder(example_igraph)

## End(Not run)

List population split events in a list of graphs

Description

List population split events in a list of graphs

Usage

summarize_eventorder_list(graphlist)
summarize_eventorder_list(graphlist)

Arguments

graphlist

A list of admixture graphs

Value

A data frame with columns earlier1, earlier2, later1, later2, n, frac

Examples

## Not run: 
summarize_eventorder(graphlist)

## End(Not run)
## Not run: 
summarize_eventorder(graphlist)

## End(Not run)

Summarize graph fits

Description

This function takes the output of qpgraph_resample_snps and creates a data frame with summaries of the estimated graph parameters. weight has the mean of all fits, while low and high have lower and upper quantiles.

Usage

summarize_fits(fits, q_low = 0.05, q_high = 0.95)
summarize_fits(fits, q_low = 0.05, q_high = 0.95)

Arguments

fits

A nested data frame where each line is the output of qpgraph()

Value

A data frame summarizing the fits

List number of admixture events for each population

Description

List number of admixture events for each population

Usage

summarize_numadmix(graph)
summarize_numadmix(graph)

Arguments

graph

An admixture graph

Value

A data frame with columns pop and nadmix

Examples

## Not run: 
summarize_numadmix(example_igraph)

## End(Not run)
## Not run: 
summarize_numadmix(example_igraph)

## End(Not run)

List number of admixture events for each population in a list of graphs

Description

List number of admixture events for each population in a list of graphs

Usage

summarize_numadmix_list(graphlist)
summarize_numadmix_list(graphlist)

Arguments

graphlist

A list of admixture graphs

Value

A data frame with columns pop, mean, mean_nonzero, min, max

Examples

## Not run: 
summarize_numadmix_list(graphlist)

## End(Not run)
## Not run: 
summarize_numadmix_list(graphlist)

## End(Not run)

Assign proxy populations to admixed populations

Description

Assign proxy populations to admixed populations

Usage

summarize_proxies(graph)
summarize_proxies(graph)

Arguments

graph

An admixture graph in igraph format, or as edge list with column weight

Value

A data frame with columns pop, nadmix, proxy, nproxies (and weight)

Examples

## Not run: 
summarize_proxies(example_igraph)

## End(Not run)
## Not run: 
summarize_proxies(example_igraph)

## End(Not run)

List proxy populations in graphlist

Description

List proxy populations in graphlist

Usage

summarize_proxies_list(graphlist)
summarize_proxies_list(graphlist)

Arguments

graphlist

A list of admixture graphs

Value

A data frame with columns pop, proxy, nadmix, nproxies, n, frac

Examples

## Not run: 
summarize_proxies_list(graphlist)

## End(Not run)
## Not run: 
summarize_proxies_list(graphlist)

## End(Not run)

Summarize triples across graphs

Description

This function summarizes topologies of population triples across graphs.

Usage

summarize_triples(graphs)
summarize_triples(graphs)

Arguments

graphs

A list of graphs

Value

A data frame with one line for each population triple and columns summarizing the relationship of each triple across graphs.

clade12: Fraction of graphs in which pop1 and pop2 form a clade
x12: Fraction of graphs in which pop1 admixes into pop2 at the exclusion of pop3
toptopo: A binary string representation of the most common topology. Digits represent x12, x21, x13, x31, x23, x32
toptopocnt: The number of graphs in which toptopo was observed
topos: The counts of all topologies

List clades in a graph

Description

List clades in a graph

Usage

summarize_zerof4(graph, eps = 1e-07)
summarize_zerof4(graph, eps = 1e-07)

Arguments

`graph`	An admixture graph
`eps`	Used to determine what counts as zero

Value

A data frame with all (non-redundant) zero f4-statistics

Examples

## Not run: 
summarize_zerof4(example_igraph)

## End(Not run)
## Not run: 
summarize_zerof4(example_igraph)

## End(Not run)

List clades in a list of graphs

Description

List clades in a list of graphs

Usage

summarize_zerof4_list(graphlist)
summarize_zerof4_list(graphlist)

Arguments

graphlist

A list of admixture graphs

Value

A data frame with columns pop1, pop2, pop3, pop4, n, frac

Examples

## Not run: 
summarize_zerof4_list(graphlist)

## End(Not run)
## Not run: 
summarize_zerof4_list(graphlist)

## End(Not run)

Modify a graph by swapping two leaf nodes

Description

Modify a graph by swapping two leaf nodes

Usage

swap_leaves(graph, fix_outgroup = TRUE)
swap_leaves(graph, fix_outgroup = TRUE)

Arguments

`graph`	An admixture graph
`fix_outgroup`	Keep outgroup in place

Value

A new admixture graph

Test if two sets of populations form two clades

Description

A thin wrapper around qpadm_p with rnk set to zero

Usage

test_cladality(f2_data, left, right, fudge = 1e-04, boot = FALSE, cpp = TRUE)
test_cladality(f2_data, left, right, fudge = 1e-04, boot = FALSE, cpp = TRUE)

Arguments

`left`	Left populations (sources)
`right`	Right populations (outgroups)
`fudge`	Value added to diagonal matrix elements before inverting
`boot`	If `FALSE` (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If `boot` is an integer, that number will specify the number of bootstrap resamplings. If `boot = TRUE`, the number of bootstrap resamplings will be equal to the number of SNP blocks.
`cpp`	Use C++ functions. Setting this to `FALSE` will be slower but can help with debugging.

Test if a tree is part of a graph

Description

This function tests whether a tree is part of a graph. This is useful for testing whether a Y-chromosome tree is consistent with an autosomal admixture graph. Leaf node names matter, but internal node names are ignored.

Usage

tree_in_graph(tree, graph)
tree_in_graph(tree, graph)

Arguments

`tree`	An admixture graph without admixture event
`graph`	An admixture graph

Value

TRUE if all admixture constraints are satisfied, else FALSE ## Not run: tree = graph_splittrees(example_igraph) tree_in_graph(tree, example_igraph) ## End(Not run)

Find all trees within SPR distance of 1

Description

Returns all trees which can be reached through one iteration of subtree-prune-and-regraft

Usage

tree_neighbors(tree)
tree_neighbors(tree)

Arguments

tree

A tree in igraph format

Value

A data frame with all trees

Find all unidentifiable edges

Description

This function generates and evaluates admixture graphs in numgen iterations to find well fitting admixturegraphs.

Usage

unidentifiable_edges(graph)
unidentifiable_edges(graph)

Arguments

graph

An admixture graph

Value

A data frame with all unidentifiable graph parameters

Convert graph to dot format

Description

Convert graph to dot format

Usage

write_dot(
  graph,
  outfile = stdout(),
  size1 = 7.5,
  size2 = 10,
  title = "",
  dot2pdf = FALSE
)
write_dot(
  graph,
  outfile = stdout(),
  size1 = 7.5,
  size2 = 10,
  title = "",
  dot2pdf = FALSE
)

Arguments

`graph`	Graph as igraph object or edge list (columns labelled 'from', 'to', 'weight')
`outfile`	Output file name

Examples

## Not run: 
results = qpgraph(example_f2_blocks, example_graph)
write_dot(results$edges)

## End(Not run)
## Not run: 
results = qpgraph(example_f2_blocks, example_graph)
write_dot(results$edges)

## End(Not run)

Write blocked f2 estimates to disk

Description

This function takes 3d arrays of blocked f2, allele frequency products, and counts, splits them by population pair, and writes each pair to a separate .rds file under {outdir}/{pop1}/{pop2}.rds.

Usage

write_f2(est_arr, count_arr, outdir, id = "f2", overwrite = FALSE)
write_f2(est_arr, count_arr, outdir, id = "f2", overwrite = FALSE)

Arguments

`est_arr`	A 3d array with blocked f2, allele frequency products, or fst estimates for each population pair. The first two dimensions of each array have to have population names.
`count_arr`	A 3d array of the same dimension with counts
`outdir`	Directory where data will be stored
`id`	Postfix showing the type of statistic ("f2", "ap", or "fst")
`overwrite`	Overwrite existing files in `outdir`

Examples

## Not run: 
write_f2(f2_arr, count_arr, outdir = 'path/to/f2stats/')

## End(Not run)
## Not run: 
write_f2(f2_arr, count_arr, outdir = 'path/to/f2stats/')

## End(Not run)

Package 'admixtools'

Help Index

Tools for inferring demographic history from genetic data

Description

Author(s)

References

See Also

Compute count blocks and write them to disk

Description

Usage

Arguments

See Also

Compute f2 blocks and write them to disk

Description

Usage

Arguments

See Also

Examples

Compute all pairwise f2 statistics

Description

Usage

Arguments

Details

Examples

Convert agraph to igraph

Description

Usage

Arguments

Value

Examples

Generate a list of bootstrap resampled arrays

Description

Usage

Arguments

Value

See Also

Compare the fit of two qpgraph models

Description

Usage

Arguments

Value

See Also

Examples

Compare the fit of two qpgraph models

Description

Usage

Arguments

Examples

Compare the fit of two qpgraph models

Description

Usage

Arguments

Examples

Count SNPs in an f2-statistics array

Description

Usage

Arguments

Value

See Also

Count zero-length edges

Description

Usage

Arguments

Value

Examples

Find all trees within SPR distance of 1 of all graph component trees

Description

Usage

Arguments

Value

Delete an admixture edge

Description

Usage

Arguments

Value

See Also

Delete groups

Description

Usage

Arguments