Package 'admixtools'

Title: Inferring demographic history from genetic data
Description: admixtools is a set of methods used to infer demographic history from genetic data using f-statistics. It was originally implemented as a command line program in C. This is a new implementation of the programs qpDstat, qpAdm, qpGraph, and contains several new features. It is efficient because it precomputes f2-statistics which can be re-used for many analyses.
Authors: Robert Maier [aut, cre], Nick Patterson [aut]
Maintainer: Robert Maier <[email protected]>
License: GPL-3
Version: 2.0.8
Built: 2024-10-20 17:16:58 UTC
Source: https://github.com/uqrmaie1/admixtools

Help Index


Tools for inferring demographic history from genetic data

Description

admixtools is a collection of several programs for calculating f-statistics and using them to make inferences about demographic history from genetic data Aside from the core functions qp3pop, qpdstat, qpadm, and qpgraph, there are also wrapper functions and parsing functions around the original ADMIXTOOLS software, as well as functions to read genotype files and precompute the data necessary to quickly compute f statistics.

Author(s)

Robert Maier <[email protected]>

Nick Patterson

References

Patterson, N. et al. (2012) Ancient admixture in human history. Genetics

See Also

https://uqrmaie1.github.io/admixtools/index.html


Compute count blocks and write them to disk

Description

This is intended for computing allele count data for a large number of individuals, too many to do everything in working memory. It assumes that the allele frequencies have already been computed and are stored in .rds files, split into consecutive blocks for a set of individuals.

Usage

afs_to_counts(
  genodir,
  outdir,
  chunk1,
  chunk2,
  overwrite = FALSE,
  verbose = TRUE
)

Arguments

genodir

Directory with genotype files split in chunks created by extract_afs (with each individual its own population).

outdir

Directory where allele count data will be stored

chunk1

Index of the first chunk of individuals

chunk2

Index of the second chunk of individuals

overwrite

Overwrite existing files (default FALSE)

verbose

Print progress updates

See Also

extract_counts Does the same thing in one step for smaller data.


Compute f2 blocks and write them to disk

Description

This is intended for computing f2-statistics for a large number of populations, too many to do everything in working memory. It assumes that the allele frequencies have already been computed and are stored in .rds files, split into consecutive blocks for a set of populations. This function calls write_f2, which takes a (sub-)chunk of pairwise f2-statistics, and writes one pair at a time to disk.

Usage

afs_to_f2(
  afdir,
  outdir,
  chunk1,
  chunk2,
  blgsize = 0.05,
  snpwt = NULL,
  overwrite = FALSE,
  type = "f2",
  poly_only = FALSE,
  snpdat = NULL,
  apply_corr = TRUE,
  verbose = TRUE
)

Arguments

afdir

Directory with allele frequency and counts .rds files created by split_mat

outdir

Directory where data will be stored

chunk1

Index of the first chunk of populations

chunk2

Index of the second chunk of populations

blgsize

SNP block size in Morgan. Default is 0.05 (5 cM). If blgsize is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.

snpwt

A vector of scaling factors applied to the f2-statistics for each SNP. The length has to match the number of SNPs.

overwrite

Overwrite existing files (default FALSE)

verbose

Print progress updates

See Also

extract_f2 Does the same thing in one step for smaller data.

Examples

## Not run: 
afdir = 'tmp_af_dir/'
f2dir = 'f2_dir'
extract_afs('path/to/packedancestrymap_prefix', afdir)
numchunks = length(list.files(afdir, 'afs.+rds'))
# numchunks should be the number of split allele frequency files
for(i in 1:numchunks) {
  for(j in i:numchunks) {
    afs_to_f2(afdir, f2dir, chunk1 = i, chunk2 = j)
  }
}

## End(Not run)
# Alternatively, the following code will do the same, while submitting each chunk as a separate job.
# (if \code{\link[future]{plan}} has been set up appropriately)
## Not run: 
furrr::future_map(1:numchunks, ~{i=.; map(i:numchunks, ~{
  afs_to_f2(afdir, f2dir, chunk1 = i, chunk2 = .)
  })})
  
## End(Not run)

Compute all pairwise f2 statistics

Description

This function takes a allele frequency data and computes blocked f2 statistics for all population pairs, which are written to outdir. f2f2 for each SNP is computed as (p1p2)2p1(1p1)/(n11)p2(1p2)/(n21)(p1 - p2)^2 - p1 (1 - p1)/(n1 - 1) - p2 (1 - p2)/(n2 - 1), where p1p1 and p2p2 are allele frequencies in populations 11 and 22, and n1n1 and n2n2 is the number of non-missing haplotypes in populations 11 and 22. See details

Usage

afs_to_f2_blocks(
  afdat,
  maxmem = 8000,
  blgsize = 0.05,
  pops1 = NULL,
  pops2 = NULL,
  outpop = NULL,
  outdir = NULL,
  overwrite = FALSE,
  afprod = TRUE,
  fst = TRUE,
  poly_only = c("f2"),
  apply_corr = apply_corr,
  n_cores = 1,
  verbose = TRUE
)

Arguments

afdat

A list with three items with the same SNP in each row (generated by packedancestrymap_to_afs or plink_to_afs)

  • afs A matrix of allele frequencies for all populations (columns) and SNPs (rows)

  • counts A matrix of allele counts for all populations (columns) and SNPs (rows)

  • snpdat A data frame with SNP metadata

maxmem

split up allele frequency data into blocks, if memory requirements exceed maxmem MB.

blgsize

SNP block size in Morgan. Default is 0.05 (5 cM). If blgsize is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.

outpop

If specified, f2-statistics will be weighted by heterozygosity in this population

outdir

Directory into which to write f2 data (if NULL, data is returned instead)

overwrite

Should existing files be overwritten? Only relevant if outdir is not NULL

poly_only

Exclude sites with identical allele frequencies in all populations. Can be different for f2-statistics, allele frequency products, and fst. Should be a character vector of length three, with some subset of c("f2", "ap", "fst")

verbose

Print progress updates

pop1

pops1 and pops2 can be specified if only a subset of pairs should be computed.

pop2

pops1 and pops2 can be specified if only a subset of pairs should be computed.

Details

For each population pair, each of the i=1,,ni = 1, \ldots, n resutling values (nn is around 700 in practice) is the mean f2f2 estimate across all SNPs except the ones in block ii.

p1(1p1)/(2n11)p2(1p2)/(2n21)- p1 (1 - p1)/(2 n1 - 1) - p2 (1 - p2)/(2 n2 - 1) is a correction term which makes the estimates unbiased at low sample sizes.

Examples

## Not run: 
afdat = plink_to_afs('/my/geno/prefix')
afs_to_f2_blocks(afdat, outdir = '/my/f2/data/')

## End(Not run)

Convert agraph to igraph

Description

agraph is the format used by the admixturegraph packge. igraph is used by the admixtools package

Usage

agraph_to_igraph(agraph)

Arguments

agraph

An admixture graph in agraph format

Value

An admixture graph in igraph format

Examples

## Not run: 
agraph_to_igraph(agraph)

## End(Not run)

Generate a list of bootstrap resampled arrays

Description

Generate a list of bootstrap resampled arrays

Usage

boo_list(arr, nboot = dim(arr)[3])

Arguments

arr

3d array with blocked estimates, with blocks in the 3rd dimension

nboot

Number of bootstrap iterations

Value

A list of bootstrap resampled arrays, with 3rd dimension equal to nboot

See Also

loo_list est_to_boo qpgraph_resample_snps2


Compare the fit of two qpgraph models

Description

Takes the bootstrap score distribution of two fits on the same populations and tests whether the scores of one graph are significantly higher or lower than the scores of the other graph.

Usage

compare_fits(scores1, scores2)

Arguments

scores1

Scores for the first graph. Each score should be for same model evaluated on different bootstrap samples of the SNP blocks. See qpgraph_resample_multi

scores2

Scores for the second graph, evaluated on the same bootstrap samples as the first graph

Value

A list with statistics comparing the two models

  • p_emp: The two-sided bootstrap p-value testing whether the scores of one model are higher or lower than the scores of the other model. It is two times the fraction of bootstrap replicates in which model 1 has a lower score than model 2 (or vice-vera, whichever is lower). This is simply a bootstrap test for testing whether some quantity (the the score difference of two models in this case) is significantly different from zero.

  • p_emp_nocorr: p_emp is truncated to be no less than the reciprocal of the number of bootstrap iterations (the length of the score vectors). p_emp_nocorr is not truncated and can be equal to 0.

  • ci_low: The 2.5% quantile of distribution of score differences

  • ci_high: The 97.5% quantile of distribution of score differences

  • The other items in this list are less important and can be ignored. In particular, p is not as reliable as p_emp because it assumes that the score differences follow a normal distribution.

See Also

qpgraph_resample_multi

Examples

## Not run: 
fits = qpgraph_resample_multi(f2_blocks, list(graph1, graph2), nboot = 100)
compare_fits(fits[[1]]$score, fits[[2]]$score)

## End(Not run)

Compare the fit of two qpgraph models

Description

Takes two data frames with model fits computed on two graphs for on the same populations and tests whether the scores of one graph are significantly better than the scores of the other

Usage

compare_fits2(fits1, fits2, boot = FALSE)

Arguments

fits1

The fits of the first graph

fits2

The fits of the second graph

boot

should match the boot parameter in qpgraph_resample_snps (FALSE by default)

Examples

## Not run: 
nblocks = dim(example_f2_blocks)[3]
train = sample(1:nblocks, round(nblocks/2))
fits1 = qpgraph_resample_snps(example_f2_blocks[,,train], graph = graph1,
                              f2_blocks_test = example_f2_blocks[,,-train])
fits2 = qpgraph_resample_snps(example_f2_blocks[,,train], graph = graph2,
                              f2_blocks_test = example_f2_blocks[,,-train])
compare_fits2(fit1, fit2)

## End(Not run)

Compare the fit of two qpgraph models

Description

Takes two data frames with model fits computed on two graphs for on the same populations and tests whether the scores of one graph are significantly better than the scores of the other.

Usage

compare_fits4(fit1, fit2, f2_blocks, f2_blocks_test, boot = FALSE, seed = NULL)

Arguments

fit1

The fit of the first graph

fit2

The fit of the second graph

f2_blocks

f2 blocks used for fitting fit1 and fit2. Used in combination with f2_blocks_test to compute f-statistics covariance matrix.

f2_blocks_test

f2 blocks which were not used for fitting fit1 and fit2

boot

If FALSE (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If boot is an integer, that number will specify the number of bootstrap resamplings. If boot = TRUE, the number of bootstrap resamplings will be equal to the number of SNP blocks. If bootstrap resampling is enabled, empirical p-values (p_emp) and 95 confidence intervals (ci_low and ci_high) will be reported.

seed

Random seed used if boot is TRUE. Does not need to match a seed used in fitting the models

Examples

## Not run: 
nblocks = dim(example_f2_blocks)[3]
train = sample(1:nblocks, round(nblocks/2))
fit1 = qpgraph(example_f2_blocks[,,train], graph1)
fit2 = qpgraph(example_f2_blocks[,,train], graph2)
compare_fits4(fit1, fit2, example_f2_blocks[,,train], example_f2_blocks[,,-train])

## End(Not run)

Count SNPs in an f2-statistics array

Description

This function adds up all block lengths (number of SNPs in each SNP blocks), which are stored in the names along the third dimension of the array. When the f2-statistics were computed while setting maxmiss to values greater than 0, it is possible that not all f2-statistics are based on the same number of SNPs. In that case, the value returned by this function is merely an upper bound on the number of SNPs used.

Usage

count_snps(f2_blocks)

Arguments

f2_blocks

A 3d array of per block f2-statistics

Value

The total number of SNPs across all blocks

See Also

f2_from_geno, f2_from_precomp


Count zero-length edges

Description

agraph is the format used by the admixturegraph packge. igraph is used by the admixtools package

Usage

count_zero_edges(edges, epsilon = 1e-06)

Arguments

edges

Edges data frame from fitted admixture graph

epsilon

Every edge with length

Value

The number of edges with length < epsilon

Examples

## Not run: 
fit = qpgraph(example_f2_blocks, example_igraph)
count_zero_edges(fit$edges)

## End(Not run)

Find all trees within SPR distance of 1 of all graph component trees

Description

Returns all trees which can be reached through one iteration of subtree-prune-and-regraft on any graph component tree

Usage

decomposed_tree_neighbors(graph)

Arguments

graph

An admixture graph

Value

A data frame with all trees


Delete an admixture edge

Description

Delete an admixture edge

Usage

delete_admix(graph, from = NULL, to = NULL)

Arguments

graph

An admixture graph

from

Edge source node

to

Edge target node

Value

Admixture graph with one deleted edge

See Also

insert_admix


Delete groups

Description

This function deletes data for groups created by group_samples

Usage

delete_groups(dir, groups = NULL, verbose = TRUE)

Arguments

dir

Directory with precomputed individual pair data

groups

Groups to delete. Defaults to all groups

verbose

Print progress updates

Value

Invisibly returns sample IDs in deleted groups as character vector

See Also

group_samples

Examples

## Not run: 
dir = 'my/f2/dir/'
inds = c('ind1', 'ind2', 'ind3', 'ind4', 'ind5')
pops = c('pop_A', 'pop_A', 'pop_A', 'pop_B', 'pop_B')
group_samples(dir, inds, pops)

## End(Not run)

Remove population from graph

Description

Remove population from graph

Usage

delete_leaf(graph, leaf)

Arguments

graph

An admixture graph

leaf

Population to be removed

Value

Admixture graph with removed population

See Also

insert_leaf


Add two nodes before each admixture node

Description

This is used to revert simplify_graph.

Usage

desimplify_graph(graph)

Arguments

graph

An admixture graph

Examples

simple = simplify_graph(example_igraph)
desimple = desimplify_graph(simple)
plot_graph(simple)
plot_graph(desimple)

Convert data frame graph to igraph

Description

Convert data frame graph to igraph

Usage

edges_to_igraph(edges)

Arguments

edges

An admixture graph as an edge list data frame

Value

An igraph object


Read allele frequencies from EIGENSTRAT files

Description

Read allele frequencies from EIGENSTRAT files

Usage

eigenstrat_to_afs(
  pref,
  inds = NULL,
  pops = NULL,
  numparts = 100,
  adjust_pseudohaploid = TRUE,
  verbose = TRUE
)

Arguments

pref

Prefix of EIGENSTRAT files (files have to end in .geno, .ind, .snp)

inds

Individuals from which to compute allele frequencies

pops

Populations from which to compute allele frequencies. If NULL (default), populations will be extracted from the third column in the .ind file. If population labels are provided, they should have the same length as inds, and will be matched to them by position

numparts

Number of parts into which the genotype file is split. Lowering this number can speed things up, but will take more memory.

adjust_pseudohaploid

Genotypes of pseudohaploid samples are usually coded as 0 or 2, even though only one allele is observed. adjust_pseudohaploid ensures that the observed allele count increases only by 1 for each pseudohaploid sample. If TRUE (default), samples that don't have any genotypes coded as 1 among the first 1000 SNPs are automatically identified as pseudohaploid. This leads to slightly more accurate estimates of f-statistics. Setting this parameter to FALSE is equivalent to the ADMIXTOOLS inbreed: NO option. Setting adjust_pseudohaploid to an integer n will check the first n SNPs instead of the first 1000 SNPs.

verbose

Print progress updates

Value

A list with three data frames: allele frequency data, allele counts, and SNP metadata

Examples

## Not run: 
afdat = eigenstrat_to_afs(prefix, pops = pops)
afs = afdat$afs
counts = afdat$counts

## End(Not run)

Turn per-block estimates into bootstrap estimates

Description

This works for any statistics which, when computed across N blocks, are equal to the weighted mean of the statistics across the N blocks.

Usage

est_to_boo(arr, nboot = dim(arr)[3], block_lengths = NULL)

Arguments

arr

3d array with blocked estimates, with blocks in the 3rd dimension.

nboot

Number of bootstrap iterations

Value

A 3d array with bootstrap estimates. The first two dimensions are equal to those of arr. The 3rd dimension is equal to nboot.

See Also

est_to_loo


Turn per-block estimates into leave-one-out estimates

Description

This works for any statistics which, when computed across N blocks, are equal to the weighted mean of the statistics across the N blocks.

Usage

est_to_loo(arr, block_lengths = NULL)

Arguments

arr

3d array with blocked estimates, with blocks in the 3rd dimension

block_lengths

Optional block lengths. If NULL, will be parsed from 3rd dimnames in blocks

Value

A 3d array with leave-one-out estimates for jackknife. Dimensions are equal to those of arr.

See Also

loo_to_est est_to_boo


Data frame with sample annotations

Description

Data frame with sample annotations

Usage

example_anno

Format

A data frame with sample annotations


Blocked f2-statistics for 7 populations

Description

Blocked f2-statistics for 7 populations

Usage

example_f2_blocks

Format

A 3d array with populations along the first two dimensions, and SNP blocks along the 3rd dimension


Admixture graph for 7 populations

Description

Admixture graph for 7 populations

Usage

example_graph

Format

A two column matrix where each row represents one edge


Admixture graph for 7 populations

Description

Admixture graph for 7 populations

Usage

example_igraph

Format

An igraph object


Data frame with one fitted admixture graph

Description

Data frame with one fitted admixture graph

Usage

example_opt

Format

A data frame with a fitted admixture graph, generated by find_graphs_old


example_graph fitted using qpGraph

Description

example_graph fitted using qpGraph

Usage

example_qpgraph_ref_results

Format

A list with parsed qpGraph output


Data frame with population triples

Description

Data frame with population triples

Usage

example_triples

Format

A data frame with population triples, generated by summarize_triples


Compute and store blocked allele frequency data

Description

Prepare data for various ADMIXTOOLS 2 functions. Reads data from packedancestrymap or PLINK files, and computes allele frequencies for selected populations and stores it as .rds files in outdir.

Usage

extract_afs(
  pref,
  outdir,
  inds = NULL,
  pops = NULL,
  cols_per_chunk = 10,
  numparts = 100,
  maxmiss = 0,
  minmaf = 0,
  maxmaf = 0.5,
  minac2 = FALSE,
  outpop = NULL,
  auto_only = TRUE,
  transitions = TRUE,
  transversions = TRUE,
  keepsnps = NULL,
  format = NULL,
  poly_only = FALSE,
  adjust_pseudohaploid = TRUE,
  verbose = TRUE
)

Arguments

pref

Prefix of PLINK/EIGENSTRAT/PACKEDANCESTRYMAP files. EIGENSTRAT/PACKEDANCESTRYMAP have to end in .geno, .snp, .ind, PLINK has to end in .bed, .bim, .fam

outdir

Directory where data will be stored.

inds

Individuals for which data should be extracted

pops

Populations for which data should be extracted. If both pops and inds are provided, they should have the same length and will be matched by position. If only pops is provided, all individuals from the .ind or .fam file in those populations will be extracted. If only inds is provided, each indivdual will be assigned to its own population of the same name. If neither pops nor inds is provided, all individuals and populations in the .ind or .fam file will be extracted.

cols_per_chunk

Number of populations per chunk. Lowering this number will lower the memory requirements when running afs_to_f2, but more chunk pairs will have to be computed.

numparts

Number of parts in which genotype data will be read for computing allele frequencies

maxmiss

Discard SNPs which are missing in a fraction of populations higher than maxmiss

minmaf

Discard SNPs with minor allele frequency less than minmaf

maxmaf

Discard SNPs with minor allele frequency greater than than maxmaf

minac2

Discard SNPs with allele count lower than 2 in any population (default FALSE). This option should be set to TRUE when computing f3-statistics where one population consists mostly of pseudohaploid samples. Otherwise heterozygosity estimates and thus f3-estimates can be biased. minac2 == 2 will discard SNPs with allele count lower than 2 in any non-singleton population (this option is experimental and is based on the hypothesis that using SNPs with allele count lower than 2 only leads to biases in non-singleton populations). While the minac2 option discards SNPs with allele count lower than 2 in any population, the qp3pop function will only discard SNPs with allele count lower than 2 in the first (target) population (when the first argument is the prefix of a genotype file).

outpop

Keep only SNPs which are heterozygous in this population

auto_only

Keep only SNPs on chromosomes 1 to 22

transitions

Set this to FALSE to exclude transition SNPs

transversions

Set this to FALSE to exclude transversion SNPs

keepsnps

SNP IDs of SNPs to keep. Overrides other SNP filtering options

format

Supply this if the prefix can refer to genotype data in different formats and you want to choose which one to read. Should be plink to read .bed, .bim, .fam files, or eigenstrat, or packedancestrymap to read .geno, .snp, .ind files.

poly_only

Specify whether SNPs with identical allele frequencies in every population should be discarded (poly_only = TRUE), or whether they should be used (poly_only = FALSE). By default (poly_only = c("f2")), these SNPs will be used to compute FST and allele frequency products, but not to compute f2 (this is the default option in the original ADMIXTOOLS).

adjust_pseudohaploid

Genotypes of pseudohaploid samples are usually coded as 0 or 2, even though only one allele is observed. adjust_pseudohaploid ensures that the observed allele count increases only by 1 for each pseudohaploid sample. If TRUE (default), samples that don't have any genotypes coded as 1 among the first 1000 SNPs are automatically identified as pseudohaploid. This leads to slightly more accurate estimates of f-statistics. Setting this parameter to FALSE treats all samples as diploid and is equivalent to the ADMIXTOOLS inbreed: NO option. Setting adjust_pseudohaploid to an integer n will check the first n SNPs instead of the first 1000 SNPs.

verbose

Print progress updates

Value

SNP metadata (invisibly)

Examples

## Not run: 
pref = 'my/genofiles/prefix'
outdir = 'dir/for/afdata/'
extract_afs(pref, outdir)

## End(Not run)

Compute and store blocked allele frequency data

Description

Prepare data for various ADMIXTOOLS 2 functions. Reads data from packedancestrymap or PLINK files, and computes allele frequencies for selected populations and stores it as .rds files in outdir.

Usage

extract_afs_simple(
  pref,
  outdir,
  inds = NULL,
  pops = NULL,
  blgsize = 0.05,
  cols_per_chunk = 10,
  maxmiss = 0,
  minmaf = 0,
  maxmaf = 0.5,
  minac2 = FALSE,
  outpop = NULL,
  transitions = TRUE,
  transversions = TRUE,
  keepsnps = NULL,
  format = NULL,
  poly_only = FALSE,
  adjust_pseudohaploid = TRUE,
  verbose = TRUE
)

Arguments

pref

Prefix of PLINK/EIGENSTRAT/PACKEDANCESTRYMAP files. EIGENSTRAT/PACKEDANCESTRYMAP have to end in .geno, .snp, .ind, PLINK has to end in .bed, .bim, .fam

outdir

Directory where data will be stored.

inds

Individuals for which data should be extracted

pops

Populations for which data should be extracted. If both pops and inds are provided, they should have the same length and will be matched by position. If only pops is provided, all individuals from the .ind or .fam file in those populations will be extracted. If only inds is provided, each indivdual will be assigned to its own population of the same name. If neither pops nor inds is provided, all individuals and populations in the .ind or .fam file will be extracted.

blgsize

SNP block size in Morgan. Default is 0.05 (5 cM). If blgsize is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.

cols_per_chunk

Number of populations per chunk. Lowering this number will lower the memory requirements when running afs_to_f2, but more chunk pairs will have to be computed.

maxmiss

Discard SNPs which are missing in a fraction of populations higher than maxmiss

minmaf

Discard SNPs with minor allele frequency less than minmaf

maxmaf

Discard SNPs with minor allele frequency greater than than maxmaf

minac2

Discard SNPs with allele count lower than 2 in any population (default FALSE). This option should be set to TRUE when computing f3-statistics where one population consists mostly of pseudohaploid samples. Otherwise heterozygosity estimates and thus f3-estimates can be biased. minac2 == 2 will discard SNPs with allele count lower than 2 in any non-singleton population (this option is experimental and is based on the hypothesis that using SNPs with allele count lower than 2 only leads to biases in non-singleton populations). While the minac2 option discards SNPs with allele count lower than 2 in any population, the qp3pop function will only discard SNPs with allele count lower than 2 in the first (target) population (when the first argument is the prefix of a genotype file).

outpop

Keep only SNPs which are heterozygous in this population

transitions

Set this to FALSE to exclude transition SNPs

transversions

Set this to FALSE to exclude transversion SNPs

keepsnps

SNP IDs of SNPs to keep. Overrides other SNP filtering options

format

Supply this if the prefix can refer to genotype data in different formats and you want to choose which one to read. Should be plink to read .bed, .bim, .fam files, or eigenstrat, or packedancestrymap to read .geno, .snp, .ind files.

poly_only

Specify whether SNPs with identical allele frequencies in every population should be discarded (poly_only = TRUE), or whether they should be used (poly_only = FALSE). By default (poly_only = c("f2")), these SNPs will be used to compute FST and allele frequency products, but not to compute f2 (this is the default option in the original ADMIXTOOLS).

adjust_pseudohaploid

Genotypes of pseudohaploid samples are usually coded as 0 or 2, even though only one allele is observed. adjust_pseudohaploid ensures that the observed allele count increases only by 1 for each pseudohaploid sample. If TRUE (default), samples that don't have any genotypes coded as 1 among the first 1000 SNPs are automatically identified as pseudohaploid. This leads to slightly more accurate estimates of f-statistics. Setting this parameter to FALSE treats all samples as diploid and is equivalent to the ADMIXTOOLS inbreed: NO option. Setting adjust_pseudohaploid to an integer n will check the first n SNPs instead of the first 1000 SNPs.

verbose

Print progress updates

Value

SNP metadata (invisibly)

Examples

## Not run: 
pref = 'my/genofiles/prefix'
outdir = 'dir/for/afdata/'
extract_afs(pref, outdir)

## End(Not run)

Extract and store data needed to compute blocked f2

Description

Prepare data for various ADMIXTOOLS 2 functions. This function reads data from genotype files, and extracts data required to compute blocked f-statistics for any sets of samples. The data consists of .rds files with total and alternative allele counts for each individual, and products of total and alternative allele counts for each pair. The function calls packedancestrymap_to_afs or plink_to_afs and afs_to_f2_blocks.

Usage

extract_counts(
  pref,
  outdir,
  inds = NULL,
  blgsize = 0.05,
  maxmiss = 0,
  minmaf = 0,
  maxmaf = 0.5,
  transitions = TRUE,
  transversions = TRUE,
  auto_only = TRUE,
  keepsnps = NULL,
  maxmem = 8000,
  overwrite = FALSE,
  format = NULL,
  cols_per_chunk = NULL,
  verbose = TRUE
)

Arguments

pref

Prefix of PLINK/EIGENSTRAT/PACKEDANCESTRYMAP files. EIGENSTRAT/PACKEDANCESTRYMAP have to end in .geno, .snp, .ind, PLINK has to end in .bed, .bim, .fam

outdir

Directory where data will be stored.

inds

Individuals for which data should be read. Defaults to all individuals

blgsize

SNP block size in Morgan. Default is 0.05 (5 cM). If blgsize is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.

maxmiss

Discard SNPs which are missing in a fraction of individuals greater than maxmiss

minmaf

Discard SNPs with minor allele frequency less than minmaf

maxmaf

Discard SNPs with minor allele frequency greater than than maxmaf

transitions

Set this to FALSE to exclude transition SNPs

transversions

Set this to FALSE to exclude transversion SNPs

auto_only

Keep only SNPs on chromosomes 1 to 22

keepsnps

SNP IDs of SNPs to keep. Overrides other SNP filtering options

maxmem

Maximum amount of memory to be used. If the required amount of memory exceeds maxmem, allele frequency data will be split into blocks, and the computation will be performed separately on each block pair. This doesn't put a precise cap on the amount of memory used (it used to at some point). Set this parameter to lower values if you run out of memory while running this function. Set it to higher values if this function is too slow and you have lots of memory.

overwrite

Overwrite existing files in outdir

format

Supply this if the prefix can refer to genotype data in different formats and you want to choose which one to read. Should be plink to read .bed, .bim, .fam files, or eigenstrat, or packedancestrymap to read .geno, .snp, .ind files.

cols_per_chunk

Number of genotype chunks to store on disk. Setting this to a positive integer makes the function slower, but requires less memory. The default value for cols_per_chunk in extract_afs is 10. Lower numbers will lower the memory requirement but increase the time it takes.

verbose

Print progress updates


Compute and store blocked f2 statistics

Description

This function prepares data for various other ADMIXTOOLS 2 functions. It reads data from genotype files, computes allele frequencies and blocked f2-statistics for selected populations, and writes the results to outdir.

Usage

extract_f2(
  pref,
  outdir,
  inds = NULL,
  pops = NULL,
  blgsize = 0.05,
  maxmem = 8000,
  maxmiss = 0,
  minmaf = 0,
  maxmaf = 0.5,
  minac2 = FALSE,
  pops2 = NULL,
  outpop = NULL,
  outpop_scale = TRUE,
  transitions = TRUE,
  transversions = TRUE,
  auto_only = TRUE,
  keepsnps = NULL,
  overwrite = FALSE,
  format = NULL,
  adjust_pseudohaploid = TRUE,
  cols_per_chunk = NULL,
  fst = TRUE,
  afprod = TRUE,
  poly_only = c("f2"),
  apply_corr = TRUE,
  qpfstats = FALSE,
  n_cores = 1,
  verbose = TRUE,
  ...
)

Arguments

pref

Prefix of PLINK/EIGENSTRAT/PACKEDANCESTRYMAP files. EIGENSTRAT/PACKEDANCESTRYMAP have to end in .geno, .snp, .ind, PLINK has to end in .bed, .bim, .fam

outdir

Directory where data will be stored.

inds

Individuals for which data should be extracted

pops

Populations for which data should be extracted. If both pops and inds are provided, they should have the same length and will be matched by position. If only pops is provided, all individuals from the .ind or .fam file in those populations will be extracted. If only inds is provided, each indivdual will be assigned to its own population of the same name. If neither pops nor inds is provided, all individuals and populations in the .ind or .fam file will be extracted.

blgsize

SNP block size in Morgan. Default is 0.05 (5 cM). If blgsize is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.

maxmem

Maximum amount of memory to be used. If the required amount of memory exceeds maxmem, allele frequency data will be split into blocks, and the computation will be performed separately on each block pair. This doesn't put a precise cap on the amount of memory used (it used to at some point). Set this parameter to lower values if you run out of memory while running this function. Set it to higher values if this function is too slow and you have lots of memory.

maxmiss

Discard SNPs which are missing in a fraction of populations higher than maxmiss

minmaf

Discard SNPs with minor allele frequency less than minmaf

maxmaf

Discard SNPs with minor allele frequency greater than than maxmaf

minac2

Discard SNPs with allele count lower than 2 in any population (default FALSE). This option should be set to TRUE when computing f3-statistics where one population consists mostly of pseudohaploid samples. Otherwise heterozygosity estimates and thus f3-estimates can be biased. minac2 == 2 will discard SNPs with allele count lower than 2 in any non-singleton population (this option is experimental and is based on the hypothesis that using SNPs with allele count lower than 2 only leads to biases in non-singleton populations). While the minac2 option discards SNPs with allele count lower than 2 in any population, the qp3pop function will only discard SNPs with allele count lower than 2 in the first (target) population (when the first argument is the prefix of a genotype file).

pops2

If specified, only a pairs between pops and pops2 will be computed

outpop

Keep only SNPs which are heterozygous in this population

outpop_scale

Scale f2-statistics by the inverse outpop heteroygosity (1/(p*(1-p))). Providing outpop and setting outpop_scale to TRUE will give the same results as the original qpGraph when the outpop parameter has been set, but it has the disadvantage of treating one population different from the others. This may limit the use of these f2-statistics for other models.

transitions

Set this to FALSE to exclude transition SNPs

transversions

Set this to FALSE to exclude transversion SNPs

auto_only

Keep only SNPs on chromosomes 1 to 22

keepsnps

SNP IDs of SNPs to keep. Overrides other SNP filtering options

overwrite

Overwrite existing files in outdir

format

Supply this if the prefix can refer to genotype data in different formats and you want to choose which one to read. Should be plink to read .bed, .bim, .fam files, or eigenstrat, or packedancestrymap to read .geno, .snp, .ind files.

adjust_pseudohaploid

Genotypes of pseudohaploid samples are usually coded as 0 or 2, even though only one allele is observed. adjust_pseudohaploid ensures that the observed allele count increases only by 1 for each pseudohaploid sample. If TRUE (default), samples that don't have any genotypes coded as 1 among the first 1000 SNPs are automatically identified as pseudohaploid. This leads to slightly more accurate estimates of f-statistics. Setting this parameter to FALSE treats all samples as diploid and is equivalent to the ADMIXTOOLS inbreed: NO option. Setting adjust_pseudohaploid to an integer n will check the first n SNPs instead of the first 1000 SNPs.

cols_per_chunk

Number of allele frequency chunks to store on disk. Setting this to a positive integer makes the function slower, but requires less memory. The default value for cols_per_chunk in extract_afs is 10. Lower numbers will lower the memory requirement but increase the time it takes.

fst

Write files with pairwise FST for every population pair. Setting this to FALSE can make extract_f2 faster and will require less memory.

afprod

Write files with allele frequency products for every population pair. Setting this to FALSE can make extract_f2 faster and will require less memory.

poly_only

Specify whether SNPs with identical allele frequencies in every population should be discarded (poly_only = TRUE), or whether they should be used (poly_only = FALSE). By default (poly_only = c("f2")), these SNPs will be used to compute FST and allele frequency products, but not to compute f2 (this is the default option in the original ADMIXTOOLS).

apply_corr

Apply small-sample-size correction when computing f2-statistics (default TRUE)

qpfstats

Compute smoothed f2-statistics (default FALSE). In the presence of large amounts of missing data, this option can be used to retain information from all SNPs while introducing less bias than setting maxmiss to values greater than 0. When setting qpfstats = TRUE, most other options to extract_f2 will be ignored. See qpfstats for more information. Arguments to qpfstats can be passed via ...

n_cores

Parallelize computation across n_cores cores via the doParallel package.

verbose

Print progress updates

...

Pass arguments to qpfstats

Value

SNP metadata (invisibly)

See Also

f2_from_precomp for reading the stored f2-statistics back into R, f2_from_geno to skip writting f2-statistics to disk and return them directly

Examples

## Not run: 
pref = 'my/genofiles/prefix'
f2dir = 'my/f2dir/'
extract_f2(pref, f2dir, pops = c('popA', 'popB', 'popC'))

## End(Not run)

Compute and store blocked f2 statistics

Description

extract_f2_large does the same as extract_f2, but it requires less memory and is slower. outdir has to be set in extract_f2_large.

Usage

extract_f2_large(
  pref,
  outdir,
  inds = NULL,
  pops = NULL,
  blgsize = 0.05,
  cols_per_chunk = 10,
  maxmiss = 0,
  minmaf = 0,
  maxmaf = 0.5,
  minac2 = FALSE,
  outpop = NULL,
  outpop_scale = TRUE,
  transitions = TRUE,
  transversions = TRUE,
  keepsnps = NULL,
  snpblocks = NULL,
  overwrite = FALSE,
  format = NULL,
  adjust_pseudohaploid = TRUE,
  afprod = TRUE,
  fst = TRUE,
  poly_only = c("f2"),
  apply_corr = TRUE,
  verbose = TRUE
)

Arguments

pref

Prefix of PLINK/EIGENSTRAT/PACKEDANCESTRYMAP files. EIGENSTRAT/PACKEDANCESTRYMAP have to end in .geno, .snp, .ind, PLINK has to end in .bed, .bim, .fam

outdir

Directory where data will be stored.

inds

Individuals for which data should be extracted

pops

Populations for which data should be extracted. If both pops and inds are provided, they should have the same length and will be matched by position. If only pops is provided, all individuals from the .ind or .fam file in those populations will be extracted. If only inds is provided, each indivdual will be assigned to its own population of the same name. If neither pops nor inds is provided, all individuals and populations in the .ind or .fam file will be extracted.

blgsize

SNP block size in Morgan. Default is 0.05 (5 cM). If blgsize is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.

cols_per_chunk

Number of populations per chunk. Lowering this number will lower the memory requirements when running

maxmiss

Discard SNPs which are missing in a fraction of populations higher than maxmiss

minmaf

Discard SNPs with minor allele frequency less than minmaf

maxmaf

Discard SNPs with minor allele frequency greater than than maxmaf

minac2

Discard SNPs with allele count lower than 2 in any population (default FALSE). This option should be set to TRUE when computing f3-statistics where one population consists mostly of pseudohaploid samples. Otherwise heterozygosity estimates and thus f3-estimates can be biased. minac2 == 2 will discard SNPs with allele count lower than 2 in any non-singleton population (this option is experimental and is based on the hypothesis that using SNPs with allele count lower than 2 only leads to biases in non-singleton populations). While the minac2 option discards SNPs with allele count lower than 2 in any population, the qp3pop function will only discard SNPs with allele count lower than 2 in the first (target) population (when the first argument is the prefix of a genotype file).

outpop

Keep only SNPs which are heterozygous in this population

outpop_scale

Scale f2-statistics by the inverse outpop heteroygosity (1/(p*(1-p))). Providing outpop and setting outpop_scale to TRUE will give the same results as the original qpGraph when the outpop parameter has been set, but it has the disadvantage of treating one population different from the others. This may limit the use of these f2-statistics for other models.

transitions

Set this to FALSE to exclude transition SNPs

transversions

Set this to FALSE to exclude transversion SNPs

keepsnps

SNP IDs of SNPs to keep. Overrides other SNP filtering options

overwrite

Overwrite existing files in outdir

format

Supply this if the prefix can refer to genotype data in different formats and you want to choose which one to read. Should be plink to read .bed, .bim, .fam files, or eigenstrat, or packedancestrymap to read .geno, .snp, .ind files.

adjust_pseudohaploid

Genotypes of pseudohaploid samples are usually coded as 0 or 2, even though only one allele is observed. adjust_pseudohaploid ensures that the observed allele count increases only by 1 for each pseudohaploid sample. If TRUE (default), samples that don't have any genotypes coded as 1 among the first 1000 SNPs are automatically identified as pseudohaploid. This leads to slightly more accurate estimates of f-statistics. Setting this parameter to FALSE treats all samples as diploid and is equivalent to the ADMIXTOOLS inbreed: NO option. Setting adjust_pseudohaploid to an integer n will check the first n SNPs instead of the first 1000 SNPs.

afprod

Write files with allele frequency products for every population pair. Setting this to FALSE can make extract_f2 faster and will require less memory.

fst

Write files with pairwise FST for every population pair. Setting this to FALSE can make extract_f2 faster and will require less memory.

poly_only

Specify whether SNPs with identical allele frequencies in every population should be discarded (poly_only = TRUE), or whether they should be used (poly_only = FALSE). By default (poly_only = c("f2")), these SNPs will be used to compute FST and allele frequency products, but not to compute f2 (this is the default option in the original ADMIXTOOLS).

apply_corr

Apply small-sample-size correction when computing f2-statistics (default TRUE)

verbose

Print progress updates

Details

extract_f2_large requires less memory because it writes allele frequency data to disk, and doesn't store the allele frequency matrix for all populations and SNPs in memory. If you still run out of memory, reduce cols_per_chunk. This function is a wrapper around extract_afs and afs_to_f2, and is slower than extract_f2. It may be faster to call extract_afs and afs_to_f2 directly, parallelizing over the different calls to afs_to_f2.

Value

SNP metadata (invisibly)

See Also

extract_f2

Examples

## Not run: 
pref = 'my/genofiles/prefix'
f2dir = 'my/f2dir/'
extract_f2_large(pref, f2dir, pops = c('popA', 'popB', 'popC'))

## End(Not run)

Copy f2-statistics

Description

Copy a subset of f2-statistics to a new directory

Usage

extract_f2_subset(from, to, pops, verbose = TRUE)

Arguments

from

Directory with f2-statistics

to

Target directory

pops

Populations to copy

verbose

Print progress updates

Examples

## Not run: 
pref = 'my/genofiles/prefix'
outdir = 'dir/for/afdata/'
extract_f2_subset(pref, outdir)

## End(Not run)

Extract samples from PLINK files

Description

This function reads PLINK files, extracts a subset of samples, and writes new PLINK files using write_plink. It's probably slower than running the equivalent command in PLINK directly, but it can be useful to do this from within R. When inds or pops is provided, only a subset of samples will be extracted.

Usage

extract_samples(
  inpref,
  outpref,
  inds = NULL,
  pops = NULL,
  overwrite = FALSE,
  verbose = TRUE
)

Arguments

inpref

Prefix of the PLINK input files

outpref

Prefix of the PLINK output files

inds

Individuals which should be extracted

pops

Populations which should be extracted. Can not be provided together with inds

overwrite

Set this to TRUE if inpref == outpref and you really want to overwrite the input files.

verbose

Print progress updates


Estimate f2 statistics

Description

Computes f2 statistics from f2 blocks of the form f2(A,B)f2(A, B)

Usage

f2(
  data,
  pop1 = NULL,
  pop2 = NULL,
  boot = FALSE,
  sure = FALSE,
  unique_only = TRUE,
  verbose = FALSE,
  ...
)

Arguments

data

Input data in one of three forms:

  1. A 3d array of blocked f2 statistics, output of f2_from_precomp or f2_from_geno (fastest option)

  2. A directory which contains pre-computed f2-statistics

  3. The prefix of genotype files (slowest option)

pop1

One of the following four:

  1. NULL: all possible population combinations will be returned

  2. A vector of population labels. All combinations with the other pop arguments will be returned

  3. A matrix with population combinations to be tested, with one population per column and one combination per row. Other pop arguments will be ignored.

  4. the location of a file (poplistname or popfilename) which specifies the populations or population combinations to be tested. Other pop arguments will be ignored.

pop2

A vector of population labels

boot

If FALSE (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If boot is an integer, that number will specify the number of bootstrap resamplings. If boot = TRUE, the number of bootstrap resamplings will be equal to the number of SNP blocks.

sure

The number of population combinations can get very large. This is a safety option that stops you from accidently computing all combinations if that number is large.

unique_only

If TRUE (the default), redundant combinations will be excluded

verbose

Print progress updates

...

Additional arguments passed to f2_from_geno when data is a genotype prefix

Value

f2 returns a data frame with f2 statistics

References

Patterson, N. et al. (2012) Ancient admixture in human history Genetics

Peter, B. (2016) Admixture, Population Structure, and F-Statistics Genetics

Examples

pop1 = 'Denisova.DG'
pop2 = c('Altai_Neanderthal.DG', 'Vindija.DG')
f2(example_f2_blocks, pop1, pop2)
## Not run: 
f2(f2_dir, pop1, pop2)

## End(Not run)

Compute blocked f2 statistics

Description

This function prepares data for various other ADMIXTOOLS 2 functions. It reads data from genotype files, computes allele frequencies and blocked f2-statistics for selected populations, and returns them as a 3d array.

Usage

f2_from_geno(
  pref,
  inds = NULL,
  pops = NULL,
  blgsize = 0.05,
  maxmem = 8000,
  maxmiss = 0,
  minmaf = 0,
  maxmaf = 0.5,
  pops2 = NULL,
  outpop = NULL,
  outpop_scale = TRUE,
  transitions = TRUE,
  transversions = TRUE,
  auto_only = TRUE,
  keepsnps = NULL,
  afprod = FALSE,
  fst = FALSE,
  poly_only = c("f2"),
  format = NULL,
  adjust_pseudohaploid = TRUE,
  remove_na = TRUE,
  apply_corr = TRUE,
  qpfstats = FALSE,
  verbose = TRUE,
  ...
)

Arguments

pref

Prefix of PLINK/EIGENSTRAT/PACKEDANCESTRYMAP files. EIGENSTRAT/PACKEDANCESTRYMAP have to end in .geno, .snp, .ind, PLINK has to end in .bed, .bim, .fam

inds

Individuals for which data should be extracted

pops

Populations for which data should be extracted. If both pops and inds are provided, they should have the same length and will be matched by position. If only pops is provided, all individuals from the .ind or .fam file in those populations will be extracted. If only inds is provided, each indivdual will be assigned to its own population of the same name. If neither pops nor inds is provided, all individuals and populations in the .ind or .fam file will be extracted.

blgsize

SNP block size in Morgan. Default is 0.05 (5 cM). If blgsize is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.

maxmem

Maximum amount of memory to be used. If the required amount of memory exceeds maxmem, allele frequency data will be split into blocks, and the computation will be performed separately on each block pair. This doesn't put a precise cap on the amount of memory used (it used to at some point). Set this parameter to lower values if you run out of memory while running this function. Set it to higher values if this function is too slow and you have lots of memory.

maxmiss

Discard SNPs which are missing in a fraction of populations higher than maxmiss

minmaf

Discard SNPs with minor allele frequency less than minmaf

maxmaf

Discard SNPs with minor allele frequency greater than than maxmaf

pops2

If specified, only a pairs between pops and pops2 will be computed

outpop

Keep only SNPs which are heterozygous in this population

outpop_scale

Scale f2-statistics by the inverse outpop heteroygosity (1/(p*(1-p))). Providing outpop and setting outpop_scale to TRUE will give the same results as the original qpGraph when the outpop parameter has been set, but it has the disadvantage of treating one population different from the others. This may limit the use of these f2-statistics for other models.

transitions

Set this to FALSE to exclude transition SNPs

transversions

Set this to FALSE to exclude transversion SNPs

auto_only

Keep only SNPs on chromosomes 1 to 22

keepsnps

SNP IDs of SNPs to keep. Overrides other SNP filtering options

afprod

Return negative average allele frequency products instead of f2-statistics. Setting afprod = TRUE will result in more precise f4-statistics when the original data had large amounts of missingness, and should be used in that case for qpdstat and qpadm. It can also be used for outgroup f3-statistics with a fixed outgroup (for example for qpgraph); values will be shifted by a constant amount compared to regular f3-statistics. This shift affects the fit of a graph only by small amounts, possibly less than bias in regular f3-statistics introduced by large amounts of missing data.

fst

Write files with pairwise FST for every population pair. Setting this to FALSE can make extract_f2 faster and will require less memory.

poly_only

Specify whether SNPs with identical allele frequencies in every population should be discarded (poly_only = TRUE), or whether they should be used (poly_only = FALSE). By default (poly_only = c("f2")), these SNPs will be used to compute FST and allele frequency products, but not to compute f2 (this is the default option in the original ADMIXTOOLS).

format

Supply this if the prefix can refer to genotype data in different formats and you want to choose which one to read. Should be plink to read .bed, .bim, .fam files, or eigenstrat, or packedancestrymap to read .geno, .snp, .ind files.

adjust_pseudohaploid

Genotypes of pseudohaploid samples are usually coded as 0 or 2, even though only one allele is observed. adjust_pseudohaploid ensures that the observed allele count increases only by 1 for each pseudohaploid sample. If TRUE (default), samples that don't have any genotypes coded as 1 among the first 1000 SNPs are automatically identified as pseudohaploid. This leads to slightly more accurate estimates of f-statistics. Setting this parameter to FALSE treats all samples as diploid and is equivalent to the ADMIXTOOLS inbreed: NO option. Setting adjust_pseudohaploid to an integer n will check the first n SNPs instead of the first 1000 SNPs.

apply_corr

Apply small-sample-size correction when computing f2-statistics (default TRUE)

qpfstats

Compute smoothed f2-statistics (default FALSE). In the presence of large amounts of missing data, this option can be used to retain information from all SNPs while introducing less bias than setting maxmiss to values greater than 0. When setting qpfstats = TRUE, most other options to extract_f2 will be ignored. See qpfstats for more information. Arguments to qpfstats can be passed via ...

verbose

Print progress updates

...

Pass arguments to qpfstats

Value

A 3d array of f2-statistics (or scaled allele frequency products if afprod = TRUE)

See Also

f2_from_precomp for reading previously stored f2-statistics into R, extract_f2 for storing f2-statistics on disk


Simulate an admixture graph in msprime

Description

This function generates an msprime simulation script, executes it in python, and turns the resulting genotype data into f2-statistics

Usage

f2_from_msprime(..., blgsize = 0.05, cleanup = TRUE, verbose = TRUE)

See Also

msprime_sim


Read blocked f2 statistics from disk

Description

Read blocked f2 statistics from disk

Usage

f2_from_precomp(
  dir,
  inds = NULL,
  pops = NULL,
  pops2 = NULL,
  afprod = FALSE,
  fst = FALSE,
  return_array = TRUE,
  apply_corr = TRUE,
  remove_na = TRUE,
  verbose = TRUE
)

Arguments

dir

Directory with precomputed f2 statistics, or precomputed individual pair data

inds

Individuals for which data should be read. Defaults to all individuals, which may require a lot of memory.

pops

Populations for which data should be read. Defaults to all populations, which may require a lot of memory.

pops2

Optional second vector of populations. Useful if a f4 statistics of a few against many populations should be computed. pops2 should not be specified in other cases, as most functions depend on f2-statistics for all population pairs in pops.

afprod

Return negative average allele frequency products instead of f2 estimates. This will result in more precise f4-statistics when the original data had large amounts of missingness, and should be used in that case for qpdstat and qpadm. It can also be used for outgroup f3-statistics with a fixed outgroup (for example for qpgraph); values will be shifted by a constant amount compared to regular f3-statistics. This shift affects the fit of a graph only by small amounts, possibly less than bias in regular f3-statistics introduced by large amounts of missing data. This option is currently ineffective when reading data extracted with extract_counts.

return_array

Return a 3d array (default). If false, a data frame will be returned.

apply_corr

Subtract the f2 correction factor. Setting this to FALSE can occasionally be useful

remove_na

Remove blocks with missing values

verbose

Print progress updates

Value

A 3d array of f2 statistics

Examples

## Not run: 
dir = 'my/f2/dir/'
f2_blocks = f2_from_precomp(dir, pops = c('pop1', 'pop2', 'pop3'))

## End(Not run)

Turn f2 data to f4 data

Description

Turn f2 data to f4 data

Usage

f2dat_f4dat(f2dat, popcomb = NULL)

Arguments

f2dat

A data frame of f2-statistics with columns pop1, pop2, f2

Value

A data frame with f4-statistics


f3 from genotype data

Description

Compute per-block f3-statistics directly from genotype data

Usage

f3blockdat_from_geno(
  pref,
  popcombs,
  auto_only = TRUE,
  blgsize = 0.05,
  block_lengths = NULL,
  allsnps = FALSE,
  adjust_pseudohaploid = TRUE,
  poly_only = FALSE,
  apply_corr = TRUE,
  outgroupmode = FALSE,
  verbose = TRUE
)

Arguments

pref

Prefix of genotype files

popcombs

A data frame with one population combination per row, and columns pop1, pop2, pop3, pop4. If there is an additional integer column named model and allsnps = FALSE, only SNPs present in every population in any given model will be used to compute f4-statistics for that model.

auto_only

Use only chromosomes 1 to 22.

blgsize

SNP block size in Morgan. Default is 0.05 (5 cM). If blgsize is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.

block_lengths

An optional vector with block lengths. If NULL, block lengths will be computed.

allsnps

Use all SNPs with allele frequency estimates in every population of any given population quadruple. If FALSE (the default) only SNPs which are present in all populations in popcombs (or any given model in it) will be used. Setting allsnps = TRUE in the presence of large amounts of missing data might lead to false positive results.

adjust_pseudohaploid

Genotypes of pseudohaploid samples are usually coded as 0 or 2, even though only one allele is observed. adjust_pseudohaploid ensures that the observed allele count increases only by 1 for each pseudohaploid sample. If TRUE (default), samples that don't have any genotypes coded as 1 among the first 1000 SNPs are automatically identified as pseudohaploid. This leads to slightly more accurate estimates of f-statistics. Setting this parameter to FALSE is equivalent to the ADMIXTOOLS inbreed: NO option. Setting adjust_pseudohaploid to an integer n will check the first n SNPs instead of the first 1000 SNPs.

apply_corr

With apply_corr = FALSE, no bias correction is performed. With apply_corr = TRUE (the default), a bias correction term based on the heterozygosity in the first population is subtracted from the f3 estimate. With apply_corr = 2, the bias correction term is calculated based on all 3 populations. This option is not generally recommended, and only exists to match how the f3-statistics are estimated in certain scenarios in the original qpGraph program.

outgroupmode

With outgroupmode = FALSE, estimates of f3 will be normalized by estimates of the heterozygosity of the target population. This is the default option if the first argument is the prefix of genotype data. If the first argument is an array of precomputed f2-statistics, then no normalization can be performed, which corresponds to outgroupmode = TRUE.

verbose

Print progress updates

Value

A data frame with per-block f4-statistics for each population quadruple.


Compute f4 from allele frequencies

Description

Compute f4 from allele frequencies

Usage

f4_from_afdat(afdat, popcombs)

Arguments

afdat

A data frame with allele frequencies and SNP metadata. Can be grouped.

popcombs

A data frame with population combinations. Columns pop1 to pop4

Examples

## Not run: 
# Compute f4 for all mutatation classes separately
afs = plink_to_afs('/my/geno/prefix', pops = c('p1', 'p2', 'p3', 'p4', 'p5'))
afdat = bind_cols(afs$snpfile, afs$afs %>% as_tibble()) %>%
        mutate(gr = paste0(pmin(A1, A2), pmax(A1, A2))) %>%
        group_by(gr)
popcombs = tibble(pop1 = c('p1', 'p5'), pop2 = 'p2', pop3 = 'p3', pop4 = 'p4')
out = f4_from_afdat(afdat, popcombs)
out %>% ggplot(aes(gr, est)) + geom_point() +
          geom_errorbar(aes(ymin = est - se, ymax = est + se)) +
          facet_wrap(~paste(pop1, pop2, pop3, pop4), scales = 'free')

## End(Not run)

Get per-block f4-statistics

Description

This function turns per-block f2-statistics into per-block f4-statistics of the form ⁠f4(pop1, pop2; pop3, pop4)⁠

Usage

f4_from_f2(f2_data, pop1, pop2 = NULL, pop3 = NULL, pop4 = NULL)

Arguments

f2_data

A 3d array with blocked f2 statistics, output of f2_from_precomp or f2_from_geno Alternatively, a directory with precomputed data. See extract_f2 and extract_counts.

pop1

Either the name(s) of the first population(s), or a four column matrix with the names of all four populations.

pop2

Population 2 (same length as pop1)

pop3

Population 3 (same length as pop1)

pop4

Population 4 (same length as pop1)

Value

A matrix of per-block f4-statistics (⁠popcomb x block⁠)


f4 from genotype data

Description

Compute per-block f4-statistics directly from genotype data

Usage

f4blockdat_from_geno(
  pref,
  popcombs = NULL,
  left = NULL,
  right = NULL,
  auto_only = TRUE,
  blgsize = 0.05,
  block_lengths = NULL,
  f4mode = TRUE,
  allsnps = FALSE,
  poly_only = FALSE,
  snpwt = NULL,
  keepsnps = NULL,
  verbose = TRUE
)

Arguments

pref

Prefix of genotype files

popcombs

A data frame with one population combination per row, and columns pop1, pop2, pop3, pop4. If there is an additional integer column named model and allsnps = FALSE, only SNPs present in every population in any given model will be used to compute f4-statistics for that model.

left

Populations on the left side of f4 (pop1 and pop2). Can be provided together with right in place of popcombs.

right

Populations on the right side of f4 (pop3 and pop4). Can be provided together with left in place of popcombs.

auto_only

Use only chromosomes 1 to 22.

blgsize

SNP block size in Morgan. Default is 0.05 (5 cM). If blgsize is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.

block_lengths

An optional vector with block lengths. If NULL, block lengths will be computed.

f4mode

If TRUE: f4 is computed from allele frequencies a, b, c, and d as (a-b)*(c-d). if FALSE, D-statistics are computed instead, defined as (a-b)*(c-d) / ((a + b - 2*a*b) * (c + d - 2*c*d)), which is the same as (P(BABA) - P(ABBA)) / (P(ABBA) + P(BABA)).

allsnps

Use all SNPs with allele frequency estimates in every population of any given population quadruple. If FALSE (the default) only SNPs which are present in all populations in popcombs (or any given model in it) will be used. Setting allsnps = TRUE in the presence of large amounts of missing data might lead to false positive results.

poly_only

Only keep SNPs with mean allele frequency not equal to 0 or 1 (default FALSE).

snpwt

A vector of SNP weights

keepsnps

A vector of SNP IDs to keep

verbose

Print progress updates

Value

A data frame with per-block f4-statistics for each population quadruple.


Turn f4 block data to 3d array

Description

Turn f4 block data to 3d array

Usage

f4blockdat_to_f4blocks(f4blockdat, remove_na = TRUE)

Arguments

f4blockdat

f4 block data frame generated by f4blockdat_from_geno

remove_na

Remove blocks with missing values


Find admixture edges

Description

Find admixture edges

Usage

find_admixedges(graph)

Arguments

graph

An admixture graph

Value

A data frame with columns from and to with admixture edges

See Also

find_normedges find_newedges


Find well fitting admixture graphs

Description

This function generates and evaluates admixture graphs in numgen iterations to find well fitting admixturegraphs.

Usage

find_graphs(
  data,
  numadmix = 0,
  outpop = NULL,
  stop_gen = 100,
  stop_gen2 = 15,
  stop_score = 0,
  stop_sec = NULL,
  initgraph = NULL,
  numgraphs = 10,
  mutfuns = namedList(spr_leaves, spr_all, swap_leaves, move_admixedge_once,
    flipadmix_random, place_root_random, mutate_n),
  opt_worst_residual = FALSE,
  plusminus_generations = 5,
  return_searchtree = FALSE,
  admix_constraints = NULL,
  event_constraints = NULL,
  reject_f4z = 0,
  max_admix = numadmix,
  verbose = TRUE,
  ...
)

Arguments

data

Input data in one of three forms:

  1. A 3d array of blocked f2 statistics, output of f2_from_precomp or f2_from_geno

  2. A directory which contains pre-computed f2-statistics

  3. The prefix of genotype files

numadmix

Number of admixture events within each graph. (Only relevant if initgraph = NULL)

outpop

Name of the outgroup population

stop_gen

Total number of generations after which to stop

stop_gen2

Number of generations without improvement after which to stop

stop_score

Stop once this score has been reached

stop_sec

Number of seconds after which to stop

initgraph

Graph to start with. If it is specified, numadmix and outpop will be inferred from this graph.

numgraphs

Number of graphs in each generation

mutfuns

Functions used to modify graphs. Defaults to the following:

  • spr_leaves: Subtree prune and regraft leaves. Cuts a leaf node and attaches it to a random other edge in the graph.

  • spr_all: Subtree prune and regraft. Cuts any edge and attaches the new orphan node to a random other edge in the graph, keeping the number of admixture nodes constant.

  • swap_leaves: Swaps two leaf nodes.

  • move_admixedge_once: Moves an admixture edge to a nearby location.

  • flipadmix_random: Flips the direction of an admixture edge (if possible).

  • mutate_n: Apply n of the mutation functions in this list to a graph (defaults to 2).

opt_worst_residual

Optimize for lowest worst residual instead of best score. FALSE by default, because the likelihood score is generally a better indicator of the quality of the model fit, and because optimizing for the lowest worst residual is slower (because f4-statistics need to be computed).

plusminus_generations

If the best score does not improve after plusminus_generations generations, another approach to improving the score will be attempted: A number of graphs with on additional admixture edge will be generated and evaluated. The resulting graph with the best score will be picked, and new graphs will be created by removing any one admixture edge (bringing the number back to what it was originally). The graph with the lowest score will then be selected. This often makes it possible to break out of local optima, but is slower than regular graph modifications. If the current number of admixture events is lower than max_numadmix, the last step (removing an admixture edge) will be skipped.

return_searchtree

Return the search tree in addition to the models. Output will be a list with three items: models, search tree, search tree as data frame

admix_constraints

A data frame with constraints on the number of admixture events for each population. See satisfies_numadmix As soon as one graph happens to satisfy these constraints, all subsequently generated graphs will be required to also satisfy them.

event_constraints

A data frame with constraints on the order of events in an admixture graph. See satisfies_eventorder As soon as one graph happens to satisfy these constraints, all subsequently generated graphs will be required to also satisfy them.

reject_f4z

If this is a number greater than zero, all f4-statistics with abs(z) > reject_f4z will be used to constrain the search space of admixture graphs: Any graphs in which f4-statistics greater than reject_f4z are expected to be zero will not be evaluated.

max_admix

Maximum number of admixture edges. By default, this number is equal to numadmix, or to the number of admixture edges in initgraph, so the number of admixture edges stays constant. Setting this to a higher number will lead to more admixture edges being added occasionally (see plusminus_generations). Graphs with additional admixture edges will only be accepted if they improve the score by 5% or more.

verbose

Print progress updates

...

Additional arguments passed to qpgraph

Value

A nested data frame with one model per line

See Also

qpgraph, find_graphs_old

Examples

## Not run: 
res = find_graphs(example_f2_blocks, numadmix = 2)
res %>% slice_min(score)

## End(Not run)
## Not run: 
# Start with a graph with 0 admixture events, increase up to 3, and stop after 10 generations of no improvement
pops = dimnames(example_f2_blocks)[[1]]
initgraph = random_admixturegraph(pops, 0, outpop = 'Chimp.REF')
res = find_graphs(example_f2_blocks, initgraph = initgraph, stop_gen2 = 10, max_admix = 3)
res %>% slice_min(score)

## End(Not run)

Find well fitting admixture graphs

Description

This function generates and evaluates admixture graphs in numgen iterations across numrep independent repeats to find well fitting admixturegraphs. It uses the function future_map to parallelize across the independent repeats. The function plan can be called to specify the details of the parallelization. This can be used to parallelize across cores or across nodes on a compute cluster. Setting numadmix to 0 will search for well fitting trees, which is much faster than searching for admixture graphs with many admixture nodes.

Usage

find_graphs_old(
  data,
  pops = NULL,
  outpop = NULL,
  numrep = 1,
  numgraphs = 50,
  numgen = 5,
  numsel = 5,
  numadmix = 0,
  numstart = 1,
  keep = c("all", "best", "last"),
  initgraphs = NULL,
  mutfuns = namedList(spr_leaves, spr_all, swap_leaves, move_admixedge_once,
    flipadmix_random, mutate_n),
  mutprobs = NULL,
  opt_worst_residual = FALSE,
  store_intermediate = NULL,
  parallel = TRUE,
  stop_after = NULL,
  verbose = TRUE,
  ...
)

Arguments

data

Input data in one of three forms:

  1. A 3d array of blocked f2 statistics, output of f2_from_precomp or f2_from_geno

  2. A directory which contains pre-computed f2-statistics

  3. The prefix of genotype files

pops

Populations for which to fit admixture graphs (default all)

outpop

An outgroup population which will split at the root from all other populations in all tested graphs. If one of the populations is know to be an outgroup, designating it as outpop will greatly reduce the search space compared to including it and not designating it as outpop.

numrep

Number of independent repetitions (each repetition can be run in parallel)

numgraphs

Number of graphs in each generation

numgen

Number of generations

numsel

Number of graphs which are selected in each generation. Should be less than numgraphs.

numadmix

Number of admixture events within each graph

numstart

Number of random initializations in each call to qpgraph. Defaults to 1, to speed up the graph optimization.

keep

Which models should be returned. One of all, best, last

  • all (default): Return all evaluated graphs

  • best: Return only the best fitting graph from each repeat and each generation

  • last: Return all graphs from the last generation

initgraphs

Optional graph or list of igraphs to start with. If NULL, optimization will start with random graphs.

mutfuns

Functions used to modify graphs. Defaults to the following:

  • spr_leaves: Subtree prune and regraft leaves. Cuts a leaf node and attaches it to a random other edge in the graph.

  • spr_all: Subtree prune and regraft. Cuts any edge and attaches the new orphan node to a random other edge in the graph, keeping the number of admixture nodes constant.

  • swap_leaves: Swaps two leaf nodes.

  • move_admixedge_once: Moves an admixture edge to a nearby location.

  • flipadmix_random: Flips the direction of an admixture edge (if possible).

  • mutate_n: Apply n of the mutation functions in this list to a graph (defaults to 2).

See examples for how to make new mutation functions.

mutprobs

Relative frequencies of each mutation function.

  • NULL (default) means each mutation function is picked with equal probability

  • A numeric vector of length equal to mutfuns defines the relative frequency of each mutation function

  • A matrix of dimensions numgen x length(mutfuns) defines the relative frequency of each mutation function in each generation

opt_worst_residual

Optimize for lowest worst residual instead of best score. FALSE by default, because the likelihood score is generally a better indicator of the quality of the model fit. Optimizing for the lowest worst residual is also slower (because f4-statistics need to be computed).

store_intermediate

Path and prefix of files for intermediate results to .rds. Can be useful if find_graphs_old doesn't finish sucessfully.

parallel

Parallelize over repeats (if numrep > 1) or graphs (if numrep == 1) by replacing map with future_map. Will only be effective if plan has been set.

stop_after

Stop optimization after stop_after seconds (and after finishing the current generation).

verbose

Print progress updates

...

Additional arguments passed to qpgraph

Value

A nested data frame with one model per line

See Also

qpgraph

Examples

## Not run: 
find_graphs_old(example_f2_blocks, numrep = 200, numgraphs = 100,
            numgen = 20, numsel = 5, numadmix = 3)

## End(Not run)
## Not run: 
# Making new mutation functions by modifying or combining existing ones:
newfun1 = function(graph, ...) mutate_n(graph, 3, ...)
newfun2 = function(graph, ...) flipadmix_random(spr_leaves(graph, ...), ...)
find_graphs_old(f2_blocks, mutfuns = namedList(spr_leaves, newfun1, newfun2), mutprobs = c(0.2, 0.3, 0.5))

## End(Not run)

Find possible new edges

Description

Find possible new edges

Usage

find_newedges(graph, fix_outgroup = TRUE, all = TRUE)

Arguments

graph

An admixture graph

Value

A data frame with columns from and to. New edges which begin above from and end above to could be inserted

See Also

find_normedges find_admixedges


Find drift edges

Description

Find drift edges

Usage

find_normedges(graph, exclude_first = FALSE)

Arguments

graph

An admixture graph

exclude_first

Do not return edge from root to outgroup

Value

A data frame with columns from and to with drift edges

See Also

find_newedges find_admixedges


Modify a graph flipping the direction of an admixture edge

Description

Modify a graph flipping the direction of an admixture edge

Usage

flipadmix_random(graph, fix_outgroup = TRUE)

Arguments

graph

An admixture graph

fix_outgroup

Keep outgroup in place (has no effect here)

Value

A new admixture graph


Compute Fst

Description

This function reads Fst from a directory with precomputed f2-statistics, and turns per-block data into estimates and standard errors for each population pair. See details for how Fst is computed.

Usage

fst(data, pop1 = NULL, pop2 = NULL, boot = FALSE, verbose = FALSE, ...)

Arguments

data

Input data in one of three forms:

  1. A 3d array of blocked Fst, output of f2_from_precomp with fst = TRUE

  2. A directory which contains pre-computed Fst

  3. The prefix of genotype files

pop1

One of the following four:

  1. NULL: all possible population combinations will be returned

  2. A vector of population labels. All combinations with the other pop arguments will be returned

  3. A matrix with population combinations to be tested, with one population per column and one combination per row. Other pop arguments will be ignored.

  4. the location of a file (poplistname or popfilename) which specifies the populations or population combinations to be tested. Other pop arguments will be ignored.

pop2

A vector of population labels

boot

If FALSE (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If boot is an integer, that number will specify the number of bootstrap resamplings. If boot = TRUE, the number of bootstrap resamplings will be equal to the number of SNP blocks.

verbose

Print progress updates

...

Additional arguments passed to f2_from_geno when data is a genotype prefix

Details

The Hudson Fst estimator used here is described in the two publications below. For two populations with estimated allele frequency vectors p1 and p2, and allele count vectors n1 and n2, it is calculated as follows:

num = (p1 - p2)^2 - p1*(1-p1)/(n1-1) - p2*(1-p2)/(n2-1)
denom = p1 + p2 - 2*p1*p2
fst = mean(num)/mean(denom)

This is done independently for each SNP block, and is stored on disk for each population pair. Jackknifing or bootstrapping across these per-block estimates yields the overall estimates and standard errors.

References

Reich, D. (2009) Reconstructing Indian population history Nature

Bhatia, G. (2013) Estimating and interpreting Fst: the impact of rare variants Genome Research

Examples

## Not run: 
pop1 = 'Denisova.DG'
pop2 = c('Altai_Neanderthal.DG', 'Vindija.DG')
fst(f2_dir, pop1, pop2)

## End(Not run)

Generate all graphs

Description

This functions generates all possible admixture graphs with a set number of admixture events for a given set of leaf nodes. It's pretty slow, and may not terminate in reasonable time for more than 5 leaves and 2 admixture events. The function is similar to the all_graphs function in the admixturegraph package, but there are a few differences:

  • The function does not return graphs with fewer than nadmix admixture events

  • The function does not return most graphs which are unidentifiable and would have equal fits as simpler identifiable graphs (for example it does not return graphs where a node is expanded to a loop)

  • The function does not return duplicated graphs, as identified by the graph_hash function

  • The function generates unique graphs which are missing in the output of all_graphs

Usage

generate_all_graphs(leaves, nadmix = 0, verbose = TRUE)

Arguments

leaves

The leaf nodes

nadmix

The number of admixture nodes

verbose

Print progress updates

Value

A list of graphs in igraph format

See Also

all_graphs, generate_all_trees, graph_hash

Examples

## Not run: 
graphs = generate_all_graphs(letters[1:4], 1)

## End(Not run)

Generate all trees

Description

This functions generates all possible trees with for a given set of leaf nodes.

Usage

generate_all_trees(leaves)

Arguments

leaves

The leaf nodes

Value

A list of trees in igraph format


Find LD-independent blocks

Description

A new block begins at the SNP after the first SNP which is not within blgsize of the start of the last block. dat needs to be ordered first by 'CHR', then by 'POS' or 'cm'

Usage

get_block_lengths(dat, blgsize = 0.05, cpp = TRUE, verbose = TRUE)

Arguments

dat

Data frame with columns 'CHR' and either 'POS' or 'cm'

blgsize

SNP block size in Morgan. Default is 0.05 (5 cM). If blgsize is 100 or greater, if will be interpreted as base pair distance rather than centimorgan distance.

cpp

Should the faster C++ version be used?

verbose

Print progress updates

Value

A numeric vector where the ith element lists the number of SNPs in the ith block.

Examples

## Not run: 
prefix = 'path/to/packedancestrymap_prefix'
pops = c('pop1', 'pop2', 'pop3')
afdat = packedancestrymap_to_afs(prefix, pops = pops)
block_lengths = get_block_lengths(afdat)

## End(Not run)

Turns f2_data into f2_blocks

Description

Turns f2_data into f2_blocks

Usage

get_f2(f2_data, pops = NULL, pops2 = NULL, afprod = FALSE, verbose = TRUE, ...)

Arguments

f2_data

f2 data as genotype file prefix, f2 directory, or f2 blocks

pops

Populations for which to extract f2-stats (defaults to all)

pops2

Optional second vector of populations. Can result in non-square array

afprod

Get allele frequency products

verbose

Print progress updates

...

Additional arguments passed to f2_from_precomp or f2_from_geno

Value

A 3d array with f2-statistics


Get the population names of a graph

Description

Get the population names of a graph

Usage

get_leafnames(graph)

Arguments

graph

An admixture graph

Value

Population names


Get the outgroup from a graph (if it exists)

Description

Get the outgroup from a graph (if it exists)

Usage

get_outpop(graph)

Arguments

graph

An admixture graph

Value

Outgroup name


Get the root name

Description

Get the root name

Usage

get_rootname(graph)

Arguments

graph

An admixture graph

Value

Root name


Add a population to an admixture graph

Description

Add a population to an admixture graph

Usage

graph_addleaf(graph, pop)

Arguments

graph

An admixture graph

pop

Population to add to the graph

Value

Admixture graph with the added population

See Also

insert_leaf adds pop at a specific position


Pairwise distance estimates for graphs

Description

Computes a distance estimate for each graph pair. Each graph is first summarized as a vector which counts for every leaf pair how many internal nodes reach that pair. The distance between two graphs is the Euclidean distance between the vectors of two graphs, and is scaled to fall between 0 and 1.

Usage

graph_distances(graphlist)

Arguments

graphlist

List of graphs

Value

A data frame with graph distances


Find well fitting admixture graphs

Description

This function generates and evaluates admixture graphs in numgen iterations to find well fitting admixturegraphs.

Usage

graph_equations(
  graph,
  substitute = TRUE,
  nam = c("a", "e", "f"),
  return_everything = FALSE
)

Arguments

graph

An admixture graph

substitute

Should edge names be represented by shorter symbols?

nam

Symbols used to shorten edge names

Value

A list with two data frames: equations holds the equtions for all f2-statistics; coding has the mapping from edge names to edge symbols, which is used when substitute = TRUE


Make a function representing a graph

Description

This function takes an igraph object and turns it into a function that takes edge weights as input, and outputs the expected f2-statistics.

Usage

graph_f2_function(
  graph,
  admix_default = 0.5,
  drift_default = 0.01,
  random_defaults = FALSE
)

Arguments

graph

An admixture graph

admix_default

The default weights for admixture edges

drift_default

The default weights for drift edges

random_defaults

Set default weights randomly for each edge between 0 and 1

Value

A function mapping edge weights to f2-statistics

Examples

## Not run: 
mygraph = graph_f2_function(example_igraph)
mygraph(N3N8 = 0.1, `N2N1|Vindija.DG` = 0.4)

## End(Not run)

Find all valid graphs which result from flipping one admixture edge

Description

Find all valid graphs which result from flipping one admixture edge

Usage

graph_flipadmix(graph)

Arguments

graph

Admixture graph in igraph format

Value

A data frame with columns from, to, and graph

Examples

## Not run: 
newgraphs = graph_flipadmix(example_igraph)
# now evaluate the new graphs
newgraphs %>%
  rowwise %>%
  mutate(res = list(qpgraph(example_f2_blocks, graph))) %>%
  unnest_wider(res)

## End(Not run)

Get unique hash of an admixture graph

Description

This can be used to check if two graphs are identical. For two graphs to be identical, the topology and leaf nodes have to match, but internal node names do not matter.

Usage

graph_hash(graph)

Arguments

graph

Admixture graph in igraph format

Value

A hash string of the admixture graph


Find all graphs which result from removing one admixture edge

Description

Find all graphs which result from removing one admixture edge

Usage

graph_minusone(graph, ntry = Inf)

Arguments

graph

Admixture graph in igraph format

Value

A data frame with columns from, to, and graph

Examples

## Not run: 
newgraphs = graph_minusone(example_igraph)
# now evaluate the new graphs
newgraphs %>%
  rowwise %>%
  mutate(res = list(qpgraph(example_f2_blocks, graph))) %>%
  unnest_wider(res)

## End(Not run)

Find all graphs which result from adding and removing one admixture edge

Description

Find all graphs which result from adding and removing one admixture edge

Usage

graph_minusplus(graph)

Arguments

graph

Admixture graph in igraph format

Value

A data frame with columns source_from, source_to, dest_from, dest_to, and graph

Examples

## Not run: 
newgraphs = graph_minusplus(example_igraph)
# now evaluate the new graphs
newgraphs %>%
  rowwise %>%
  mutate(res = list(qpgraph(example_f2_blocks, graph))) %>%
  unnest_wider(res)

## End(Not run)

Find all graphs which result from adding one admixture edge

Description

Find all graphs which result from adding one admixture edge

Usage

graph_plusone(graph, ntry = Inf)

Arguments

graph

Admixture graph in igraph format

ntry

Specify this to return only a subset of all possible graphs with one more edge

Value

A data frame with columns from, to, and graph

Examples

## Not run: 
newgraphs = graph_plusone(example_igraph)
# now evaluate the new graphs
newgraphs %>%
  rowwise %>%
  mutate(res = list(qpgraph(example_f2_blocks, graph))) %>%
  unnest_wider(res)

## End(Not run)

Find all trees which are part of the admixture graph

Description

Find all trees which are part of the admixture graph

Usage

graph_splittrees(graph, return_admix = FALSE, simplify = TRUE)

Arguments

graph

Admixture graph in igraph format

Value

A data frame with columns name and graph

Examples

## Not run: 
trees = graph_splittrees(example_igraph)
# now evaluate the trees
trees %>%
  rowwise %>%
  mutate(res = list(qpgraph(example_f2_blocks, graph))) %>%
  unnest_wider(res)

## End(Not run)

Simulate allele frequncies under an admixture graph

Description

This function performs a very crude simulation of allele frequencies under an admixture graph model

Usage

graph_to_afs(
  graph,
  nsnps = 10000,
  drift_default = 0.02,
  admix_default = 0.5,
  leaves_only = FALSE
)

Arguments

graph

An admixture graph as igraph object, or as edge list data frame with a column weight, as returned by qpgraph()$edges

nsnps

Number of SNPs to simulate

drift_default

Default branch lengths. Ignored if graph is a data frame with weights

admix_default

Default admixture weights. Ignored if graph is a data frame with weights

leaves_only

Return allele frequencies for leaf nodes only

Value

A data frame with allele frequencies

See Also

graph_to_pcs

Examples

## Not run: 
afs = graph_to_afs(example_igraph)

## End(Not run)

Simulate PCs under an admixture graph

Description

This function simulates PCA of allele frequencies under an admixture graph model

Usage

graph_to_pcs(
  graph,
  nsnps = 10000,
  drift_default = 0.02,
  admix_default = 0.5,
  leaves_only = TRUE
)

Arguments

graph

An admixture graph as igraph object, or as edge list data frame with a column weight, as returned by qpgraph()$edges

nsnps

Number of SNPs to simulate

drift_default

Default branch lengths. Ignored if graph is a data frame with weights

admix_default

Default admixture weights. Ignored if graph is a data frame with weights

leaves_only

Return PCs for leaf nodes only

Value

A data frame with PCs for each population

See Also

graph_to_afs

Examples

## Not run: 
pcs = graph_to_pcs(example_igraph)
pcs %>% ggplot(aes(PC1, PC2, label = pop)) + geom_text() + geom_point()

## End(Not run)

Get all qpadm models for a graph

Description

This function tests which qpadm models should be valid for an admixture graph and a target population. By default, all models returned by qpadm_models are tested. For large graphs this will be too slow, and you may want test only some models by providing the models argument, or only a single model by providing the left and right arguments.

Usage

graph_to_qpadm(
  graph,
  target,
  left = NULL,
  right = NULL,
  models = NULL,
  weights = TRUE,
  f4dat = NULL,
  allpops = TRUE,
  more_right = TRUE,
  return_f4 = FALSE,
  eps = 1e-10
)

Arguments

graph

An admixture graph

target

Name of the target population.

left

Left populations (provide this optionally if you want to test only a single qpadm model)

right

Right populations (provide this optionally if you want to test only a single qpadm model)

models

A two column nested data frame with models to be evaluated, one model per row. The first column, l, should contain the left populations, the second column, r, should contain the right populations. The target population is provided separately in the target argument.

weights

Set this to FALSE to return only information on the ranks, not the weights, of each qpadm model. The ranks should depend only on the graph topology, while the weights and weight-validity (all weights for left populations between 0 and 1) can depend on the branch lengths of the graph. By default f4-statistics are based on equal branch lengths and admixture weights of 0.5. This can be overridden by providing f4dat.

f4dat

A data frame of f4-statistics which can be provided to override the default branch lengths.

allpops

Evaluate only models which use all populations in the admixture graph. See qpadm_models

return_f4

Include f4 statistic matrices in the results (default FALSE)

eps

Epsilon value close to zero which is used for determining which f4 matrix elements should be considered non-zero, and which weights are strictly between 0 and 1.

Details

Two validity criteria are tested for each qpadm model: Rank validity and weight validity. Rank validity means that the rank of the f4 statistic matrix for only left and right populations is the same as the rank of the f4 statistic matrix that includes the target population among the left populations. Weight validity means that the estimated admixture weights for all left populations are between 0 and 1.

Value

A data frame with one qpadm model per row and columns valid_rank and valid_weights indicating whether a model should be valid under the graph.

Note

An earlier version of this function tried to use the graph topology for identifying valid qpadm models, but this didn't work reliably. Christian Huber had the idea of using the ranks of expected f4 statistic matrices instead.

See Also

qpadm_models, graph_f2_function

Examples

## Not run: 
graph2 = example_igraph %>% simplify_graph() %>%
  delete_admix('N3N0', 'N3N4') %>% delete_admix('N3N1', 'N3N8')
graph_to_qpadm(graph2, 'Mbuti.DG') %>% filter(valid_rank, valid_weights)

## End(Not run)

Return all graphs created from permuting a subclade

Description

generates new graphs from basegraph as follows:

  1. generates all possible trees using addpops (which are not in basegraph)

  2. attaches trees to connection_edge, which is defined by two nodes in basegraph

  3. adds edges originating above each edge in source_node, to each node above addpops

Usage

graphmod_pavel(basegraph, addpops, connection_edge, source_nodes)

Arguments

basegraph

an admixture graph as igraph object. (convert from edge list using igraph::graph_from_edgelist)

addpops

a vector of population labels which are not in basegraph. These populations should form a clade. All possible trees will be generated and those trees will be attached to basegraph.

connection_edge

edge in basegraph where the tree made from addpops should be attached

source_nodes

nodes in basegraph. edges above these nodes will be added and attached to all terminal edges leading to addpops

Examples

## Not run: 
graphlist = graphmod_pavel(example_igraph, addpops = c('pop1', 'pop2', 'pop3'),
                           connection_edge = c('N2N0', 'N1N'),
                           source_nodes = c('Denisova.DG', 'N2N2'))
results = tibble(graph = graphlist) %>%
  mutate(res = map(graph, ~qpgraph(example_f2_blocks, .))) %>%
  unnest_wider(res) %>%
  mutate(worstz = map_dbl(f3, ~max(abs(.$z))))

## End(Not run)

Group precomputed data

Description

Computing f2 statistics for populations consisting of many individuals can be slow and require a lot of memory. To speed this up, this function groups individuals into populations, computes allele counts and products of pairwise allele counts with all individuals and groups, and writes the data to disk. f2 statistics for a combination of grouped and ungrouped precomputed data can then be read using f2_from_precomp, replacing individual IDs of the grouped samples with the new group labels. All groupings are listed in ⁠{dir}/groups/{groupname}.rds⁠

Usage

group_samples(dir, inds, pops, overwrite = FALSE, verbose = TRUE)

Arguments

dir

Directory with precomputed individual pair data

inds

Individuals to group

pops

Group names, either length 1, or same length as inds

overwrite

Overwrite existing files in outdir

verbose

print progress updates

See Also

delete_groups

Examples

## Not run: 
dir = 'my/f2/dir/'
inds = c('ind1', 'ind2', 'ind3', 'ind4', 'ind5')
pops = c('pop_A', 'pop_A', 'pop_A', 'pop_B', 'pop_B')
group_samples(dir, inds, pops)

## End(Not run)

Convert igraph to agraph

Description

agraph is the format used by the admixturegraph packge. igraph is used by the admixtools package

Usage

igraph_to_agraph(igraph)

Arguments

agraph

An admixture graph in igraph format

Value

An admixture graph in agraph format

Examples

## Not run: 
igraph_to_agraph(example_igraph)

## End(Not run)

Insert a single edge into graph

Description

Insert a single edge into graph

Usage

insert_admix(
  graph,
  source_from = NULL,
  source_to = NULL,
  dest_from = NULL,
  dest_to = NULL,
  substitute = FALSE,
  fix_outgroup = TRUE
)

Arguments

graph

An admixture graph

source_from

Parent node of the source edge

source_to

Child node of the source edge

dest_from

Parent node of the destination edge

dest_to

Child node of the destination edge

substitute

Should another edge be inserted, if the one specified doesn't work?

Value

Adxmiture graph with inserted edge

See Also

delete_admix, insert_admix_n


Insert admixture edges into graph

Description

Insert admixture edges into graph

Usage

insert_admix_n(graph, n = 1, fix_outgroup = TRUE)

Arguments

graph

An admixture graph

from

List of nodes. New edges will originate above these nodes.

to

List of nodes. New edges will end above these nodes.

substitute

Should another edge be inserted, if the one specified doesn't work?

Value

Admixture graph with inserted edges

See Also

insert_admix delete_admix


Insert admixture edges into graph

Description

Insert admixture edges into graph

Usage

insert_admix_old(
  graph,
  fromnodes,
  tonodes,
  substitute_missing = FALSE,
  allow_below_admix = FALSE
)

Arguments

graph

An admixture graph

fromnodes

List of nodes. New edges will originate above these nodes.

tonodes

List of nodes. New edges will end above these nodes.

substitute_missing

If TRUE, an attempt will be made to insert random other edges if some of the provided edges could not be inserted.

allow_below_admix

Allow insertion of edges which begin or end directly underneath an admixture node

Value

Adxmiture graph with inserted edges


Add population to graph

Description

Add population to graph

Usage

insert_leaf(graph, leaf, from, to)

Arguments

graph

An admixture graph

leaf

Population to be added

from

Source node of edge onto which leaf should be added

to

Target node of edge onto which leaf should be added

Value

Admixture graph with added population

See Also

delete_leaf, graph_addleaf to add leaf at any position


Test if an admixture graph is valid

Description

Test if an admixture graph is valid

Usage

is_valid(graph)

Arguments

graph

An admixture graph

Value

TRUE if graph is valid, otherwise FALSE


Find identical graphs

Description

Find identical graphs

Usage

isomorphism_classes(igraphlist)

Arguments

igraphlist

A list with admixture graphs

Value

An integer vector with isomorphism classes. Graphs with the same number have identical topology (but may have different labels).

See Also

isomorphism_classes2


Find identical graphs

Description

Find identical graphs

Usage

isomorphism_classes2(igraphlist)

Arguments

igraphlist

A list with admixture graphs

Value

An integer vector with isomorphism classes. Graphs with the same number have identical topology and leaf labels (but may have different internal labels).

See Also

isomorphism_classes


Joint site frequency spectrum

Description

This function computes the joint site frequency spectrum from genotype files or allele frequency and count matrices. The joint site frequency spectrum lists how often each combination of haplotypes is observed. The number of combinations is equal to the product of one plus the number of haplotypes in each population. For example, five populations with a single diploid individual each have 3^5 possible combinations.

Usage

joint_sfs(afs, pref = NULL)

Arguments

afs

A named list of length two where the first element (afs) contains the allele frequency matrix, and the second element (counts) contains the allele count matrix.

pref

Instead of afs, the prefix of genotype files can be provided.

Value

A data frame with the number of times each possible combination of allele counts is observed.

Examples

## Not run: 
dat = plink_to_afs('/my/plink/file', pops = c('pop1', 'pop2', 'pop3', 'pop4', 'pop5'))
joint_sfs(dat)

## End(Not run)

Estimate joint allele frequency spectrum

Description

Estimate joint allele frequency spectrum

Usage

joint_spectrum(afs)

Arguments

afs

A matrix or data frame of allele frequencies

Value

A data frame with columns pattern and proportion

Examples

## Not run: 
dat = plink_to_afs('/my/plink/file', pops = c('pop1', 'pop2', 'pop3', 'pop4', 'pop5'))

# Spectrum across all SNPs
joint_spectrum(dat$afs)

# Stratify by allele frequency in one population
dat$afs %>% as_tibble %>% select(1:4) %>%
  group_by(grp = cut(pop1, 10)) %>%
  group_modify(joint_spectrum) %>%
  ungroup

# Stratify by mutation class
dat$afs %>% as_tibble %>% select(1:4) %>%
  mutate(mut = paste(dat$snpfile$A1, dat$snpfile$A2)) %>%
  group_by(mut) %>%
  group_modify(joint_spectrum) %>%
  ungroup

## End(Not run)

Estimate admixture weights

Description

Models target as a mixture of left populations, and outgroup right populations. Uses Lazaridis method based non-negative least squares of f4 matrix.

Usage

lazadm(data, left, right, target, boot = FALSE, constrained = TRUE)

Arguments

data

The input data in the form of:

  • A 3d array of blocked f2 statistics, output of f2_from_precomp or extract_f2

  • A directory with f2 statistics

  • The prefix of a genotype file

left

Left populations (sources)

right

Right populations (outgroups)

target

Target population

boot

If FALSE (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If boot is an integer, that number will specify the number of bootstrap resamplings. If boot = TRUE, the number of bootstrap resamplings will be equal to the number of SNP blocks.

constrained

Constrain admixture weights to be non-negative

Value

lazadm returns a data frame with weights and standard errors for each left population

References

Patterson, N. et al. (2012) Ancient admixture in human history. Genetics

Haak, W. et al. (2015) Massive migration from the steppe was a source for Indo-European languages in Europe. Nature (SI 9)

See Also

qpadm

Examples

target = 'Denisova.DG'
left = c('Altai_Neanderthal.DG', 'Vindija.DG')
right = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG', 'Switzerland_Bichon.SG')
lazadm(example_f2_blocks, left, right, target)
lazadm(example_f2_blocks, left, right, target, constrained = FALSE)

Generate a list of leave-one-out arrays

Description

Generate a list of leave-one-out arrays

Usage

loo_list(arr)

Arguments

arr

3d array with blocked estimates, with blocks in the 3rd dimension

Value

A list of leave-one-out arrays, each of which is 1 element shorter than arr along the 3rd dimension

See Also

boo_list est_to_loo


Turn leave-one-out estimates to per-block estimates

Description

Inverse of est_to_loo This works for any statistics which, when computed across N blocks, are equal to the weighted mean of the statistics across the N blocks.

Usage

loo_to_est(arr, block_lengths = NULL)

Arguments

arr

3d array with blocked estimates, with blocks in the 3rd dimension.

block_lengths

Optional block lengths. If NULL, will be parsed from 3rd dimnames in blocks

Value

A 3d array with leave-one-out estimates for jackknife. Dimensions are equal to those of arr.

See Also

est_to_loo


Modify a graph by moving an admixture edge

Description

Modify a graph by moving an admixture edge

Usage

move_admixedge_once(graph, fix_outgroup = TRUE)

Arguments

graph

An admixture graph

fix_outgroup

Keep outgroup in place

Value

A new admixture graph


Simulate an admixture graph in msprime v1.x.

Description

This function generates an msprime simulation script, and optionally executes it in python. Unlike msprime_sim, this function can simulate continuous sequence (not independent SNPs) and multiple chromosomes.

Usage

msprime_genome(
  graph,
  outpref = "msprime_sim",
  neff = 1000,
  ind_per_pop = 1,
  mutation_rate = 1.25e-08,
  time = 1000,
  fix_leaf = FALSE,
  nchr = 1,
  recomb_rate_chr = 2e-08,
  seq_length = 1000,
  admix_default = 0.5,
  run = FALSE,
  ghost_lineages = FALSE,
  shorten_admixed_leaves = FALSE
)

Arguments

graph

A graph as an igraph object or edge list with columns ’from’ and ’to’. If it is an edge list with a column ’weight’ (derived possibly from a fitted graph), the admixture weights will be used. Otherwise, all admixture edges will have a weight of 0.5.

outpref

A prefix of output files.

neff

Effective population size (in diploid individuals). If a scalar value, it will be constant across all populations. Alternatively, it can be a named vector with a different value for each population (e.g., c('R'=100, 'A'=50, 'B'=50)).

ind_per_pop

The number of diploid individuals to simulate for each population. If a scalar value, it will be constant across all populations. Alternatively, it can be a named vector with a different value for each population (e.g., c('A'=10, 'B'=20) to sample 10 and 20 diploid individuals from populations A and B, respectively).

mutation_rate

Mutation rate per site per generation. The default is 1.25e-8 per base pair per generation.

time

Either a scalar value (1000 generations by default) with the dates generated by pseudo_dates, or a named vector with dates for each graph node (in generations).

fix_leaf

A boolean value specifying if the dates of the leaf nodes will be fixed at time 0. If TRUE, all samples will be drawn at the end of the simulation (i.e., from “today”).

nchr

Number of chromosomes to simulate.

recomb_rate_chr

A float value specifying recombination rate along the chromosomes. The default is 2e-8 per base pair per generation.

seq_length

The sequence length of the chromosomes. If it is a scalar value, the sequence length will be constant for all chromosomes. Alternatively, it can be a vector with a length equal to the number of chromosomes (i.e., c(100,50) to simulate 2 chromosomes with the lengths of 100 and 50 base pairs).

admix_default

A float value specifying default admixture proportion for all admixture nodes. The default is 0.5. If another value between 0 and 1 is specified, admixture weights for each admixture event will be (value, 1-value).

run

If FALSE, the function will terminate after writing the msprime script. If TRUE, it will try to execute the msprime script with the default python installation. If you want to use some other python installation, you can set ⁠run = /my/python⁠.

ghost_lineages

A boolean value specifying whether ghost lineages will be allowed. If TRUE, admixture happens at the time points defined by the y-axis generated while plotting the graph by plot_graph. If FALSE (default), admixture occurs at the time of the previous split event.

shorten_admixed_leaves

If TRUE simulate the behavior of treemix where drift after admixture is not allowed

Value

The file name and path of the simulation script

Examples

results = qpgraph(example_f2_blocks, example_graph)
# Simulate 3 chromosomes whose lengths are 50, 100 and 100
msprime_genome(results$edges, nchr=3, seq_length=c(50, 100, 100))

Simulate an admixture graph in msprime v1.x

Description

This function generates an msprime simulation script, and optionally executes it in python. Unlike the msprime_genome function, this function simulates independent SNP sites.

Usage

msprime_sim(
  graph,
  outpref = "msprime_sim",
  nsnps = 1000,
  neff = 1000,
  ind_per_pop = 1,
  mutation_rate = 0.001,
  time = 1000,
  fix_leaf = FALSE,
  admix_default = 0.5,
  run = FALSE,
  numcores = NULL,
  ghost_lineages = FALSE,
  shorten_admixed_leaves = FALSE
)

Arguments

graph

A graph as an igraph object or edge list with columns ’from’ and ’to’. If it is an edge list with a column ’weight’ (derived possibly from a fitted graph), the admixture weights will be used. Otherwise, all admixture edges will have a weight of 0.5.

outpref

A prefix of output files.

nsnps

The number of SNPs to simulate. All SNPs will be simulated independently of each other.

neff

Effective population size (in diploid individuals). If a scalar value, it will be constant across all populations. Alternatively, it can be a named vector with a different value for each population (e.g., c('R'=100, 'A'=50, 'B'=50)).

ind_per_pop

The number of diploid individuals to simulate for each population. If a scalar value, it will be constant across all populations. Alternatively, it can be a named vector with a different value for each population (e.g., c('A'=10, 'B'=20) to sample 10 and 20 diploid individuals from populations A and B, respectively).

mutation_rate

Mutation rate per site per generation. The default is set to a high value (0.001 per site per generation) to obtain more polymorphic SNPs in order to speed up the simulation.

time

Either a scalar value (1000 generations by default) with the dates generated by pseudo_dates function, or a named vector with dates for each graph nodes (in generations).

fix_leaf

A boolean value specifying if the dates of the leaf nodes will be fixed at time 0. If TRUE, all samples will be drawn at the end of the simulation (i.e., from “today”).

admix_default

A float value specifying default admixture proportion for all admixture nodes. The default is 0.5. If another value between 0 and 1 is specified, admixture weights for each admixture event will be (value, 1-value).

run

If FALSE, the function will terminate after writing the msprime script. If TRUE, it will try to execute the msprime script with the default python installation. If you want to use some other python installation, you can set ⁠run = /my/python⁠.

numcores

The number of cores to use when simulating data.

ghost_lineages

A boolean value specifying whether ghost lineages will be allowed. If TRUE, admixture happens at the time points defined by the y-axis generated while plotting the graph by plot_graph. If FALSE (default), admixture occurs at the time of the previous split event.

shorten_admixed_leaves

If TRUE simulate the behavior of treemix where drift after admixture is not allowed

Value

The file name and path of the simulation script

Examples

## Not run: 
results = qpgraph(example_f2_blocks, example_graph)
msprime_sim(results$edges, nsnps=100)

## End(Not run)

Modify a graph by applying n mutation functions

Description

Modify a graph by applying n mutation functions

Usage

mutate_n(
  graph,
  n = 2,
  funs = list(spr_all, spr_leaves, swap_leaves, move_admixedge_once, flipadmix_random,
    mutate_n),
  fix_outgroup = TRUE
)

Arguments

graph

An admixture graph

n

Number of functions to apply

funs

List of function from which to choose

fix_outgroup

Keep outgroup in place

Value

A new admixture graph


Turn a newick format tree to a matrix of edges

Description

Turn a newick format tree to a matrix of edges

Usage

newick_to_edges(newick, node = "R", edgemat = matrix(NA, 0, 2))

Arguments

newick

Tree in newick format.

node

Root label of the tree.

edgemat

Used for recursive function calls.

Value

Tree as two column matrix of edges (adjacency list)

Examples

newick = random_newick(c('a', 'b', 'c', 'd'))
newick
newick_to_edges(newick)

Count how often each node in graph occurs in other graphs

Description

Count how often each node in graph occurs in other graphs

Usage

node_counts(graph, graphlist)

Arguments

graph

An admixture graph

graphlist

List of graphs


Returns a signature of a graph consisting of the left and right descendent leaf nodes of each internal node (sorted and concatenated)

Description

Can be used to determine how often internal nodes occur in a list of other well fitting models

Usage

node_signature(graph)

Arguments

graph

An admixture graph

Value

A graph signature as character vector

Examples

## Not run: 
sigs = example_winners %>% mutate(sig = map(igraph, node_signature)) %$%
         sig %>% unlist %>% table %>% c
node_signature(example_winners$igraph[[1]])

## End(Not run)

Count number of admixture nodes

Description

Count number of admixture nodes

Usage

numadmix(graph)

Arguments

graph

An admixture graph

Value

Number of admixture nodes


Read allele frequencies from packedancestrymap files

Description

Read allele frequencies from packedancestrymap files

Usage

packedancestrymap_to_afs(
  pref,
  inds = NULL,
  pops = NULL,
  adjust_pseudohaploid = TRUE,
  first = 1,
  last = NULL,
  verbose = TRUE
)

Arguments

pref

Prefix of packedancestrymap files (files have to end in .geno, .ind, .snp)

inds

Individuals from which to compute allele frequencies

pops

Populations from which to compute allele frequencies. If NULL (default), populations will be extracted from the third column in the .ind file. If population labels are provided, they should have the same length as inds, and will be matched to them by position

adjust_pseudohaploid

Genotypes of pseudohaploid samples are usually coded as 0 or 2, even though only one allele is observed. adjust_pseudohaploid ensures that the observed allele count increases only by 1 for each pseudohaploid sample. If TRUE (default), samples that don't have any genotypes coded as 1 among the first 1000 SNPs are automatically identified as pseudohaploid. This leads to slightly more accurate estimates of f-statistics. Setting this parameter to FALSE is equivalent to the ADMIXTOOLS inbreed: NO option. Setting adjust_pseudohaploid to an integer n will check the first n SNPs instead of the first 1000 SNPs.

verbose

Print progress updates

Value

A list with three data frames: allele frequency data, allele counts, and SNP metadata

Examples

## Not run: 
afdat = packedancestrymap_to_afs(prefix, pops = pops)
afs = afdat$afs
counts = afdat$counts

## End(Not run)

Read graph in dot format

Description

Read graph in dot format

Usage

parse_dot(dotfile)

Arguments

dotfile

Name of a file with a dot formatted admixture graph

Examples

## Not run: 
graph = parse_dot("/my/graph.dot")

## End(Not run)

Read qp3Pop output file

Description

Read qp3Pop output file

Usage

parse_qp3pop_output(outfile)

Arguments

outfile

output file generated by qp3Pop.

Value

tibble with output data.


Read qpAdm output file

Description

Read qpAdm output file

Usage

parse_qpadm_output(outfile)

Arguments

outfile

Output file generated by qpAdm.

Value

Data frame with output data.


Read qpDstat output file

Description

Read qpDstat output file

Usage

parse_qpdstat_output(outfile)

Arguments

outfile

output file generated by qpDstat.

Value

tibble with output data.


Read qpF4ratio output file

Description

Read qpF4ratio output file

Usage

parse_qpf4ratio_output(parfile)

Arguments

outfile

Output file generated by qpF4ratio

Value

Data frame with output data


Read qpGraph graph file

Description

Read qpGraph graph file

Usage

parse_qpgraph_graphfile(
  graphfile,
  split_multi = TRUE,
  igraph = FALSE,
  uselabels = FALSE
)

Arguments

graphfile

File with admixture graph in qpGraph format.

split_multi

Split multifurcations

igraph

Convert to igraph format

uselabels

Should labels or full names be used? Defaults to full names.

Value

Graph represented as two column edge matrix. Can have four columns if edges are locked


Read qpGraph output file

Description

Read qpGraph output file

Usage

parse_qpgraph_output(outfile, logfile = outfile)

Arguments

outfile

output file generated by qpGraph.

Value

list of output data.


Modify a graph by permuting leaf nodes

Description

Modify a graph by permuting leaf nodes

Usage

permute_leaves(graph, fix_outgroup = TRUE)

Arguments

graph

An admixture graph

fix_outgroup

Keep outgroup in place

Value

A new admixture graph


Modify a graph by changing the position of the root node

Description

Modify a graph by changing the position of the root node

Usage

place_root_random(graph, fix_outgroup = TRUE)

Arguments

graph

An admixture graph

fix_outgroup

Keep outgroup in place

Value

A new admixture graph


Compare two models

Description

Plot a comparison of two runs of qpgraph or two runs of qpadm

Usage

plot_comparison(out1, out2, name1 = NULL, name2 = NULL)

Arguments

out1

First model

out2

Second model

name1

Optional name for first model

name2

Optional name for second model

Value

A ggplot object.

Examples

fit1 = qpgraph(example_f2_blocks, example_graph, lsqmode = FALSE)
fit2 = qpgraph(example_f2_blocks, example_graph, lsqmode = TRUE)
plot_comparison(fit1, fit2)

Plot an admixture graph

Description

Plot an admixture graph

Usage

plot_graph(
  graph,
  fix = NULL,
  title = "",
  color = TRUE,
  textsize = 2.5,
  highlight_unidentifiable = FALSE,
  pos = NULL,
  dates = NULL,
  neff = NULL,
  scale_y = FALSE,
  hide_weights = FALSE
)

Arguments

graph

An admixture graph. If it's an edge list with a label column, those values will be displayed on the edges

fix

If TRUE, there will be an attempt to rearrange the nodes to minimize the number of intersecting edges. This can take very long for large graphs. By default this is only done for graphs with fewer than 10 leaves.

title

A plot title

color

Plot it in color or greyscale

textsize

Size of edge and node labels

highlight_unidentifiable

Highlight unidentifiable edges in red. Can be slow for large graphs. See unidentifiable_edges.

pos

An optional data frame with node coordinates (columns node, x, y)

dates

An optional named vector with dates (in generations) for each node to plot dates on the y-axis (e.g., c('R'=1000, 'A'=0, 'B'=0)). If this option is supplied, the y-axis will display dates in generations.

neff

An optional named vector with effective population sizes for each population (e.g., c('R'=100, 'A'=100, 'B'=100)). If this option is supplied, the effective population size of each population will be shown next to the corresponding edge.

scale_y

If TRUE, scale the y-axis according to dates vector. The default is FALSE.

hide_weights

A boolean value specifying if the drift values on the edges will be hidden. The default is FALSE.

Value

A ggplot object

Examples

plot_graph(example_graph)

# Plot a random simulation output. Show dates and population sizes on the plot
out = random_sim(nleaf=5, nadmix=1)
plot_graph(out$edges, dates=out$dates, neff=out$neff)

Plot an admixture graph on a map

Description

Plot an admixture graph on a map

Usage

plot_graph_map(graph, leafcoords, shapedata = NULL)

Arguments

graph

a two column matrix specifying the admixture graph. first column is source node, second column is target node. the first edge has to be root -> outgroup. admixture nodes are inferred as those nodes which are the targets of two different sources.

leafcoords

data frame with columns group, lon, lat

shapedata

shapedata

Value

a plotly object

Examples

plot_graph_map(example_igraph, example_anno)

Plot samples on a map

Description

Plot samples on a map

Usage

plot_map(
  leafcoords,
  map_layout = 1,
  color = "yearsbp",
  colorscale = "Portland",
  collog10 = TRUE
)

Arguments

leafcoords

data frame with columns iid, lon, lat

map_layout

1 or 2

color

color

colorscale

colorscale

collog10

collog10

Value

a plotly object.

Examples

## Not run: 
plot_map(example_anno, 1)

## End(Not run)

Compare two models

Description

Plot a comparison of two runs of qpgraph or two runs of qpadm

Usage

plotly_comparison(out1, out2, name1 = NULL, name2 = NULL)

Arguments

out1

First model

out2

Second model

name1

Optional name for first model

name2

Optional name for second model

Value

A plotly object.

Examples

fit1 = qpgraph(example_f2_blocks, example_graph, lsqmode = FALSE)
fit2 = qpgraph(example_f2_blocks, example_graph, lsqmode = TRUE)
plotly_comparison(fit1, fit2)

Plot an admixture graph using plotly

Description

Plot an admixture graph using plotly

Usage

plotly_graph(
  graph,
  collapse_threshold = 0,
  fix = FALSE,
  print_highlow = FALSE,
  highlight_unidentifiable = FALSE,
  pos = NULL,
  nudge_y = -0.1,
  annot = ""
)

Arguments

graph

An admixture graph

collapse_threshold

Collapse nodes if they are separated by less than this distance (for fitted graphs)

fix

If TRUE, there will be an attempt to rearrange the nodes to minimize the number of intersecting edges. This can take very long for large graphs. By default this is only done for graphs with fewer than 10 leaves.

highlight_unidentifiable

Highlight unidentifiable edges in red. Can be slow for large graphs. See unidentifiable_edges.

pos

Optional data frame with node coordinates (columns node, x, y)

Value

A plotly object

Examples

plotly_graph(example_graph)

Get pseudo dates for graph nodes

Description

This function assigns a date (in generations) to each node in an admixture graph and is used in msprime_sim and msprime_genome. The date of the node will correspond to the y-coordinate of that node used for plotting in plotly_graph, unless the fix_leaf option is set to TRUE, in which case the dates of all leaf nodes returned by get_leafnames will be set to 0.

Usage

pseudo_dates(graph, time = 1000, fix_leaf = FALSE)

Arguments

graph

An admixture graph

time

A scalar by which y-coordinate values will be multiplied to get dates

fix_leaf

A boolean specifying if the dates of the leaf nodes will be fixed at time 0 (i.e., at the most recent time). If TRUE, all samples will be drawn at the end of the simulation (i.e., from "today"). The default is FALSE

Value

A named vector with pseudo dates for each graph node


Estimate f3 statistics

Description

Computes f3 statistics of the form f3(A;B,C)f3(A; B, C). This is generally equivalent to (f2(A,B)+f2(A,C)f2(B,C))/2(f2(A, B) + f2(A, C) - f2(B, C)) / 2 and to f4(A,B;A,C)f4(A, B; A, C). However, the exact estimates depend on the type of input data, selection of SNPs, and on the values of some of the arguments, which are described below.

Usage

qp3pop(
  data,
  pop1 = NULL,
  pop2 = NULL,
  pop3 = NULL,
  boot = FALSE,
  sure = FALSE,
  unique_only = TRUE,
  blgsize = NULL,
  block_lengths = NULL,
  verbose = FALSE,
  ...
)

Arguments

data

Input data in one of three forms:

  1. A 3d array of blocked f2 statistics, output of f2_from_precomp or f2_from_geno (fastest option)

  2. A directory which contains pre-computed f2-statistics

  3. The prefix of genotype files (slowest option)

pop1

One of the following four:

  1. NULL: all possible population combinations will be returned

  2. A vector of population labels. All combinations with the other pop arguments will be returned

  3. A matrix with population combinations to be tested, with one population per column and one combination per row. Other pop arguments will be ignored.

  4. the location of a file (poplistname or popfilename) which specifies the populations or population combinations to be tested. Other pop arguments will be ignored.

pop2

A vector of population labels

pop3

A vector of population labels

boot

If FALSE (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If boot is an integer, that number will specify the number of bootstrap resamplings. If boot = TRUE, the number of bootstrap resamplings will be equal to the number of SNP blocks.

sure

The number of population combinations can get very large. This is a safety option that stops you from accidently computing all combinations if that number is large.

unique_only

If TRUE (the default), redundant combinations will be excluded

verbose

Print progress updates

...

Additional arguments passed to f3blockdat_from_geno if data is a genotype prefix, or to get_f2 otherwise

apply_corr

With apply_corr = FALSE, no bias correction is performed. With apply_corr = TRUE (the default), a bias correction term based on the heterozygosity in the first population is subtracted from the f3 estimate. The option is ineffective if the first argument is an array of pre-computed f2-statistics. In that case, the bias correction can be toggled by the apply_corr parameter in extract_f2 or f2_from_geno. The heterozygosity calculation used in apply_corr = TRUE requires at least two haploid samples or one diploid sample in the first population. Otherwise, apply_corr = TRUE will result in missing values.

outgroupmode

With outgroupmode = FALSE, estimates of f3 will be normalized by estimates of the heterozygosity of the target population. This is the default option if the first argument is the prefix of genotype data. If the first argument is an array of pre-computed f2-statistics, then no normalization can be performed, which corresponds to outgroupmode = TRUE. As with apply_corr = TRUE, the heterozygosity calculation used in outgroupmode = FALSE requires at least two pseudodiploid samples or one diploid sample in the first population. Otherwise, outgroupmode = FALSE will result in missing values.

poly_only

Only keep SNPs with mean allele frequency not equal to 0 or 1. Defaults to FALSE. Only effective if the first argument is the prefix of genotype data. If the first argument is an array of pre-computed f2-statistics, use the poly_only argument in extract_f2 or f2_from_geno

Details

The default values of the parameters apply_corr, outgroupmode, poly_only depend on the first argument (data), and they can affect the estimated f3-statistics. When the data is the prefix of genotype data, the default parameters are generally the same as in the original qp3pop program. When data is an array of pre-computed f2-statistics, the parameters may be set while computing f2-statistics. If the first population is a single pseudodiploid sample, it is not possible to get unbiased estimates of f3. To get biased estimates for a single pseudodiploid sample from genotype data, set outgroupmode = TRUE and apply_corr = FALSE. Under this combination of parameters, ⁠f3(A; B, C)⁠ should also be identical to ⁠f4(A, B; A, C)⁠, since f4-statistics generally do not require any bias correction. See examples for more information.

Value

qp3pop returns a data frame with f3 statistics, with columns for populations 1 to 3, f3-estimate (est), standard error of the estimate (se), z-score (z, est/se), p-value (2*(1-pnorm(z))), and the number of SNPs used (n; only if first argument is genotype prefix)

Alias

f3

References

Patterson, N. et al. (2012) Ancient admixture in human history Genetics

Peter, B. (2016) Admixture, Population Structure, and F-Statistics Genetics

Examples

## Not run: 
pop1 = 'Denisova.DG'
pop2 = c('Altai_Neanderthal.DG', 'Vindija.DG')
pop3 = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG')
qp3pop(example_f2_blocks, pop1, pop2, pop3)

pops = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG')
qp3pop(example_f2_blocks, pops)
qp3pop(example_f2_blocks, pops, unique_only = FALSE)
qp3pop(f2_dir, pop1, pop2, pop3)

# Below are three scenarios, and in each one `qp3pop()` and `qp3pop_wrapper()`
# should give the same or very similar estimates. Note that to compute `f3(A; B, C)`,
# `qp3pop_wrapper()`, and the original qp3pop program, expect populations to be in the order `B`, `C`, `A`.

prefix = '/path/to/geno/prefix'
qp3popbin = '/path/to/AdmixTools/bin/qp3Pop'
pops = dimnames(example_f2_blocks)[[1]]

# 1. target diploid, outgroupmode NO (this is the default when passing a geno file prefix)
qp3pop_wrapper(prefix, pops[2], pops[3], pops[1], bin = qp3popbin, outgroupmode = FALSE)
qp3pop(prefix, pops[1], pops[2], pops[3])
  # est, se, z match well; n is higher
qp3pop(prefix, pops[1], pops[2], pops[3], poly_only = TRUE)
  # est, se, z match less well; n is identical

# 2. target diploid, outgroupmode YES (this is the only option with precomputed f2-stats)
qp3pop_wrapper(prefix, pops[2], pops[3], pops[1], bin = qp3popbin, outgroupmode = TRUE)
qp3pop(prefix, pops[1], pops[2], pops[3], outgroupmode = TRUE)
  # est, se, z match (except for factor 1000)
f2b = f2_from_geno(prefix, pops = pops[1:3], poly_only = FALSE)
qp3pop(f2b, pops[1], pops[2], pops[3])

# 3. target pseudodiploid (no heterozygotes means heterozygosity rate correction is not possible)
qp3pop_wrapper(prefix, pops[1], pops[3], pops[2], bin = qp3popbin, outgroupmode = TRUE)
qp3pop(prefix, pops[2], pops[1], pops[3], outgroupmode = TRUE, apply_corr = FALSE)
  # est, se, z match (except for factor 1000)

## End(Not run)

Wrapper function around the original qp3Pop program

Description

Computes f3 statistics of the form f3(A;B,C)f3(A; B, C). Equivalent to (f2(A,B)+f2(A,C)f2(B,C))/2(f2(A, B) + f2(A, C) - f2(B, C)) / 2 and to f4(A,B;A,C)f4(A, B; A, C). Requires a working installation of qp3Pop, which will be called using system

Usage

qp3pop_wrapper(
  pref,
  source1,
  source2 = NULL,
  target = NULL,
  bin = "~np29/o2bin/qp3Pop",
  outdir = ".",
  parfile = NULL,
  inbreed = "NO",
  outgroupmode = "NO",
  printonly = FALSE,
  env = "",
  verbose = TRUE
)

Arguments

pref

Path to and prefix of the packedancestrymap genotype files

source1

One of the following four:

  1. NULL: Populations will be read from poplistname or popfilename specified in parfile

  2. A vector of population labels

  3. A data frame in which the first four columns specify the population triples to be tested. source2, target will be ignored.

  4. The location of a file (poplistname or popfilename) which specifies the populations or population combinations to be tested. source2 and target will be ignored.

source2

A vector of population labels

target

A vector of population labels

bin

Path to the qp3Pop binary file

outdir

Output directory. files out, parfile, poplistname, popfilename may be overwritten

parfile

qp3Pop parameter file. If this is specified, source1, source2, target will be ignored.

inbreed

inbreed

outgroupmode

outgroupmode

printonly

Should the command be printed or executed?

env

Export environmental variables. See examples.

verbose

Print progress updates

Value

If printonly, the qp3Pop command, otherwise a data frame with parsed qp3Pop output

Examples

## Not run: 
source1 = c('Altai_Neanderthal.DG', 'Vindija.DG')
source2 = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG')
target = 'Denisova.DG'
qp3pop_wrapper('genotype_prefix', source1, source2, target,
  bin = 'path/to/qp3Pop')

## End(Not run)

Estimate admixture weights

Description

qpadm models a target population as a mixture of left (source) populations, given a set of right (outgroup) populations. It can be used to estimate whether the left populations explain all genetic variation in the target population, relative to the right populations, and to estimate admixture proportions of the left populations to the target population.

Usage

qpadm(
  data,
  left,
  right,
  target,
  f4blocks = NULL,
  fudge = 1e-04,
  fudge_twice = FALSE,
  boot = FALSE,
  getcov = TRUE,
  constrained = FALSE,
  return_f4 = FALSE,
  cpp = TRUE,
  verbose = TRUE,
  ...
)

Arguments

data

The input data in the form of:

  • A 3d array of blocked f2 statistics, output of f2_from_precomp or extract_f2

  • A directory with f2 statistics

  • The prefix of a genotype file

left

Left populations (sources)

right

Right populations (outgroups)

target

Target population

f4blocks

Instead of f2 blocks, f4 blocks can be supplied. This is used by qpadm_multi

fudge

Value added to diagonal matrix elements before inverting

fudge_twice

Setting this to TRUE should result in p-values that better match those in the original qpAdm program

boot

If FALSE (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If boot is an integer, that number will specify the number of bootstrap resamplings. If boot = TRUE, the number of bootstrap resamplings will be equal to the number of SNP blocks.

getcov

Compute weights covariance. Setting getcov = FALSE will speed up the computation.

constrained

Constrain admixture weights to be non-negative

return_f4

Return f4-statistics

cpp

Use C++ functions. Setting this to FALSE will be slower but can help with debugging.

verbose

Print progress updates

...

If data is the prefix of genotype files, additional arguments will be passed to f4blockdat_from_geno

Value

qpadm returns a list with up to four data frames describing the model fit:

  1. weights: A data frame with estimated admixture proportions where each row is a left population.

  2. f4: A data frame with estimated and fitted f4-statistics

  3. rankdrop: A data frame describing model fits with different ranks, including p-values for the overall fit and for nested models (comparing two models with rank difference of one). A model with L left populations and R right populations has an f4-matrix of dimensions (L-1)*(R-1). If no two left population form a clade with respect to all right populations, this model will have rank (L-1)*(R-1).

    • f4rank: Tested rank

    • dof: Degrees of freedom of the chi-squared null distribution: (L-1-f4rank)*(R-1-f4rank)

    • chisq: Chi-sqaured statistic, obtained as ⁠E'QE⁠, where E is the difference between estimated and fitted f4-statistics, and Q is the f4-statistic covariance matrix.

    • p: p-value obtained from chisq as pchisq(chisq, df = dof, lower.tail = FALSE)

    • dofdiff: Difference in degrees of freedom between this model and the model with one less rank

    • chisqdiff: Difference in chi-squared statistics

    • p_nested: p-value testing whether the difference between two models of rank difference 1 is significant

  4. popdrop: A data frame describing model fits with different populations. Note that all models with fewer populations use the same set of SNPs as the first model.

    • pat: A binary code indicating which populations are present in this model. A 1 represents dropped populations. The full model is all zeros.

    • wt: Number of populations dropped

    • dof: Degrees of freedom of the chi-squared null distribution: (L-1-f4rank)*(R-1-f4rank)

    • chisq: Chi-sqaured statistic, obtained as ⁠E'QE⁠, where E is the difference between estimated and fitted f4-statistics, and Q is the f4-statistic covariance matrix.

    • p: p-value obtained from chisq as pchisq(chisq, df = dof, lower.tail = FALSE)

    • f4rank: Tested rank

    • feasible: A model is feasible if all weights fall between 0 and 1

    • ⁠<population name>⁠: The weights for each population in this model

References

Haak, W. et al. (2015) Massive migration from the steppe was a source for Indo-European languages in Europe. Nature (SI 10)

See Also

qpwave, lazadm

Examples

left = c('Altai_Neanderthal.DG', 'Vindija.DG')
right = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG', 'Switzerland_Bichon.SG')
target = 'Denisova.DG'
qpadm(example_f2_blocks, left, right, target)
## Not run: 
# The original ADMIXTOOLS qpAadm program has an option called "allsnps"
# that selects different SNPs for each f4-statistic, which is
# useful when working with sparse genotype data.
# To get the same behavior in ADMIXTOOLS 2, supply the genotype data prefix
# and set `allsnps = TRUE`
qpadm("/my/geno/prefix", left, right, target, allsnps = TRUE)

## End(Not run)

Partition a list of populations into left and right populations

Description

Partition a list of populations into left and right populations

Usage

qpadm_models(pops, allpops = TRUE, more_right = TRUE)

Arguments

pops

A vector of populations

allpops

Return only models which use all provided populations

more_right

Return only models with more right than left populations

Value

A data frame with one qpadm model per row

See Also

qpadm_models, graph_f2_function

Examples

## Not run: 
graph_to_qpadm(get_leafnames(example_igraph))

## End(Not run)

Return all valid qpAdm models for an admixturegraph

Description

For large admixturegraph, there may be a large number of valid qpAdm models!

Usage

qpadm_models_old(graph, add_outgroup = FALSE, nested = TRUE, abbr = -1)

Arguments

graph

An admixture graph as igraph object

add_outgroup

Should the graph outgroup be added to the qpAdm right populations? Technically this shouldn't be an informative outgroup for qpAdm.

nested

Should a nested data frame be returned (TRUE), or should populations be concatenated into strings (FALSE)?

abbr

Maximum number of characters to print for each population. The default (-1) doesn't abbreviate the names.

Examples

## Not run: 
qpadm_models_old(igraph2, add_outgroup = TRUE)
qpadm_models_old(igraph2, add_outgroup = TRUE) %>% slice(1) %$% list(target, left, right)

## End(Not run)

Run multiple qpadm models

Description

This function runs multiple qpadm models, re-using f4-statistics where possible. Supports parallel evaluation of models, which can be turned on with future::plan('multisession') or similar, and turned off with future::plan('sequential').

Usage

qpadm_multi(
  data,
  models,
  allsnps = FALSE,
  full_results = TRUE,
  verbose = TRUE,
  ...
)

Arguments

data

The input data in the form of:

  • A 3d array of blocked f2 statistics, output of f2_from_precomp or extract_f2

  • A directory with f2 statistics

  • The prefix of a genotype file

models

A nested list (or data frame) with qpadm models. It should consist of two or three other named lists (or columns) containing left, right, (and target) populations.

allsnps

Use all SNPs with allele frequency estimates in every population of any given population quadruple. If FALSE (the default) only SNPs which are present in all populations in popcombs (or any given model in it) will be used. When there are populations with lots of (non-randomly) missing data, allsnps = TRUE can lead to false positive results. This option only has an effect if data is the prefix of a genotype file. If data are f2-statistics, the behavior will be determined by the options that were used in computing the f2-statistics.

verbose

Print progress updates

...

Further arguments passed to qpadm

Value

A list where each element is the output of one qpadm model.

Examples

## Not run: 
# the following specifies two models: one with 2/3/1 and one with 1/2/1 left/right/target populations
models = tibble(
           left = list(c('pop1', 'pop2'), c('pop3')),
           right = list(c('pop5', 'pop6', 'pop7'), c('pop7', 'pop8')),
           target = c('pop10', 'pop10'))
results = qpadm_multi('/my/geno/prefix', models)

## End(Not run)

Faster version of qpadm with reduced output

Description

Models target as a mixture of left populations, given a set of outgroup right populations. Can be used to estimate admixture proportions, and to estimate the number of independent admixture events.

Usage

qpadm_p(
  f2_data,
  left,
  right,
  target = NULL,
  fudge = 1e-04,
  boot = FALSE,
  constrained = FALSE,
  rnk = length(setdiff(left, target)) - 1,
  cpp = TRUE,
  weights = FALSE,
  f4blocks = NULL
)

Arguments

left

Left populations (sources)

right

Right populations (outgroups)

target

Target population

fudge

Value added to diagonal matrix elements before inverting

boot

If FALSE (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If boot is an integer, that number will specify the number of bootstrap resamplings. If boot = TRUE, the number of bootstrap resamplings will be equal to the number of SNP blocks.

constrained

Constrain admixture weights to be non-negative

rnk

Rank of f4-matrix. Defaults to one less than full rank.

cpp

Use C++ functions. Setting this to FALSE will be slower but can help with debugging.

weights

Return weights (default = FALSE)

f4blocks

Instead of f2 blocks, f4 blocks can be supplied. This is used by qpadm_multi

Value

Data frame with f4rank, dof, chisq, p, feasible

See Also

qpadm

Examples

left = c('Altai_Neanderthal.DG', 'Vindija.DG')
right = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG', 'Switzerland_Bichon.SG')
target = 'Denisova.DG'
qpadm_p(example_f2_blocks, left, right, target)

Compute p-values for many qpadm models

Description

This functions evaluates many qpadm models simultaneously by keeping the target population and the rightfix populations fixed, and distributing the leftright populations by keeping some in the set of left population and adding the remaining populations to the right populations. (See details for an example of how models are generated)

Usage

qpadm_rotate(
  f2_blocks,
  leftright,
  target,
  rightfix = NULL,
  full_results = FALSE,
  verbose = TRUE
)

Arguments

f2_blocks

3d array of blocked f2 statistics, output of f2_from_precomp.

leftright

Populations which will be distributed between left and right

target

Target population

rightfix

Populations which will be on the right side in all models

full_results

Return all output items which are returned by qpadm. By default (full_results = FALSE), weights and several other statistics will not be computed for each model, making it faster and the output more readable. If full_results = TRUE, the output will be a nested data frame where each row is one qpadm model, and each column has one data frame item from the regular qpadm output (weights, f4, rankdrop, popdrop).

verbose

Print progress updates

Details

If leftright consists of the populations L1, L2, L3, L4; rightfix is the population R; and target is T, the following models will be genrated:

(left), (right), (target)
(L1), (L2, L3, L4, R), (T)
(L2), (L1, L3, L4, R), (T)
(L3), (L1, L2, L4, R), (T)
(L4), (L1, L2, L3, R), (T)
(L1, L2), (L3, L4, R), (T)
(L1, L3), (L2, L4, R), (T)
(L1, L4), (L2, L3, R), (T)
(L2, L3), (L1, L4, R), (T)
(L2, L4), (L1, L3, R), (T)
(L3, L4), (L1, L2, R), (T)

Value

A data frame with Chi-squared statistics and p-values for each population combination

Examples

## Not run: 
pops = dimnames(example_f2_blocks)[[1]]
qpadm_rotate(example_f2_blocks, leftright = pops[1:4],
             target = pops[5], rightfix = pops[6:7])

## End(Not run)

Wrapper function around the original qpAdm program

Description

This requires a working installation of qpAdm, which will be called using system

Usage

qpadm_wrapper(
  pref,
  left,
  right,
  target = NULL,
  bin = "~np29/o2bin/qpAdm",
  outdir = "./",
  parfile = NULL,
  allsnps = "NO",
  blgsize = 0.05,
  fancyf4 = "NO",
  f4mode = "YES",
  inbreed = "NO",
  printonly = FALSE,
  env = "",
  verbose = TRUE
)

Arguments

pref

Path to and prefix of the packedancestrymap genotype files

left

Left populations (or leftlist file)

right

Right populations (or rightlist file)

target

Target population

bin

Path to the qpAdm binary file

outdir

Output directory. files out, parfile, leftlist, rightlist will be overwritten

parfile

qpAdm parameter file

allsnps

allsnps

blgsize

blgsize

fancyf4

fancyf4

f4mode

f4mode

inbreed

inbreed

printonly

Should the command be printed or executed?

env

Export environmental variables. See examples.

verbose

Print progress updates

Value

If not printonly, a data frame with parsed qpAdm output

Examples

## Not run: 
left = c('Altai_Neanderthal.DG', 'Vindija.DG')
right = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG', 'Switzerland_Bichon.SG')
target = 'Denisova.DG'
qpadm_wrapper('genotype_prefix', left, right, target,
  bin = 'path/to/qpAdm')

## End(Not run)

Estimate f4 statistics

Description

Computes f4-statistics of the form f4(A,B;C,D)f4(A, B; C, D). For allele frequencies a, b, c, d, ⁠f4(A, B; C, D)⁠ is computed as the average of (a-b)*(c-d) over all SNPs in each SNP block. This is equivalent to (f2(A,D)+f2(B,C)f2(A,C)f2(B,D))/2(f2(A, D) + f2(B, C) - f2(A, C) - f2(B, D)) / 2 (assuming no missing data). The input of this function can either be a 3d array of f2-statistics generated by f2_from_precomp or f2_from_geno, a directory with f2-statistics, or the prefix of genotype files. Computing f4 from genotype files directly is slower, but provides more flexibility in dealing with missing data (see details).

Usage

qpdstat(
  data,
  pop1 = NULL,
  pop2 = NULL,
  pop3 = NULL,
  pop4 = NULL,
  boot = FALSE,
  sure = FALSE,
  unique_only = TRUE,
  comb = TRUE,
  blgsize = NULL,
  block_lengths = NULL,
  f4mode = TRUE,
  afprod = TRUE,
  cpp = TRUE,
  verbose = is.character(data),
  ...
)

Arguments

data

Input data in one of three forms:

  1. A 3d array of blocked f2 statistics, output of f2_from_precomp or f2_from_geno (fastest option)

  2. A directory which contains pre-computed f2-statistics

  3. The prefix of genotype files (slowest option)

pop1

One of the following four:

  1. NULL: all possible population combinations will be returned

  2. A vector of population labels. All combinations with the other pop arguments will be returned

  3. A matrix with population combinations to be tested, with one population per column and one combination per row. Other pop arguments will be ignored.

  4. the location of a file (poplistname or popfilename) which specifies the populations or population combinations to be tested. Other pop arguments will be ignored.

pop2

A vector of population labels

pop3

A vector of population labels

pop4

A vector of population labels

boot

If FALSE (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If boot is an integer, that number will specify the number of bootstrap resamplings. If boot = TRUE, the number of bootstrap resamplings will be equal to the number of SNP blocks.

sure

The number of population combinations can get very large. This is a safety option that stops you from accidently computing all combinations if that number is large.

unique_only

If TRUE (the default), redundant combinations will be excluded

comb

Generate all combinations of pop1, pop2, pop3, pop4. If FALSE, pop1, pop2, pop3, pop4 should all be vectors of the same length.

blgsize

SNP block size in Morgan. Default is 0.05 (5 cM). Only used when data is the prefix of genotype files

block_lengths

Vector with lengths of each jackknife block. sum(block_lengths) has to match the number of SNPs. only used when data is the prefix of genotype files

f4mode

Set this to FALSE to compute D-statistics instead of f4. This only has an effect if the first argument is a genotype prefix. D-statistics are computed as (a-b)*(c-d) / ((a + b - 2*a*b) * (c + d - 2*c*d)), which is the same as (P(BABA) - P(ABBA)) / (P(ABBA) + P(BABA))

afprod

Compute f4 from allele frequency products instead of f2 (default TRUE). Only used if data is a directory with precomputed data. For populations with lots of missing data, this option reduces bias that can result from setting maxmiss to values greater than 0. In all other cases it should not make a difference.

cpp

Use C++ functions. Setting this to FALSE will be slower but can help with debugging.

verbose

Print progress updates

...

Additional arguments passed to f4blockdat_from_geno if data is a genotype file prefix or f2_from_precomp if data is a directory with f2-statistics

Details

f4- and D-statistics are informative about how four populations are related to one another. Estimates of f4 are unbiased as long as assumptions about SNP ascertainment and mutation rates are met. Missing data can violate the ascertainment assumptions: an f4-statistic may be significantly different when it is calculated from all non-missing SNPs, compared to what it would be if it were calculated from all SNPs in the genome. However, because this difference is often small, f4 is often calculated using samples or populations with missing data, on a particular subset of all SNPs. There are different strategies for choosing the SNPs in this case, and these strategies differ in how many SNPs they use, how likely they lead to bias, and whether pre-computed f2-statistics can be used.

  • Use the same SNPs for every f4-statistic
    This is the most conservative option, but also the option which will use the smallest number of SNPs, which may result in a lack of power. It is the default option when pre-computing f2-statistics (maxmiss = 0 in extract_f2 or f2_from_geno).

  • Use different SNPs for each f4-statistic
    This option strikes a balance between avoiding bias and using a larger number of SNPs. For each f4-statistic it selects all SNPs which are present in all four populations. This option only works when the first argument is a genotype prefix (it doesn't work with pre-computed f2-statistics). In that case, this option is used by default. To turn it off and instead use the same SNPs for every f4-statistic, set allsnps = FALSE. This option is the default option in the original qpDstat program (it is in fact the only option for selecting SNPs in the original qpDstat; however in the original qpAdm and qpGraph programs, this mode of selecting SNPs can be enabled with the allsnps option, whereas the default mode in qpAdm and qpGraph is equivalent to maxmiss = 0)

  • Use different SNPs for each f2-statistic
    This is the least conservative option, but also the option which uses most of the available information. It makes it possible to use pre-computed f2-statistics for a large number of populations without losing a large number of SNPs. To use this option, set the maxmiss parameter to a value greater than 0 (and not larger than 1) in extract_f2 or f2_from_geno. When using this option, be aware that bias is possible, in particular for f4-statistics where some populations have large amounts of missing data. To reduce the bias that can result from using this option, you may want to combine it with using the option afprod = TRUE in f2_from_precomp. In summary, whenever you work with populations with missing data, there is no guarantee that f4- or D-statistics involving these populations are not skewed in some way. If you choose to analyze these populations anyway, and you decide which SNPs to use, there is a trade-off between maximizing power and minimizing the risk of bias. One strategy might be to first use the least conservative option (setting maxmiss = 1 in extract_f2) to get an overview, and then spot-check individual results using more conservative options.

Value

qpdstat returns a data frame with f4 statistics, with columns for populations 1 to 4, f4-estimate (est), standard error of the estimate (se), z-score (z, est/se), p-value (2*(1-pnorm(z))), and the number of SNPs used (n; only if first argument is genotype prefix)

Alias

f4

References

Patterson, N. et al. (2012) Ancient admixture in human history Genetics

Peter, B. (2016) Admixture, Population Structure, and F-Statistics Genetics

Examples

pop1 = 'Denisova.DG'
pop2 = c('Altai_Neanderthal.DG', 'Vindija.DG')
pop3 = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG')
pop4 = 'Switzerland_Bichon.SG'
qpdstat(example_f2_blocks, pop1, pop2, pop3, pop4)
## Not run: 
qpdstat(f2_dir, pop1, pop2, pop3, pop4)

## End(Not run)
## Not run: 
# compute D-statistics instead
qpdstat("/geno/prefix", pop1, pop2, pop3, pop4)

## End(Not run)
## Not run: 
# make a data frame with the population combinations for which f4 should be computed
combinations = tibble(pop1 = pop1, pop2 = pop2[1], pop3 = pop3, pop4 = pop4)
qpdstat(example_f2_blocks, combinations)

## End(Not run)

Wrapper function around the original qpDstat program

Description

This requires a working installation of qpDstat, which will be called using system

Usage

qpdstat_wrapper(
  pref,
  pop1,
  pop2 = NULL,
  pop3 = NULL,
  pop4 = NULL,
  bin = "~np29/o2bin/qpDstat",
  outdir = ".",
  parfile = NULL,
  f4mode = "YES",
  inbreed = "NO",
  printonly = FALSE,
  env = "",
  verbose = TRUE
)

Arguments

pref

Path to and prefix of the packedancestrymap genotype files

pop1

One of the following four:

  1. NULL: populations will be read from poplistname or popfilename specified in parfile

  2. A vector of population labels

  3. A data frame in which the first four columns specify the population combinations to be tested. pop2, pop3, pop4 will be ignored.

  4. the location of a file (poplistname or popfilename) which specifies the populations or population combinations to be tested. pop2, pop3, pop4 will be ignored.

pop2

A vector of population labels

pop3

A vector of population labels

pop4

A vector of population labels

bin

Path to the qpDstat binary file

outdir

Output directory. files out, parfile, poplistname, popfilename may be overwritten

parfile

qpDstat parameter file. If this is specified, pop, pop2, pop3, and pop4 will be ignored.

f4mode

f4mode

inbreed

inbreed

printonly

Should the command be printed or executed?

env

Export environmental variables. See examples.

verbose

Print progress updates

Value

If printonly, the qpDstat command, otherwise a data frame with parsed qpDstat output

Examples

## Not run: 
pop1 = 'Denisova.DG'
pop2 = c('Altai_Neanderthal.DG', 'Vindija.DG')
pop3 = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG')
pop4 = 'Switzerland_Bichon.SG'
qpdstat_wrapper('genotype_prefix', pop1, pop2, pop3, pop4,
  bin = 'path/to/qpDstat', pref = 'path/to/packedancestrymap_prefix')

## End(Not run)

Estimate f4 differences

Description

Estimate f4 differences

Usage

qpf4diff(data, pops, boot = FALSE, verbose = FALSE)

Arguments

data

Input data in one of three forms:

  1. A 3d array of blocked f2 statistics, output of f2_from_precomp or f2_from_geno (fastest option)

  2. A directory which contains pre-computed f2-statistics

  3. The prefix of genotype files (slowest option)

pops

A vector of 5 populations or a five column population matrix. The following ratios will be computed: ⁠f4(1, 2; 3, 4)/f4(1, 2; 5, 4)⁠

boot

If FALSE (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If boot is an integer, that number will specify the number of bootstrap resamplings. If boot = TRUE, the number of bootstrap resamplings will be equal to the number of SNP blocks.

verbose

Print progress updates

Value

qpf4diff returns a data frame with f4 ratios


Estimate admixture proportions via f4 ratios

Description

Estimate admixture proportions via f4 ratios

Usage

qpf4ratio(data, pops, boot = FALSE, verbose = FALSE)

Arguments

data

Input data in one of three forms:

  1. A 3d array of blocked f2 statistics, output of f2_from_precomp or f2_from_geno (fastest option)

  2. A directory which contains pre-computed f2-statistics

  3. The prefix of genotype files (slowest option)

pops

A vector of 5 populations or a five column population matrix. The following ratios will be computed: ⁠f4(1, 2; 3, 4)/f4(1, 2; 5, 4)⁠

boot

If FALSE (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If boot is an integer, that number will specify the number of bootstrap resamplings. If boot = TRUE, the number of bootstrap resamplings will be equal to the number of SNP blocks.

verbose

Print progress updates

Value

qpf4ratio returns a data frame with f4 ratios


Wrapper function around the original qpF4ratio program

Description

This requires a working installation of qpF4ratio, which will be called using system

Usage

qpf4ratio_wrapper(
  pref,
  pops,
  bin = "~np29/o2bin/qpF4ratio",
  outdir = ".",
  parfile = NULL,
  blgsize = 0.05,
  fancyf4 = "YES",
  printonly = FALSE,
  env = "",
  verbose = TRUE
)

Arguments

pref

Path to and prefix of the packedancestrymap genotype files

pops

A vector of five populations, or a 5 x n matrix with population names. For each line alpha will be computed as ⁠f4(1,2; 3,4)/f4(1,2; 5,4)⁠

bin

Path to the qpF4ratio binary file

outdir

Output directory. files out, parfile, poplistname, popfilename may be overwritten

parfile

qpF4ratio parameter file. If this is specified, pops will be ignored.

blgsize

blgsize

fancyf4

fancyf4

printonly

Should the command be printed or executed?

env

Export environmental variables. See examples.

verbose

Print progress updates

Value

If printonly, the qpF4ratio command, otherwise a data frame with parsed qpF4ratio output

Examples

## Not run: 
pops = c('Denisova.DG', 'Altai_Neanderthal.DG', 'Vindija.DG', 'Chimp.REF', 'Mbuti.DG')
qpf4ratio_wrapper('genotype_prefix', pops, bin = 'path/to/qpDstat')

## End(Not run)

Get smoothed f2-statistics

Description

This function returns an array of (pseudo-) f2-statistics which are computed by taking into account other f2-, f3-, and f4-statistics. The advantage of doing that is that f3- and f4-statistics computed from these smoothed f2-statistics can be more accurate for populations with large amounts of missing data. The function uses SNPs which are missing in some populations in a manner which tends to introduce less bias than setting maxmiss to values greater than 0.

Usage

qpfstats(
  pref,
  pops,
  include_f2 = TRUE,
  include_f3 = TRUE,
  include_f4 = TRUE,
  verbose = TRUE
)

Arguments

pref

Prefix of genotype files

pops

Populations for which to compute f2-statistics

include_f2

Should f2-statistics be used to get smoothed f2-statistics? If include_f2 is a positive integer, it specifies how many randomly chosen f2-statistics should be used.

include_f3

Should f3-statistics be used to get smoothed f2-statistics? If include_f3 is a positive integer, it specifies how many randomly chosen f3-statistics should be used.

include_f4

Should f4-statistics be used to get smoothed f2-statistics? If include_f4 is a positive integer, it specifies how many randomly chosen f4-statistics should be used.

Value

A 3d-array of smoothed f2-statistics

Examples

## Not run: 
f2_blocks = qpfstats(geno_prefix, mypops)

## End(Not run)

Compute the fit of an admixture graph

Description

Computes the fit of a given admixturegraph from f2-statistics. Drift edge weights and admixture edges weights are optimized until the (negative) likelihood score is minimized. The likelihood score is based on the squared difference between estimated and fitted f3-statistics.

Usage

qpgraph(
  data,
  graph,
  lambdascale = 1,
  boot = FALSE,
  diag = 1e-04,
  diag_f3 = 1e-05,
  lsqmode = FALSE,
  numstart = 10,
  seed = NULL,
  cpp = TRUE,
  return_fstats = FALSE,
  return_pvalue = FALSE,
  f3precomp = NULL,
  f3basepop = NULL,
  constrained = TRUE,
  allsnps = FALSE,
  ppinv = NULL,
  f2_blocks_test = NULL,
  verbose = FALSE
)

Arguments

data

Input data in one of three forms:

  1. A 3d array of blocked f2 statistics, output of f2_from_precomp or extract_f2 (fastest option)

  2. A directory which contains pre-computed f2-statistics

  3. The prefix of genotype files (slowest option)

graph

An admixture graph represented as a matrix of edges, an igraph object, or the path to a qpGraph graph file. Edges can be constrained by providing a matrix or data frame of edges with columns titled lower and upper with lower and upper bounds, respectively. By default, admixture edges are constrained to be between zero and one (with paired edges summing to one), and drift edges have a lower bound at zero.

lambdascale

Scales f2-statistics. This has no effect on the fit, but is used in the original qpGraph program to display branch weights on a scale that corresponds to FST distances.

boot

If FALSE (the default), each block will be left out at a time and the covariance matrix of f3 statistics will be computed using block-jackknife. Otherwise bootstrap resampling is performed n times, where n is either equal to boot if it is an integer, or equal to the number of blocks if boot is TRUE. The covariance matrix of f3 statistics will be computed using bootstrap resampling.

diag

Regularization term added to the diagonal elements of the covariance matrix of fitted branch lengths (after scaling by the matrix trace). Default is 0.0001.

diag_f3

Regularization term added to the diagonal elements of the covariance matrix of estimated f3 statistics (after scaling by the matrix trace). In the original qpGraph program, this is fixed at 0.00001.

lsqmode

Least-squares mode. If TRUE, the likelihood score will be computed using a diagonal matrix with 1/(sum(diag(f3_var)) * diag_f3), in place of the inverse f3-statistic covariance matrix.

lsqmode = 2 will use the identity matrix instead, which is equivalent to computing the score as the sum of squared residuals (sum((f3_est-f3_fit)^2)).

Both of these options do not take the covariance of f3-statistics into account. This can lead to bias, but is more stable in cases where the inverse f3-statistics covariance matrix can not be estimated precisely (for example because the number of populations is large). An alternative to lsqmode = TRUE that doesn't completely ignore the covariance of f3-statistics is to increase diag_f3.

numstart

Number of random initializations of starting weights. Defaults to 10. Increasing this number will make the optimization slower, but reduce the risk of not finding the optimal weights. Check the opt output to see how much the optimization depends on the starting weights.

seed

Random seed for generating starting weights.

cpp

Use C++ functions. Setting this to FALSE will be slower but can help with debugging.

return_fstats

Return estimated and fitted f2- and f4-statistics, as well as the worst f4-statistic residual Z-score. Defaults to FALSE because this can be slow.

f3precomp

Optional precomputed f3-statistics. This should be the output of qpgraph_precompute_f3 and can be provided instead of data. This can speed things up if many graphs are evaluated using the same set of f3-statistics.

f3basepop

Optional f3-statistics base population. Inference will be based on f3-statistics of the form ⁠f3(f3basepop; i, j)⁠ for all population pairs ⁠(i, j)⁠. Defaults to the outgroup population if the graph has one. This option is ignored if f3precomp is provided. Changing f3basepop should make very little difference.

constrained

Constrain estimated drift edge weights to be non-negative, and admixture edge weights to be between zero and one.

allsnps

Compute f3 from different SNPs for each population triplet (if data is missing for some SNPs and populations). This only has an effect when data is the prefix of genotype files.

ppinv

Optional inverse f3-statistics covariance matrix

f2_blocks_test

An optional 3d array of f2-statistics used for computing an out-of-sample score. This should contain only SNP blocks which are not part of f2_blocks. This allows to estimate the fit of a graph without overfitting and will not be used during the optimization step

verbose

Print progress updates

Value

qpgraph returns a list with data describing the model fit:

  • edges: A data frame where each row is an edge in the graph. For regular edges, the column weight is the estimated edge length, and for admixture edges, it is the estimated admixture weight.

  • score: The likelihood score of the fitted graph. Lower values correspond to better fits. The score is calculated as the inner product of the residuals (difference between estimated and fitted f3 statistics), weighted by the inverse of the f3 covariance matrix. See qpgraph_score

  • f2: Estimated and fitted f2 statistics (if return_fstats = TRUE). p-values and z-scores test the significance of the difference.

  • f3: Estimated and fitted f3 statistics. p-values and z-scores test the significance of the difference.

  • f4: Estimated and fitted f4 statistics (if return_fstats = TRUE). p-values and z-scores test the significance of the difference.

  • opt: A data frame with details of the weight-fitting step, including the randomly sampled starting weights. The column value contains the score for each set of starting weights. Columns starting with x denote initial weights, and columns starting with y denote fitted weights.

  • worst_residual: The highest absolute z-score of f4-statistics residuals (fitted - estimated f4); (returned if return_fstats = TRUE)

References

Patterson, N. et al. (2012) Ancient admixture in human history. Genetics

See Also

qpgraph_wrapper for a wrapper functions which calls the original qpGraph program.

Examples

out = qpgraph(example_f2_blocks, example_graph)
plot_graph(out$edges)

Compute f3-statistics from f2-statistics.

Description

Takes a 3d array of f2 block jackknife estimates and computes f3-statistics between the first population p1p1 and all population pairs i,ji, j: f3(p1;pi,pj)f3(p1; p_i, p_j)

Usage

qpgraph_precompute_f3(
  data,
  pops,
  f3basepop = NULL,
  lambdascale = 1,
  boot = FALSE,
  seed = NULL,
  diag_f3 = 1e-05,
  lsqmode = FALSE
)

Arguments

data

Input data in one of three forms:

  1. A 3d array of blocked f2 statistics, output of f2_from_precomp or extract_f2 (fastest option)

  2. A directory which contains pre-computed f2-statistics

  3. The prefix of genotype files (slowest option)

pops

Populations for which to compute f3-statistics

f3basepop

f3-statistics base population. If NULL (the default), the first population in pops will be used as the basis.

lambdascale

Scales f2-statistics. This has no effect on the fit, but is used in the original qpGraph program to display branch weights on a scale that corresponds to FST distances.

boot

If FALSE (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If boot is an integer, that number will specify the number of bootstrap resamplings. If boot = TRUE, the number of bootstrap resamplings will be equal to the number of SNP blocks.

seed

Random seed used if boot is TRUE.

diag_f3

Regularization term added to the diagonal elements of the covariance matrix of estimated f3 statistics (after scaling by the matrix trace). In the original qpGraph program, this is fixed at 0.00001.

lsqmode

Least-squares mode. If TRUE, the likelihood score will be computed using a diagonal matrix with 1/(sum(diag(f3_var)) * diag_f3), in place of the inverse f3-statistic covariance matrix. lsqmode = 2 will use the identity matrix instead, which is equivalent to computing the score as the sum of squared residuals (sum((f3_est-f3_fit)^2)). Both of these options do not take the covariance of f3-statistics into account. This can lead to bias, but is more stable in cases where the inverse f3-statistics covariance matrix can not be estimated precisely (for example because the number of populations is large). An alternative to using lsqmode = TRUE which doesn't completely ignore the covariance of f3-statistics is to increase diag_f3.

Value

A list with four items

  1. f3_est a matrix with f3-statistics for all population pairs with the output

  2. ppinv a matrix with the inverse of the f3-statistic covariance matrix

  3. f2out a data frame with f2 estimates

  4. f3out a data frame with f3 estimates

Examples

pops = get_leafnames(example_igraph)
qpgraph_precompute_f3(example_f2_blocks, pops)$f3_est
## Not run: 
qpgraph_precompute_f3(f2_dir, pops)

## End(Not run)

Evaluate a qpgraph models many times

Description

This function is used in combination with compare_fits in order to test whether one graph has a significantly better fit than another. It creates bootstrap resampled SNP block training and test sets, and uses them to evaluate multiple graphs.

Usage

qpgraph_resample_multi(f2_blocks, graphlist, nboot, verbose = TRUE, ...)

Arguments

f2_blocks

3d array of f2-statistics

graphlist

A list of admixture graphs

nboot

Number of bootstrap iterations

verbose

Print progress updates

...

Arguments passed to qpgraph

Value

A list of same length as graphlist with data frames with qpgraph results for each iteration of bootstrap resampled f2-statistics

See Also

compare_fits

Examples

## Not run: 
fits = qpgraph_resample_multi(f2_blocks, list(graph1, graph2), nboot = 100)
compare_fits(fits[[1]]$score_test, fits[[2]]$score_test)

## End(Not run)

Evaluate a qpgraph model many times

Description

This function is used in combination with compare_fits in order to test whether one graph has a significantly better fit than another. using a list of bootstrap resampled f2-statistics and corresponding test-set f2-statistics

Usage

qpgraph_resample_snps2(f2_blocks, graph, f2_blocks_test, verbose = TRUE, ...)

Arguments

f2_blocks

A list of bootstrap resampled f2-statistics

graph

An admixture graph

f2_blocks_test

A list of f2-statistics from all blocks which were not used in the corresponding f2-array in f2_blocks

verbose

Print progress updates

...

Arguments passed to qpgraph

Value

A data frame with qpgraph results for each iteration of bootstrap resampled f2-statistics

See Also

compare_fits boo_list


Wrapper function around the original qpGraph program

Description

Wrapper function around the original qpGraph program

Usage

qpgraph_wrapper(
  pref,
  graph,
  bin = "~np29/o2bin/qpGraph",
  parfile = NULL,
  outdir = ".",
  printonly = FALSE,
  badsnps = NULL,
  lambdascale = -1,
  inbreed = "NO",
  diag = 1e-04,
  blgsize = 0.05,
  outpop = "NULL",
  loadf3 = NULL,
  lsqmode = "NO",
  fstdmode = "NO",
  hires = "NO",
  forcezmode = "NO",
  zthresh = 0,
  allsnps = "NO",
  oldallsnps = "NO",
  doanalysis = "YES",
  bigiter = 100,
  initmix = 0,
  env = "",
  verbose = TRUE
)

Arguments

pref

Prefix of the packedancestrymap format genotype files.

graph

An admixture graph or qpGraph graph file

bin

Location of the qpGraph binary

parfile

qpGraph parameter file

outdir

Output directory

printonly

Should output be executed or the command just be printed?

badsnps

badsnps

lambdascale

lambdascale

inbreed

inbreed

diag

diag

blgsize

blgsize

outpop

outgroup population

loadf3

loadf3

lsqmode

least-squares mode. sets the offdiagonal elements of the block-jackknife covariance matrix to zero.

fstdmode

fstdmode

hires

hires

forcezmode

forcezmode

zthresh

zthresh

allsnps

allsnps

oldallsnps

oldallsnps

doanalysis

doanalysis

bigiter

bigiter

initmix

initmix

env

Export environmental variables. See examples.

verbose

Print progress updates

Value

A list with parsed qpGraph output

  1. edges: data frame

  2. score: scalar

  3. f2: data frame

Examples

## Not run: 
qpgraph_wrapper('genotype_prefix', example_graph,
                 bin = 'path/to/qpGraph')

## End(Not run)

Estimate admixture waves

Description

qpwave compares two sets of populations (left and right) to each other. It estimates a lower bound on the number of admixtue waves that went from left into right, by comparing a matrix of f4-statistics to low-rank approximations. For a rank of 0 this is equivalent to testing whether left and right form clades relative to each other.

Usage

qpwave(
  data,
  left,
  right,
  fudge = 1e-04,
  boot = FALSE,
  constrained = FALSE,
  cpp = TRUE,
  verbose = TRUE
)

Arguments

data

The input data in the form of:

  • A 3d array of blocked f2 statistics, output of f2_from_precomp or extract_f2

  • A directory with f2 statistics

  • The prefix of a genotype file

left

Left populations (sources)

right

Right populations (outgroups)

fudge

Value added to diagonal matrix elements before inverting

boot

If FALSE (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If boot is an integer, that number will specify the number of bootstrap resamplings. If boot = TRUE, the number of bootstrap resamplings will be equal to the number of SNP blocks.

constrained

Constrain admixture weights to be non-negative

cpp

Use C++ functions. Setting this to FALSE will be slower but can help with debugging.

verbose

Print progress updates

Value

qpwave returns a list with up to two data frames describing the model fit:

  1. f4 A data frame with estimated f4-statistics

  2. rankdrop: A data frame describing model fits with different ranks, including p-values for the overall fit and for nested models (comparing two models with rank difference of one). A model with L left populations and R right populations has an f4-matrix of dimensions (L-1)*(R-1). If no two left population form a clade with respect to all right populations, this model will have rank (L-1)*(R-1).

    • f4rank: Tested rank

    • dof: Degrees of freedom of the chi-squared null distribution: (L-1-f4rank)*(R-1-f4rank)

    • chisq: Chi-sqaured statistic, obtained as ⁠E'QE⁠, where E is the difference between estimated and fitted f4-statistics, and Q is the f4-statistic covariance matrix.

    • p: p-value obtained from chisq as pchisq(chisq, df = dof, lower.tail = FALSE)

    • dofdiff: Difference in degrees of freedom between this model and the model with one less rank

    • chisqdiff: Difference in chi-squared statistics

    • p_nested: p-value testing whether the difference between two models of rank difference 1 is significant

References

Patterson, N. et al. (2012) Ancient admixture in human history. Genetics

Haak, W. et al. (2015) Massive migration from the steppe was a source for Indo-European languages in Europe. Nature (SI 10)

See Also

qpadm

Examples

left = c('Altai_Neanderthal.DG', 'Vindija.DG')
right = c('Chimp.REF', 'Mbuti.DG', 'Russia_Ust_Ishim.DG', 'Switzerland_Bichon.SG')
qpwave(example_f2_blocks, left, right)

Compute all pairwise qpwave p-values

Description

For all pairs of left populations qpwave rank 0 Chi-squared statistics and p-values will be computed. This tests for each pair of left populations whether it forms a clade with respect to the right populations.

Usage

qpwave_pairs(f2_data, left, right)

Arguments

f2_data

A 3d array of blocked f2 statistics, output of f2_from_precomp. Alternatively, a directory with f2 statistics.

left

Left populations

right

Right populations

Examples

## Not run: 
left = pops[5:7]
right = pops[1:4]
f2_blocks = f2_from_precomp('/my/f2/dir/', pops = left, pops2 = right, afprod = TRUE)
qpwave_pairs(f2_blocks, left, right)
# If f2-stats are too big to load them into memory,
# the following will read the data for each pair from disk:
qpwave_pairs('/my/f2/dir/', left, right)

## End(Not run)

Wrapper function around the original qpWave program

Description

Wrapper function around the original qpWave program

Usage

qpwave_wrapper(
  pref,
  left,
  right,
  target = NULL,
  bin = "~np29/o2bin/qpAdm",
  outdir = "./",
  parfile = NULL,
  allsnps = "NO",
  blgsize = 0.05,
  fancyf4 = "NO",
  f4mode = "YES",
  inbreed = "NO",
  printonly = FALSE,
  env = "",
  verbose = TRUE
)

Arguments

left

Left populations (sources)

right

Right populations (outgroups)

target

Target population

verbose

Print progress updates


Generate a random admixture graph

Description

This function randomly generates an admixture graph for a given set of leaf nodes

Usage

random_admixturegraph(
  leaves,
  numadmix = 0,
  simple = TRUE,
  outpop = NULL,
  nonzero_f4 = NULL,
  admix_constraints = NULL,
  event_order = NULL,
  ntry = 100
)

Arguments

leaves

Names of the leaf nodes, or a number specifying how many leaf nodes there should be

numadmix

Number of admixture events

simple

Should edges leading to admixture nodes consist of separate admix edges and normal edges

outpop

Outgroup population

Examples

rand_graph = random_admixturegraph(10, numadmix = 5)
plot_graph(rand_graph)

Get random dates for graph nodes

Description

This function assigns a date (in generations) to each node in an admixture graph and is used in random_sim. The dates are drawn from a uniform distribution with given lower and upper bounds, unless the fix_leaf option is set to TRUE, in which case the dates of all leaf nodes returned by get_leafnames will be set to 0.

Usage

random_dates(graph, min = 1000, max = 1000, fix_leaf = FALSE)

Arguments

graph

An admixture graph

min

A lower limit of uniform distribution

max

An upper limit of uniform distribution

fix_leaf

A boolean specifying if the dates of the leaf nodes will be fixed at time 0 (i.e., at the most recent time). If TRUE, all samples will be drawn at the end of the simulation (i.e., from "today"). The default is FALSE.

Value

A named vector with random dates for each graph node


Generate a random binary graph

Description

Generate a random binary graph

Usage

random_newick(n, start = "", end = "", outpop = NULL)

Arguments

n

The number of terminal nodes, or a vector of population labels.

start

Prefix.

end

Postfix.

outpop

Outgroup (if n is a vector of labels).

Value

Tree in newick format.

Examples

random_newick(5)
random_newick(c('a', 'b', 'c', 'd')) # toplogy random, but pop order fixed
random_newick(sample(c('a', 'b', 'c', 'd'))) # toplogy and pop order random

Generate a random graph and simulate it in msprime v1.x

Description

This is basically a wrapper function around the msprime_genome that allows user to create a random graph and simulate it in msprime v1.x.

Usage

random_sim(
  nleaf,
  nadmix,
  outpref = "random_sim",
  max_depth = NULL,
  ind_per_pop = 1,
  mutation_rate = 1.25e-08,
  admix_weights = 0.5,
  neff = 1000,
  time = 1000,
  fix_leaf = FALSE,
  outpop = NULL,
  nchr = 1,
  recomb_rate = 2e-08,
  seq_length = 1000,
  ghost_lineages = TRUE,
  run = FALSE
)

Arguments

nleaf

The number of leaf nodes

nadmix

The number of admixture events

outpref

A prefix of output files

max_depth

A constraint specifying the maximum time depth of the admixture graph (in generations)

ind_per_pop

The number of individuals to simulate for each population. If a scalar value, it will be constant across all populations. Alternatively, it can be a named vector with a different value for each population.

mutation_rate

Mutation rate per site per generation. The default is 1.25e-8 per base pair per generation.

admix_weights

Admixture weights. If a float value (0 < value < 1), admixture weights for each admixture event will be (value, 1-value). Alternatively, it can be a range, i.e., c(0.1, 0.4) specifying lower and upper limits of a uniform distribution from which the admixture weight value will be drawn. By default, all admixture edges have a weight of 0.5.

neff

Effective population size (in diploid individuals). If a scalar value, it will be constant across all populations. Alternatively, it can be a range, i.e., c(500, 1000) specifying lower and upper limits of an uniform distribution from which values will be drawn

time

Time between nodes. Either a scalar value (1000 by default) with the dates generated by pseudo_dates, or a range, i.e., c(500, 1000) specifying lower and upper limits of a uniform distribution from which values will be drawn (see random_dates)

fix_leaf

A boolean specifying if the dates of the leaf nodes will be fixed at time 0. If TRUE, all samples will be drawn at the end of the simulation (i.e., from “today”).

outpop

A name of the (optional) outgroup population.

nchr

The number of chromosomes to simulate

recomb_rate

A float value specifying recombination rate along the chromosomes. The default is 2e-8 per base pair per generation.

seq_length

The sequence length of the chromosomes. If it is a scalar value, the sequence length will be constant for all chromosomes. Alternatively, it can be a vector with a length equal to number of chromosomes (i.e., c(100,50) to simulate 2 chromosomes with the lengths of 100 and 50 base pairs).

ghost_lineages

A boolean value specifying whether ghost lineages will be allowed. If TRUE, admixture happens at the time points defined by the y-axis generated while plotting the graph by plot_graph If FALSE (default), admixture occurs at the time of the previous split event

run

If FALSE, the function will terminate after writing the msprime script. If TRUE, it will try and execute the function with the default python installation. If you want to use some other python installation, you can set ⁠run = /my/python⁠.

Value

A list with the path of simulation script, a data frame of graph edges, dates and population sizes:

  • out - A file name and path of the simulation script

  • edges - An edge dataframe with admixture weights

  • dates - A named vector with a date for each node

  • neffs - A named vector with an effective population size for each node

Examples

# Create simulation script that simulates 2 chromosomes that are 50base long
# where maximum depth of the tree is 5000 generations, and plot the output graph
## Not run: 
out = random_sim(nleaf=4, nadmix=0, max_depth=5000, nchr=2, seq_length=50)
plot_graph(out$edges, dates = out$dates, neff = out$neff, hide_weights = TRUE)

## End(Not run)

Read genotype data from EIGENSTRAT files

Description

Read genotype data from EIGENSTRAT files

Usage

read_eigenstrat(
  pref,
  inds = NULL,
  pops = NULL,
  first = 1,
  last = Inf,
  transpose = FALSE,
  verbose = TRUE
)

Arguments

pref

Prefix of the packedancestrymap files

inds

Individuals for which data should be read. Defaults to all individuals

pops

Populations for which data should be read. Cannot be provided together with 'inds'

first

Index of first SNP to read

last

Index of last SNP to read

transpose

Transpose genotype matrix (default is snps x individuals)

verbose

Print progress updates

Value

A list with the genotype data matrix, the .ind file, and the .snp file

Examples

## Not run: 
samples = c('Ind1', 'Ind2', 'Ind3')
geno = read_packedancestrymap(prefix, samples)

## End(Not run)

Read blocked f2 estimates from disk

Description

This function reads blocked f2 estimates (or allele frequency products) which were writtend to disk by write_f2 and returns them as a 3d array.

Usage

read_f2(
  f2_dir,
  pops = NULL,
  pops2 = NULL,
  type = "f2",
  counts = FALSE,
  remove_na = TRUE,
  verbose = FALSE
)

Arguments

f2_dir

Directory from which to read files

pops

Populations for which f2 statistics should be read. Defaults to all populations, which may require a lot of memory.

pops2

Specify this if you only want to read a subset of all population pairs. The resulting array will differ on 1st and 2nd dimension and will not work with all functions.

counts

Return allele counts instead of f2 estimates

remove_na

Remove blocks with missing values

verbose

Print progress updates

afprod

Return allele frequency products instead of f2 estimates

Value

A 3d array of block jackknife estimates

See Also

write_f2

Examples

## Not run: 
read_f2(f2_dir, pops = c('pop1', 'pop2', 'pop3'))

## End(Not run)

Read genotype data from packedancestrymap files

Description

Read genotype data from packedancestrymap files

Usage

read_packedancestrymap(
  pref,
  inds = NULL,
  pops = NULL,
  first = 1,
  last = Inf,
  transpose = FALSE,
  verbose = TRUE
)

Arguments

pref

Prefix of the packedancestrymap files

inds

Individuals for which data should be read. Defaults to all individuals

pops

Populations for which data should be read. Cannot be provided together with 'inds'

first

Index of first SNP to read

last

Index of last SNP to read

transpose

Transpose genotype matrix (default is snps x individuals)

verbose

Print progress updates

Value

A list with the genotype data matrix, the .ind file, and the .snp file

Examples

## Not run: 
samples = c('Ind1', 'Ind2', 'Ind3')
geno = read_packedancestrymap(prefix, samples)

## End(Not run)

Run models while leaving out individuals

Description

These are wrapper functions around various qp functions, which will evaluate many models at once. The models are formed by leaving out every individual which belongs to a population with more than one sample, one at a time.

Usage

qp3pop_resample_inds(dir, inds, pops, verbose = TRUE, ...)

qpdstat_resample_inds(dir, inds, pops, verbose = TRUE, ...)

qpwave_resample_inds(dir, inds, pops, verbose = TRUE, ...)

qpadm_resample_inds(dir, inds, pops, verbose = TRUE, ...)

qpgraph_resample_inds(dir, inds, pops, verbose = TRUE, ...)

Arguments

dir

directory with precomputed data

inds

vector of individual names

pops

vector of population names. Should be the same length as inds

verbose

print progress updates

...

named arguments passed to the qp function.

Value

a nested data frame where each model is a row, and the columns are model parameters and model outputs

Examples

## Not run: 
res = qpadm_resample_inds
unnest(res, weights)

## End(Not run)

Run models while leaving out SNP blocks

Description

These are wrapper functions around various qp functions, which will evaluate a model on multiple subdivisions of the data. The subdivisions are formed by leaving out one or more SNP block at a time (see boot for details).

Usage

qp3pop_resample_snps(f2_blocks, boot = FALSE, ...)

qpdstat_resample_snps(f2_blocks, boot = FALSE, ...)

qpwave_resample_snps(f2_blocks, boot = FALSE, ...)

qpadm_resample_snps(f2_blocks, boot = FALSE, ...)

qpgraph_resample_snps(f2_blocks, boot = FALSE, ...)

Arguments

f2_blocks

a 3d array of blocked f2 statistics

boot

If FALSE (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If boot is an integer, that number will specify the number of bootstrap resamplings. If boot = TRUE, the number of bootstrap resamplings will be equal to the number of SNP blocks.

...

named arguments which are passed to the qp function.

verbose

print progress updates

Value

a nested data frame where each model is a row, and the columns are model parameters and model outputs

Examples

## Not run: 
res = qpadm_resample_snps(example_f2_blocks, target = target, left = left, right = right)
unnest(res, weights)

## End(Not run)

Rotate populations between left and right

Description

This functions creates a data frame with population combinations which can be used as the input for qpadm_multi

Usage

rotate_models(leftright, target, rightfix = NULL)

Arguments

leftright

Populations which will be distributed between left and right

target

Target population

rightfix

Populations which will be on the right side in all models

Value

A data frame with Chi-squared statistics and p-values for each population combination

Examples

## Not run: 
pops = dimnames(example_f2_blocks)[[1]]
rotate_models(leftright = pops[1:4],
             target = pops[5], rightfix = pops[6:7])

## End(Not run)

Launch ADMIXTOOLS 2 GUI

Description

Launch ADMIXTOOLS 2 GUI

Usage

run_shiny_admixtools()

Test constraints on a graph

Description

This function tests whether a graph satisfies a number of different types of constraints

Usage

satisfies_constraints(
  graph,
  nonzero_f4 = NULL,
  admix_constraints = NULL,
  event_order = NULL
)

Arguments

graph

An admixture graph

nonzero_f4

A data frame or matrix with four columns. One row for each f4-statistic which is observed to be significantly non-zero

admix_constraints

A data frame with columns pop, min, max

event_order

A data frame with columns earlier1, earlier2, later1, later2

Value

TRUE if all admixture constraints are satisfied, else FALSE

See Also

satisfies_numadmix, satisfies_zerof4, satisfies_eventorder ## Not run: # At least one admixture event for C, and none for D: constrain_cd = tibble(pop = c('C', 'D'), min = c(1, NA), max = c(NA, 0)) satisfies_numadmix(random_admixturegraph(5, 2), constrain_cd) ## End(Not run)


Test f4 constraints on a graph

Description

This function returns TRUE if and only if the admixture graph is compatible with all event orders listed in eventorder

Usage

satisfies_eventorder(graph, eventorder, strict = TRUE)

Arguments

graph

An admixture graph

eventorder

A data frame with columns earlier1, earlier2, later1, later2, optionally type

strict

What to do in case some events are not determined by the graph. If strict = TRUE (the default), the function will only return TRUE if there are no ambiguous constraints. Otherwise, TRUE will be returned as long as no constraint is directly contradicted.

Details

Each row in eventorder represents a constraint that earlier1 and earlier2 split earlier than later1 and later2. If later2 is NA, later2 will be set to the parent node of later1. By default (type = 1), a constraint will be satisfied as long as there is any lineage in which a split between earlier1 and earlier2 is ancestral to a split between later1 and later2. type = 2 is stricter and requires that the earlier1, earlier2 split is ancestral to the later1, later2 split in all lineages. In graphs with multiple admixture events there can be multiple splits between earlier1, earlier2 and later1, later2, and many ways in which these splits can relate to each other. The current implementation only covers some of the many possible topological relationships.

Value

TRUE if all constraints are satisfied, else FALSE

Examples

## Not run: 
# Test whether the split between A and B is earlier than the split between C and D,
#   and whether the split between C and D is earlier than the terminal branch leading to E
constrain_events = tribble(
  ~earlier1, ~earlier2, ~later1, ~later2,
  'A', 'B', 'C', 'D',
  'C', 'D', 'E', NA)
satisfies_eventorder(random_admixturegraph(5, 0), eventorder = constrain_events)

## End(Not run)

Test f4 constraints on a graph

Description

This function returns TRUE if and only if the admixture graph is compatible with the f4-statistics listed in nonzero_f4 being non-zero

Usage

satisfies_nonzerof4(graph, nonzero_f4)

Arguments

graph

An admixture graph

nonzero_f4

A data frame or matrix with four columns. One row for each f4-statistic which is expected to be non-zero

Value

TRUE if all constraints are satisfied, else FALSE

Examples

## Not run: 
# Test whether f4(A,B; C,D) is expected to be non-zero in this graph:
nonzero_f4 = matrix(c('A', 'B', 'C', 'D'), 1)
satisfies_nonzerof4(random_admixturegraph(5, 2), nonzero_f4)

## End(Not run)

Test admixture constraints on a graph

Description

This function returns TRUE if and only if the admixture graph satisfies all constraints on the number of admixture events for the populations in admix_constraints

Usage

satisfies_numadmix(graph, admix_constraints)

Arguments

graph

An admixture graph

admix_constraints

A data frame with columns pop, min, max

Value

TRUE if all admixture constraints are satisfied, else FALSE

Examples

## Not run: 
# At least one admixture event for C, and none for D:
constrain_cd = tribble(
  ~pop, ~min, ~max,
  'C', 1, NA,
  'D', NA, 0)
satisfies_numadmix(random_admixturegraph(5, 2), constrain_cd)

## End(Not run)

Test f4 constraints on a graph

Description

This function returns TRUE if and only if the admixture graph is compatible with the f4-statistics listed in nonzero_f4 being non-zero

Usage

satisfies_zerof4(graph, zero_f4)

Arguments

graph

An admixture graph

zero_f4

A data frame or matrix with four columns. One row for each f4-statistic which is expected to be zero

Value

TRUE if all constraints are satisfied, else FALSE

Examples

## Not run: 
# Test whether Chimp and Altai are a clade with respect to all populations X and Y:
# (whether f4("Chimp", "Altai"; X, Y) is 0 for all pairs of X and Y)
zero_f4 = expand_grid(pop1 = "Chimp", pop2 = "Altai", pop3 = X, pop4 = Y)
satisfies_zerof4(random_admixturegraph(5, 2), zero_f4)

## End(Not run)

Return shortest unique prefixes

Description

Return shortest unique prefixes

Usage

shortest_unique_prefixes(strings, min_length = 1)

Arguments

strings

A character vector

min_length

Minimum length of prefixes

Value

Character vector with shortest unique prefixes

Examples

strings = c("Abra", "Abracadabra", "Simsalabim")
shortest_unique_prefixes(strings)

Remove redundant edges

Description

Nodes with in and out degree of one will be removed.

Usage

simplify_graph(graph)

Arguments

graph

an igraph object

Examples

simple = simplify_graph(example_igraph)
plot_graph(example_igraph)
plot_graph(simple)

Split a matrix into blocks

Description

This function splits a large matrix into smaller blocks with cols_per_chunk columns per block, and saves them as .rds files with prefix prefix

Usage

split_mat(mat, cols_per_chunk, prefix, overwrite = TRUE, verbose = TRUE)

Arguments

mat

The matrix to be split

cols_per_chunk

Number of columns per block

prefix

Prefix of output files

overwrite

Overwrite existing files (default TRUE)

verbose

Print progress updates

See Also

packedancestrymap_to_afs, afs_to_f2

Examples

## Not run: 
afdat = packedancestrymap_to_afs('path/to/packedancestrymap_prefix', allpopulations)
split_mat(afdat$afs, cols_per_chunk = 20, prefix = 'afdat_split_v42.1/afs')

## End(Not run)

Split nodes with more than two edges

Description

Split nodes with more than two edges

Usage

split_multifurcations(graph)

Arguments

graph

An admixture graph

Value

A new admixture graph in which no node has more than two incoming or outgoing edges


Modify a graph by regrafting a subcomponent

Description

Modify a graph by regrafting a subcomponent

Usage

spr_all(graph, fix_outgroup = TRUE)

Arguments

graph

An admixture graph

Value

A new admixture graph


Modify a graph by regrafting a leaf

Description

Modify a graph by regrafting a leaf

Usage

spr_leaves(graph, fix_outgroup = TRUE)

Arguments

graph

An admixture graph

Value

A new admixture graph


List leaf nodes for all internal nodes

Description

List leaf nodes for all internal nodes

Usage

summarize_descendants(graph)

Arguments

graph

An admixture graphs

Value

A data frame with columns from, to, id

Examples

## Not run: 
summarize_descendants(graph)

## End(Not run)

List leaf nodes for all internal nodes in a list of graphs

Description

List leaf nodes for all internal nodes in a list of graphs

Usage

summarize_descendants_list(graphlist, rename = FALSE)

Arguments

graphlist

A list of admixture graphs

rename

If FALSE (default), the output will be a data frame indicating how often each node in each graph is observed in all other graphs. If TRUE, the output will be a list, where the inner nodes will be renamed to have these percentages as part of their name. plot_graph will print the percentages of graphs renamed in this way.

Value

A data frame with columns graph, from, n, frac

Examples

## Not run: 
summarize_descendants_list(graphlist)

## End(Not run)

List population split events in a graph

Description

List population split events in a graph

Usage

summarize_eventorder(graph, unique_only = TRUE)

Arguments

graph

An admixture graph

Value

A data frame with columns earlier1, earlier2, later1, later2

Examples

## Not run: 
summarize_eventorder(example_igraph)

## End(Not run)

List population split events in a list of graphs

Description

List population split events in a list of graphs

Usage

summarize_eventorder_list(graphlist)

Arguments

graphlist

A list of admixture graphs

Value

A data frame with columns earlier1, earlier2, later1, later2, n, frac

Examples

## Not run: 
summarize_eventorder(graphlist)

## End(Not run)

Summarize graph fits

Description

This function takes the output of qpgraph_resample_snps and creates a data frame with summaries of the estimated graph parameters. weight has the mean of all fits, while low and high have lower and upper quantiles.

Usage

summarize_fits(fits, q_low = 0.05, q_high = 0.95)

Arguments

fits

A nested data frame where each line is the output of qpgraph()

Value

A data frame summarizing the fits

See Also

qpgraph_resample_snps


List number of admixture events for each population

Description

List number of admixture events for each population

Usage

summarize_numadmix(graph)

Arguments

graph

An admixture graph

Value

A data frame with columns pop and nadmix

Examples

## Not run: 
summarize_numadmix(example_igraph)

## End(Not run)

List number of admixture events for each population in a list of graphs

Description

List number of admixture events for each population in a list of graphs

Usage

summarize_numadmix_list(graphlist)

Arguments

graphlist

A list of admixture graphs

Value

A data frame with columns pop, mean, mean_nonzero, min, max

Examples

## Not run: 
summarize_numadmix_list(graphlist)

## End(Not run)

Assign proxy populations to admixed populations

Description

Assign proxy populations to admixed populations

Usage

summarize_proxies(graph)

Arguments

graph

An admixture graph in igraph format, or as edge list with column weight

Value

A data frame with columns pop, nadmix, proxy, nproxies (and weight)

Examples

## Not run: 
summarize_proxies(example_igraph)

## End(Not run)

List proxy populations in graphlist

Description

List proxy populations in graphlist

Usage

summarize_proxies_list(graphlist)

Arguments

graphlist

A list of admixture graphs

Value

A data frame with columns pop, proxy, nadmix, nproxies, n, frac

Examples

## Not run: 
summarize_proxies_list(graphlist)

## End(Not run)

Summarize triples across graphs

Description

This function summarizes topologies of population triples across graphs.

Usage

summarize_triples(graphs)

Arguments

graphs

A list of graphs

Value

A data frame with one line for each population triple and columns summarizing the relationship of each triple across graphs.

  • clade12: Fraction of graphs in which pop1 and pop2 form a clade

  • x12: Fraction of graphs in which pop1 admixes into pop2 at the exclusion of pop3

  • toptopo: A binary string representation of the most common topology. Digits represent x12, x21, x13, x31, x23, x32

  • toptopocnt: The number of graphs in which toptopo was observed

  • topos: The counts of all topologies


List clades in a graph

Description

List clades in a graph

Usage

summarize_zerof4(graph, eps = 1e-07)

Arguments

graph

An admixture graph

eps

Used to determine what counts as zero

Value

A data frame with all (non-redundant) zero f4-statistics

Examples

## Not run: 
summarize_zerof4(example_igraph)

## End(Not run)

List clades in a list of graphs

Description

List clades in a list of graphs

Usage

summarize_zerof4_list(graphlist)

Arguments

graphlist

A list of admixture graphs

Value

A data frame with columns pop1, pop2, pop3, pop4, n, frac

Examples

## Not run: 
summarize_zerof4_list(graphlist)

## End(Not run)

Modify a graph by swapping two leaf nodes

Description

Modify a graph by swapping two leaf nodes

Usage

swap_leaves(graph, fix_outgroup = TRUE)

Arguments

graph

An admixture graph

fix_outgroup

Keep outgroup in place

Value

A new admixture graph


Test if two sets of populations form two clades

Description

A thin wrapper around qpadm_p with rnk set to zero

Usage

test_cladality(f2_data, left, right, fudge = 1e-04, boot = FALSE, cpp = TRUE)

Arguments

left

Left populations (sources)

right

Right populations (outgroups)

fudge

Value added to diagonal matrix elements before inverting

boot

If FALSE (the default), block-jackknife resampling will be used to compute standard errors. Otherwise, block-bootstrap resampling will be used to compute standard errors. If boot is an integer, that number will specify the number of bootstrap resamplings. If boot = TRUE, the number of bootstrap resamplings will be equal to the number of SNP blocks.

cpp

Use C++ functions. Setting this to FALSE will be slower but can help with debugging.


Test if a tree is part of a graph

Description

This function tests whether a tree is part of a graph. This is useful for testing whether a Y-chromosome tree is consistent with an autosomal admixture graph. Leaf node names matter, but internal node names are ignored.

Usage

tree_in_graph(tree, graph)

Arguments

tree

An admixture graph without admixture event

graph

An admixture graph

Value

TRUE if all admixture constraints are satisfied, else FALSE ## Not run: tree = graph_splittrees(example_igraph) tree_in_graph(tree, example_igraph) ## End(Not run)


Find all trees within SPR distance of 1

Description

Returns all trees which can be reached through one iteration of subtree-prune-and-regraft

Usage

tree_neighbors(tree)

Arguments

tree

A tree in igraph format

Value

A data frame with all trees


Find all unidentifiable edges

Description

This function generates and evaluates admixture graphs in numgen iterations to find well fitting admixturegraphs.

Usage

unidentifiable_edges(graph)

Arguments

graph

An admixture graph

Value

A data frame with all unidentifiable graph parameters


Convert graph to dot format

Description

Convert graph to dot format

Usage

write_dot(
  graph,
  outfile = stdout(),
  size1 = 7.5,
  size2 = 10,
  title = "",
  dot2pdf = FALSE
)

Arguments

graph

Graph as igraph object or edge list (columns labelled 'from', 'to', 'weight')

outfile

Output file name

Examples

## Not run: 
results = qpgraph(example_f2_blocks, example_graph)
write_dot(results$edges)

## End(Not run)

Write blocked f2 estimates to disk

Description

This function takes 3d arrays of blocked f2, allele frequency products, and counts, splits them by population pair, and writes each pair to a separate .rds file under {outdir}/{pop1}/{pop2}.rds.

Usage

write_f2(est_arr, count_arr, outdir, id = "f2", overwrite = FALSE)

Arguments

est_arr

A 3d array with blocked f2, allele frequency products, or fst estimates for each population pair. The first two dimensions of each array have to have population names.

count_arr

A 3d array of the same dimension with counts

outdir

Directory where data will be stored

id

Postfix showing the type of statistic ("f2", "ap", or "fst")

overwrite

Overwrite existing files in outdir

See Also

read_f2

Examples

## Not run: 
write_f2(f2_arr, count_arr, outdir = 'path/to/f2stats/')

## End(Not run)