ADMIXTOOLS 2 can read genotype data in four formats.
PACKEDANCESTRYMAP and EIGENSTRAT are described here
In all four formats a dataset consists of three files: one file for the genotype matrix, one file for the SNP metadata, and one file for the sample metadata.
The following shows what a data set for three SNPs and five samples grouped into two populations (French.DG, Onge.DG) could look like:
PLINK .fam files, have six columns, but only the first
two (family/population ID and sample ID) are relevant in ADMIXTOOLS
2
French.DG S_French-1.DG 0 0 M -9
French.DG S_French-2.DG 0 0 F -9
French.DG B_French-3.DG 0 0 M -9
Onge.DG BR_Onge-2.DG 0 0 F -9
Onge.DG BR_Onge-1.DG 0 0 F -9
PLINK .bim file:
1 rs3094315 0.02013 752566 G A
1 rs12124819 0.020242 776546 A G
1 rs28765502 0.022137 832918 T C
A PLINK 2 dataset consists of three files:
.pgen (binary genotype matrix), .pvar (variant
metadata), and .psam (sample metadata). The prefix
convention is the same as for binary PLINK (one prefix that
resolves to all three), so ADMIXTOOLS 2 auto-detects PFILE
format whenever <prefix>.pgen,
<prefix>.pvar, and <prefix>.psam
all exist at the prefix you pass.
.psam is a tab-separated text file. Population labels
live in the #FID column when present (matching the PLINK
convention of FID-as-population); when #FID is missing or
0, ADMIXTOOLS 2 falls back to
IID:
#FID IID SEX
French.DG S_French-1.DG 1
French.DG S_French-2.DG 2
French.DG B_French-3.DG 1
Onge.DG BR_Onge-2.DG 2
Onge.DG BR_Onge-1.DG 2
.pvar is tab-separated text and (unlike
.bim) supports multi-allelic sites by allowing
comma-separated ALT alleles. The
pfile_to_afs() function accepts a
multiallelic = c("error", "skip", "first_alt") argument to
control how those sites are handled (default "error"):
#CHROM POS ID REF ALT
1 752566 rs3094315 G A
1 776546 rs12124819 A G
1 832918 rs28765502 T C
.pvar may also be supplied in Zstd-compressed form as
.pvar.zst — ADMIXTOOLS 2 picks up either form
transparently. Reading .pvar.zst requires the
zstd CLI on PATH (the genotype access via
pgenlibr handles .zst natively, but the R-side
metadata parser shells out to zstd -d).
Note: .pvar does not by default carry
centimorgan positions. When the CM column is absent
ADMIXTOOLS 2 leaves cm = 0 for all variants and
falls back to base-pair distance for block partitioning
(blgsize = 1e6 works well at typical block sizes).
Alternatively you can supply an external recombination map via
cm_file = (a TSV with a SNP-identifier column —
variant_id, SNP, or ID — and a
cm column).
EIGENSTRAT and PACKEDANCESTRYMAP use the same format for the metadata. The only difference is that the genotype data is stored in a text file in EIGENSTRAT, but in a binary file in PACKEDANCESTRYMAP.
The .ind file has 3 out of the 6 columns in the
.fam file, in different order:
S_French-1.DG M French.DG
S_French-2.DG F French.DG
B_French-3.DG M French.DG
BR_Onge-2.DG F Onge.DG
BR_Onge-1.DG F Onge.DG
.snp files are the same as .bim files,
except that the first two columns are swapped:
rs3094315 1 0.02013 752566 G A
rs12124819 1 0.020242 776546 A G
rs28765502 1 0.022137 832918 T C
EIGENSTRAT .geno files look like this:
29101
10211
22901
One SNP per row, one sample per column, and missing data is denoted
by 9.
While there are other R packages for reading and writing binary PLINK files (plink2R, genio), no such packages exist to date for PACKEDANCESTRYMAP format files, so some of the functions shown here may be useful outside of the ADMIXTOOLS framework.
Note that the functions below impose the restriction than individual IDs may not be duplicated across populations or families.
geno = read_plink("plink/prefix")
geno = read_eigenstrat("eigenstrat/prefix")
geno = read_packedancestrymap("packedancestrymap/prefix")
# read SNPs 1000 to 2000
geno = read_packedancestrymap("packedancestrymap/prefix", first = 1000, last = 2000)
# read only ind1 and ind2
geno = read_packedancestrymap("packedancestrymap/prefix", inds = c('ind1', 'ind2'))For PFILE, there is no read_pfile — PFILE inputs are
consumed directly by the f-statistic pipeline (extract_f2,
f2_from_geno, f4blockdat_from_geno,
qpfstats) which auto-detect the format from the file
extensions at the prefix.
Populations can be extracted from PLINK files (and written to another set of PLINK files) like this:
extract_samples can also be used to extract samples,
similar to plink --keep.
ADMIXTOOLS 2 can’t merge data sets. Data in PLINK
format can be merged using plink --bmerge.
afs = plink_to_afs("plink/prefix")
afs = eigenstrat_to_afs("eigenstrat/prefix")
afs = packedancestrymap_to_afs("packedancestrymap/prefix")
afs = pfile_to_afs("plink2/prefix")
# pfile_to_afs supports the same range/subset args as the others
afs = pfile_to_afs("plink2/prefix", inds = c('ind1', 'ind2'))
afs = pfile_to_afs("plink2/prefix", pops = c('pop1', 'pop2'))
afs = pfile_to_afs("plink2/prefix", first = 1000, last = 2000)
# When .pvar lacks centimorgan positions (the default), supply a map
# via cm_file (TSV with variant_id + cm columns)
afs = pfile_to_afs("plink2/prefix", cm_file = "ref_genmap.tsv")This function converts genotype files from PACKEDANCESTRYMAP
or EIGENSTRAT to PLINK format. If inds or
pops is specified, only those samples will be
extracted.
ADMIXTOOLS 2 currently doesn’t provide the option to write binary PACKEDANCESTRYMAP files, so the conversion only goes one way.
When extract_f2() writes to disk it emits two sidecar
files alongside the per-pair .rds arrays:
.f2_cache_id — a short text file
containing a sha256:<64hex> hash over the inputs and
filter args. Cheap to read; useful as a cache key for orchestrators that
want to skip re-extraction when nothing has changed.cache_metadata.json — a
schema-versioned JSON file with n_snps,
n_blocks, pops, build timestamp, filter args,
and the same cache_id. Written atomically (tempfile +
rename) before .f2_cache_id, so a crash between the two
writes leaves a recoverable directory.Read the metadata programmatically:
meta = read_f2_cache_metadata("my_f2_dir/")
meta$n_snps # SNPs that actually contributed to f2 blocks
meta$pops # populations in the cache
meta$cache_id # matches .f2_cache_idTo check whether an existing cache directory still matches your current inputs without re-extracting, recompute the cache id from the genotype prefix and compare:
fresh = compute_f2_cache_id("geno_prefix", pops = pops, ...)
cached = read_f2_cache_metadata("my_f2_dir/")$cache_id
identical(fresh, cached) # TRUE if the cache is still valid for these inputsIf only the genotype prefix and the f2 directory are available
(e.g. an orchestrator probe),
compute_f2_cache_id("my_f2_dir/") reads the hash directly
from the sidecar — no genotype-file access needed.
There are several ways in which admixture graphs can be represented. Some of them are useful because they can be stored as human readable text files, and others are useful because they make it easier for R to process graphs. ADMIXTOOLS 2 has several function for importing, exporting and converting between formats.
Before fitting an admixture graph, the graph only has nodes and edges. After fitting a graph, each edge has a weight associated with it, and maybe some other information. Here I mostly show examples of graphs that are not fitted yet.
The simplest way to represent a graph in a text file is by writing each edge in the graph in a separate line:
R O
R n1
n1 n2
n1 n3
n2 A
n2 admix
n3 B
n3 n4
n4 C
n4 admix
admix D
We can save this in a text file graph.txt and read it
into R like this:
qpgraph() uses this format to represent fitted graphs.
In addition to the first two columns (from and
to), there will be additional columns to indicate the type
of edge, its estimated weight, and possibly more information.
## # A tibble: 6 × 6
## from to type weight low high
## <chr> <chr> <chr> <dbl> <dbl> <dbl>
## 1 R Chimp.REF edge 0.0724 NA NA
## 2 R N4N edge 0.0724 NA NA
## 3 N0N Altai_Neanderthal.DG edge 0.00321 NA NA
## 4 N0N N3N0 edge 0.0111 NA NA
## 5 N1N N0N edge 0.0841 NA NA
## 6 N1N N3N1 edge 0.0362 NA NA
Many ADMIXTOOLS 2 functions use the igraph
package, which has its own graph format. The most important
ADMIXTOOLS 2 functions work with graphs in either the edge list
or the igraph format, but some only work with graphs in the
igraph format. It’s easy to convert between the two
formats:
## [1] "tbl_df" "tbl" "data.frame"
## [1] "igraph"
## [1] "data.frame"
The text representation of the graph above in the original ADMIXTOOLS software would look like this:
vertex R
vertex O
vertex n1
vertex n2
vertex n3
vertex A
vertex admix
vertex B
vertex n4
vertex C
vertex D
label O O
label A A
label B B
label C C
label D D
edge R_O R O
edge R_n1 R n1
edge n1_n2 n1 n2
edge n1_n3 n1 n3
edge n2_A n2 A
edge n3_B n3 B
edge n3_n4 n3 n4
edge n4_C n4 C
edge admix_D admix D
admix admix n2 n4
If this is stored in a file called graph.txt, it can be
imported into R like this:
DOT format is often used for plotting admixture graphs. You can export a fitted admixture graph like this:
You can read a DOT file back into an igraph with
parse_dot():
parse_dot() is a minimal reader that recovers the
topology from from -> to edges. Edge attributes
(weights, types) are not preserved on the round trip; for full
round-tripping of fitted graphs, use the edge-list format instead.