--- title: "Data formats" author: "Robert Maier" date: "2022-11-12" output: rmarkdown::html_vignette: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{Data formats} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- ## Genotype data formats *ADMIXTOOLS 2* can read genotype data in four formats. 1. Binary *PLINK* format (PACKEDPED), described [here](https://www.cog-genomics.org/plink2/input) 2. Binary *PLINK 2* format (PFILE), described [here](https://www.cog-genomics.org/plink/2.0/input) 3. Binary *PACKEDANCESTRYMAP* format 4. Text based *EIGENSTRAT* format *PACKEDANCESTRYMAP* and *EIGENSTRAT* are described [here](https://reich.hms.harvard.edu/software/InputFileFormats) In all four formats a dataset consists of three files: one file for the genotype matrix, one file for the SNP metadata, and one file for the sample metadata. The following shows what a data set for three SNPs and five samples grouped into two populations (French.DG, Onge.DG) could look like: ### PLINK PLINK `.fam` files, have six columns, but only the first two (family/population ID and sample ID) are relevant in *ADMIXTOOLS 2* ``` French.DG S_French-1.DG 0 0 M -9 French.DG S_French-2.DG 0 0 F -9 French.DG B_French-3.DG 0 0 M -9 Onge.DG BR_Onge-2.DG 0 0 F -9 Onge.DG BR_Onge-1.DG 0 0 F -9 ``` PLINK `.bim` file: ``` 1 rs3094315 0.02013 752566 G A 1 rs12124819 0.020242 776546 A G 1 rs28765502 0.022137 832918 T C ``` ### PLINK 2 (PFILE) A *PLINK 2* dataset consists of three files: `.pgen` (binary genotype matrix), `.pvar` (variant metadata), and `.psam` (sample metadata). The prefix convention is the same as for binary *PLINK* (one prefix that resolves to all three), so *ADMIXTOOLS 2* auto-detects PFILE format whenever `.pgen`, `.pvar`, and `.psam` all exist at the prefix you pass. `.psam` is a tab-separated text file. Population labels live in the `#FID` column when present (matching the PLINK convention of FID-as-population); when `#FID` is missing or `0`, *ADMIXTOOLS 2* falls back to `IID`: ``` #FID IID SEX French.DG S_French-1.DG 1 French.DG S_French-2.DG 2 French.DG B_French-3.DG 1 Onge.DG BR_Onge-2.DG 2 Onge.DG BR_Onge-1.DG 2 ``` `.pvar` is tab-separated text and (unlike `.bim`) supports multi-allelic sites by allowing comma-separated `ALT` alleles. The `pfile_to_afs()` function accepts a `multiallelic = c("error", "skip", "first_alt")` argument to control how those sites are handled (default `"error"`): ``` #CHROM POS ID REF ALT 1 752566 rs3094315 G A 1 776546 rs12124819 A G 1 832918 rs28765502 T C ``` `.pvar` may also be supplied in Zstd-compressed form as `.pvar.zst` — *ADMIXTOOLS 2* picks up either form transparently. Reading `.pvar.zst` requires the `zstd` CLI on PATH (the genotype access via `pgenlibr` handles `.zst` natively, but the R-side metadata parser shells out to `zstd -d`). Note: `.pvar` does *not* by default carry centimorgan positions. When the `CM` column is absent *ADMIXTOOLS 2* leaves `cm = 0` for all variants and falls back to base-pair distance for block partitioning (`blgsize = 1e6` works well at typical block sizes). Alternatively you can supply an external recombination map via `cm_file =` (a TSV with a SNP-identifier column — `variant_id`, `SNP`, or `ID` — and a `cm` column). ### EIGENSTRAT/PACKEDANCESTRYMAP *EIGENSTRAT* and *PACKEDANCESTRYMAP* use the same format for the metadata. The only difference is that the genotype data is stored in a text file in *EIGENSTRAT*, but in a binary file in *PACKEDANCESTRYMAP*. The `.ind` file has 3 out of the 6 columns in the `.fam` file, in different order: ``` S_French-1.DG M French.DG S_French-2.DG F French.DG B_French-3.DG M French.DG BR_Onge-2.DG F Onge.DG BR_Onge-1.DG F Onge.DG ``` `.snp` files are the same as `.bim` files, except that the first two columns are swapped: ``` rs3094315 1 0.02013 752566 G A rs12124819 1 0.020242 776546 A G rs28765502 1 0.022137 832918 T C ``` *EIGENSTRAT* `.geno` files look like this: ``` 29101 10211 22901 ``` One SNP per row, one sample per column, and missing data is denoted by `9`. ## Reading genotype files While there are other R packages for reading and writing binary *PLINK* files ([plink2R](https://github.com/gabraham/plink2R), [genio](https://github.com/OchoaLab/genio)), no such packages exist to date for *PACKEDANCESTRYMAP* format files, so some of the functions shown here may be useful outside of the *ADMIXTOOLS* framework. Note that the functions below impose the restriction than individual IDs may not be duplicated across populations or families. ```{r, echo = FALSE, warning = FALSE, message = FALSE} library(magrittr) library(tidyverse) library(admixtools) ``` ```{r, eval = FALSE} geno = read_plink("plink/prefix") geno = read_eigenstrat("eigenstrat/prefix") geno = read_packedancestrymap("packedancestrymap/prefix") # read SNPs 1000 to 2000 geno = read_packedancestrymap("packedancestrymap/prefix", first = 1000, last = 2000) # read only ind1 and ind2 geno = read_packedancestrymap("packedancestrymap/prefix", inds = c('ind1', 'ind2')) ``` For PFILE, there is no `read_pfile` — PFILE inputs are consumed directly by the f-statistic pipeline (`extract_f2`, `f2_from_geno`, `f4blockdat_from_geno`, `qpfstats`) which auto-detect the format from the file extensions at the prefix. ## Extracting populations Populations can be extracted from *PLINK* files (and written to another set of *PLINK* files) like this: ```{r, eval = FALSE} extract_samples("input/prefix", "output/prefix", pops = c("pop1", "pop2")) ``` `extract_samples` can also be used to extract samples, similar to `plink --keep`. ## Combining data sets *ADMIXTOOLS 2* can't merge data sets. Data in *PLINK* format can be merged using `plink --bmerge`. ## Computing allele frequencies ```{r, eval = FALSE} afs = plink_to_afs("plink/prefix") afs = eigenstrat_to_afs("eigenstrat/prefix") afs = packedancestrymap_to_afs("packedancestrymap/prefix") afs = pfile_to_afs("plink2/prefix") # pfile_to_afs supports the same range/subset args as the others afs = pfile_to_afs("plink2/prefix", inds = c('ind1', 'ind2')) afs = pfile_to_afs("plink2/prefix", pops = c('pop1', 'pop2')) afs = pfile_to_afs("plink2/prefix", first = 1000, last = 2000) # When .pvar lacks centimorgan positions (the default), supply a map # via cm_file (TSV with variant_id + cm columns) afs = pfile_to_afs("plink2/prefix", cm_file = "ref_genmap.tsv") ``` ## Converting *PACKEDANCESTRYMAP* to *PLINK* This function converts genotype files from *PACKEDANCESTRYMAP* or *EIGENSTRAT* to *PLINK* format. If `inds` or `pops` is specified, only those samples will be extracted. *ADMIXTOOLS 2* currently doesn't provide the option to write binary *PACKEDANCESTRYMAP* files, so the conversion only goes one way. ```{r, eval = FALSE} eigenstrat_to_plink("eigenstrat_input_prefix", "plink_output_prefix") packedancestrymap_to_plink("packedancestrymap_input_prefix", "plink_output_prefix") ``` ## f2 cache metadata When `extract_f2()` writes to disk it emits two sidecar files alongside the per-pair `.rds` arrays: * **`.f2_cache_id`** — a short text file containing a `sha256:<64hex>` hash over the inputs and filter args. Cheap to read; useful as a cache key for orchestrators that want to skip re-extraction when nothing has changed. * **`cache_metadata.json`** — a schema-versioned JSON file with `n_snps`, `n_blocks`, `pops`, build timestamp, filter args, and the same `cache_id`. Written atomically (tempfile + rename) before `.f2_cache_id`, so a crash between the two writes leaves a recoverable directory. Read the metadata programmatically: ```{r, eval = FALSE} meta = read_f2_cache_metadata("my_f2_dir/") meta$n_snps # SNPs that actually contributed to f2 blocks meta$pops # populations in the cache meta$cache_id # matches .f2_cache_id ``` To check whether an existing cache directory still matches your current inputs without re-extracting, recompute the cache id from the genotype prefix and compare: ```{r, eval = FALSE} fresh = compute_f2_cache_id("geno_prefix", pops = pops, ...) cached = read_f2_cache_metadata("my_f2_dir/")$cache_id identical(fresh, cached) # TRUE if the cache is still valid for these inputs ``` If only the genotype prefix and the f2 directory are available (e.g. an orchestrator probe), `compute_f2_cache_id("my_f2_dir/")` reads the hash directly from the sidecar — no genotype-file access needed. ## Admixture graph formats There are several ways in which admixture graphs can be represented. Some of them are useful because they can be stored as human readable text files, and others are useful because they make it easier for R to process graphs. *ADMIXTOOLS 2* has several function for importing, exporting and converting between formats. Before fitting an admixture graph, the graph only has nodes and edges. After fitting a graph, each edge has a weight associated with it, and maybe some other information. Here I mostly show examples of graphs that are not fitted yet. ### Edge list format The simplest way to represent a graph in a text file is by writing each edge in the graph in a separate line: ``` R O R n1 n1 n2 n1 n3 n2 A n2 admix n3 B n3 n4 n4 C n4 admix admix D ``` We can save this in a text file `graph.txt` and read it into R like this: ```{r, eval = FALSE} graph_el = read_table('~/Downloads/graph.tsv', col_names = F) plotly_graph(graph_el) ``` ```{r, echo = FALSE} graph_el = tribble(~from, ~to, 'R', 'O', 'R', 'n1', 'n1', 'n2', 'n1', 'n3', 'n2', 'A', 'n2', 'admix', 'n3', 'B', 'n3', 'n4', 'n4', 'C', 'n4', 'admix', 'admix', 'D') plotly_graph(graph_el) ``` `qpgraph()` uses this format to represent fitted graphs. In addition to the first two columns (`from` and `to`), there will be additional columns to indicate the type of edge, its estimated weight, and possibly more information. ```{r} qpgraph(example_f2_blocks, example_igraph)$edges %>% head ``` ### igraph format Many *ADMIXTOOLS 2* functions use the `igraph` package, which has its own graph format. The most important *ADMIXTOOLS 2* functions work with graphs in either the edge list or the `igraph` format, but some only work with graphs in the `igraph` format. It's easy to convert between the two formats: ```{r} class(graph_el) graph = edges_to_igraph(graph_el) class(graph) graph_el2 = as.data.frame(igraph::as_edgelist(graph)) class(graph_el2) ``` ### Original ADMIXTOOLS format The text representation of the graph above in the original *ADMIXTOOLS* software would look like this: ``` vertex R vertex O vertex n1 vertex n2 vertex n3 vertex A vertex admix vertex B vertex n4 vertex C vertex D label O O label A A label B B label C C label D D edge R_O R O edge R_n1 R n1 edge n1_n2 n1 n2 edge n1_n3 n1 n3 edge n2_A n2 A edge n3_B n3 B edge n3_n4 n3 n4 edge n4_C n4 C edge admix_D admix D admix admix n2 n4 ``` If this is stored in a file called `graph.txt`, it can be imported into R like this: ```{r, eval = FALSE} graph_el = parse_qpgraph_graphfile('graph.txt') ``` ### DOT format DOT format is often used for plotting admixture graphs. You can export a fitted admixture graph like this: ```{r} fit = qpgraph(example_f2_blocks, example_igraph)$edges write_dot(fit, 'graph.dot') ``` You can read a DOT file back into an igraph with `parse_dot()`: ```{r, eval = FALSE} graph = parse_dot("graph.dot") ``` `parse_dot()` is a minimal reader that recovers the topology from `from -> to` edges. Edge attributes (weights, types) are not preserved on the round trip; for full round-tripping of fitted graphs, use the edge-list format instead.