--- title: "Parallelization" author: "Robert Maier" date: "2026-05-21" output: rmarkdown::html_vignette: toc: true toc_depth: 2 vignette: > %\VignetteIndexEntry{Parallelization} %\VignetteEngine{knitr::rmarkdown} \usepackage[utf8]{inputenc} --- *ADMIXTOOLS 2* uses the [`future`](https://github.com/HenrikBengtsson/future) / [`furrr`](https://github.com/DavisVaughan/furrr) framework. The default plan is sequential, so functions behave like single-core code unless you opt in. Opt-in is one line: ```{r, eval = FALSE} library(future) plan(multisession, workers = 4) ``` Switch back with `plan(sequential)`. The same `plan()` call applies to every `admixtools` function that supports parallelism — there are no per-function flags to toggle. ## What's parallelized Core data-extraction and model-fitting workflows: * **`extract_f2`** and **`f4blockdat_from_geno`** parallelize across SNP blocks. This is the highest-impact case: a 15-pop / 5565-popcomb / 713-block run takes about 200 s sequentially and 75 s under `plan(multisession, workers = 4)`. * **`qpadm`**, **`qp3pop`**, **`qp4ratio`**, and **`qpfstats`** call `f4blockdat_from_geno` when the input is a genotype-file prefix, so they benefit transparently. * **`qpadm_multi`** and **`qpadm_sweep`** parallelize across models on top of that — each model's f4 computation also parallelizes per block, so two layers compose. On a clean compute budget you usually want only one layer parallel (set workers to match cores, and let either the per-block or per-model layer absorb them). * **`read_f2`** parallelizes the per-pair `.rds` reads when loading a precomputed f2 cache. Useful on slow disks or NFS where I/O latency dominates. * **`qpgraph_resample_snps`** and **`qpgraph_resample_snps2`** parallelize across resamples. * **`find_graphs_old`** parallelizes its independent repeats (with `parallel = TRUE`, the default). Not parallelized: * **`find_graphs`** (the newer fitter — fast enough single-threaded that parallel overhead would dominate; if you want N independent runs in parallel, wrap it yourself in `furrr::future_map(1:N, ~find_graphs(...))`). * **`qpgraph`** itself (single-graph fit). ## multisession vs multicore `multisession` (workers = R subprocesses, communicating via sockets) is the default-portable choice. It works on all platforms including Windows. `multicore` (workers = forked processes, sharing memory copy-on-write) is faster on Linux: a 15-pop qpadm run sees 1.9× speedup vs sequential under `plan(multicore, workers = 4)` and only 1.1× under `plan(multisession, workers = 4)` because forking skips the worker-startup and data-marshalling cost. On macOS it works at the command line but fails inside RStudio. On Windows it's unsupported. If you're on Linux and not in RStudio: ```{r, eval = FALSE} plan(multicore, workers = 4) ``` Otherwise: ```{r, eval = FALSE} plan(multisession, workers = 4) ``` ## Parallelization on a compute cluster Sometimes it makes more sense to parallelize across compute nodes rather than across cores. This can be done either in the traditional way of writing an R script and submitting it many times in parallel as separate jobs, or interactively from within R again using the `furrr` / `future` framework. The interactive route is more complicated to set up than parallelization across cores. On a cluster using the *Slurm* job scheduler, the following command will set up parallelization across compute nodes. ```{r, eval = FALSE} library(future.batchtools) plan(tweak(batchtools_slurm, workers = 50, resources = list(ncpus = 1, memory = 1024, walltime = 10 * 60 * 60, partition = 'short'))) ``` It specifies that up to 50 jobs should be run at a time, with each one requesting one CPU, 1024 MB of memory, and 10 hours on the partition called `short`. This requires the [`future.batchtools`](https://github.com/HenrikBengtsson/future.batchtools) R package and a batchtools template file in the working directory — see [this example template](https://github.com/mllg/batchtools/blob/master/inst/templates/slurm-simple.tmpl). ## When parallelization doesn't help * **Tiny workloads** — for a 5-pop / 10-popcomb / 100-block `extract_f2`, the sequential version finishes in under a second. The worker-spawn overhead of `multisession` can make parallel slower than serial on tiny inputs. * **Memory-constrained machines** — `multisession` workers each hold their own copy of the input data after R serializes-and-sends it. For a 100 GB f2 cache and 8 workers, you'd need ~100 GB × 8 of headroom. Use `multicore` (copy-on-write) on Linux, fewer workers, or stay sequential.