Parallelization

ADMIXTOOLS 2 parallelizes on two levels that work together, and on a single machine you usually do not have to manage either one. The f3 and f4 statistic kernels on the genotype path already use all of your cores, so for one qpadm or one qp3pop you can simply call the function. Reach for a future plan when you have many independent units of work, a sweep of models or a batch of resamples, or when you want to spread work across the nodes of a cluster.

The two levels are these.

The f-statistic kernels on the genotype path, the inner loops that turn genotypes into allele frequencies (cpp_gmat_to_aftable) and allele frequencies into f3 and f4 statistics (cpp_aftable_to_dstat*), are parallelized in C++ with OpenMP. This is on by default and uses all available cores even when you have set up no parallelism in R. OpenMP is an optional build dependency, so if the package was compiled without it these kernels run on a single core and everything still works. The f2 kernels that extract_f2 uses to build an f2 cache (cpp_mats_to_f2_arr, cpp_outer_array_*) are not threaded by OpenMP; extract_f2 instead parallelizes across population-pair chunks in R through doParallel when you raise its n_cores argument, and runs on a single core by default.

The coarse-grained outer work, namely SNP blocks, sweeps of many models, and bootstrap resamples, is parallelized in R with the future / furrr framework. The default plan is sequential, so this outer layer stays in a single process until you opt in with one line.

library(future)
plan(multisession, workers = 4)

Switch back with plan(sequential). The same plan() call applies to every admixtools function that supports the outer layer, and there are no per-function flags to toggle.

The two layers share one machine without fighting over it. Under plan(multisession, workers = 4) on a 16-core machine each worker is handed 4 kernel threads, so the total stays at 16. The budget is availableCores() divided by the number of workers, computed once in the main process and passed down, so the block layer and the OpenMP layer never multiply out into oversubscription.

What’s parallelized

Most of these functions parallelize their outer work through future, grouped by what that outer work is. extract_f2 is the exception, and is described separately below.

Across SNP blocks

  • f4blockdat_from_geno spreads its SNP blocks across future workers. The per-block f3 and f4 work inside it is already threaded by OpenMP, so on one machine plan(sequential) is not idle, it is running the kernels on every core. The block layer adds throughput mainly when you have spare cores beyond what a single block’s kernel can keep busy, or when you spread blocks across nodes.
  • qpadm, qp3pop, qp4ratio, and qpfstats call f4blockdat_from_geno when the input is a genotype-file prefix, so they inherit the block layer transparently.

Across population-pair chunks

  • extract_f2 does not use the future layer. It splits the population set into memory-sized chunks and runs them through doParallel (foreach) when you pass n_cores > 1; the default n_cores = 1 runs serially, and its f2 kernels are single-threaded, so by default extract_f2 is single-core. Raise n_cores to use more cores, or compute the f2 cache once and reuse it. The one exception is extract_f2(..., qpfstats = TRUE), which routes through f4blockdat_from_geno and so picks up the OpenMP kernels and the future block layer above.

Across models

  • qpadm_multi and qpadm_sweep evaluate many models in parallel. With a genotype-file prefix the shared f4 computation runs once at the top level, correctly budgeted, and the per-model fits reuse it, so the per-model layer and the kernel layer do not stack into oversubscription. You no longer have to hand-tune which layer gets the cores.

Across resamples

  • qpgraph_resample_snps and qpgraph_resample_snps2 parallelize across resamples.
  • find_graphs_old parallelizes its independent repeats (with parallel = TRUE, the default).

Across cache reads

  • read_f2 parallelizes the per-pair .rds reads when loading a precomputed f2 cache. Useful on slow disks or NFS where I/O latency dominates.

Not parallelized:

  • find_graphs (the newer fitter, fast enough single-threaded that parallel overhead would dominate; if you want N independent runs in parallel, wrap it yourself in furrr::future_map(1:N, ~find_graphs(...))).
  • qpgraph itself (single-graph fit).

Choosing a future plan

This is the old multisession versus multicore question, with one change. Because the OpenMP kernels already use every core under plan(sequential), adding a future plan no longer multiplies throughput on a single f-statistic computation the way it did when the kernels were serial. A dense f4 computation over 8 populations (about 1,680 population quadruples) on a 32-core machine landed like this.

Plan Wall time Relative to plan(sequential)
plan(sequential) ~17 s baseline, kernels use all 32 cores
plan(multisession, workers = 4) ~25 s 0.7x, worker startup and data copying cost more than they save
plan(multicore, workers = 4) ~14 s 1.2x

So the plan you pick now matters most for memory and for work that has many independent units, not for speeding up a single computation.

multisession runs its workers as separate R processes talking over sockets. It is the portable default and works everywhere, including Windows. multicore runs its workers as forked processes that share memory copy-on-write, which skips the startup and copying cost and is a little faster on Linux, but it is unavailable on Windows and unsafe inside RStudio. Forking is safe alongside the OpenMP kernels because the thread pool is drained across the fork, so each forked worker starts single-threaded and builds its own pool.

If you’re on Linux and not in RStudio:

plan(multicore, workers = 4)

Otherwise:

plan(multisession, workers = 4)

Controlling how many cores are used

By default the kernels size their thread team from parallelly::availableCores() divided by the number of future workers. Three levers narrow that:

  • OMP_NUM_THREADS caps the team size. OMP_NUM_THREADS=1 forces the kernels to run single-threaded; any value limits each kernel call to at most that many threads. admixtools reads this variable explicitly and applies it as a cap, because parallelly::availableCores() does not read it and the kernels set num_threads directly (which OpenMP gives precedence over the environment variable).
  • options(admixtools.omp_num_threads = N) is the R-native equivalent and takes precedence over the environment variable.
  • Lowering what availableCores() reports, for example with options(mc.cores = 8) or the variables described in ?parallelly::availableCores, or running under a plan with more workers, which divides the budget further.

Parallelization on a compute cluster

Sometimes it makes more sense to parallelize across compute nodes rather than across cores. This can be done either in the traditional way of writing an R script and submitting it many times in parallel as separate jobs, or interactively from within R again using the furrr / future framework. The interactive route is more complicated to set up than parallelization across cores.

On a cluster using the Slurm job scheduler, the following command will set up parallelization across compute nodes.

library(future.batchtools)
plan(tweak(batchtools_slurm, workers = 50,
           resources = list(ncpus = 1, memory = 1024,  # 1024 MB per job
                            walltime = 10 * 60 * 60, partition = 'short')))

It specifies that up to 50 jobs should be run at a time, with each one requesting one CPU, 1024 MB of memory, and 10 hours on the partition called short. This requires the future.batchtools R package and a batchtools template file in the working directory, see this example template.

One detail matters if you request more than one CPU per job. The kernel’s thread budget is worked out in the submitting session from its own availableCores(), not from each job’s ncpus. With the ncpus = 1 example above the kernel correctly runs single-threaded in every job, but requesting ncpus = 4 will not by itself make the kernel use four threads. The scaling here comes from spreading the 50 jobs across nodes, not from threading within a job.

When parallelization doesn’t help

  • A single f3 or f4 computation on one machine. The genotype-path kernels already use all your cores under plan(sequential), so wrapping one qpadm or qp3pop in a future plan usually adds overhead without adding speed. extract_f2 does not use the future layer at all; to give it more cores, raise its n_cores argument rather than setting a future plan. Reach for the future layer when you have many independent units, such as a sweep of models or many resamples, or when you are spreading work across nodes.
  • Tiny workloads. For a 5-pop / 10-popcomb / 100-block extract_f2, the sequential version finishes in under a second. The worker-spawn overhead of multisession can make parallel slower than serial on tiny inputs.
  • Memory-constrained machines. multisession workers each hold their own copy of the input data after R serializes-and-sends it. For a 100 GB f2 cache and 8 workers, you’d need ~100 GB x 8 of headroom. Use multicore (copy-on-write) on Linux, fewer workers, or stay sequential.