ADMIXTOOLS 2 parallelizes
on two levels that work together, and on a single machine you usually do
not have to manage either one. The f3 and f4 statistic kernels on the
genotype path already use all of your cores, so for one
qpadm or one qp3pop you can simply call the
function. Reach for a future plan when you have many
independent units of work, a sweep of models or a batch of resamples, or
when you want to spread work across the nodes of a cluster.
The two levels are these.
The f-statistic kernels on the genotype path, the inner loops that
turn genotypes into allele frequencies
(cpp_gmat_to_aftable) and allele frequencies into f3 and f4
statistics (cpp_aftable_to_dstat*), are parallelized in C++
with OpenMP. This is on by default and uses all available cores even
when you have set up no parallelism in R. OpenMP is an optional build
dependency, so if the package was compiled without it these kernels run
on a single core and everything still works. The f2 kernels that
extract_f2 uses to build an f2 cache
(cpp_mats_to_f2_arr, cpp_outer_array_*) are
not threaded by OpenMP; extract_f2 instead parallelizes
across population-pair chunks in R through doParallel when
you raise its n_cores argument, and runs on a single core
by default.
The coarse-grained outer work, namely SNP blocks, sweeps of many
models, and bootstrap resamples, is parallelized in R with the future /
furrr
framework. The default plan is sequential, so this outer layer stays in
a single process until you opt in with one line.
Switch back with plan(sequential). The same
plan() call applies to every admixtools
function that supports the outer layer, and there are no per-function
flags to toggle.
The two layers share one machine without fighting over it. Under
plan(multisession, workers = 4) on a 16-core machine each
worker is handed 4 kernel threads, so the total stays at 16. The budget
is availableCores() divided by the number of workers,
computed once in the main process and passed down, so the block layer
and the OpenMP layer never multiply out into oversubscription.
Most of these functions parallelize their outer work through
future, grouped by what that outer work is.
extract_f2 is the exception, and is described separately
below.
Across SNP blocks
f4blockdat_from_geno spreads its SNP
blocks across future workers. The per-block f3 and f4 work
inside it is already threaded by OpenMP, so on one machine
plan(sequential) is not idle, it is running the kernels on
every core. The block layer adds throughput mainly when you have spare
cores beyond what a single block’s kernel can keep busy, or when you
spread blocks across nodes.qpadm,
qp3pop,
qp4ratio, and
qpfstats call
f4blockdat_from_geno when the input is a genotype-file
prefix, so they inherit the block layer transparently.Across population-pair chunks
extract_f2 does not use the
future layer. It splits the population set into
memory-sized chunks and runs them through doParallel
(foreach) when you pass n_cores > 1; the
default n_cores = 1 runs serially, and its f2 kernels are
single-threaded, so by default extract_f2 is single-core.
Raise n_cores to use more cores, or compute the f2 cache
once and reuse it. The one exception is
extract_f2(..., qpfstats = TRUE), which routes through
f4blockdat_from_geno and so picks up the OpenMP kernels and
the future block layer above.Across models
qpadm_multi and
qpadm_sweep evaluate many models in
parallel. With a genotype-file prefix the shared f4 computation runs
once at the top level, correctly budgeted, and the per-model fits reuse
it, so the per-model layer and the kernel layer do not stack into
oversubscription. You no longer have to hand-tune which layer gets the
cores.Across resamples
qpgraph_resample_snps and
qpgraph_resample_snps2 parallelize across
resamples.find_graphs_old parallelizes its
independent repeats (with parallel = TRUE, the
default).Across cache reads
read_f2 parallelizes the per-pair
.rds reads when loading a precomputed f2 cache. Useful on
slow disks or NFS where I/O latency dominates.Not parallelized:
find_graphs (the newer fitter, fast
enough single-threaded that parallel overhead would dominate; if you
want N independent runs in parallel, wrap it yourself in
furrr::future_map(1:N, ~find_graphs(...))).qpgraph itself (single-graph
fit).This is the old multisession versus
multicore question, with one change. Because the OpenMP
kernels already use every core under plan(sequential),
adding a future plan no longer multiplies throughput on a
single f-statistic computation the way it did when the kernels were
serial. A dense f4 computation over 8 populations (about 1,680
population quadruples) on a 32-core machine landed like this.
| Plan | Wall time | Relative to plan(sequential) |
|---|---|---|
plan(sequential) |
~17 s | baseline, kernels use all 32 cores |
plan(multisession, workers = 4) |
~25 s | 0.7x, worker startup and data copying cost more than they save |
plan(multicore, workers = 4) |
~14 s | 1.2x |
So the plan you pick now matters most for memory and for work that has many independent units, not for speeding up a single computation.
multisession runs its workers as separate R processes
talking over sockets. It is the portable default and works everywhere,
including Windows. multicore runs its workers as forked
processes that share memory copy-on-write, which skips the startup and
copying cost and is a little faster on Linux, but it is unavailable on
Windows and unsafe inside RStudio. Forking is safe alongside the OpenMP
kernels because the thread pool is drained across the fork, so each
forked worker starts single-threaded and builds its own pool.
If you’re on Linux and not in RStudio:
Otherwise:
By default the kernels size their thread team from
parallelly::availableCores() divided by the number of
future workers. Three levers narrow that:
OMP_NUM_THREADS caps the team size.
OMP_NUM_THREADS=1 forces the kernels to run
single-threaded; any value limits each kernel call to at most that many
threads. admixtools reads this variable explicitly and applies it as a
cap, because parallelly::availableCores() does not read it
and the kernels set num_threads directly (which OpenMP
gives precedence over the environment variable).options(admixtools.omp_num_threads = N)
is the R-native equivalent and takes precedence over the environment
variable.availableCores()
reports, for example with options(mc.cores = 8) or
the variables described in ?parallelly::availableCores, or
running under a plan with more workers, which divides the budget
further.Sometimes it makes more sense to parallelize across compute nodes
rather than across cores. This can be done either in the traditional way
of writing an R script and submitting it many times in parallel as
separate jobs, or interactively from within R again using the
furrr / future framework. The interactive
route is more complicated to set up than parallelization across
cores.
On a cluster using the Slurm job scheduler, the following command will set up parallelization across compute nodes.
library(future.batchtools)
plan(tweak(batchtools_slurm, workers = 50,
resources = list(ncpus = 1, memory = 1024, # 1024 MB per job
walltime = 10 * 60 * 60, partition = 'short')))It specifies that up to 50 jobs should be run at a time, with each
one requesting one CPU, 1024 MB of memory, and 10 hours on the partition
called short. This requires the future.batchtools
R package and a batchtools template file in the working directory, see
this
example template.
One detail matters if you request more than one CPU per job. The
kernel’s thread budget is worked out in the submitting session from its
own availableCores(), not from each job’s
ncpus. With the ncpus = 1 example above the
kernel correctly runs single-threaded in every job, but requesting
ncpus = 4 will not by itself make the kernel use four
threads. The scaling here comes from spreading the 50 jobs across nodes,
not from threading within a job.
plan(sequential), so wrapping one qpadm or
qp3pop in a future plan usually adds overhead
without adding speed. extract_f2 does not use the
future layer at all; to give it more cores, raise its
n_cores argument rather than setting a future
plan. Reach for the future layer when you have many
independent units, such as a sweep of models or many resamples, or when
you are spreading work across nodes.extract_f2, the sequential version finishes in
under a second. The worker-spawn overhead of multisession
can make parallel slower than serial on tiny inputs.multisession workers each hold their own copy of the input
data after R serializes-and-sends it. For a 100 GB f2 cache and 8
workers, you’d need ~100 GB x 8 of headroom. Use multicore
(copy-on-write) on Linux, fewer workers, or stay sequential.