LEGOFIT estimates
population history from the frequencies of nucleotide site patterns. It
takes a description of an admixture graph (a .lgo file) and
a file of observed site-pattern frequencies, then fits the free
parameters of the graph (split times, population sizes, admixture
fractions) to the data.
ADMIXTOOLS 2 can now hand graphs to LEGOFIT and read fitted results
back. This article walks the full loop: an ADMIXTOOLS edge tibble is
written out to a .lgo file, LEGOFIT fits it, and the fitted
parameters come back as an edge tibble you can keep working with.
graph_to_lgo() provides the write direction and
read_legofit_output() the read direction, with
run_legofit() as a thin convenience wrapper over the
external binary.
LEGOFIT itself is an external command-line tool, not an R dependency. The chunks below that would invoke it are shown but not run; their results are loaded from small precomputed files that ship with the package, so this vignette builds anywhere.
We start from a small three-leaf tree with no admixture. The leaves
are x, y, and z; xy
is the ancestor of x and y, and the root
xyz is the ancestor of xy and z.
The time column gives the length of each edge;
graph_to_lgo() accumulates those lengths from the leaves
upward to derive the absolute node ages (leaves at the present, the root
oldest).
demo_graph <- tibble::tribble(
~from, ~to, ~type, ~weight, ~time,
"xyz", "xy", "normal", NA_real_, 1.5,
"xyz", "z", "normal", NA_real_, 2,
"xy", "x", "normal", NA_real_, 0.5,
"xy", "y", "normal", NA_real_, 0.5
)
demo_graph
#> # A tibble: 4 × 5
#> from to type weight time
#> <chr> <chr> <chr> <dbl> <dbl>
#> 1 xyz xy normal NA 1.5
#> 2 xyz z normal NA 2
#> 3 xy x normal NA 0.5
#> 4 xy y normal NA 0.5graph_to_lgo() turns the tibble into LEGOFIT’s
.lgo grammar: a block of time and population-size parameter
declarations, one segment per node, and the
derive relationships that wire the topology together.
cat(graph_to_lgo(demo_graph), sep = "\n")
#> ## .lgo generated by admixtools::graph_to_lgo()
#>
#> ### Parameters
#> time fixed T_z=0
#> time fixed T_x=0
#> time fixed T_y=0
#> time free T_xyz=2
#> time free T_xy=0.5
#> twoN fixed one=1
#>
#>
#> ### Segments
#> segment xyz t=T_xyz twoN=one
#> segment xy t=T_xy twoN=one
#> segment z t=T_z twoN=one samples=1
#> segment x t=T_x twoN=one samples=1
#> segment y t=T_y twoN=one samples=1
#>
#> ### Relationships
#> derive xy from xyz
#> derive z from xyz
#> derive x from xy
#> derive y from xyLeaf times are emitted as time fixed (all 0
here, the present) and the internal split times as
time free for the optimizer. The free values shown are the
starting points derived from the edge lengths: T_xy=0.5
matches the xy-to-x edge, and
T_xyz=2 the xyz-to-z edge above
it. Each leaf segment carries samples=1.
With a .lgo file and a site-pattern file you call
LEGOFIT. run_legofit() wraps the external binary: it writes
the graph to a temporary .lgo, invokes
legofit, and parses the fitted output back through
read_legofit_output(). The site-pattern file used here
(demo.opf) is LEGOFIT’s bundled ex1 example,
vendored alongside the cached output.
The next chunk is not evaluated when the vignette builds, because it
needs the external legofit binary. It is the exact call
that produced the cached result shown below.
fit <- run_legofit(
demo_graph,
patterns = ext("demo.opf"),
args = c("--threads", "1", "-1", "-d", "0")
)Instead of running it, we load the cached result, which is exactly
what that run_legofit() call returns:
fit <- readRDS(ext("demo_fit.rds"))
fit
#> # A tibble: 4 × 9
#> from to type weight time admix_event_time twoN admix_prop admix_event
#> <chr> <chr> <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <chr>
#> 1 xyz xy normal NA 1.00 NA 1 NA <NA>
#> 2 xyz z normal NA 0 NA 1 NA <NA>
#> 3 xy x normal NA 0 NA 1 NA <NA>
#> 4 xy y normal NA 0 NA 1 NA <NA>read_legofit_output() returns the input edges with
fitted values filled in. The time column now holds the
LEGOFIT-fitted age of each edge’s younger endpoint; twoN
holds the effective population size (fixed at 1 throughout
this demo, so every value is 1 here; a model with free
twoN would show the fitted sizes instead). For an admixture
graph, admix_prop and admix_event_time would
carry the fitted mixture fractions and admixture-event times (there is
no admixture in this demo, so those are NA).
Two pieces of fit metadata ride along as attributes.
node_times is the full set of fitted node ages, including
the root, which has no incoming edge and so does not appear in the
time column:
fit_convergence records how the optimizer terminated.
status is "reached_goal" when LEGOFIT
converged, along with the final cost and the swarm spread:
attr(fit, "fit_convergence")
#> $status
#> [1] "reached_goal"
#>
#> $cost
#> [1] 2.75248e-08
#>
#> $spread
#> [1] 2.5221e-06A third attribute, identifiability, flags which fitted
values the site-pattern data can actually constrain (for example,
admixture-event times are structurally non-identifiable). See
?read_legofit_output for the full classification.
The write direction is only half of the interoperability story.
LEGOFIT .lgo files written elsewhere can be read straight
into an ADMIXTOOLS edge tibble with read_lgo(). Here we
read a real archaic-admixture model that ships as a cached fixture: 18
nodes and 21 edges, four admixture events, and two sampled internal
nodes.
rha20 <- read_lgo(ext("rha20.lgo"))
nrow(rha20)
#> [1] 21
head(rha20)
#> # A tibble: 6 × 4
#> from to type weight
#> <chr> <chr> <chr> <dbl>
#> 1 v n normal NA
#> 2 a v normal NA
#> 3 av a2 normal NA
#> 4 nd av normal NA
#> 5 nd d2 normal NA
#> 6 xy x normal NAread_lgo() is pure R and needs no external tool, so this
chunk runs at build time. The file parses into its full topology. Two of
its nodes are sampled internal nodes: LEGOFIT allows a
segment to be both an internal ancestor and a sampled
population, which a leaves-only representation cannot express.
When a downstream analysis needs every sampled population to sit on
its own tip, add_sampled_tips() splits each sampled
internal node by attaching a new leaf for the sample, leaving the
internal node purely structural. Here it adds the tips for this graph’s
sampled internal nodes:
This is the bridge to qpgraph() and the rest of the
ADMIXTOOLS graph machinery, which expect sampled populations at the
tips.
LEGOFIT is an external tool and is not installed with ADMIXTOOLS 2.
Install it from https://github.com/alanrogers/legofit and put the
legofit binary on your PATH, or pass its
location to run_legofit(bin = ...). When the binary cannot
be found, run_legofit() fails with a clear, typed condition
(legofit_binary_not_found) rather than a raw system error,
so calling code can detect and handle the missing tool.