--- title: "Application with palaeodata" output: rmarkdown::html_vignette #output: rmarkdown::pdf_document vignette: > %\VignetteIndexEntry{Application with palaeodata} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} --- ```{r, include = FALSE} knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) # set options to limit threads used by imported libraries options(cores=2) options(mc.cores=2) # xgboost uses data.table data.table::setDTthreads(2) ``` # SDMs with tidymodels for palaeo data In this article, we show how a Species Distribution Model can be fitted with `tidysdm` on time-scattered (i.e.palaeontological, archaeozoological, archaeological) data, with samples covering different time periods. We recommend users first read the "tidysdm overview" article, which introduces a number of functions and concepts that will be used in the present article. We first load `tidysdm`: ```{r} library(tidysdm) ``` # Preparing your data We start by loading a set of radiocarbon dates (calibrated) for horses, covering from 22k years ago until 8k years ago. ```{r load_presences} data(horses) horses ``` We convert our dataset into an `sf` data.frame so that we can easily plot it (here `tidyterra` shines): ```{r} library(sf) horses <- st_as_sf(horses, coords = c("longitude", "latitude")) st_crs(horses) <- 4326 ``` As a background to our presences, we will use the land mask for the present, taken from `pastclim`, and cut to cover only Europe: ```{r echo=FALSE, results='hide'} library(pastclim) set_data_path(on_CRAN = TRUE) ``` ```{r} library(pastclim) land_mask <- pastclim::get_land_mask(time_bp = 0, dataset = "Example") europe_poly <- vect(region_outline$Europe) crs(europe_poly) <- "lonlat" land_mask <- crop(land_mask, europe_poly) land_mask <- mask(land_mask, europe_poly) ``` And use `tidyterra` to plot: ```{r cast_to_sf, fig.width=6, fig.height=4} library(tidyterra) ggplot() + geom_spatraster(data = land_mask, aes(fill = land_mask_0)) + geom_sf(data = horses, aes(col = time_bp)) ``` We now thin our presences, so that locations are further than 100km and 2000 years apart. ```{r thin_by_dist} set.seed(123) horses <- thin_by_dist_time(horses, dist_min = km2m(100), interval_min = y2d(2000), time_col = "time_bp", lubridate_fun = pastclim::ybp2date ) nrow(horses) ``` And see what we have left: ```{r plot_thinned, fig.width=6, fig.height=4} ggplot() + geom_spatraster(data = land_mask, aes(fill = land_mask_0)) + geom_sf(data = horses, aes(col = time_bp)) ``` We now need a time series of palaeoclimate reconstructions. In this vignette, we will use the example dataset from `pastclim`. This dataset only has reconstructions every 5k years for the past 20k years at 1 degree resolution, with 3 bioclimatic variables. It will suffice for illustrative purposes, but we recommend that you download higher quality datasets with `pastclim` for real analysis. As for the land mask, we will cut the reconstructions to cover Europe only: ```{r load_climate} library(pastclim) climate_vars <- c("bio01", "bio10", "bio12") climate_full <- pastclim::region_series( bio_variables = climate_vars, data = "Example", crop = region_outline$Europe ) ``` Now we thin the observations to only keep one per cell in the raster (it would be better if we had an equal area projection...), and remove locations outside the desired area (if there was any): ```{r thin_by_cell} set.seed(123) horses <- thin_by_cell_time(horses, raster = climate_full, time_col = "time_bp", lubridate_fun = pastclim::ybp2date ) nrow(horses) ``` Let's see what we have left of our points: ```{r, fig.width=6, fig.height=4} ggplot() + geom_spatraster(data = land_mask, aes(fill = land_mask_0)) + geom_sf(data = horses, aes(col = time_bp)) ``` Now we sample pseudo-absences (we will constraint them to be at least 70km away from any presences), selecting three times the number of presences ```{r} set.seed(123) horses <- sample_pseudoabs_time(horses, n_per_presence = 3, raster = climate_full, time_col = "time_bp", lubridate_fun = pastclim::ybp2date, method = c("dist_min", km2m(70)) ) ``` Let's see our presences and absences: ```{r, fig.width=6, fig.height=4} ggplot() + geom_spatraster(data = land_mask, aes(fill = land_mask_0)) + geom_sf(data = horses, aes(col = class)) ``` Now let's get the climate for these location. `pastclim` requires a data frame with two columns with coordinates and a column of time in years before present (where negative values represent time in the past). We manipulate the `sf` object accordingly: ```{r climate_for_locations} horses_df <- horses %>% dplyr::bind_cols(sf::st_coordinates(horses)) %>% mutate(time_bp = date2ybp(time_step)) %>% as.data.frame() %>% select(-geometry) # get climate horses_df <- location_slice_from_region_series(horses_df, region_series = climate_full ) # add the climate reconstructions to the sf object, and remove the time_step # as we don't need it for modelling horses <- horses %>% bind_cols(horses_df[, climate_vars]) %>% select(-time_step) ``` # Fit the model by crossvalidation Next, we need to set up a `recipe` to define how to handle our dataset. We don't want to transform our data, so we just need to define the formula (*class* is the outcome, all other variables are predictors; note that, for `sf` objects, `geometry` is automatically ignored as a predictor): ```{r recipe} horses_rec <- recipe(horses, formula = class ~ .) horses_rec ``` We can quickly check that we have the variables that we want with: ```{r} horses_rec$var_info ``` We now build a `workflow_set` of different models, defining which hyperparameters we want to tune. We will use *glm*, *gam*, *random forest* and *boosted trees* as our models, so only *random forest* and *boosted trees* have tunable hyperparameters. For the most commonly used models, `tidysdm` automatically chooses the most important parameters, but it is possible to fully customise model specifications. ```{r workflow_set} horses_models <- # create the workflow_set workflow_set( preproc = list(default = horses_rec), models = list( # the standard glm specs (no params to tune) glm = sdm_spec_glm(), # the standard sdm specs (no params to tune) gam = sdm_spec_gam(), # rf specs with tuning rf = sdm_spec_rf(), # boosted tree model (gbm) specs with tuning gbm = sdm_spec_boost_tree() ), # make all combinations of preproc and models, cross = TRUE ) %>% # set formula for gams update_workflow_model("default_gam", spec = sdm_spec_gam(), formula = gam_formula(horses_rec) ) %>% # tweak controls to store information needed later to create the ensemble option_add(control = control_ensemble_grid()) ``` Note that *gams* are unusual, as we need to specify a formula to define to which variables we will fit smooths. By default, `gam_formula()` fits a smooth to every continuous predictor, but a custom formula can be provided instead. We now want to set up a spatial block cross-validation scheme to tune and assess our models: ```{r training_cv, fig.width=6, fig.height=4} library(tidysdm) set.seed(1005) horses_cv <- spatial_block_cv(horses, v = 5) autoplot(horses_cv) ``` We can now use the block CV folds to tune and assess the models: ```{r tune_grid} set.seed(123) horses_models <- horses_models %>% workflow_map("tune_grid", resamples = horses_cv, grid = 5, metrics = sdm_metric_set(), verbose = TRUE ) ``` Note that `workflow_set` correctly detects that we have no tuning parameters for *glm* and *gam*. We can have a look at the performance of our models with: ```{r, fig.width=6, fig.height=4} autoplot(horses_models) ``` Now let's create an ensemble, selecting the best set of parameters for each model (this is really only relevant for the random forest, as there were not hype-parameters to tune for the glm and gam). We will use the Boyce continuous index as our metric to choose the best random forest and boosted tree. When adding members to an ensemble, they are automatically fitted to the full training dataset, and so ready to make predictions. ```{r} horses_ensemble <- simple_ensemble() %>% add_member(horses_models, metric = "boyce_cont") ``` And visualise it ```{r, fig.width=6, fig.height=4} autoplot(horses_ensemble) ``` # Projecting to other times We can now make predictions with this ensemble (using the default option of taking the mean of the predictions from each model) for the Last Glacial Maximum (LGM, 21,000 years ago). ```{r get_lgm} climate_lgm <- pastclim::region_slice( time_bp = -20000, bio_variables = climate_vars, data = "Example", crop = region_outline$Europe ) ``` And predict using the ensemble: ```{r plot_lgm, fig.width=6, fig.height=4} prediction_lgm <- predict_raster(horses_ensemble, climate_lgm) ggplot() + geom_spatraster(data = prediction_lgm, aes(fill = mean)) + scale_fill_terrain_c() ```