Title: | Species Distribution Models with Tidymodels |
---|---|
Description: | Fit species distribution models (SDMs) using the 'tidymodels' framework, which provides a standardised interface to define models and process their outputs. 'tidysdm' expands 'tidymodels' by providing methods for spatial objects, models and metrics specific to SDMs, as well as a number of specialised functions to process occurrences for contemporary and palaeo datasets. The full functionalities of the package are described in Leonardi et al. (2023) <doi:10.1101/2023.07.24.550358>. |
Authors: | Michela Leonardi [aut], Margherita Colucci [aut], Andrea Vittorio Pozzi [aut], Eleanor M.L. Scerri [aut], Ben Tupper [ctb], Andrea Manica [aut, cre] |
Maintainer: | Andrea Manica <[email protected]> |
License: | AGPL (>= 3) |
Version: | 0.9.6.9004 |
Built: | 2024-11-18 22:28:40 UTC |
Source: | https://github.com/EvolEcolGroup/tidysdm |
This function adds member(s) to a simple_ensemble()
object, taking the
best member from each workflow provided. It is possible to pass individual
tune_results
objects from a tuned workflow
, or a
workflowsets::workflow_set()
.
add_member(x, member, ...) ## Default S3 method: add_member(x, member, ...) ## S3 method for class 'tune_results' add_member(x, member, metric = NULL, id = NULL, ...) ## S3 method for class 'workflow_set' add_member(x, member, metric = NULL, ...)
add_member(x, member, ...) ## Default S3 method: add_member(x, member, ...) ## S3 method for class 'tune_results' add_member(x, member, metric = NULL, id = NULL, ...) ## S3 method for class 'workflow_set' add_member(x, member, metric = NULL, ...)
x |
a simple_ensemble to which member(s) will be added |
member |
a |
... |
not used at the moment. |
metric |
A character string (or NULL) for which metric to optimize. If NULL, the first metric is used. |
id |
the name to be given to this workflow in the |
a simple_ensemble with additional member(s)
This function adds repeat(s) to a repeat_ensemble
object, where each
repeat is a simple_ensemble
. All repeats must contain the same members,
selected using the same metric.
add_repeat(x, rep, ...) ## Default S3 method: add_repeat(x, rep, ...) ## S3 method for class 'simple_ensemble' add_repeat(x, rep, ...) ## S3 method for class 'list' add_repeat(x, rep, ...)
add_repeat(x, rep, ...) ## Default S3 method: add_repeat(x, rep, ...) ## S3 method for class 'simple_ensemble' add_repeat(x, rep, ...) ## S3 method for class 'list' add_repeat(x, rep, ...)
x |
a repeat_ensemble to which repeat(s) will be added |
rep |
a repeat, as a single |
... |
not used at the moment. |
a repeat_ensemble with additional repeat(s)
This autoplot()
method plots performance metrics that have been
ranked using a metric.
## S3 method for class 'simple_ensemble' autoplot( object, rank_metric = NULL, metric = NULL, std_errs = stats::qnorm(0.95), ... )
## S3 method for class 'simple_ensemble' autoplot( object, rank_metric = NULL, metric = NULL, std_errs = stats::qnorm(0.95), ... )
object |
A |
rank_metric |
A character string for which metric should be used to rank
the results. If none is given, the first metric in the metric set is used
(after filtering by the |
metric |
A character vector for which metrics (apart from |
std_errs |
The number of standard errors to plot (if the standard error exists). |
... |
Other options to pass to |
This function is intended to produce a default plot to visualize helpful
information across all possible applications of a simple_ensemble
. More
sophisticated plots can be produced using standard ggplot2
code for
plotting.
The x-axis is the workflow rank in the set (a value of one being the best)
versus the performance metric(s) on the y-axis. With multiple metrics, there
will be facets for each metric, with the rank_metric
first (if any was
provided; otherwise the metric used to create the simple_ensemble
will
be used).
If multiple resamples are used, confidence bounds are shown for each result (95% confidence, by default).
A ggplot object.
#' # we use the two_class_example from `workflowsets` two_class_ens <- simple_ensemble() %>% add_member(two_class_res, metric = "roc_auc") autoplot(two_class_ens)
#' # we use the two_class_example from `workflowsets` two_class_ens <- simple_ensemble() %>% add_member(two_class_res, metric = "roc_auc") autoplot(two_class_ens)
This method provides a good visualization method for a spatial initial rsplit.
## S3 method for class 'spatial_initial_split' autoplot(object, ..., alpha = 0.6)
## S3 method for class 'spatial_initial_split' autoplot(object, ..., alpha = 0.6)
object |
A |
... |
Options passed to |
alpha |
Opacity, passed to |
This plot method is a wrapper around the standard spatial_rsplit
method,
but it re-labels the folds as Testing and Training following the
convention for a standard initial_split
object
A ggplot object with each fold assigned a color, made using
ggplot2::geom_sf()
.
set.seed(123) block_initial <- spatial_initial_split(boston_canopy, prop = 1 / 5, spatial_block_cv ) autoplot(block_initial)
set.seed(123) block_initial <- spatial_initial_split(boston_canopy, prop = 1 / 5, spatial_block_cv ) autoplot(block_initial)
blockCV
to an rsample
objectThis function creates objects created with blockCV
to rsample
objects
that can be used by tidysdm
. BlockCV provides more sophisticated sampling
options than the spatialsample
library. For example, it is possible to
stratify the sampling to ensure that presences and absences are evenly
distributed among the folds (see the example below).
blockcv2rsample(x, data)
blockcv2rsample(x, data)
x |
a object created with a |
data |
the |
Note that currently only objects of type cv_spatial
and cv_cluster
are
supported.
an rsample
object
library(blockCV) points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV")) pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845) sb1 <- cv_spatial( x = pa_data, column = "occ", # the response column to balance the folds k = 5, # number of folds size = 350000, # size of the blocks in metres selection = "random", # random blocks-to-fold iteration = 10 ) # find evenly dispersed folds sb1_rsample <- blockcv2rsample(sb1, pa_data) class(sb1_rsample) autoplot(sb1_rsample)
library(blockCV) points <- read.csv(system.file("extdata/", "species.csv", package = "blockCV")) pa_data <- sf::st_as_sf(points, coords = c("x", "y"), crs = 7845) sb1 <- cv_spatial( x = pa_data, column = "occ", # the response column to balance the folds k = 5, # number of folds size = 350000, # size of the blocks in metres selection = "random", # random blocks-to-fold iteration = 10 ) # find evenly dispersed folds sb1_rsample <- blockcv2rsample(sb1, pa_data) class(sb1_rsample) autoplot(sb1_rsample)
This function the Boyce Continuous Index, a measure of model accuracy appropriate
for Species Distribution Models with presence only data (i.e. using pseudoabsences
or background). The algorithm used here comes from the package enmSdm
, and uses multiple
overlapping windows.
boyce_cont(data, ...) ## S3 method for class 'data.frame' boyce_cont( data, truth, ..., estimator = NULL, na_rm = TRUE, event_level = "first", case_weights = NULL ) ## S3 method for class 'sf' boyce_cont(data, ...) boyce_cont_vec( truth, estimate, estimator = NULL, na_rm = TRUE, event_level = "first", case_weights = NULL, ... )
boyce_cont(data, ...) ## S3 method for class 'data.frame' boyce_cont( data, truth, ..., estimator = NULL, na_rm = TRUE, event_level = "first", case_weights = NULL ) ## S3 method for class 'sf' boyce_cont(data, ...) boyce_cont_vec( truth, estimate, estimator = NULL, na_rm = TRUE, event_level = "first", case_weights = NULL, ... )
data |
Either a data.frame containing the columns specified by the truth and estimate arguments, or a table/matrix where the true class results should be in the columns of the table. |
... |
A set of unquoted column names or one or more dplyr selector functions to choose which variables contain the class probabilities. If truth is binary, only 1 column should be selected, and it should correspond to the value of event_level. Otherwise, there should be as many columns as factor levels of truth and the ordering of the columns should be the same as the factor levels of truth. |
truth |
The column identifier for the true class results (that is a factor). This should be an unquoted column name although this argument is passed by expression and supports quasiquotation (you can unquote column names). For _vec() functions, a factor vector. |
estimator |
One of "binary", "hand_till", "macro", or "macro_weighted" to specify the type of averaging to be done. "binary" is only relevant for the two class case. The others are general methods for calculating multiclass metrics. The default will automatically choose "binary" if truth is binary, "hand_till" if truth has >2 levels and case_weights isn't specified, or "macro" if truth has >2 levels and case_weights is specified (in which case "hand_till" isn't well-defined). |
na_rm |
A logical value indicating whether NA values should be stripped before the computation proceeds. |
event_level |
A single string. Either "first" or "second" to specify which level of truth to consider as the "event". This argument is only applicable when estimator = "binary". The default uses an internal helper that generally defaults to "first" |
case_weights |
The optional column identifier for case weights. This should be an unquoted column name that evaluates to a numeric column in data. For _vec() functions, a numeric vector. |
estimate |
If truth is binary, a numeric vector of class probabilities corresponding to the "relevant" class. Otherwise, a matrix with as many columns as factor levels of truth. It is assumed that these are in the same order as the levels of truth. |
There is no multiclass version of this function, it only operates on binary predictions (e.g. presences and absences in SDMs).
A tibble with columns .metric, .estimator, and .estimate and 1 row of values. For grouped data frames, the number of rows returned will be the same as the number of groups.
Boyce, M.S., P.R. Vernier, S.E. Nielsen and F.K.A. Schmiegelow. 2002. Evaluating resource selection functions. Ecol. Model., 157, 281-300.
Hirzel, A.H., G. Le Lay, V. Helfer, C. Randin and A. Guisan. 2006. Evaluating the ability of habitat suitability models to predict species presences. Ecol. Model., 199, 142-152.
Other class probability metrics:
kap_max()
,
tss_max()
boyce_cont(two_class_example, truth, Class1)
boyce_cont(two_class_example, truth, Class1)
Predict for a new dataset by using a simple ensemble. Predictions from individual
models are combined according to fun
calib_class_thresh(object, class_thresh, metric_thresh = NULL)
calib_class_thresh(object, class_thresh, metric_thresh = NULL)
object |
an simple_ensemble object |
class_thresh |
probability threshold used to convert probabilities into classes. It can be a number (between 0 and 1), or a character metric (currently "tss_max", "kap_max" or "sensitivity"). For sensitivity, an additional target value is passed along as a second element of a vector, e.g. c("sensitivity",0.8). |
metric_thresh |
a vector of length 2 giving a metric and its threshold, which will be used to prune which models in the ensemble will be used for the prediction. The 'metrics' need to have been computed when the workflow was tuned. The metric's threshold needs to match the value used during prediction. Examples are c("accuracy",0.8) or c("boyce_cont",0.7). |
a simple_ensemble object
test_ens <- simple_ensemble() %>% add_member(two_class_res[1:3, ], metric = "roc_auc") test_ens <- calib_class_thresh(test_ens, class_thresh = "tss_max") test_ens <- calib_class_thresh(test_ens, class_thresh = "kap_max") test_ens <- calib_class_thresh(test_ens, class_thresh = c("sens", 0.9))
test_ens <- simple_ensemble() %>% add_member(two_class_res[1:3, ], metric = "roc_auc") test_ens <- calib_class_thresh(test_ens, class_thresh = "tss_max") test_ens <- calib_class_thresh(test_ens, class_thresh = "kap_max") test_ens <- calib_class_thresh(test_ens, class_thresh = c("sens", 0.9))
In tidysdm
, the string defining presences should be the first level of
the response factor. This function checks that the column is correctly formatted.
check_sdm_presence(.data, .col, presence_level = "presence")
check_sdm_presence(.data, .col, presence_level = "presence")
.data |
a |
.col |
the column containing the presences |
presence_level |
the string used to define the presence level of |
TRUE if correctly formatted
Check the balance of presences vs pseudoabsences among splits
check_splits_balance(splits, .col)
check_splits_balance(splits, .col)
splits |
the data splits (an |
.col |
the column containing the presences |
a tibble of the number of presences and pseudoabsences in the assessment and analysis set of each split (or training and testing in an initial split)
lacerta_thin <- readRDS(system.file("extdata/lacerta_climate_sf.RDS", package = "tidysdm" )) lacerta_cv <- spatial_block_cv(lacerta_thin, v = 5) check_splits_balance(lacerta_cv, class)
lacerta_thin <- readRDS(system.file("extdata/lacerta_climate_sf.RDS", package = "tidysdm" )) lacerta_cv <- spatial_block_cv(lacerta_thin, v = 5) check_splits_balance(lacerta_cv, class)
This function clamps the environmental variables in a terra::SpatRaster
or terra::SpatRasterDataset
so that their minimum and maximum values
do not exceed the range in the training dataset.
clamp_predictors(x, training, .col, use_na) ## Default S3 method: clamp_predictors(x, training, .col, use_na) ## S3 method for class 'stars' clamp_predictors(x, ...) ## S3 method for class 'SpatRaster' clamp_predictors(x, training, .col, use_na = FALSE) ## S3 method for class 'SpatRasterDataset' clamp_predictors(x, training, .col, use_na = FALSE)
clamp_predictors(x, training, .col, use_na) ## Default S3 method: clamp_predictors(x, training, .col, use_na) ## S3 method for class 'stars' clamp_predictors(x, ...) ## S3 method for class 'SpatRaster' clamp_predictors(x, training, .col, use_na = FALSE) ## S3 method for class 'SpatRasterDataset' clamp_predictors(x, training, .col, use_na = FALSE)
x |
a |
training |
the training dataset (a |
.col |
the column containing the presences (optional). If specified, it is excluded from the clamping. |
use_na |
a boolean determining whether values outside the range of the training dataset are removed (set to NA). If FALSE (the default), values outside the training range are replaced with the extremes of the training range. |
... |
additional arguments specific to a given object type |
a terra::SpatRaster
or terra::SpatRasterDataset
clamped to
the ranges in training
Return a tibble of performance metrics for all models.
## S3 method for class 'simple_ensemble' collect_metrics(x, ...) ## S3 method for class 'repeat_ensemble' collect_metrics(x, ...)
## S3 method for class 'simple_ensemble' collect_metrics(x, ...) ## S3 method for class 'repeat_ensemble' collect_metrics(x, ...)
x |
A |
... |
Not currently used. |
When applied to a ensemble, the metrics that are returned do not contain the actual tuning parameter columns and values (unlike when these collect functions are run on other objects). The reason is that ensembles contain different types of models or models with different tuning parameters.
A tibble.
collect_metrics(lacerta_ensemble) collect_metrics(lacerta_rep_ens)
collect_metrics(lacerta_ensemble) collect_metrics(lacerta_rep_ens)
Supply these light wrappers as the control
argument in a
tune::tune_grid()
, tune::tune_bayes()
, or tune::fit_resamples()
call to return the needed elements for use in an ensemble.
These functions will return the appropriate control grid to ensure that
assessment set predictions and information on model specifications and
preprocessors, is supplied in the resampling results object!
To integrate ensemble settings with your existing control settings, note
that these functions just call the appropriate tune::control_*
function
with the arguments save_pred = TRUE, save_workflow = TRUE
.
These wrappers are equivalent to the ones used in the stacks
package.
control_ensemble_grid() control_ensemble_resamples() control_ensemble_bayes()
control_ensemble_grid() control_ensemble_resamples() control_ensemble_bayes()
A tune::control_grid, tune::control_bayes, or tune::control_resamples object.
See the vignettes for examples of these functions used in context.
For each environmental variable, this function computes the density functions of presences and absences and returns (1-overlap), which is a measure of the distance between the two distributions. Variables with a high distance are good candidates for SDMs, as species occurrences are confined to a subset of the available background.
dist_pres_vs_bg(.data, .col)
dist_pres_vs_bg(.data, .col)
.data |
a |
.col |
the column containing the presences; it assumes presences to be the first level of this factor |
a name vector of distances
# This should be updated to use a dataset from tidysdm data("bradypus", package = "maxnet") bradypus_tb <- tibble::as_tibble(bradypus) %>% dplyr::mutate(presence = relevel( factor( dplyr::case_match(presence, 1 ~ "presence", 0 ~ "absence") ), ref = "presence" )) %>% select(-ecoreg) bradypus_tb %>% dist_pres_vs_bg(presence)
# This should be updated to use a dataset from tidysdm data("bradypus", package = "maxnet") bradypus_tb <- tibble::as_tibble(bradypus) %>% dplyr::mutate(presence = relevel( factor( dplyr::case_match(presence, 1 ~ "presence", 0 ~ "absence") ), ref = "presence" )) %>% select(-ecoreg) bradypus_tb %>% dist_pres_vs_bg(presence)
DALEX is designed to explore and explain the behaviour of Machine Learning
methods. This function creates a DALEX explainer (see DALEX::explain()
), which can then be queried
by multiple function to create explanations of the model.
explain_tidysdm( model, data, y, predict_function, predict_function_target_column, residual_function, ..., label, verbose, precalculate, colorize, model_info, type, by_workflow ) ## Default S3 method: explain_tidysdm( model, data = NULL, y = NULL, predict_function = NULL, predict_function_target_column = NULL, residual_function = NULL, ..., label = NULL, verbose = TRUE, precalculate = TRUE, colorize = !isTRUE(getOption("knitr.in.progress")), model_info = NULL, type = "classification", by_workflow = FALSE ) ## S3 method for class 'simple_ensemble' explain_tidysdm( model, data = NULL, y = NULL, predict_function = NULL, predict_function_target_column = NULL, residual_function = NULL, ..., label = NULL, verbose = TRUE, precalculate = TRUE, colorize = !isTRUE(getOption("knitr.in.progress")), model_info = NULL, type = "classification", by_workflow = FALSE ) ## S3 method for class 'repeat_ensemble' explain_tidysdm( model, data = NULL, y = NULL, predict_function = NULL, predict_function_target_column = NULL, residual_function = NULL, ..., label = NULL, verbose = TRUE, precalculate = TRUE, colorize = !isTRUE(getOption("knitr.in.progress")), model_info = NULL, type = "classification", by_workflow = FALSE )
explain_tidysdm( model, data, y, predict_function, predict_function_target_column, residual_function, ..., label, verbose, precalculate, colorize, model_info, type, by_workflow ) ## Default S3 method: explain_tidysdm( model, data = NULL, y = NULL, predict_function = NULL, predict_function_target_column = NULL, residual_function = NULL, ..., label = NULL, verbose = TRUE, precalculate = TRUE, colorize = !isTRUE(getOption("knitr.in.progress")), model_info = NULL, type = "classification", by_workflow = FALSE ) ## S3 method for class 'simple_ensemble' explain_tidysdm( model, data = NULL, y = NULL, predict_function = NULL, predict_function_target_column = NULL, residual_function = NULL, ..., label = NULL, verbose = TRUE, precalculate = TRUE, colorize = !isTRUE(getOption("knitr.in.progress")), model_info = NULL, type = "classification", by_workflow = FALSE ) ## S3 method for class 'repeat_ensemble' explain_tidysdm( model, data = NULL, y = NULL, predict_function = NULL, predict_function_target_column = NULL, residual_function = NULL, ..., label = NULL, verbose = TRUE, precalculate = TRUE, colorize = !isTRUE(getOption("knitr.in.progress")), model_info = NULL, type = "classification", by_workflow = FALSE )
model |
object - a model to be explained |
data |
data.frame or matrix - data which will be used to calculate the explanations. If not provided, then it will be extracted from the model. Data should be passed without a target column (this shall be provided as the |
y |
numeric vector with outputs/scores. If provided, then it shall have the same size as |
predict_function |
function that takes two arguments: model and new data and returns a numeric vector with predictions. By default it is |
predict_function_target_column |
Character or numeric containing either column name or column number in the model prediction object of the class that should be considered as positive (i.e. the class that is associated with probability 1). If NULL, the second column of the output will be taken for binary classification. For a multiclass classification setting, that parameter cause switch to binary classification mode with one vs others probabilities. |
residual_function |
function that takes four arguments: model, data, target vector y and predict function (optionally). It should return a numeric vector with model residuals for given data. If not provided, response residuals ( |
... |
other parameters |
label |
character - the name of the model. By default it's extracted from the 'class' attribute of the model |
verbose |
logical. If TRUE (default) then diagnostic messages will be printed |
precalculate |
logical. If TRUE (default) then |
colorize |
logical. If TRUE (default) then |
model_info |
a named list ( |
type |
type of a model, either |
by_workflow |
boolean determining whether a list of explainer, one per model, should be returned instead of a single explainer for the ensemble |
explainer object DALEX::explain
ready to work with DALEX
# using the whole ensemble lacerta_explainer <- explain_tidysdm(tidysdm::lacerta_ensemble) # by workflow explainer_list <- explain_tidysdm(tidysdm::lacerta_ensemble, by_workflow = TRUE )
# using the whole ensemble lacerta_explainer <- explain_tidysdm(tidysdm::lacerta_ensemble) # by workflow explainer_list <- explain_tidysdm(tidysdm::lacerta_ensemble, by_workflow = TRUE )
Compute multivariate environmental similarity surfaces (MESS), as described by Elith et al., 2010.
extrapol_mess(x, training, .col, ...) ## Default S3 method: extrapol_mess(x, training, ...) ## S3 method for class 'stars' extrapol_mess(x, ...) ## S3 method for class 'SpatRaster' extrapol_mess(x, training, .col, filename = "", ...) ## S3 method for class 'data.frame' extrapol_mess(x, training, .col, ...) ## S3 method for class 'SpatRasterDataset' extrapol_mess(x, training, .col, ...)
extrapol_mess(x, training, .col, ...) ## Default S3 method: extrapol_mess(x, training, ...) ## S3 method for class 'stars' extrapol_mess(x, ...) ## S3 method for class 'SpatRaster' extrapol_mess(x, training, .col, filename = "", ...) ## S3 method for class 'data.frame' extrapol_mess(x, training, .col, ...) ## S3 method for class 'SpatRasterDataset' extrapol_mess(x, training, .col, ...)
x |
|
training |
matrix or data.frame or sf object containing the reference values; each column
should correspond to one layer of the |
.col |
the column containing the presences (optional). If specified, it is excluded when computing the MESS scores. |
... |
additional arguments as for |
filename |
character. Output filename (optional) |
This function is a modified version of mess
in
package predicts
, with a method added to work on terra::SpatRasterDataset
.
Note that the method for terra::SpatRasterDataset
assumes that each variables
is stored as a terra::SpatRaster
with time information within x
. Time
is also assumed to be in years
. If these conditions are not met, it is possible
to manually extract a terra::SpatRaster
for each time step, and use
extrapol_mess
on those terra::SpatRaster
s
a terra::SpatRaster
(data.frame) with
the MESS values.
Jean-Pierre Rossi, Robert Hijmans, Paulo van Breugel, Andrea Manica
Elith J., M. Kearney M., and S. Phillips, 2010. The art of modelling range-shifting species. Methods in Ecology and Evolution 1:330-342.
This method finds a subset of variables that have low collinearity. It provides
three methods: cor_caret
, a stepwise approach to remove variables with a pairwise correlation
above a given cutoff, choosing the variable with the greatest mean correlation (based on the algorithm in caret::findCorrelation
);
vif_step
, a stepwise approach to remove variables with an variance inflation factor
above a given cutoff (based on the algorithm in usdm::vifstep
), and vif_cor
, a stepwise
approach that, at each step, find the pair of variables with the highest correlation above the cutoff and removes the
one with the largest vif.
such that all have a correlation
below a certain cutoff. There are methods for terra::SpatRaster
,
data.frame
and matrix
. For terra::SpatRaster
and data.frame
, only numeric variables will be
considered.
filter_collinear( x, cutoff = NULL, verbose = FALSE, names = TRUE, to_keep = NULL, method = "cor_caret", cor_type = "pearson", max_cells = Inf, ... ) ## Default S3 method: filter_collinear( x, cutoff = NULL, verbose = FALSE, names = TRUE, to_keep = NULL, method = "cor_caret", cor_type = "pearson", max_cells = Inf, ... ) ## S3 method for class 'stars' filter_collinear( x, cutoff = NULL, verbose = FALSE, names = TRUE, to_keep = NULL, method = "cor_caret", cor_type = "pearson", max_cells = Inf, exhaustive = FALSE, ... ) ## S3 method for class 'SpatRaster' filter_collinear( x, cutoff = NULL, verbose = FALSE, names = TRUE, to_keep = NULL, method = "cor_caret", cor_type = "pearson", max_cells = Inf, exhaustive = FALSE, ... ) ## S3 method for class 'data.frame' filter_collinear( x, cutoff = NULL, verbose = FALSE, names = TRUE, to_keep = NULL, method = "cor_caret", cor_type = "pearson", max_cells = Inf, ... ) ## S3 method for class 'matrix' filter_collinear( x, cutoff = NULL, verbose = FALSE, names = TRUE, to_keep = NULL, method = "cor_caret", cor_type = "pearson", max_cells = Inf, ... )
filter_collinear( x, cutoff = NULL, verbose = FALSE, names = TRUE, to_keep = NULL, method = "cor_caret", cor_type = "pearson", max_cells = Inf, ... ) ## Default S3 method: filter_collinear( x, cutoff = NULL, verbose = FALSE, names = TRUE, to_keep = NULL, method = "cor_caret", cor_type = "pearson", max_cells = Inf, ... ) ## S3 method for class 'stars' filter_collinear( x, cutoff = NULL, verbose = FALSE, names = TRUE, to_keep = NULL, method = "cor_caret", cor_type = "pearson", max_cells = Inf, exhaustive = FALSE, ... ) ## S3 method for class 'SpatRaster' filter_collinear( x, cutoff = NULL, verbose = FALSE, names = TRUE, to_keep = NULL, method = "cor_caret", cor_type = "pearson", max_cells = Inf, exhaustive = FALSE, ... ) ## S3 method for class 'data.frame' filter_collinear( x, cutoff = NULL, verbose = FALSE, names = TRUE, to_keep = NULL, method = "cor_caret", cor_type = "pearson", max_cells = Inf, ... ) ## S3 method for class 'matrix' filter_collinear( x, cutoff = NULL, verbose = FALSE, names = TRUE, to_keep = NULL, method = "cor_caret", cor_type = "pearson", max_cells = Inf, ... )
x |
A |
cutoff |
A numeric value used as a threshold to remove variables. For, "cor_caret" and "vif_cor", it is the pair-wise absolute correlation cutoff, which defaults to 0.7. For "vif_step", it is the variable inflation factor, which defaults to 10 |
verbose |
A boolean whether additional information should be provided on the screen |
names |
a logical; should the column names be returned |
to_keep |
A vector of variable names that we want to force in the set (note that the function will return an error if the correlation among any of those variables is higher than the cutoff). |
method |
character. One of "cor_caret", "vif_cor" or "vif_step". |
cor_type |
character. For methods that use correlation, which type of correlation: "pearson", "kendall", or "spearman". Defaults to "pearson" |
max_cells |
positive integer. The maximum number of cells to be used. If this is smaller than ncell(x), a regular sample of x is used |
... |
additional arguments specific to a given object type |
exhaustive |
boolean. Used only for |
A vector of names of columns that are below the correlation threshold
(when names = TRUE
), otherwise a vector of indices. Note that the indices
are only for numeric variables (i.e. if factors are present, the indices do
not take them into account).
for cor_caret
: Original R code by Dong Li, modified by Max Kuhn
and Andrea Manica; for vif_step
and vif_cor
, original algorithm by Babak
Naimi, rewritten by Andrea Manica for tidysdm
Naimi, B., Hamm, N.A.S., Groen, T.A., Skidmore, A.K., and Toxopeus, A.G. 2014. Where is positional uncertainty a problem for species distribution modelling?, Ecography 37 (2): 191-203.
THIS FUNCTION IS DEPRECATED. USE filter_collinear
with method=cor_caret
instead
filter_high_cor(x, cutoff = 0.7, verbose = FALSE, names = TRUE, to_keep = NULL) ## Default S3 method: filter_high_cor(x, cutoff = 0.7, verbose = FALSE, names = TRUE, to_keep = NULL) ## S3 method for class 'SpatRaster' filter_high_cor(x, cutoff = 0.7, verbose = FALSE, names = TRUE, to_keep = NULL) ## S3 method for class 'data.frame' filter_high_cor(x, cutoff = 0.7, verbose = FALSE, names = TRUE, to_keep = NULL) ## S3 method for class 'matrix' filter_high_cor(x, cutoff = 0.7, verbose = FALSE, names = TRUE, to_keep = NULL)
filter_high_cor(x, cutoff = 0.7, verbose = FALSE, names = TRUE, to_keep = NULL) ## Default S3 method: filter_high_cor(x, cutoff = 0.7, verbose = FALSE, names = TRUE, to_keep = NULL) ## S3 method for class 'SpatRaster' filter_high_cor(x, cutoff = 0.7, verbose = FALSE, names = TRUE, to_keep = NULL) ## S3 method for class 'data.frame' filter_high_cor(x, cutoff = 0.7, verbose = FALSE, names = TRUE, to_keep = NULL) ## S3 method for class 'matrix' filter_high_cor(x, cutoff = 0.7, verbose = FALSE, names = TRUE, to_keep = NULL)
x |
A |
cutoff |
A numeric value for the pair-wise absolute correlation cutoff |
verbose |
A boolean for printing the details |
names |
a logical; should the column names be returned |
to_keep |
A vector of variable names that we want to force in the set (note that the function will return an error if the correlation among any of those variables is higher than the cutoff). |
This method finds a subset of variable such that all have a correlation
below a certain cutoff. There are methods for terra::SpatRaster
,
data.frame
, and to work directly on a correlation matrix that was
previously estimated. For data.frame
, only numeric variables will be
considered.
The algorithm is based on caret::findCorrelation
, using the exact
option.
The absolute values of pair-wise correlations are considered. If two
variables have a high correlation, the function looks at the mean absolute
correlation of each variable and removes the variable with the largest mean
absolute correlation.
There are several function in the package subselect
that can also be used to accomplish
the same goal but tend to retain more predictors.
A vector of names of columns that are below the correlation threshold
(when names = TRUE
), otherwise a vector of indices. Note that the indices
are only for numeric variables (i.e. if factors are present, the indices do
not take them into account).
This function takes the formula from a recipe, and turns numeric predictors into smooths with a given k. This formula can be passed to a workflow or workflow set when fitting a gam.
gam_formula(object, k = 10)
gam_formula(object, k = 10)
object |
a recipes::recipe, already trained |
k |
the k value for the smooth |
a formula
This geometry displays the density distribution of two groups side by side,
as two halves of a violin. Note that an emptyx
aesthetic has to be provided even
if you want to plot a single variable (see example below).
geom_split_violin( mapping = NULL, data = NULL, stat = "ydensity", position = "identity", nudge = 0, ..., draw_quantiles = NULL, trim = TRUE, scale = "area", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE )
geom_split_violin( mapping = NULL, data = NULL, stat = "ydensity", position = "identity", nudge = 0, ..., draw_quantiles = NULL, trim = TRUE, scale = "area", na.rm = FALSE, show.legend = NA, inherit.aes = TRUE )
mapping |
Set of aesthetic mappings created by |
data |
The data to be displayed in this layer. There are three options: If A A |
stat |
Use to override the default connection between |
position |
A position adjustment to use on the data for this layer. This
can be used in various ways, including to prevent overplotting and
improving the display. The
|
nudge |
Add space between the half-violin and the middle of the space allotted to a given factor on the x-axis. |
... |
Other arguments passed on to
|
draw_quantiles |
If |
trim |
If |
scale |
if "area" (default), all violins have the same area (before trimming the tails). If "count", areas are scaled proportionally to the number of observations. If "width", all violins have the same maximum width. |
na.rm |
If |
show.legend |
logical. Should this layer be included in the legends?
|
inherit.aes |
If |
The implementation is based on https://stackoverflow.com/questions/35717353/split-violin-plot-with-ggplot2. Credit goes to @jan-jlx for providing a complete implementation on StackOverflow, and to Trang Q. Nguyen for adding the nudge parameter.
a ggplot2::layer
object
data("bradypus", package = "maxnet") bradypus_tb <- tibble::as_tibble(bradypus) %>% dplyr::mutate(presence = relevel( factor( dplyr::case_match(presence, 1 ~ "presence", 0 ~ "absence") ), ref = "presence" )) ggplot(bradypus_tb, aes( x = "", y = cld6190_ann, fill = presence )) + geom_split_violin(nudge = 0.01)
data("bradypus", package = "maxnet") bradypus_tb <- tibble::as_tibble(bradypus) %>% dplyr::mutate(presence = relevel( factor( dplyr::case_match(presence, 1 ~ "presence", 0 ~ "absence") ), ref = "presence" )) ggplot(bradypus_tb, aes( x = "", y = cld6190_ann, fill = presence )) + geom_split_violin(nudge = 0.01)
This function facilitates using spatialsample::spatial_block_cv multiple
times in an analysis. spatialsample::spatial_block_cv creates a grid
based on the object in data
. However, if spatial blocks are generated
multiple times in an analysis (e.g. for a spatial_initial_split()
, and then
subsequently for cross-validation on the training dataset), it might be desirable to keep the
same grid). By applying this function to the largest dataset, usually the
full dataset before spatial_initial_split()
. The resulting cellsize can
be used as an option in spatialsample::spatial_block_cv.
grid_cellsize(data, n = c(10, 10))
grid_cellsize(data, n = c(10, 10))
data |
a sf::sf dataset used to size the grid |
n |
the number of cells in the grid, defaults to c(10,10), which is also
the default for |
the cell size
This function facilitates using spatialsample::spatial_block_cv multiple
times in an analysis. spatialsample::spatial_block_cv creates a grid
based on the object in data
. However, if spatial blocks are generated
multiple times in an analysis (e.g. for a spatial_initial_split()
, and then
subsequently for cross-validation on the training dataset), it might be desirable to keep the
same grid). By applying this function to the largest dataset, usually the
full dataset before spatial_initial_split()
. The resulting cellsize can
be used as an option in spatialsample::spatial_block_cv.
grid_offset(data)
grid_offset(data)
data |
a sf::sf dataset used to size the grid |
the grid offset
Coordinates for presences of horses from 22k to 8k YBP.
horses
horses
An tibble
with 1,297 rows and 3 variables:
latitudes in degrees
longitudes in degrees
time in years before present
Cohen's Kappa (yardstick::kap()
) is a measure similar to yardstick::accuracy()
, but it normalises
the observed accuracy by the value that would be expected by chance (this
helps for unbalanced cases when one class is predominant).
kap_max(data, ...) ## S3 method for class 'data.frame' kap_max( data, truth, ..., estimator = NULL, na_rm = TRUE, event_level = "first", case_weights = NULL ) ## S3 method for class 'sf' kap_max(data, ...) kap_max_vec( truth, estimate, estimator = NULL, na_rm = TRUE, event_level = "first", case_weights = NULL, ... )
kap_max(data, ...) ## S3 method for class 'data.frame' kap_max( data, truth, ..., estimator = NULL, na_rm = TRUE, event_level = "first", case_weights = NULL ) ## S3 method for class 'sf' kap_max(data, ...) kap_max_vec( truth, estimate, estimator = NULL, na_rm = TRUE, event_level = "first", case_weights = NULL, ... )
data |
Either a data.frame containing the columns specified by the truth and estimate arguments, or a table/matrix where the true class results should be in the columns of the table. |
... |
A set of unquoted column names or one or more dplyr selector functions to choose which variables contain the class probabilities. If truth is binary, only 1 column should be selected, and it should correspond to the value of event_level. Otherwise, there should be as many columns as factor levels of truth and the ordering of the columns should be the same as the factor levels of truth. |
truth |
The column identifier for the true class results (that is a factor). This should be an unquoted column name although this argument is passed by expression and supports quasiquotation (you can unquote column names). For _vec() functions, a factor vector. |
estimator |
One of "binary", "hand_till", "macro", or "macro_weighted" to specify the type of averaging to be done. "binary" is only relevant for the two class case. The others are general methods for calculating multiclass metrics. The default will automatically choose "binary" if truth is binary, "hand_till" if truth has >2 levels and case_weights isn't specified, or "macro" if truth has >2 levels and case_weights is specified (in which case "hand_till" isn't well-defined). |
na_rm |
A logical value indicating whether NA values should be stripped before the computation proceeds. |
event_level |
A single string. Either "first" or "second" to specify which level of truth to consider as the "event". This argument is only applicable when estimator = "binary". The default uses an internal helper that generally defaults to "first" |
case_weights |
The optional column identifier for case weights. This should be an unquoted column name that evaluates to a numeric column in data. For _vec() functions, a numeric vector. |
estimate |
If truth is binary, a numeric vector of class probabilities corresponding to the "relevant" class. Otherwise, a matrix with as many columns as factor levels of truth. It is assumed that these are in the same order as the levels of truth. |
This function calibrates the probability threshold to classify presences to maximises kappa.
There is no multiclass version of this function, it only operates on binary predictions (e.g. presences and absences in SDMs).
A tibble with columns .metric, .estimator, and .estimate and 1 row of values. For grouped data frames, the number of rows returned will be the same as the number of groups.
Cohen, J. (1960). "A coefficient of agreement for nominal scales". Educational and Psychological Measurement. 20 (1): 37-46.
Cohen, J. (1968). "Weighted kappa: Nominal scale agreement provision for scaled disagreement or partial credit". Psychological Bulletin. 70 (4): 213-220.
Other class probability metrics:
boyce_cont()
,
tss_max()
kap_max(two_class_example, truth, Class1)
kap_max(two_class_example, truth, Class1)
This function takes distance in km and converts it into meters, the
units generally used by geographic operations in R
. This is a trivial
conversion, but this functions ensures that no zeroes are lost along the way!
km2m(x)
km2m(x)
x |
the number of km |
the number of meters
km2m(10000) km2m(1)
km2m(10000) km2m(1)
Coordinates for presences of Lacerta schreiberi. The variables are as follows:
lacerta
lacerta
An tibble
with 1,297 rows and 3 variables:
ids from GBIF
latitudes in degrees
longitudes in degrees
Ensemble SDM for Lacerta schreiberi, as generated in the vignette.
lacerta_ensemble
lacerta_ensemble
A simple_ensemble
object
Ensemble SDM for Lacerta schreiberi, as generated in the vignette.
lacerta_rep_ens
lacerta_rep_ens
A repeat_ensemble
object
Coordinates for presences of lacertidae, used as background for the lacerta
dataset.. The variables are as
follows:
lacertidae_background
lacertidae_background
An tibble
with 1,297 rows and 3 variables:
ids from GBIF
latitudes in degrees
longitudes in degrees
This functions uses the presence column to create a mask to apply to the raster to define the area of interest. Two methods are available: one that uses a buffer around each presence, and one that create a convex hull around all presences (with the possibility of further adding a buffer around the hull).
make_mask_from_presence(data, method = "buffer", buffer = 0, return_sf = FALSE)
make_mask_from_presence(data, method = "buffer", buffer = 0, return_sf = FALSE)
data |
An |
method |
the method to use to create the mask. Either 'buffer' or 'convex_hull' |
buffer |
the buffer to add around each presence (in the units of the crs of the data; for lat/lon, the buffer will be in meters), or around the convex hull (if method is 'convex_hull') |
return_sf |
whether to return the mask as an |
To use terra::mask()
on a raster, use return_sf = FALSE
to get a terra::SpatVector
object
that can be used for masking.
a terra::SpatVector
or an sf
object (depending on the value of return_sf
) with the mask
lacerta_sf <- lacerta %>% sf::st_as_sf(coords = c("longitude", "latitude")) %>% sf::st_set_crs(4326) land_mask <- terra::readRDS(system.file("extdata/lacerta_land_mask.rds", package = "tidysdm")) mask_buffer <- make_mask_from_presence(lacerta_sf, method = "buffer", buffer = 60000) terra::plot(terra::mask(land_mask, mask_buffer)) mask_ch <- make_mask_from_presence(lacerta_sf, method = "convex_hull") terra::plot(terra::mask(land_mask, mask_ch))
lacerta_sf <- lacerta %>% sf::st_as_sf(coords = c("longitude", "latitude")) %>% sf::st_set_crs(4326) land_mask <- terra::readRDS(system.file("extdata/lacerta_land_mask.rds", package = "tidysdm")) mask_buffer <- make_mask_from_presence(lacerta_sf, method = "buffer", buffer = 60000) terra::plot(terra::mask(land_mask, mask_buffer)) mask_ch <- make_mask_from_presence(lacerta_sf, method = "convex_hull") terra::plot(terra::mask(land_mask, mask_ch))
maxent
defines the MaxEnt model as used in Species
Distribution Models.
A good guide to how options of a MaxEnt model work can be found in
https://onlinelibrary.wiley.com/doi/full/10.1111/j.1600-0587.2013.07872.x
maxent( mode = "classification", engine = "maxnet", feature_classes = NULL, regularization_multiplier = NULL )
maxent( mode = "classification", engine = "maxnet", feature_classes = NULL, regularization_multiplier = NULL )
mode |
A single character string for the type of model. The only possible value for this model is "classification". |
engine |
A single character string specifying what computational engine to use for fitting. Currently only "maxnet" is available. |
feature_classes |
character, continuous feature classes desired, either "default" or any subset of "lqpht" (for example, "lh") |
regularization_multiplier |
numeric, a constant to adjust regularization |
a parsnip::model_spec
for a maxent
model
# format the data data("bradypus", package = "maxnet") bradypus_tb <- tibble::as_tibble(bradypus) %>% dplyr::mutate(presence = relevel( factor( dplyr::case_match(presence, 1 ~ "presence", 0 ~ "absence") ), ref = "presence" )) %>% select(-ecoreg) # fit the model, and make some predictions maxent_spec <- maxent(feature_classes = "lq") maxent_fitted <- maxent_spec %>% fit(presence ~ ., data = bradypus_tb) pred_prob <- predict(maxent_fitted, new_data = bradypus[, -1], type = "prob") pred_class <- predict(maxent_fitted, new_data = bradypus[, -1], type = "class") # Now with tuning maxent_spec <- maxent( regularization_multiplier = tune(), feature_classes = tune() ) set.seed(452) cv <- vfold_cv(bradypus_tb, v = 2) maxent_tune_res <- maxent_spec %>% tune_grid(presence ~ ., cv, grid = 3) show_best(maxent_tune_res, metric = "roc_auc")
# format the data data("bradypus", package = "maxnet") bradypus_tb <- tibble::as_tibble(bradypus) %>% dplyr::mutate(presence = relevel( factor( dplyr::case_match(presence, 1 ~ "presence", 0 ~ "absence") ), ref = "presence" )) %>% select(-ecoreg) # fit the model, and make some predictions maxent_spec <- maxent(feature_classes = "lq") maxent_fitted <- maxent_spec %>% fit(presence ~ ., data = bradypus_tb) pred_prob <- predict(maxent_fitted, new_data = bradypus[, -1], type = "prob") pred_class <- predict(maxent_fitted, new_data = bradypus[, -1], type = "class") # Now with tuning maxent_spec <- maxent( regularization_multiplier = tune(), feature_classes = tune() ) set.seed(452) cv <- vfold_cv(bradypus_tb, v = 2) maxent_tune_res <- maxent_spec %>% tune_grid(presence ~ ., cv, grid = 3) show_best(maxent_tune_res, metric = "roc_auc")
These parameters are auxiliary to MaxEnt models using the "maxnet" engine. These functions are used by the tuning functions, and the user will rarely access them directly.
regularization_multiplier(range = c(0.5, 3), trans = NULL) feature_classes(values = c("l", "lq", "lqp", "lqph", "lqpht"))
regularization_multiplier(range = c(0.5, 3), trans = NULL) feature_classes(values = c("l", "lq", "lqp", "lqph", "lqpht"))
range |
A two-element vector holding the defaults for the smallest and largest possible values, respectively. If a transformation is specified, these values should be in the transformed units. |
trans |
A trans object from the scales package, such as scales::log10_trans() or scales::reciprocal_trans(). If not provided, the default is used which matches the units used in range. If no transformation, NULL. |
values |
For |
a param
object that can be used for tuning.
regularization_multiplier() feature_classes()
regularization_multiplier() feature_classes()
This function computes overlap metrics between two rasters. It currently implements Schoener's D and the inverse I of Hellinger's distance.
niche_overlap(x, y, method = c("Schoener", "Hellinger"))
niche_overlap(x, y, method = c("Schoener", "Hellinger"))
x |
a terra::SpatRaster with a single layer |
y |
a terra::SpatRaster with a single layer |
method |
a string (or vector of strings) taking values "Schoener" and "Hellinger" |
Note that Hellinger's distance is normalised by dividing by square root of 2 (which is the correct asymptote for Hellinger's D), rather than the incorrect 2 used originally in Warren et al (2008), based on the Erratum for that paper.
a list of overlap metrics, with slots D and I (depending on method
)
Warren, D.L., Glor, R.E. & Turelli M. (2008) Environmental niche equivalency versus conservativism: quantitative approaches to niche evolution. Evolution 62: 2868-2883
This function returns the threshold to turn probabilities into binary classes
whilst optimising a given metric. Currently available for tss_max
, kap_max
and
sensitivity
(for which a target sensitivity is required).
optim_thresh(truth, estimate, metric, event_level = "first")
optim_thresh(truth, estimate, metric, event_level = "first")
truth |
The column identifier for the true class results (that is a factor). This should be an unquoted column name although this argument is passed by expression and supports quasiquotation (you can unquote column names). For _vec() functions, a factor vector. |
estimate |
the predicted probability for the event |
metric |
character of metric to be optimised. Currently only "tss_max", "kap_max", and "sensitivity" with a given target (e.g. c("sensitivity",0.8)) |
event_level |
A single string. Either "first" or "second" to specify which level of truth to consider as the "event". This argument is only applicable when estimator = "binary". The default uses an internal helper that generally defaults to "first" |
the probability threshold for the event
optim_thresh(two_class_example$truth, two_class_example$Class1, metric = c("tss_max")) optim_thresh(two_class_example$truth, two_class_example$Class1, metric = c("sens", 0.9))
optim_thresh(two_class_example$truth, two_class_example$Class1, metric = c("tss_max")) optim_thresh(two_class_example$truth, two_class_example$Class1, metric = c("sens", 0.9))
graphics::pairs()
that accepts stars
objects.
It is adapted from a similar function in the terra
package.This is a wrapper around graphics::pairs()
that accepts stars
objects.
It is adapted from a similar function in the terra
package.
## S4 method for signature 'stars' pairs( x, hist = TRUE, cor = TRUE, use = "pairwise.complete.obs", maxcells = 1e+05, ... )
## S4 method for signature 'stars' pairs( x, hist = TRUE, cor = TRUE, use = "pairwise.complete.obs", maxcells = 1e+05, ... )
x |
SpatRaster |
hist |
logical. If TRUE a histogram of the values is shown on the diagonal |
cor |
logical. If TRUE the correlation coefficient is shown in the upper panels |
use |
argument passed to the |
maxcells |
integer. Number of pixels to sample from each layer of a large SpatRaster |
... |
additional arguments (graphical parameters) |
Create a composite plots contrasting the distribution of multiple variables for presences vs the background.
plot_pres_vs_bg(.data, .col)
plot_pres_vs_bg(.data, .col)
.data |
a |
.col |
the column containing the presences; it assumes presences to be the first level of this factor |
a patchwork
composite plot
data("bradypus", package = "maxnet") bradypus_tb <- tibble::as_tibble(bradypus) %>% dplyr::mutate(presence = relevel( factor( dplyr::case_match(presence, 1 ~ "presence", 0 ~ "absence") ), ref = "presence" )) %>% select(-ecoreg) bradypus_tb %>% plot_pres_vs_bg(presence)
data("bradypus", package = "maxnet") bradypus_tb <- tibble::as_tibble(bradypus) %>% dplyr::mutate(presence = relevel( factor( dplyr::case_match(presence, 1 ~ "presence", 0 ~ "absence") ), ref = "presence" )) %>% select(-ecoreg) bradypus_tb %>% plot_pres_vs_bg(presence)
This function allows to use a raster as data to make predictions from a
variety of tidymodels
objects, such as simple_ensemble
or stacks::stacks
predict_raster(object, raster, ...) ## Default S3 method: predict_raster(object, raster, ...)
predict_raster(object, raster, ...) ## Default S3 method: predict_raster(object, raster, ...)
object |
the |
raster |
the |
... |
parameters to be passed to the standard |
a terra::SpatRaster
(or stars
if that is the input) with the predictions
Predict for a new dataset by using a repeat ensemble. Predictions from individual
models are combined according to fun
## S3 method for class 'repeat_ensemble' predict( object, new_data, type = "prob", fun = "mean", metric_thresh = NULL, class_thresh = NULL, members = FALSE, ... )
## S3 method for class 'repeat_ensemble' predict( object, new_data, type = "prob", fun = "mean", metric_thresh = NULL, class_thresh = NULL, members = FALSE, ... )
object |
an repeat_ensemble object |
new_data |
a data frame in which to look for variables with which to predict. |
type |
the type of prediction, "prob" or "class". |
fun |
string defining the aggregating function. It can take values
|
metric_thresh |
a vector of length 2 giving a metric and its threshold, which will be used to prune which models in the ensemble will be used for the prediction. The 'metrics' need to have been computed when the workflow was tuned. Examples are c("accuracy",0.8) or c("boyce_cont",0.7) |
class_thresh |
probability threshold used to convert probabilities into classes. It can be a number (between 0 and 1), or a character metric (currently "tss_max" or "sensitivity"). For sensitivity, an additional target value is passed along as a second element of a vector, e.g. c("sensitivity",0.8). |
members |
boolean defining whether individual predictions for each member should be added to the ensemble prediction. The columns for individual members have the name of the workflow a a prefix, separated by "." from the usual column names of the predictions. |
... |
not used in this method. |
a tibble of predictions
Predict for a new dataset by using a simple ensemble. Predictions from individual
models (i.e. workflows) are combined according to fun
## S3 method for class 'simple_ensemble' predict( object, new_data, type = "prob", fun = "mean", metric_thresh = NULL, class_thresh = NULL, members = FALSE, ... )
## S3 method for class 'simple_ensemble' predict( object, new_data, type = "prob", fun = "mean", metric_thresh = NULL, class_thresh = NULL, members = FALSE, ... )
object |
an simple_ensemble object |
new_data |
a data frame in which to look for variables with which to predict. |
type |
the type of prediction, "prob" or "class". |
fun |
string defining the aggregating function. It can take values
|
metric_thresh |
a vector of length 2 giving a metric and its threshold, which will be used to prune which models in the ensemble will be used for the prediction. The 'metrics' need to have been computed when the workflow was tuned. Examples are c("accuracy",0.8) or c("boyce_cont",0.7) |
class_thresh |
probability threshold used to convert probabilities into classes. It can be a number (between 0 and 1), or a character metric (currently "tss_max" or "sensitivity"). For sensitivity, an additional target value is passed along as a second element of a vector, e.g. c("sensitivity",0.8). |
members |
boolean defining whether individual predictions for each member should be added to the ensemble prediction. The columns for individual members have the name of the workflow a a prefix, separated by "." from the usual column names of the predictions. |
... |
not used in this method. |
a tibble of predictions
sf
objectstidysdm
provides specialised metrics for SDMs, which have their own
help pages(boyce_cont()
, kap_max()
, and tss_max()
). Additionally, it also
provides methods to handle sf::sf objects for the following
standard yardstick
metrics:
yardstick::average_precision()
yardstick::classification_cost()
## S3 method for class 'sf' average_precision(data, ...) ## S3 method for class 'sf' brier_class(data, ...) ## S3 method for class 'sf' classification_cost(data, ...) ## S3 method for class 'sf' gain_capture(data, ...) ## S3 method for class 'sf' mn_log_loss(data, ...) ## S3 method for class 'sf' pr_auc(data, ...) ## S3 method for class 'sf' roc_auc(data, ...) ## S3 method for class 'sf' roc_aunp(data, ...) ## S3 method for class 'sf' roc_aunu(data, ...)
## S3 method for class 'sf' average_precision(data, ...) ## S3 method for class 'sf' brier_class(data, ...) ## S3 method for class 'sf' classification_cost(data, ...) ## S3 method for class 'sf' gain_capture(data, ...) ## S3 method for class 'sf' mn_log_loss(data, ...) ## S3 method for class 'sf' pr_auc(data, ...) ## S3 method for class 'sf' roc_auc(data, ...) ## S3 method for class 'sf' roc_aunp(data, ...) ## S3 method for class 'sf' roc_aunu(data, ...)
data |
an sf::sf object |
... |
any other parameters to pass to the |
Note that roc_aunp
and roc_aunu
are multiclass metrics, and as such are
are not relevant for SDMs (which work on a binary response). They are included
for completeness, so that all class probability metrics from yardstick
have
an sf
method, for applications other than SDMs.
A tibble with columns .metric
, .estimator
, and .estimate
and 1 row of values.
sf
objectsThis method for recipes::recipe()
handles the
case when x
is an sf::sf object, as commonly
used in Species Distribution Model, and generates a spatial_recipe
.
## S3 method for class 'sf' recipe(x, ...) spatial_recipe(x, ...)
## S3 method for class 'sf' recipe(x, ...) spatial_recipe(x, ...)
x |
An sf::sf data frame. |
... |
parameters to be passed to |
recipes::recipe()
are not natively compatible with sf::sf objects. The problem is that
the geometry
column of sf::sf objects is a list, which is incompatible with
the translation of formulae in recipes::recipe()
. This method strips the geometry
column from the data.frame and replaces it with a simple X
and Y
columns
before any further operations, thus allowing
the usual processing by recipes::recipe()
to succeed (X
and Y
are give the role
of coords in a spatial recipe). When prepping and baking a spatial_recipe
,
if a data.frame or tibble without coordinates is used as training
or
new_data
, dummy X
and Y
columns
are generated and filled with NAs.
NOTE that order matters! You need to use the syntax
recipe(x=sf_obj, formula=class~.)
for the method to successfully detect
the sf::sf object. Starting with formula
will fail.
An object of class spatial_recipe
, which is a derived version of
recipes::recipe()
, see
the manpage for recipes::recipe()
for details.
An ensemble based multiple sets of pseudoabsences/background. This object
is a collection (list) of simple_ensemble
objects for which predictions will
be combined in a simple way (e.g. by taking either the mean or median). Each
simple_ensemble
contains the best version of a each given model type
following turning; all simple ensembles will need to have the same metric
estimated during the cv process.
repeat_ensemble(...)
repeat_ensemble(...)
... |
not used, this function just creates an empty |
an empty repeat_ensemble
This function samples background points from a raster given a set of presences. The locations returned as the center points of the sampled cells, which can overlap with the presences (in contrast to pseudo-absences, see sample_pseudoabs). The following methods are implemented:
'random': background randomly sampled from the region covered by the raster (i.e. not NAs).
'dist_max': background randomly sampled from the unioned buffers of 'dist_max' from presences (distances in 'm' for lonlat rasters, and in map units for projected rasters). Using the union of buffers means that areas that are in multiple buffers are not oversampled. This is also referred to as "thickening".
'bias': background points are sampled according to a surface representing the biased sampling effort.
sample_background( data, raster, n, coords = NULL, method = "random", class_label = "background", return_pres = TRUE )
sample_background( data, raster, n, coords = NULL, method = "random", class_label = "background", return_pres = TRUE )
data |
An |
raster |
the terra::SpatRaster or |
n |
number of background points to sample. |
coords |
a vector of length two giving the names of the "x" and "y"
coordinates, as found in |
method |
sampling method. One of 'random', 'dist_max', and 'bias'. For dist_max, the maximum distance is set as an additional element of a vector, e.g c('dist_max',70000). |
class_label |
the label given to the sampled points. Defaults to |
return_pres |
return presences together with background in a single tibble. |
Note that the units of distance depend on the projection of the raster.
An object of class tibble::tibble. If presences are returned, the
presence level is set as the reference (to match the expectations in the
yardstick
package that considers the first level to be the event).
This function samples background points from a raster given a set of presences. The locations returned as the center points of the sampled cells, which can overlap with the presences (in contrast to pseudo-absences, see sample_pseudoabs_time). The following methods are implemented:
'random': background points randomly sampled from the region covered by the raster (i.e. not NAs).
'dist_max': background points randomly sampled from the unioned buffers of 'dist_max' from presences (distances in 'm' for lonlat rasters, and in map units for projected rasters). Using the union of buffers means that areas that are in multiple buffers are not oversampled. This is also referred to as "thickening".
'bias': background points are sampled according to a surface representing
the biased sampling effort. Note that the surface for each time step is normalised to
sum to 1;use n_per_time_step
to affect sampling effort within each time step.
sample_background_time( data, raster, n_per_time_step, coords = NULL, time_col = "time", lubridate_fun = c, method = "random", class_label = "background", return_pres = TRUE, time_buffer = 0 )
sample_background_time( data, raster, n_per_time_step, coords = NULL, time_col = "time", lubridate_fun = c, method = "random", class_label = "background", return_pres = TRUE, time_buffer = 0 )
data |
An |
raster |
the terra::SpatRaster, |
n_per_time_step |
number of background points to sample for each time step (i.e. a vector of length equal to the number of time steps in raster) |
coords |
a vector of length two giving the names of the "x" and "y"
coordinates, as found in |
time_col |
The name of the column with time; if time is not a lubridate object,
use |
lubridate_fun |
function to convert the time column into a lubridate object |
method |
sampling method. One of 'random', 'dist_max', or 'bias'. |
class_label |
the label given to the sampled points. Defaults to |
return_pres |
return presences together with background in a single tibble |
time_buffer |
the buffer on the time axis around presences that defines their effect when
sampling background with method 'max_dist'. If set to zero, presences have an effect only on the time step to which
they are assigned in |
Note that the time axis of the raster should be in POSIXct
or Date
format,
or use ‘tstep="years"’. See terra::time()
for details on how to set the time axis.
An object of class tibble::tibble. If presences are returned, the
presence level is set as the reference (to match the expectations in the
yardstick
package that considers the first level to be the event)
This function samples pseudo-absence points from a raster given a set of presences. The locations returned as the center points of the sampled cells, which can not overlap with the presences (in contrast to background points, see sample_background). The following methods are implemented:
'random': pseudo-absences randomly sampled from the region covered by the raster (i.e. not NAs).
'dist_min': pseudo-absences randomly sampled from the region excluding a buffer of 'dist_min' from presences (distances in 'm' for lonlat rasters, and in map units for projected rasters).
'dist_max': pseudo-absences randomly sampled from the unioned buffers of 'dist_max' from presences (distances in 'm' for lonlat rasters, and in map units for projected rasters). Using the union of buffers means that areas that are in multiple buffers are not oversampled. This is also referred to as "thickening".
'dist_disc': pseudo-absences randomly sampled from the unioned discs around presences with the two values of 'dist_disc' defining the minimum and maximum distance from presences.
sample_pseudoabs( data, raster, n, coords = NULL, method = "random", class_label = "pseudoabs", return_pres = TRUE )
sample_pseudoabs( data, raster, n, coords = NULL, method = "random", class_label = "pseudoabs", return_pres = TRUE )
data |
An |
raster |
the terra::SpatRaster or |
n |
number of pseudoabsence points to sample |
coords |
a vector of length two giving the names of the "x" and "y"
coordinates, as found in |
method |
sampling method. One of 'random', 'dist_min', 'dist_max', or 'dist_disc'. Threshold distances are set as additional elements of a vector, e.g c('dist_min',70000) or c('dist_disc',50000,200000). |
class_label |
the label given to the sampled points. Defaults to |
return_pres |
return presences together with pseudoabsences in a single tibble |
An object of class tibble::tibble. If presences are returned, the
presence level is set as the reference (to match the expectations in the
yardstick
package that considers the first level to be the event)
This function samples pseudo-absence points from a raster given a set of presences. The locations returned as the center points of the sampled cells, which can not overlap with the presences (in contrast to background points, see sample_background_time). The following methods are implemented:
'random': pseudo-absences randomly sampled from the region covered by the raster (i.e. not NAs).
'dist_min': pseudo-absences randomly sampled from the region excluding a buffer of 'dist_min' from presences (distances in 'm' for lonlat rasters, and in map units for projected rasters).
'dist_max': pseudo-absences randomly sampled from the unioned buffers of 'dist_max' from presences (distances in 'm' for lonlat rasters, and in map units for projected rasters). Using the union of buffers means that areas that are in multiple buffers are not oversampled. This is also referred to as "thickening".
'dist_disc': pseudo-absences randomly sampled from the unioned discs around presences with the two values of 'dist_disc' defining the minimum and maximum distance from presences.
sample_pseudoabs_time( data, raster, n_per_presence, coords = NULL, time_col = "time", lubridate_fun = c, method = "random", class_label = "pseudoabs", return_pres = TRUE, time_buffer = 0 )
sample_pseudoabs_time( data, raster, n_per_presence, coords = NULL, time_col = "time", lubridate_fun = c, method = "random", class_label = "pseudoabs", return_pres = TRUE, time_buffer = 0 )
data |
An |
raster |
the terra::SpatRaster, |
n_per_presence |
number of pseudoabsence points to sample for each presence |
coords |
a vector of length two giving the names of the "x" and "y"
coordinates, as found in |
time_col |
The name of the column with time; if time is not a lubridate object,
use |
lubridate_fun |
function to convert the time column into a lubridate object |
method |
sampling method. One of 'random', 'dist_min', 'dist_max', or 'dist_disc'. |
class_label |
the label given to the sampled points. Defaults to |
return_pres |
return presences together with pseudoabsences in a single tibble |
time_buffer |
the buffer on the time axis around presences that defines their effect when
sampling pseudoabsences. If set to zero, presences have an effect only on the time step to which
they are assigned in |
#' @details Note that the time axis of the raster should be in POSIXct
or Date
format,
or use ‘tstep="years"’. See terra::time()
for details on how to set the time axis.
An object of class tibble::tibble. If presences are returned, the
presence level is set as the reference (to match the expectations in the
yardstick
package that considers the first level to be the event)
This function returns a yardstick::metric_set that includes boyce_cont()
,
yardstick::roc_auc()
and tss_max()
, the most commonly used metrics for
SDM.
sdm_metric_set(...)
sdm_metric_set(...)
... |
additional metrics to be added to the
|
a yardstick::metric_set
object.
sdm_metric_set() sdm_metric_set(accuracy)
sdm_metric_set() sdm_metric_set(accuracy)
This function returns a parsnip::model_spec for a Boosted Trees model to
be used as a classifier of presences and absences in Species Distribution Model.
It uses the library xgboost
to fit boosted trees; to use another library, simply build the
parsnip::model_spec directly.
sdm_spec_boost_tree(..., tune = c("sdm", "all", "custom", "none"))
sdm_spec_boost_tree(..., tune = c("sdm", "all", "custom", "none"))
... |
parameters to be passed to |
tune |
character defining the tuning strategy. Valid strategies are:
|
a parsnip::model_spec of the model.
Other "sdm model specifications":
sdm_spec_gam()
,
sdm_spec_glm()
,
sdm_spec_maxent()
,
sdm_spec_rand_forest()
standard_bt_spec <- sdm_spec_boost_tree() full_bt_spec <- sdm_spec_boost_tree(tune = "all") custom_bt_spec <- sdm_spec_boost_tree(tune = "custom", mtry = tune())
standard_bt_spec <- sdm_spec_boost_tree() full_bt_spec <- sdm_spec_boost_tree(tune = "all") custom_bt_spec <- sdm_spec_boost_tree(tune = "custom", mtry = tune())
This function returns a parsnip::model_spec for a General Additive Model to be used as a classifier of presences and absences in Species Distribution Model.
sdm_spec_gam(..., tune = "none")
sdm_spec_gam(..., tune = "none")
... |
parameters to be passed to |
tune |
character defining the tuning strategy. As there are no hyperparameters
to tune in a gam, the only valid option is "none". This parameter is present
for consistency with other |
Note that, when using GAMs in a workflow_set()
, it is necessary to
update the model with gam_formula()
(see parsnip::model_formula
for a
discussion of formulas with special terms in tidymodels
):
workflow_set( preproc = list(default = my_recipe), models = list(gam = sdm_spec_gam()), cross = TRUE ) %>% update_workflow_model("default_gam", spec = sdm_spec_gam(), formula = gam_formula(my_recipe))
a parsnip::model_spec of the model.
parsnip::gen_additive_mod()
gam_formula()
Other "sdm model specifications":
sdm_spec_boost_tree()
,
sdm_spec_glm()
,
sdm_spec_maxent()
,
sdm_spec_rand_forest()
my_gam_spec <- sdm_spec_gam()
my_gam_spec <- sdm_spec_gam()
This function returns a parsnip::model_spec for a Generalised Linear Model to be used as a classifier of presences and absences in Species Distribution Model.
sdm_spec_glm(..., tune = "none")
sdm_spec_glm(..., tune = "none")
... |
parameters to be passed to |
tune |
character defining the tuning strategy. As there are no hyperparameters
to tune in a glm, the only valid option is "none". This parameter is present
for consistency with other |
a parsnip::model_spec of the model.
Other "sdm model specifications":
sdm_spec_boost_tree()
,
sdm_spec_gam()
,
sdm_spec_maxent()
,
sdm_spec_rand_forest()
my_spec_glm <- sdm_spec_glm()
my_spec_glm <- sdm_spec_glm()
This function returns a parsnip::model_spec for a MaxEnt model to be used in Species Distribution Models.
sdm_spec_maxent(..., tune = c("sdm", "all", "custom", "none"))
sdm_spec_maxent(..., tune = c("sdm", "all", "custom", "none"))
... |
parameters to be passed to |
tune |
character defining the tuning strategy. Valid strategies are:
|
a parsnip::model_spec of the model.
Other "sdm model specifications":
sdm_spec_boost_tree()
,
sdm_spec_gam()
,
sdm_spec_glm()
,
sdm_spec_rand_forest()
test_maxent_spec <- sdm_spec_maxent(tune = "sdm") test_maxent_spec # setting specific values sdm_spec_maxent(tune = "custom", feature_classes = "lq")
test_maxent_spec <- sdm_spec_maxent(tune = "sdm") test_maxent_spec # setting specific values sdm_spec_maxent(tune = "custom", feature_classes = "lq")
This function returns a parsnip::model_spec for a Random Forest to
be used as a classifier of presences and absences in Species Distribution
Models. It uses the library ranger
to fit boosted trees; to use another
library, simply build the
parsnip::model_spec directly.
sdm_spec_rand_forest(..., tune = c("sdm", "all", "custom", "none")) sdm_spec_rf(..., tune = c("sdm", "all", "custom", "none"))
sdm_spec_rand_forest(..., tune = c("sdm", "all", "custom", "none")) sdm_spec_rf(..., tune = c("sdm", "all", "custom", "none"))
... |
parameters to be passed to |
tune |
character defining the tuning strategy. Valid strategies are:
|
sdm_spec_rf()
is simply a short form for sm_spec_rand_forest()
.
a parsnip::model_spec of the model.
Other "sdm model specifications":
sdm_spec_boost_tree()
,
sdm_spec_gam()
,
sdm_spec_glm()
,
sdm_spec_maxent()
test_rf_spec <- sdm_spec_rf(tune = "sdm") test_rf_spec # combining tuning with specific values for other hyperparameters sdm_spec_rf(tune = "sdm", trees = 100)
test_rf_spec <- sdm_spec_rf(tune = "sdm") test_rf_spec # combining tuning with specific values for other hyperparameters sdm_spec_rf(tune = "sdm", trees = 100)
A simple ensemble is a collection of workflows for which predictions will be combined in a simple way (e.g. by taking either the mean or median). Usually these workflows will consists each of the best version of a given model algorithm following tuning. The workflows are fitted to the full training dataset before making predictions.
simple_ensemble(...)
simple_ensemble(...)
... |
not used, this function just creates an empty |
an empty simple_ensemble
. This is a tibble with columns:
wflow_id
: the name of the workflows for which the best model was
chosen
workflow
: the trained workflow objects
metrics
: metrics based on the crossvalidation resampling used
to tune the models
spatial_initial_split
creates a single binary split of the data into a training
set and testing set. All strategies from the package spatialsample
are available;
a random split from that strategy will be used to generate the initial split.
spatial_initial_split(data, prop, strategy, ...)
spatial_initial_split(data, prop, strategy, ...)
data |
A dataset (data.frame or tibble) |
prop |
The proportion of data to be retained for modelling/analysis. |
strategy |
A sampling strategy from |
... |
parameters to be passed to the |
An rsplit
object that can be used with the rsample::training and rsample::testing
functions to extract the data in each split.
set.seed(123) block_initial <- spatial_initial_split(boston_canopy, prop = 1 / 5, spatial_block_cv) testing(block_initial) training(block_initial)
set.seed(123) block_initial <- spatial_initial_split(boston_canopy, prop = 1 / 5, spatial_block_cv) testing(block_initial) training(block_initial)
This function thins a dataset so that only one observation per cell is retained.
thin_by_cell(data, raster, coords = NULL, drop_na = TRUE, agg_fact = NULL)
thin_by_cell(data, raster, coords = NULL, drop_na = TRUE, agg_fact = NULL)
data |
An |
raster |
A |
coords |
a vector of length two giving the names of the "x" and "y"
coordinates, as found in |
drop_na |
boolean on whether locations that are NA in the raster should be dropped. |
agg_fact |
positive integer. Aggregation factor expressed as number of cells
in each direction (horizontally and vertically). Or two integers (horizontal
and vertical aggregation factor) or three integers (when also aggregating over layers).
Defaults to NULL, which implies no aggregation (i.e. thinning is done on the
grid of |
Further thinning can be achieved by aggregating cells in the raster
before thinning, as achieved by setting agg_fact
> 1 (aggregation works in a
manner equivalent to terra::aggregate()
).
An object of class sf::sf
or data.frame
, the same as "data".
This function thins a dataset so that only one observation per cell per time slice is retained. We use a raster with layers as time slices to define the data cube on which thinning is enforced (see details below on how time should be formatted).
thin_by_cell_time( data, raster, coords = NULL, time_col = "time", lubridate_fun = c, drop_na = TRUE, agg_fact = NULL )
thin_by_cell_time( data, raster, coords = NULL, time_col = "time", lubridate_fun = c, drop_na = TRUE, agg_fact = NULL )
data |
An |
raster |
A |
coords |
a vector of length two giving the names of the "x" and "y"
coordinates, as found in |
time_col |
The name of the column with time; if time is not a lubridate object,
use |
lubridate_fun |
function to convert the time column into a lubridate object |
drop_na |
boolean on whether locations that are NA in the raster should be dropped. |
agg_fact |
positive integer. Aggregation factor expressed as number of cells
in each direction (horizontally and vertically). Or two integers (horizontal
and vertical aggregation factor) or three integers (when also aggregating over layers).
Defaults to NULL, which implies no aggregation (i.e. thinning is done on the
grid of |
Further spatial thinning can be achieved by aggregating cells in the raster
before thinning, as achieved by setting agg_fact
> 1 (aggregation works in a
manner equivalent to terra::aggregate()
).
An object of class sf::sf
or data.frame
, the same as "data".
This function thins a dataset so that only observations that have a distance from each other greater than "dist_min" are retained.
thin_by_dist(data, dist_min, coords = NULL)
thin_by_dist(data, dist_min, coords = NULL)
data |
An |
dist_min |
Minimum distance between points (in units appropriate for the projection, or meters for lonlat data). |
coords |
A vector of length two giving the names of the "x" and "y"
coordinates, as found in |
Distances are measured in the appropriate units for the projection used. In case of raw latitude and longitude (e.g. as provided in a data.frame), the crs is set to WGS84, and units are set to meters.
This function is a modified version of the algorithm in spThin
, adapted
to work on sf
objects.
An object of class sf::sf
or data.frame
, the same as "data".
This function thins a dataset so that only observations that have a distance from each other greater than "dist_min" in space and "interval_min" in time are retained.
thin_by_dist_time( data, dist_min, interval_min, coords = NULL, time_col = "time", lubridate_fun = c )
thin_by_dist_time( data, dist_min, interval_min, coords = NULL, time_col = "time", lubridate_fun = c )
data |
An |
dist_min |
Minimum distance between points (in units appropriate for the projection, or meters for lonlat data). |
interval_min |
Minimum time interval between points, in days. |
coords |
A vector of length two giving the names of the "x" and "y"
coordinates, as found in |
time_col |
The name of the column with time; if time is not a lubridate object,
use |
lubridate_fun |
function to convert the time column into a lubridate object |
Geographic distances are measured in the appropriate units for the projection used. In case
of raw latitude and longitude (e.g. as provided in a data.frame), the crs is set
to WGS84, and units are set to meters. Time interval are estimated in days. Note that for
very long time period, the simple conversion x years = 365 * x days might lead
to slightly shorter intervals than expected, as it ignores leap years. The function
y2d()
provides a closer approximation.
This function an algorithm analogous to spThin
, with the exception that
neighbours are defined in terms of both space and time.
An object of class sf::sf
or data.frame
, the same as "data".
The True Skills Statistic, which is defined as
tss(data, ...) ## S3 method for class 'data.frame' tss( data, truth, estimate, estimator = NULL, na_rm = TRUE, case_weights = NULL, event_level = "first", ... )
tss(data, ...) ## S3 method for class 'data.frame' tss( data, truth, estimate, estimator = NULL, na_rm = TRUE, case_weights = NULL, event_level = "first", ... )
data |
Either a data.frame containing the columns specified by the truth and estimate arguments, or a table/matrix where the true class results should be in the columns of the table. |
... |
Not currently used. |
truth |
The column identifier for the true class results (that is a factor). This should be an unquoted column name although this argument is passed by expression and supports quasiquotation (you can unquote column names). For _vec() functions, a factor vector. |
estimate |
The column identifier for the predicted class results (that is also factor). As with truth this can be specified different ways but the primary method is to use an unquoted variable name. For _vec() functions, a factor vector. |
estimator |
One of: "binary", "macro", "macro_weighted", or "micro" to specify the type of averaging to be done. "binary" is only relevant for the two class case. The other three are general methods for calculating multiclass metrics. The default will automatically choose "binary" or "macro" based on estimate. |
na_rm |
A logical value indicating whether NA values should be stripped before the computation proceeds. |
case_weights |
The optional column identifier for case weights. This should be an unquoted column name that evaluates to a numeric column in data. For _vec() functions, a numeric vector. |
event_level |
A single string. Either "first" or "second" to specify which level of truth to consider as the "event". This argument is only applicable when estimator = "binary". The default is "first". |
sensitivity+specificity +1
This function is a wrapper around yardstick::j_index()
, another name for the
same quantity. Note that this function takes the classes as predicted by the
model without any calibration (i.e. making a split at 0.5 probability). This
is usually not the metric used for Species Distribution Models, where the
threshold is recalibrated to maximise TSS; for that purpose, use tss_max()
.
A tibble with columns .metric, .estimator, and .estimate and 1 row of values. For grouped data frames, the number of rows returned will be the same as the number of groups.
# Two class data("two_class_example") tss(two_class_example, truth, predicted) # Multiclass library(dplyr) data(hpc_cv) # Groups are respected hpc_cv %>% group_by(Resample) %>% tss(obs, pred)
# Two class data("two_class_example") tss(two_class_example, truth, predicted) # Multiclass library(dplyr) data(hpc_cv) # Groups are respected hpc_cv %>% group_by(Resample) %>% tss(obs, pred)
The True Skills Statistic, which is defined as
tss_max(data, ...) ## S3 method for class 'data.frame' tss_max( data, truth, ..., estimator = NULL, na_rm = TRUE, event_level = "first", case_weights = NULL ) ## S3 method for class 'sf' tss_max(data, ...) tss_max_vec( truth, estimate, estimator = NULL, na_rm = TRUE, event_level = "first", case_weights = NULL, ... )
tss_max(data, ...) ## S3 method for class 'data.frame' tss_max( data, truth, ..., estimator = NULL, na_rm = TRUE, event_level = "first", case_weights = NULL ) ## S3 method for class 'sf' tss_max(data, ...) tss_max_vec( truth, estimate, estimator = NULL, na_rm = TRUE, event_level = "first", case_weights = NULL, ... )
data |
Either a data.frame containing the columns specified by the truth and estimate arguments, or a table/matrix where the true class results should be in the columns of the table. |
... |
A set of unquoted column names or one or more dplyr selector functions to choose which variables contain the class probabilities. If truth is binary, only 1 column should be selected, and it should correspond to the value of event_level. Otherwise, there should be as many columns as factor levels of truth and the ordering of the columns should be the same as the factor levels of truth. |
truth |
The column identifier for the true class results (that is a factor). This should be an unquoted column name although this argument is passed by expression and supports quasiquotation (you can unquote column names). For _vec() functions, a factor vector. |
estimator |
One of "binary", "hand_till", "macro", or "macro_weighted" to specify the type of averaging to be done. "binary" is only relevant for the two class case. The others are general methods for calculating multiclass metrics. The default will automatically choose "binary" if truth is binary, "hand_till" if truth has >2 levels and case_weights isn't specified, or "macro" if truth has >2 levels and case_weights is specified (in which case "hand_till" isn't well-defined). |
na_rm |
A logical value indicating whether NA values should be stripped before the computation proceeds. |
event_level |
A single string. Either "first" or "second" to specify which level of truth to consider as the "event". This argument is only applicable when estimator = "binary". The default uses an internal helper that generally defaults to "first" |
case_weights |
The optional column identifier for case weights. This should be an unquoted column name that evaluates to a numeric column in data. For _vec() functions, a numeric vector. |
estimate |
If truth is binary, a numeric vector of class probabilities corresponding to the "relevant" class. Otherwise, a matrix with as many columns as factor levels of truth. It is assumed that these are in the same order as the levels of truth. |
sensitivity+specificity +1
This function calibrates the probability threshold to classify presences to maximise the TSS.
There is no multiclass version of this function, it only operates on binary predictions (e.g. presences and absences in SDMs).
A tibble with columns .metric, .estimator, and .estimate and 1 row of values. For grouped data frames, the number of rows returned will be the same as the number of groups.
Other class probability metrics:
boyce_cont()
,
kap_max()
tss_max(two_class_example, truth, Class1)
tss_max(two_class_example, truth, Class1)
This function takes takes a time interval in years and converts into days,
the unit commonly used in time operations in R
. The simple conversion
x * 365 does not work for large number of years, due to the presence of
leap years.
y2d(x)
y2d(x)
x |
the number of years of the interval |
a difftime
object (in days)
y2d(1) y2d(1000)
y2d(1) y2d(1000)