check_setup

setup(
  x_train,
  x_explain,
  approach,
  prediction_zero,
  output_size = 1,
  n_combinations,
  group,
  n_samples,
  n_batches,
  seed,
  keep_samp_for_vS,
  feature_specs,
  MSEv_uniform_comb_weights = TRUE,
  type = "normal",
  horizon = NULL,
  y = NULL,
  xreg = NULL,
  train_idx = NULL,
  explain_idx = NULL,
  explain_y_lags = NULL,
  explain_xreg_lags = NULL,
  group_lags = NULL,
  timing,
  verbose,
  is_python = FALSE,
  ...
)

Arguments

x_train

Matrix or data.frame/data.table. Contains the data used to estimate the (conditional) distributions for the features needed to properly estimate the conditional expectations in the Shapley formula.

x_explain

A matrix or data.frame/data.table. Contains the the features, whose predictions ought to be explained.

approach

Character vector of length 1 or one less than the number of features. All elements should, either be "gaussian", "copula", "empirical", "ctree", "vaeac", "categorical", "timeseries", "independence", "regression_separate", or "regression_surrogate". The two regression approaches can not be combined with any other approach. See details for more information.

prediction_zero

Numeric. The prediction value for unseen data, i.e. an estimate of the expected prediction without conditioning on any features. Typically we set this value equal to the mean of the response variable in our training data, but other choices such as the mean of the predictions in the training data are also reasonable.

output_size

TODO: Document

n_combinations

Integer. If group = NULL, n_combinations represents the number of unique feature combinations to sample. If group != NULL, n_combinations represents the number of unique group combinations to sample. If n_combinations = NULL, the exact method is used and all combinations are considered. The maximum number of combinations equals 2^m, where m is the number of features.

group

List. If NULL regular feature wise Shapley values are computed. If provided, group wise Shapley values are computed. group then has length equal to the number of groups. The list element contains character vectors with the features included in each of the different groups.

n_samples

Positive integer. Indicating the maximum number of samples to use in the Monte Carlo integration for every conditional expectation. See also details.

n_batches

Positive integer (or NULL). Specifies how many batches the total number of feature combinations should be split into when calculating the contribution function for each test observation. The default value is NULL which uses a reasonable trade-off between RAM allocation and computation speed, which depends on approach and n_combinations. For models with many features, increasing the number of batches reduces the RAM allocation significantly. This typically comes with a small increase in computation time.

seed

Positive integer. Specifies the seed before any randomness based code is being run. If NULL the seed will be inherited from the calling environment.

keep_samp_for_vS

Logical. Indicates whether the samples used in the Monte Carlo estimation of v_S should be returned (in internal$output)

feature_specs

List. The output from get_model_specs() or get_data_specs(). Contains the 3 elements:

labels

Character vector with the names of each feature.

classes

Character vector with the classes of each features.

factor_levels

Character vector with the levels for any categorical features.

MSEv_uniform_comb_weights

Logical. If TRUE (default), then the function weights the combinations uniformly when computing the MSEv criterion. If FALSE, then the function use the Shapley kernel weights to weight the combinations when computing the MSEv criterion. Note that the Shapley kernel weights are replaced by the sampling frequency when not all combinations are considered.

type

Character. Either "normal" or "forecast" corresponding to function setup() is called from, correspondingly the type of explanation that should be generated.

horizon

Numeric. The forecast horizon to explain. Passed to the predict_model function.

y

Matrix, data.frame/data.table or a numeric vector. Contains the endogenous variables used to estimate the (conditional) distributions needed to properly estimate the conditional expectations in the Shapley formula including the observations to be explained.

xreg

Matrix, data.frame/data.table or a numeric vector. Contains the exogenous variables used to estimate the (conditional) distributions needed to properly estimate the conditional expectations in the Shapley formula including the observations to be explained. As exogenous variables are used contemporaneusly when producing a forecast, this item should contain nrow(y) + horizon rows.

train_idx

Numeric vector The row indices in data and reg denoting points in time to use when estimating the conditional expectations in the Shapley value formula. If train_idx = NULL (default) all indices not selected to be explained will be used.

explain_idx

Numeric vector The row indices in data and reg denoting points in time to explain.

explain_y_lags

Numeric vector. Denotes the number of lags that should be used for each variable in y when making a forecast.

explain_xreg_lags

Numeric vector. If xreg != NULL, denotes the number of lags that should be used for each variable in xreg when making a forecast.

group_lags

Logical. If TRUE all lags of each variable are grouped together and explained as a group. If FALSE all lags of each variable are explained individually.

timing

Logical. Whether the timing of the different parts of the explain() should saved in the model object.

verbose

An integer specifying the level of verbosity. If 0, shapr will stay silent. If 1, it will print information about performance. If 2, some additional information will be printed out. Use 0 (default) for no verbosity, 1 for low verbose, and 2 for high verbose. TODO: Make this clearer when we end up fixing this and if they should force a progressr bar.

is_python

Logical. Indicates whether the function is called from the Python wrapper. Default is FALSE which is never changed when calling the function via explain() in R. The parameter is later used to disallow running the AICc-versions of the empirical as that requires data based optimization.

...

Further arguments passed to specific approaches