R/explain.R
explain.Rd
Computes dependence-aware Shapley values for observations in x_explain
from the specified
model
by using the method specified in approach
to estimate the conditional expectation.
explain(
model,
x_explain,
x_train,
approach,
prediction_zero,
n_combinations = NULL,
group = NULL,
n_samples = 1000,
n_batches = NULL,
seed = 1,
keep_samp_for_vS = FALSE,
predict_model = NULL,
get_model_specs = NULL,
MSEv_uniform_comb_weights = TRUE,
timing = TRUE,
verbose = 0,
...
)
The model whose predictions we want to explain.
Run get_supported_models()
for a table of which models explain
supports natively. Unsupported models
can still be explained by passing predict_model
and (optionally) get_model_specs
,
see details for more information.
A matrix or data.frame/data.table. Contains the the features, whose predictions ought to be explained.
Matrix or data.frame/data.table. Contains the data used to estimate the (conditional) distributions for the features needed to properly estimate the conditional expectations in the Shapley formula.
Character vector of length 1
or one less than the number of features.
All elements should, either be "gaussian"
, "copula"
, "empirical"
, "ctree"
, "vaeac"
,
"categorical"
, "timeseries"
, "independence"
, "regression_separate"
, or "regression_surrogate"
.
The two regression approaches can not be combined with any other approach. See details for more information.
Numeric. The prediction value for unseen data, i.e. an estimate of the expected prediction without conditioning on any features. Typically we set this value equal to the mean of the response variable in our training data, but other choices such as the mean of the predictions in the training data are also reasonable.
Integer.
If group = NULL
, n_combinations
represents the number of unique feature combinations to sample.
If group != NULL
, n_combinations
represents the number of unique group combinations to sample.
If n_combinations = NULL
, the exact method is used and all combinations are considered.
The maximum number of combinations equals 2^m
, where m
is the number of features.
List.
If NULL
regular feature wise Shapley values are computed.
If provided, group wise Shapley values are computed. group
then has length equal to
the number of groups. The list element contains character vectors with the features included
in each of the different groups.
Positive integer. Indicating the maximum number of samples to use in the Monte Carlo integration for every conditional expectation. See also details.
Positive integer (or NULL).
Specifies how many batches the total number of feature combinations should be split into when calculating the
contribution function for each test observation.
The default value is NULL which uses a reasonable trade-off between RAM allocation and computation speed,
which depends on approach
and n_combinations
.
For models with many features, increasing the number of batches reduces the RAM allocation significantly.
This typically comes with a small increase in computation time.
Positive integer.
Specifies the seed before any randomness based code is being run.
If NULL
the seed will be inherited from the calling environment.
Logical.
Indicates whether the samples used in the Monte Carlo estimation of v_S should be returned
(in internal$output
)
Function.
The prediction function used when model
is not natively supported.
(Run get_supported_models()
for a list of natively supported
models.)
The function must have two arguments, model
and newdata
which specify, respectively, the model
and a data.frame/data.table to compute predictions for. The function must give the prediction as a numeric vector.
NULL
(the default) uses functions specified internally.
Can also be used to override the default function for natively supported model classes.
Function.
An optional function for checking model/data consistency when model
is not natively supported.
(Run get_supported_models()
for a list of natively supported
models.)
The function takes model
as argument and provides a list with 3 elements:
Character vector with the names of each feature.
Character vector with the classes of each features.
Character vector with the levels for any categorical features.
If NULL
(the default) internal functions are used for natively supported model classes, and the checking is
disabled for unsupported model classes.
Can also be used to override the default function for natively supported model classes.
Logical. If TRUE
(default), then the function weights the combinations
uniformly when computing the MSEv criterion. If FALSE
, then the function use the Shapley kernel weights to
weight the combinations when computing the MSEv criterion. Note that the Shapley kernel weights are replaced by the
sampling frequency when not all combinations are considered.
Logical.
Whether the timing of the different parts of the explain()
should saved in the model object.
An integer specifying the level of verbosity. If 0
, shapr
will stay silent.
If 1
, it will print information about performance. If 2
, some additional information will be printed out.
Use 0
(default) for no verbosity, 1
for low verbose, and 2
for high verbose.
TODO: Make this clearer when we end up fixing this and if they should force a progressr bar.
Arguments passed on to setup_approach.empirical
, setup_approach.independence
, setup_approach.gaussian
, setup_approach.copula
, setup_approach.ctree
, setup_approach.vaeac
, setup_approach.categorical
, setup_approach.regression_separate
, setup_approach.regression_surrogate
, setup_approach.timeseries
empirical.type
Character. (default = "fixed_sigma"
)
Should be equal to either "independence"
,"fixed_sigma"
, "AICc_each_k"
"AICc_full"
.
TODO: Describe better what the methods do here.
empirical.eta
Numeric. (default = 0.95)
Needs to be 0 < eta <= 1
.
Represents the minimum proportion of the total empirical weight that data samples should use.
If e.g. eta = .8
we will choose the K
samples with the largest weight so that the sum of the weights
accounts for 80\
eta
is the \(\eta\) parameter in equation (15) of Aas et al (2021).
empirical.fixed_sigma
Positive numeric scalar. (default = 0.1)
Represents the kernel bandwidth in the distance computation used when conditioning on all different combinations.
Only used when empirical.type = "fixed_sigma"
empirical.n_samples_aicc
Positive integer. (default = 1000)
Number of samples to consider in AICc optimization.
Only used for empirical.type
is either "AICc_each_k"
or "AICc_full"
.
empirical.eval_max_aicc
Positive integer. (default = 20)
Maximum number of iterations when optimizing the AICc.
Only used for empirical.type
is either "AICc_each_k"
or "AICc_full"
.
empirical.start_aicc
Numeric. (default = 0.1)
Start value of the sigma
parameter when optimizing the AICc.
Only used for empirical.type
is either "AICc_each_k"
or "AICc_full"
.
empirical.cov_mat
Numeric matrix. (Optional, default = NULL)
Containing the covariance matrix of the data generating distribution used to define the Mahalanobis distance.
NULL
means it is estimated from x_train
.
internal
Not used.
gaussian.mu
Numeric vector. (Optional)
Containing the mean of the data generating distribution.
NULL
means it is estimated from the x_train
.
gaussian.cov_mat
Numeric matrix. (Optional)
Containing the covariance matrix of the data generating distribution.
NULL
means it is estimated from the x_train
.
ctree.mincriterion
Numeric scalar or vector. (default = 0.95) Either a scalar or vector of length equal to the number of features in the model. Value is equal to 1 - \(\alpha\) where \(\alpha\) is the nominal level of the conditional independence tests. If it is a vector, this indicates which value to use when conditioning on various numbers of features.
ctree.minsplit
Numeric scalar. (default = 20) Determines minimum value that the sum of the left and right daughter nodes required for a split.
ctree.minbucket
Numeric scalar. (default = 7) Determines the minimum sum of weights in a terminal node required for a split
ctree.sample
Boolean. (default = TRUE)
If TRUE, then the method always samples n_samples
observations from the leaf nodes (with replacement).
If FALSE and the number of observations in the leaf node is less than n_samples
,
the method will take all observations in the leaf.
If FALSE and the number of observations in the leaf node is more than n_samples
,
the method will sample n_samples
observations (with replacement).
This means that there will always be sampling in the leaf unless
sample
= FALSE AND the number of obs in the node is less than n_samples
.
vaeac.depth
Positive integer (default is 3
). The number of hidden layers
in the neural networks of the masked encoder, full encoder, and decoder.
vaeac.width
Positive integer (default is 32
). The number of neurons in each
hidden layer in the neural networks of the masked encoder, full encoder, and decoder.
vaeac.latent_dim
Positive integer (default is 8
). The number of dimensions in the latent space.
vaeac.lr
Positive numeric (default is 0.001
). The learning rate used in the torch::optim_adam()
optimizer.
vaeac.activation_function
An torch::nn_module()
representing an activation function such as, e.g.,
torch::nn_relu()
(default), torch::nn_leaky_relu()
, torch::nn_selu()
, or torch::nn_sigmoid()
.
vaeac.n_vaeacs_initialize
Positive integer (default is 4
). The number of different vaeac models to initiate
in the start. Pick the best performing one after vaeac.extra_parameters$epochs_initiation_phase
epochs (default is 2
) and continue training that one.
vaeac.epochs
Positive integer (default is 100
). The number of epochs to train the final vaeac model.
This includes vaeac.extra_parameters$epochs_initiation_phase
, where the default is 2
.
vaeac.extra_parameters
Named list with extra parameters to the vaeac
approach. See
vaeac_get_extra_para_default()
for description of possible additional parameters and their default values.
categorical.joint_prob_dt
Data.table. (Optional)
Containing the joint probability distribution for each combination of feature
values.
NULL
means it is estimated from the x_train
and x_explain
.
categorical.epsilon
Numeric value. (Optional)
If joint_probability_dt
is not supplied, probabilities/frequencies are
estimated using x_train
. If certain observations occur in x_train
and NOT in x_explain
,
then epsilon is used as the proportion of times that these observations occurs in the training data.
In theory, this proportion should be zero, but this causes an error later in the Shapley computation.
regression.model
A tidymodels
object of class model_specs
. Default is a linear regression model, i.e.,
parsnip::linear_reg()
. See tidymodels for all possible models,
and see the vignette for how to add new/own models. Note, to make it easier to call explain()
from Python, the
regression.model
parameter can also be a string specifying the model which will be parsed and evaluated. For
example, "parsnip::rand_forest(mtry = hardhat::tune(), trees = 100, engine = "ranger", mode = "regression")"
is also a valid input. It is essential to include the package prefix if the package is not loaded.
regression.tune_values
Either NULL
(default), a data.frame/data.table/tibble, or a function.
The data.frame must contain the possible hyperparameter value combinations to try.
The column names must match the names of the tuneable parameters specified in regression.model
.
If regression.tune_values
is a function, then it should take one argument x
which is the training data
for the current combination/coalition and returns a data.frame/data.table/tibble with the properties described above.
Using a function allows the hyperparameter values to change based on the size of the combination. See the regression
vignette for several examples.
Note, to make it easier to call explain()
from Python, the regression.tune_values
can also be a string
containing an R function. For example,
"function(x) return(dials::grid_regular(dials::mtry(c(1, ncol(x)))), levels = 3))"
is also a valid input.
It is essential to include the package prefix if the package is not loaded.
regression.vfold_cv_para
Either NULL
(default) or a named list containing
the parameters to be sent to rsample::vfold_cv()
. See the regression vignette for
several examples.
regression.recipe_func
Either NULL
(default) or a function that that takes in a recipes::recipe()
object and returns a modified recipes::recipe()
with potentially additional recipe steps. See the regression
vignette for several examples.
Note, to make it easier to call explain()
from Python, the regression.recipe_func
can also be a string
containing an R function. For example,
"function(recipe) return(recipes::step_ns(recipe, recipes::all_numeric_predictors(), deg_free = 2))"
is also
a valid input. It is essential to include the package prefix if the package is not loaded.
regression.surrogate_n_comb
Integer (default is internal$parameters$used_n_combinations
) specifying the
number of unique combinations/coalitions to apply to each training observation. Maximum allowed value is
"internal$parameters$used_n_combinations
- 2". By default, we use all coalitions, but this can take a lot of memory
in larger dimensions. Note that by "all", we mean all coalitions chosen by shapr
to be used. This will be all
\(2^{n_{\text{features}}}\) coalitions (minus empty and grand coalition) if shapr
is in the exact mode. If the
user sets a lower value than internal$parameters$used_n_combinations
, then we sample this amount of unique
coalitions separately for each training observations. That is, on average, all coalitions should be equally trained.
timeseries.fixed_sigma_vec
Numeric. (Default = 2) Represents the kernel bandwidth in the distance computation. TODO: What length should it have? 1?
timeseries.bounds
Numeric vector of length two. (Default = c(NULL, NULL)) If one or both of these bounds are not NULL, we restrict the sampled time series to be between these bounds. This is useful if the underlying time series are scaled between 0 and 1, for example.
Object of class c("shapr", "list")
. Contains the following items:
data.table with the estimated Shapley values
List with the different parameters, data and functions used internally
Numeric vector with the predictions for the explained observations
List with the values of the MSEv evaluation criterion for the approach.
shapley_values
is a data.table where the number of rows equals
the number of observations you'd like to explain, and the number of columns equals m +1
,
where m
equals the total number of features in your model.
If shapley_values[i, j + 1] > 0
it indicates that the j-th feature increased the prediction for
the i-th observation. Likewise, if shapley_values[i, j + 1] < 0
it indicates that the j-th feature
decreased the prediction for the i-th observation.
The magnitude of the value is also important to notice. E.g. if shapley_values[i, k + 1]
and
shapley_values[i, j + 1]
are greater than 0
, where j != k
, and
shapley_values[i, k + 1]
> shapley_values[i, j + 1]
this indicates that feature
j
and k
both increased the value of the prediction, but that the effect of the k-th
feature was larger than the j-th feature.
The first column in dt
, called none
, is the prediction value not assigned to any of the features
(\(\phi\)0).
It's equal for all observations and set by the user through the argument prediction_zero
.
The difference between the prediction and none
is distributed among the other features.
In theory this value should be the expected prediction without conditioning on any features.
Typically we set this value equal to the mean of the response variable in our training data, but other choices
such as the mean of the predictions in the training data are also reasonable.
The most important thing to notice is that shapr
has implemented eight different
Monte Carlo-based approaches for estimating the conditional distributions of the data, namely "empirical"
,
"gaussian"
, "copula"
, "ctree"
, "vaeac"
, "categorical"
, "timeseries"
, and "independence"
.
shapr
has also implemented two regression-based approaches "regression_separate"
and "regression_surrogate"
,
and see the separate vignette on the regression-based approaches for more information.
In addition, the user also has the option of combining the different Monte Carlo-based approaches.
E.g., if you're in a situation where you have trained a model that consists of 10 features,
and you'd like to use the "gaussian"
approach when you condition on a single feature,
the "empirical"
approach if you condition on 2-5 features, and "copula"
version
if you condition on more than 5 features this can be done by simply passing
approach = c("gaussian", rep("empirical", 4), rep("copula", 4))
. If
"approach[i]" = "gaussian"
means that you'd like to use the "gaussian"
approach
when conditioning on i
features. Conditioning on all features needs no approach as that is given
by the complete prediction itself, and should thus not be part of the vector.
For approach="ctree"
, n_samples
corresponds to the number of samples
from the leaf node (see an exception related to the sample
argument).
For approach="empirical"
, n_samples
is the \(K\) parameter in equations (14-15) of
Aas et al. (2021), i.e. the maximum number of observations (with largest weights) that is used, see also the
empirical.eta
argument.
Aas, K., Jullum, M., & L<U+00F8>land, A. (2021). Explaining individual predictions when features are dependent: More accurate approximations to Shapley values. Artificial Intelligence, 298, 103502.
# Load example data
data("airquality")
airquality <- airquality[complete.cases(airquality), ]
x_var <- c("Solar.R", "Wind", "Temp", "Month")
y_var <- "Ozone"
# Split data into test- and training data
data_train <- head(airquality, -3)
data_explain <- tail(airquality, 3)
x_train <- data_train[, x_var]
x_explain <- data_explain[, x_var]
# Fit a linear model
lm_formula <- as.formula(paste0(y_var, " ~ ", paste0(x_var, collapse = " + ")))
model <- lm(lm_formula, data = data_train)
# Explain predictions
p <- mean(data_train[, y_var])
# Empirical approach
explain1 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "empirical",
prediction_zero = p,
n_samples = 1e2
)
#> Setting parameter 'n_batches' to 2 as a fair trade-off between memory consumption and computation time.
#> Reducing 'n_batches' typically reduces the computation time at the cost of increased memory consumption.
# Gaussian approach
explain2 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "gaussian",
prediction_zero = p,
n_samples = 1e2
)
#> Setting parameter 'n_batches' to 10 as a fair trade-off between memory consumption and computation time.
#> Reducing 'n_batches' typically reduces the computation time at the cost of increased memory consumption.
# Gaussian copula approach
explain3 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "copula",
prediction_zero = p,
n_samples = 1e2
)
#> Setting parameter 'n_batches' to 10 as a fair trade-off between memory consumption and computation time.
#> Reducing 'n_batches' typically reduces the computation time at the cost of increased memory consumption.
# ctree approach
explain4 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = "ctree",
prediction_zero = p,
n_samples = 1e2
)
#> Setting parameter 'n_batches' to 10 as a fair trade-off between memory consumption and computation time.
#> Reducing 'n_batches' typically reduces the computation time at the cost of increased memory consumption.
# Combined approach
approach <- c("gaussian", "gaussian", "empirical")
explain5 <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
approach = approach,
prediction_zero = p,
n_samples = 1e2
)
#> Setting parameter 'n_batches' to 10 as a fair trade-off between memory consumption and computation time.
#> Reducing 'n_batches' typically reduces the computation time at the cost of increased memory consumption.
# Print the Shapley values
print(explain1$shapley_values)
#> none Solar.R Wind Temp Month
#> <num> <num> <num> <num> <num>
#> 1: 42.78704 6.124296 -20.137653 -5.033967 -5.987303
#> 2: 42.78704 -1.470838 11.525868 -9.487924 -5.597657
#> 3: 42.78704 3.524599 -5.335059 -16.599988 -8.703929
# Plot the results
if (requireNamespace("ggplot2", quietly = TRUE)) {
plot(explain1)
plot(explain1, plot_type = "waterfall")
}
# Group-wise explanations
group_list <- list(A = c("Temp", "Month"), B = c("Wind", "Solar.R"))
explain_groups <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
group = group_list,
approach = "empirical",
prediction_zero = p,
n_samples = 1e2
)
#> Setting parameter 'n_batches' to 2 as a fair trade-off between memory consumption and computation time.
#> Reducing 'n_batches' typically reduces the computation time at the cost of increased memory consumption.
print(explain_groups$shapley_values)
#> none A B
#> <num> <num> <num>
#> 1: 42.78704 -11.63856 -13.396062
#> 2: 42.78704 -10.36824 5.337683
#> 3: 42.78704 -25.79874 -1.315633
# Separate and surrogate regression approaches with linear regression models.
# More complex regression models can be used, and we can use CV to
# tune the hyperparameters of the regression models and preprocess
# the data before sending it to the model. See the regression vignette
# (Shapley value explanations using the regression paradigm) for more
# details about the `regression_separate` and `regression_surrogate` approaches.
explain_separate_lm <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
prediction_zero = p,
approach = "regression_separate",
regression.model = parsnip::linear_reg()
)
#> Setting parameter 'n_batches' to 2 as a fair trade-off between memory consumption and computation time.
#> Reducing 'n_batches' typically reduces the computation time at the cost of increased memory consumption.
explain_surrogate_lm <- explain(
model = model,
x_explain = x_explain,
x_train = x_train,
prediction_zero = p,
approach = "regression_surrogate",
regression.model = parsnip::linear_reg()
)
#> Setting parameter 'n_batches' to 2 as a fair trade-off between memory consumption and computation time.
#> Reducing 'n_batches' typically reduces the computation time at the cost of increased memory consumption.