R/preprocess_data.R
preprocess_data.Rd
Process (check and update) data according to specified feature list
preprocess_data(x, feature_list)
matrix, data.frame or data.table. The data to check input for and update
according to the specification in feature_list
.
List. Output from running get_data_specs
or
get_model_specs
List with two named elements: x_dt
: Checked and updated data x
in data.table format, and
update_feature_list
the output from check_features
This function takes care of all preprocessing and checking of the provided data in x
against
the feature_list which is typically the output from get_model_specs
# Load example data
if (requireNamespace("MASS", quietly = TRUE)) {
data("Boston", package = "MASS")
# Split data into test- and training data
x_train <- data.table::as.data.table(head(Boston))
x_train[, rad := as.factor(rad)]
data_features <- get_data_specs(x_train)
model <- lm(medv ~ lstat + rm + rad + indus, data = x_train)
model_features <- get_model_specs(model)
preprocess_data(x_train, model_features)
}
#>
#> Success with message:
#> The columns(s) crim, zn, chas, nox, age, dis, tax, ptratio, black, medv is not used by the model and thus removed from the data.
#> $x_dt
#> lstat rm rad indus
#> 1: 4.98 6.575 1 2.31
#> 2: 9.14 6.421 2 7.07
#> 3: 4.03 7.185 2 7.07
#> 4: 2.94 6.998 3 2.18
#> 5: 5.33 7.147 3 2.18
#> 6: 5.21 6.430 3 2.18
#>
#> $updated_feature_list
#> $updated_feature_list$labels
#> [1] "lstat" "rm" "rad" "indus"
#>
#> $updated_feature_list$classes
#> lstat rm rad indus
#> "numeric" "numeric" "factor" "numeric"
#>
#> $updated_feature_list$factor_levels
#> $updated_feature_list$factor_levels$lstat
#> NULL
#>
#> $updated_feature_list$factor_levels$rm
#> NULL
#>
#> $updated_feature_list$factor_levels$rad
#> [1] "1" "2" "3"
#>
#> $updated_feature_list$factor_levels$indus
#> NULL
#>
#>
#> $updated_feature_list$specs_type
#> [1] "model"
#>
#>