library(tidyverse)
library(tidymodels)
library(rpart.plot)
library(patchwork)
tidymodels_prefer()
Machine Learning with R-tidymodels: classification
machine learning
Last week I shared some examples on Regression Models following the very good workshop on Machine Learning organized by the Rbootcamp. This week I’m continuing with Classification Models.
setup
logistics regression
sample
<- read_csv(file = "data/airbnb.csv") airbnb
Rows: 1191 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): district, host_respons_time, kitchen, tv, coffe_machine, dishwashe...
dbl (14): price, accommodates, bedrooms, bathrooms, cleaning_fee, availabili...
lgl (1): host_superhost
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<-
airbnb %>%
airbnb mutate(host_superhost = factor(host_superhost, levels = c(TRUE, FALSE)))
set.seed(123)
<- initial_split(airbnb, prop = .8, strata = host_superhost)
airbnb_split <- training(airbnb_split)
airbnb_train <- testing(airbnb_split) airbnb_test
recipe
<-
logistic_recipe recipe(host_superhost ~ ., data = airbnb_train) %>%
step_dummy(all_nominal_predictors())
logistic_recipe
Recipe
Inputs:
role #variables
outcome 1
predictor 22
Operations:
Dummy variables from all_nominal_predictors()
model
<-
logistic_model logistic_reg() %>%
set_engine("glm") %>%
set_mode("classification")
translate(logistic_model)
Logistic Regression Model Specification (classification)
Computational engine: glm
Model fit template:
stats::glm(formula = missing_arg(), data = missing_arg(), weights = missing_arg(),
family = stats::binomial)
workflow
<-
logistic_workflow workflow() %>%
add_recipe(logistic_recipe) %>%
add_model(logistic_model)
logistic_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: logistic_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_dummy()
── Model ───────────────────────────────────────────────────────────────────────
Logistic Regression Model Specification (classification)
Computational engine: glm
fit
<-
superhost_glm %>%
logistic_workflow fit(airbnb_train)
predict
<-
logistic_pred predict(superhost_glm, airbnb_train, type = "prob") %>%
bind_cols(predict(superhost_glm, airbnb_train)) %>%
bind_cols(airbnb_train %>% select(host_superhost))
metrics
<- metrics(logistic_pred, truth = host_superhost, estimate = .pred_class, .pred_TRUE) logistic_metrics
plot
<- function(prediction_data, model_metrics, title_text) {
create_model_plot <- tibble(
annotation_data x_position = 0.65,
y_position = c(0.1,0.2,0.3,0.4),
label_value = str_glue_data(model_metrics, "{.metric}: {round(.estimate, 2)}")
)
%>%
prediction_data roc_curve(truth = host_superhost, .pred_TRUE) %>%
autoplot() +
labs(title = as.character(title_text)) +
geom_text(
data = annotation_data,
mapping = aes(x = x_position, y = y_position, label = label_value),
size = 3
) }
<- create_model_plot(logistic_pred, logistic_metrics, "ROC logistic reg.") lg_plot
decision tree
recipe
<-
tree_recipe recipe(host_superhost ~ ., data = airbnb_train) %>%
step_other(all_nominal_predictors(), threshold = 0.005)
model
<-
dt_model decision_tree() %>%
set_engine("rpart") %>%
set_mode("classification")
workflow
<-
dt_workflow workflow() %>%
add_recipe(tree_recipe) %>%
add_model(dt_model)
fit
<-
superhost_dt %>%
dt_workflow fit(airbnb_train)
predict
<-
dt_pred predict(superhost_dt, airbnb_train, type = "prob") %>%
bind_cols(predict(superhost_dt, airbnb_train)) %>%
bind_cols(airbnb_train %>% select(host_superhost))
metrics
<- metrics(dt_pred, truth = host_superhost, estimate = .pred_class, .pred_TRUE) dt_metrics
plot
<- create_model_plot(dt_pred, dt_metrics, "ROC Decision tree") dt_plot
random forest
model
<-
rf_model rand_forest() %>%
set_engine("ranger") %>%
set_mode("classification")
workflow
<-
rf_workflow workflow() %>%
add_recipe(tree_recipe) %>%
add_model(rf_model)
fit
<-
superhost_rf %>%
rf_workflow fit(airbnb_train)
predict
<-
rf_pred predict(superhost_rf, airbnb_train, type = "prob") %>%
bind_cols(predict(superhost_rf, airbnb_train)) %>%
bind_cols(airbnb_train %>% select(host_superhost))
metrics
<- metrics(rf_pred, truth = host_superhost, estimate = .pred_class, .pred_TRUE) rf_metrics
plot
<- create_model_plot(rf_pred, rf_metrics, "ROC Random forest") rf_plot
metrics overview
+ dt_plot + rf_plot lg_plot
In classification models the typical metrics are given by the following loss functions:
- ROC auc: area under the receiver operator curve
- accuracy: the proportion of the data that are predicted correctly
- KAP (kappa): similar to accuracy, but normalized by the accuracy that would be expected by chance alone
- LogLoss: alternative to MSE and MAE (compared with accuracy, the logarithmic loss takes into account the uncertainty in the prediction)
check ?metrics
for details.
Additional important definitions:
- sensitivity: of only the truly positive, what proportions are classified as positive
- specificity: of only the truly negative, what proportion is classified as negative
- overlap: the proportion of times the model fails in predicting correctly