library(tidyverse)
library(tidymodels)
library(rpart.plot)
library(patchwork)
tidymodels_prefer()
Machine Learning with R-tidymodels: regression
In a previous article I’ve presented the tidymodels framework and the pipeline I’ve adopted to apply predictive models to my datasets.
Below I’m sharing here one interesting case that was presented in the RBootcamp training, that follows these steps and applies three different models in a regression problem.
setup
linear regression
sample
<- read_csv(file = "data/airbnb.csv") airbnb
Rows: 1191 Columns: 23
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (8): district, host_respons_time, kitchen, tv, coffe_machine, dishwashe...
dbl (14): price, accommodates, bedrooms, bathrooms, cleaning_fee, availabili...
lgl (1): host_superhost
ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
set.seed(123)
<- initial_split(airbnb, prop = 0.8)
airbnb_split <- training(airbnb_split)
airbnb_train <- testing(airbnb_split) airbnb_test
recipe
<- recipe(
airbnb_recipe formula = price ~ accommodates,
airbnb
) airbnb_recipe
Recipe
Inputs:
role #variables
outcome 1
predictor 1
model
<-
lm_model linear_reg() %>%
set_engine('lm') %>%
set_mode("regression")
translate(lm_model)
Linear Regression Model Specification (regression)
Computational engine: lm
Model fit template:
stats::lm(formula = missing_arg(), data = missing_arg(), weights = missing_arg())
workflow
<-
lm_workflow workflow() %>%
add_recipe(airbnb_recipe) %>%
add_model(lm_model)
lm_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
0 Recipe Steps
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Computational engine: lm
fit
<-
price_lm %>%
lm_workflow fit(airbnb)
tidy(price_lm)
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) -13.5 4.04 -3.34 8.60e- 4
2 accommodates 27.6 1.12 24.6 1.15e-108
predict
<-
lm_pred %>%
price_lm predict(new_data = airbnb) %>%
bind_cols(airbnb) # %>% select(price))
lm_pred
# A tibble: 1,191 × 24
.pred price accommodates bedrooms bathrooms cleaning_fee availability_90_days
<dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 69.2 99 3 1 2 30 3
2 96.7 61 4 1 1 35 0
3 41.6 50 2 2 0.5 0 0
4 41.6 30 2 1 1.5 25 31
5 41.6 60 2 1 1 40 87
6 41.6 45 2 1 1.5 10 0
7 41.6 32 2 0 1 0 14
8 152. 62 6 2 1 30 76
9 41.6 50 2 1 1 0 53
10 41.6 60 2 1 1 0 16
# … with 1,181 more rows, and 17 more variables: district <chr>,
# host_respons_time <chr>, host_response_rate <dbl>, host_superhost <lgl>,
# host_listings_count <dbl>, review_scores_accuracy <dbl>,
# review_scores_cleanliness <dbl>, review_scores_checkin <dbl>,
# review_scores_communication <dbl>, review_scores_location <dbl>,
# review_scores_value <dbl>, kitchen <chr>, tv <chr>, coffe_machine <chr>,
# dishwasher <chr>, terrace <chr>, bathtub <chr>
metrics
<- lm_pred %>%
metrics_model_lm metrics(truth = price, estimate = .pred)
plot y ~ x
%>%
lm_pred ggplot(aes(y = price, x = accommodates)) +
geom_point() +
geom_smooth(method = "lm")
`geom_smooth()` using formula 'y ~ x'
plot true vs predicted
Here we create a function to generate a true vs predicted plot that we will use in each model and summarize at the end.
<- function(prediction_data, model_metrics, title_text) {
create_model_plot <- tibble(
annotation_data x_position = 100,
y_position = c(400, 450, 500),
label_value = str_glue_data(model_metrics, "{.metric}: {round(.estimate, 2)}")
)
%>%
prediction_data ggplot(aes(x = .pred, y = price)) +
geom_abline(lty = 2) +
geom_point(alpha = 0.5) +
geom_text(
data = annotation_data,
mapping = aes(x = x_position, y = y_position, label = label_value),
size = 3
+
) labs(
title = as.character(title_text),
caption = "Line = perfect performance",
x = "Predicted Prices in $",
y = "True Prices in $"
+
) coord_obs_pred(ratio = 1) + # Scale and size the x- and y-axis uniformly:
coord_cartesian(x = c(0, 500), y = c(0, 500)) +
theme_light()
}
<- create_model_plot(lm_pred, metrics_model_lm, "Linear Regression") lm_plot
Coordinate system already present. Adding new coordinate system, which will replace the existing one.
linear regression (update)
recipe
<- recipe(
airbnb_recipe formula = price ~ .,
airbnb_train%>%
) step_dummy(all_nominal_predictors())
airbnb_recipe
Recipe
Inputs:
role #variables
outcome 1
predictor 22
Operations:
Dummy variables from all_nominal_predictors()
workflow
<-
lm_workflow %>%
lm_workflow update_recipe(airbnb_recipe)
lm_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: linear_reg()
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_dummy()
── Model ───────────────────────────────────────────────────────────────────────
Linear Regression Model Specification (regression)
Computational engine: lm
fit
<-
price_lm %>%
lm_workflow fit(airbnb_train)
predict
<-
lm_pred %>%
price_lm predict(new_data = airbnb_train) %>%
bind_cols(airbnb_train %>% select(price))
metrics
<- lm_pred %>%
metrics_model_lm2 metrics(truth = price, estimate = .pred)
plot
<- create_model_plot(lm_pred, metrics_model_lm2, "Linear Regression 2") lm_plot2
Coordinate system already present. Adding new coordinate system, which will replace the existing one.
decision tree
recipe
<- recipe(
tree_recipe formula = price ~ .,
airbnb_train%>%
) step_other(all_nominal_predictors(), threshold = 0.005)
tree_recipe
Recipe
Inputs:
role #variables
outcome 1
predictor 22
Operations:
Collapsing factor levels for all_nominal_predictors()
model
<-
dt_model decision_tree() %>%
set_engine('rpart') %>%
set_mode('regression')
dt_model
Decision Tree Model Specification (regression)
Computational engine: rpart
translate(dt_model)
Decision Tree Model Specification (regression)
Computational engine: rpart
Model fit template:
rpart::rpart(formula = missing_arg(), data = missing_arg(), weights = missing_arg())
workflow
<-
dt_workflow workflow() %>%
add_recipe(tree_recipe) %>%
add_model(dt_model)
dt_workflow
══ Workflow ════════════════════════════════════════════════════════════════════
Preprocessor: Recipe
Model: decision_tree()
── Preprocessor ────────────────────────────────────────────────────────────────
1 Recipe Step
• step_other()
── Model ───────────────────────────────────────────────────────────────────────
Decision Tree Model Specification (regression)
Computational engine: rpart
fit
<-
price_dt %>%
dt_workflow fit(data = airbnb_train)
%>%
price_dt extract_fit_parsnip() %>%
pluck("fit")
n= 952
node), split, n, deviance, yval
* denotes terminal node
1) root 952 9279565.0 71.40336
2) accommodates< 11.5 942 1939145.0 65.97665
4) bedrooms< 1.5 759 649554.8 53.89592
8) cleaning_fee< 35.5 589 315195.7 47.13243 *
9) cleaning_fee>=35.5 170 214063.6 77.32941 *
5) bedrooms>=1.5 183 719389.8 116.08200
10) cleaning_fee< 49.5 87 116223.1 88.41379 *
11) cleaning_fee>=49.5 96 476208.7 141.15620 *
3) accommodates>=11.5 10 4699458.0 582.60000 *
tree plot
%>%
price_dt extract_fit_parsnip() %>%
pluck("fit") %>%
rpart.plot()
Warning: Cannot retrieve the data used to build the model (so cannot determine roundint and is.binary for the variables).
To silence this warning:
Call rpart.plot with roundint=FALSE,
or rebuild the rpart model with model=TRUE.
The number of classes that can be predicted is equal to the number of nodes + 1. In this case the tree that has 4 nodes can predict 5 buckets.
<-
dt_pred %>%
price_dt predict(new_data = airbnb_train) %>%
bind_cols(airbnb_train %>% select(price))
<- dt_pred %>%
metrics_model_dt metrics(truth = price, estimate = .pred)
<- create_model_plot(dt_pred, metrics_model_dt, "Decision tree") dt_plot
Coordinate system already present. Adding new coordinate system, which will replace the existing one.
random forest
Random forests are a collection of decision trees. They have each a small number of predictors and are not prunned. When all the trees are grown (default 500) the average of the predictions are averaged.
model
<-
rf_model rand_forest() %>%
set_engine('ranger', importance = "impurity") %>%
set_mode('regression')
translate(rf_model)
Random Forest Model Specification (regression)
Engine-Specific Arguments:
importance = impurity
Computational engine: ranger
Model fit template:
ranger::ranger(x = missing_arg(), y = missing_arg(), case.weights = missing_arg(),
importance = "impurity", num.threads = 1, verbose = FALSE,
seed = sample.int(10^5, 1))
workflow
<-
rf_workflow workflow() %>%
add_recipe(tree_recipe) %>%
add_model(rf_model)
fit
<-
price_rf %>%
rf_workflow fit(data = airbnb_train)
%>%
price_rf extract_fit_parsnip() %>%
pluck("fit")
Ranger result
Call:
ranger::ranger(x = maybe_data_frame(x), y = y, importance = ~"impurity", num.threads = 1, verbose = FALSE, seed = sample.int(10^5, 1))
Type: Regression
Number of trees: 500
Sample size: 952
Number of independent variables: 22
Mtry: 4
Target node size: 5
Variable importance mode: impurity
Splitrule: variance
OOB prediction error (MSE): 5200.677
R squared (OOB): 0.4670177
predict
<-
rf_pred %>%
price_rf predict(new_data = airbnb_train) %>%
bind_cols(airbnb_train %>% select(price))
metrics
<- rf_pred %>%
metrics_model_rf metrics(truth = price, estimate = .pred)
plot
<- create_model_plot(rf_pred, metrics_model_rf, "Random forest") rf_plot
Coordinate system already present. Adding new coordinate system, which will replace the existing one.
metrics overview
In supervised models there’s a trade off to be made between transparency and performance. As we will see in this example the performance of the Random Forest model is higher but this comes at the cost of transparency. Regression and decision trees are transparent but Random Forests are not transparent
+ lm_plot2) / (dt_plot + rf_plot) (lm_plot
The metrics are usually give by loss functions, giving the sum of the errors committed by the model. Typical metrics in regression models are:
- Mean Squared Error (MSE): average squared distance between predictions and true values
- Mean Absolute Error (MAE): average absolute distance between prediction and true values
- R-squared: the ratio between the residuals variance and the output variable variance
The MSE weights stronger the points that are further away.
Note that in general the performance of training data is deceiving. Performance on training data can always be improved by more sophisticated models. If the model is to complex it will start fitting to the noise that is in the data. In this examples we measured performance with training data but in further cases this is done with test data obtained by the initial split.
Variable intercorrelation has not been assessed here. This is less of a concern in machine learning. In statistics it is important because it increases the error and make the assessment of the significance more difficult for each factor.