Machine Learning with R-tidymodels: overview

machine learning

Published

November 5, 2021

introduction

I’ve just attended a very good workshop on Machine Learning organized by the Rbootcamp. The two instructors Dirk Wulff and Markus Steiner have proven very knowledgeable in R and managed to keep the audience interested all along the course.

The course started with a short review of ML and then quickly dived into the practical details. Most explanations and exercises were based the application of the R framework tidymodels to predict house prices in a case study based on AirBnB data. Specifically these covered Supervised Learning approaches with examples of the topics below:

regression: linear regression, decision tree, random forest
classification: logistics regression, decision tree, random forest
model assessment on training and test datasets
regression metrics: rmse, rsq, mae
classification metrics: accuracy, kappa, log loss, roc auc
plotting: regression, trees, ROC curve

For clear explanations and details on the concepts it is good to go through the excellent book Introduction to Statistical Learning by Gareth James (2021)

workflow

The tidymodel framework helps a lot structuring the work. On this basis, I’ve prepared for myself a pipeline with the following steps:

sample > recipe > model > workflow > tune > fit > predict > metrics > plot

packages and functions

This pipeline can be further detailed by listing the packages and functions associated:

sample
- {rsample}
  - initial_split()
  - training()
  - testing()
  - vfold_cv()
  - bootstraps()
recipe
- {recipes}
  - recipe()
  - step_dummy()
model
- {parsnip}
  - linear_reg()
  - rand_forest()
  - set_engine()
  - show_engines()
  - set_mode()
  - translate()
workflow
- {workflows}
  - workflow()
  - add_recipe()
  - add_model()
tune
- {dials}
  - grid_regular()
  - mixture()
  - penalty()
- {tune}
  - tune grid()
  - fit_resamples()
  - collect_metrics()
  - select_best()
  - finalize_workflow()
fit
- {parsnip}
  - fit()
- {tune}
  - last_fit()
predict
- {stats}
  - predict()
metrics
- {yardstick}
  - metrics()
  - tidy()
plot
- {ggplot2}
  - ggplot()
  - autoplot()

model tuning

A very important phase is of course the model tuning which can be done with a big variety of models which have different engines and tuning parameters. A first summary in the table below:

model	function	engine	tuning parameters
linear regression	linear_reg	lm	mixture, penalty
ridge regression	linear_reg	glmnet	mixture, penalty
lasso regression	linear_reg	glmnet	mixture, penalty
decision tree	decision_tree	rpart	cost_complexity
random forest	rand_forest	ranger	mtry

definitions:

penalty = lambda

small mtry = diverse forest, large mtry = similar forest

next steps

It may seem a lot of functions to learn and memorize but in the end it falls in place quickly because the sequence is very logic. I’m now rather convinced of its convenience by providing an uniform interface to feed parameters to the models and extract metrics and predictions. It also makes it very easy to test different models and parameters on the same dataset by simply modifying, copying or updating model objects.

In coming articles I will explore some of the the case studies presented and concrete applications of these packages and functions, discussing potential applications in product development and manufacturing.

References

References

Gareth James, Trevor Hastle, Daniel Witten. 2021. Statistical Learning with r. 2nd ed. Springer. www.statlearning.com.