Machine Learning with R-tidymodels: overview
introduction
I’ve just attended a very good workshop on Machine Learning organized by the Rbootcamp. The two instructors Dirk Wulff and Markus Steiner have proven very knowledgeable in R and managed to keep the audience interested all along the course.
The course started with a short review of ML and then quickly dived into the practical details. Most explanations and exercises were based the application of the R framework tidymodels to predict house prices in a case study based on AirBnB data. Specifically these covered Supervised Learning approaches with examples of the topics below:
- regression: linear regression, decision tree, random forest
- classification: logistics regression, decision tree, random forest
- model assessment on training and test datasets
- regression metrics: rmse, rsq, mae
- classification metrics: accuracy, kappa, log loss, roc auc
- plotting: regression, trees, ROC curve
For clear explanations and details on the concepts it is good to go through the excellent book Introduction to Statistical Learning by Gareth James (2021)
workflow
The tidymodel framework helps a lot structuring the work. On this basis, I’ve prepared for myself a pipeline with the following steps:
sample > recipe > model > workflow > tune > fit > predict > metrics > plot
packages and functions
This pipeline can be further detailed by listing the packages and functions associated:
- sample
- {rsample}
- initial_split()
- training()
- testing()
- vfold_cv()
- bootstraps()
- {rsample}
- recipe
- {recipes}
- recipe()
- step_dummy()
- {recipes}
- model
- {parsnip}
- linear_reg()
- rand_forest()
- set_engine()
- show_engines()
- set_mode()
- translate()
- {parsnip}
- workflow
- {workflows}
- workflow()
- add_recipe()
- add_model()
- {workflows}
- tune
- {dials}
- grid_regular()
- mixture()
- penalty()
- {tune}
- tune grid()
- fit_resamples()
- collect_metrics()
- select_best()
- finalize_workflow()
- {dials}
- fit
- {parsnip}
- fit()
- {tune}
- last_fit()
- {parsnip}
- predict
- {stats}
- predict()
- {stats}
- metrics
- {yardstick}
- metrics()
- tidy()
- {yardstick}
- plot
- {ggplot2}
- ggplot()
- autoplot()
- {ggplot2}
model tuning
A very important phase is of course the model tuning which can be done with a big variety of models which have different engines and tuning parameters. A first summary in the table below:
model | function | engine | tuning parameters |
---|---|---|---|
linear regression | linear_reg | lm | mixture, penalty |
ridge regression | linear_reg | glmnet | mixture, penalty |
lasso regression | linear_reg | glmnet | mixture, penalty |
decision tree | decision_tree | rpart | cost_complexity |
random forest | rand_forest | ranger | mtry |
definitions:
penalty = lambda
small mtry = diverse forest, large mtry = similar forest
next steps
It may seem a lot of functions to learn and memorize but in the end it falls in place quickly because the sequence is very logic. I’m now rather convinced of its convenience by providing an uniform interface to feed parameters to the models and extract metrics and predictions. It also makes it very easy to test different models and parameters on the same dataset by simply modifying, copying or updating model objects.
In coming articles I will explore some of the the case studies presented and concrete applications of these packages and functions, discussing potential applications in product development and manufacturing.
References