Python and R: combined workflows
Objective
Many articles abound on when to choose Python or R for data science. I’m summarizing here how I use and plan to use both languages together.
Data collection
As most of the time, to have a clear idea of the situation it is better to have some data: I started with R in Datacamp in mid 2018 and estimate to be coding professorially in R at least 1/3 of my time since 2020. As my working year is 200 * 8h = 1’600 h/year this makes approximately 500h / year x 3.
Status end 2022:
89h - datacamp data scientist with R
650h - epfl data science with R
1500h - on the job in R
2239h - total in R
66h - datacamp data analyst with Python
66h - total in Python
Status
The volume of hours on R is still much higher than Python but this doesn’t tell all the story. I’ve been learning and using other languages such as bash, javascript and some C++ and got well documented in the history of programming languages. I’m now very familiar to many programming concepts making the learning of python much easier. Interesting also to see that the experience is not new for me as I’ve learned Spanish and French after having learned Portuguese and have learned some German after having learned English and all the time it feels as an extension of previous knowledge, the underlying structures being the same.
Now the more I learn Python the more I tend (or try) to see R as a domain language for statistics and Python as a generic language. There is a very big overlap though and it is very hard to draw the borders as R provides all sorts of tools for os handling, web applications and everything we can think of. I’ve been experimenting with tidymodels and other R machine learning packages but am not entirely satisfied. # Next steps
As I enter now in the machine learning domain much deeper and also professionally I’ve decided to move to Python as my first language.
This means that for typical Data Science tasks like loading data, wrangling dataframes and so on I will be using Pandas and Matplotlib instead of the tidyverse. I don’t see a big effort in the transition as the concepts are the same and in any case it is hard to know the syntax and I keep refering to the tidyverse cheatsheets anyway (github copilot may one day change this). R will become useful when specific statistics tasks are needed such as analysing Designs of Experiments.
In this sense I expect Python to become my primary programming language. With Posit quarto notebooks I have a working tool where I can freely combine both. I see my workflow going more and more in the direction of doing most things in Python and reserving R for some advanced statistical analysis.
Summary
Below a short summary my workflow domains, current and future (a star identifies my selected packages / approaches):
Domain | R | Python | Comments |
---|---|---|---|
Statistics | *stats | statsmodels | Adoption of Python under evaluation |
Modelling | *stats, mle4 | statsmodels | Adoption of Python under evaluation |
MSA | *SixSigma | May remain in R as very specific | |
DoE | *FrF2 | May remain in R as very specific | |
Process Capability | Custom functions in R | Adoption of Python only after Statistics | |
Machine Learning | tidymodels, mlr3 | *scikit learn | Selected initially tidymodels but still not satisfied |
Data science | tidyverse, ggplot2 | *numpy, pandas, matplotlib, seaborn | Python becomes inevitable with the choice of scikit learn |
Dashboards | *shiny | flask | Strong investment in Shiny already. Consider embedding Python on a need basis on the back if ever. |
Text | stringr | *str, bultins | |
NLP, Cartography, APIs | - | - | If ever needed will use the teams language |
OS | - | - | Directly linux |
References
https://www.oreilly.com/library/view/python-and-r/9781492093398/