Software and programming languages for industrial statistics
Like in most things in life, there’s a continuum of possibilities in the approaches and tools to use to treat data in an industrial context. In my professional life I’ve seen and still see people using from the paper and pen to programming neural networks in the cloud. Colleagues with such different practices can very well sit next door, fight over their findings on the “real world” events but not necessarily discuss or share their tools. The typical tasks in data science can be summarised as below:
Task | Examples of approaches | Examples of tools and languages |
---|---|---|
file handling | operating system | linux bash, windows explorer |
data manipulation | flat files, databases | csv, sqlite, excel, text files |
statistics, data science | spreadsheets, commercial software, programming languages | Excel, Minitab, SAS, SPSS, R, Python |
communication of results | presentations, dashboards, documents | Powerpoint, PowerBI, Shiny, LaTeX |
As expected for each typical task a varied toolset is available with different trade offs between learning effort and power. The question is whats the learning path that is realistic and useful for each individual according to his needs. Lets look a bit more in detail into the trade offs at each level and regular user of base tools such as windows can progress into more powerful tools.
File handling: most people remain at the level of using windows explorer, sharepoint or similar file navigation tool and are not aware of other possibilities. There’s nevertheless a return of the terminal with younger generations who work with data pipelines using the command line to move files around specially with git and github. For someone who is not into coding the “explorer” level is ok.
Data manipulation: again windows has taken prominence in the software toolsets and most people use excel to copy paste chunks of data back and forth. An alternative is to use R or Python accessing directly sources of data on the internet and allowing to easily manipulate larger data sets. I believe there is room here to install a programming IDE such as RStudio or Spyder and guide the user through some initial tasks such as loading, filtering and saving back data. This is fully compatible with the typical excel tasks and there’s no redundancy. It further encourages best practices in excel file handling such avoiding merging cells and using a tabular format.
Statistics: many technicians and engineers had statistical training at school and can perform some statistical tests or prepare a simple model for prediction. Commercial software such as Minitab also greatly facilitates these tasks. A user that would have made some steps in the data manipulation could progressively transfer the statistical testing skills to R or Python too. The effort is greater that just data manipulation and visualization because the statistical tools are not so friendly and require a deeper understanding of the mathematical concepts to provide the arguments to the functions.
Communication: Powerpoint remains the tool of choice in corporate environments and training classes to present content in a format easy to follow but Dashboards start to become accessible to the general public with PowerBI and Tableau. Custom made solutions can be prepared with software such as R Shiny and Python Dash and professional quality documents can also be prepared with Rmarkdown and LaTeX.
For those who are willing to dedicate a bit of their lives to programming there’s a unique world that opens up. A world where the manipulation of information is done at another scale much beyond what is done with the apps we can have on a phone. In the professional life programming even at an entry level opens up the possibility of developing tailor made dashboards, of automating data manipulation tasks at a level that is not imaginable, saving endless hours of copy paste. Programming also allows to enter new areas such as descriptive statistics for big and complex data (e.g. spacial, geographic, web scrapping). It allows to work on advanced modeling and prediction. As we can see in the book Concepts of Programming Languages by W.Sebesta (2016) the number of programming languages is very large, each with different possibilities and complexities. I’m just giving here a short list that we can explore in a future article:
C, C++, Bash, SQL, Haskell, Perl, Ruby, Prolog, Matlab, Octave, R, Python, Julia, HTML, CSS, Javascript
References