ML with Scikit Learn: end to end machine learning project
machine learning
Published
January 28, 2023
Introduction
This work combines learnings from several of the previous articles. It confirms how to edit a jupyter notebook with jupyter labs and later publish it online with quarto. It confirms also configuration of conda environments within quarto posts and the utilisation of a specific ipykernel located in the conda environment. Finally it brings together the full end to end machine learning workflow using the open source notebook by Aurélien Géron from his book Hands on Machine Learning
!which python%conda info%pwd
/home/joao/JR-IA/renv/python/condaenvs/renv-python/bin/python
active environment : /home/joao/JR-IA/renv/python/condaenvs/renv-python
active env location : /home/joao/JR-IA/renv/python/condaenvs/renv-python
shell level : 2
user config file : /home/joao/.condarc
populated config files :
conda version : 23.1.0
conda-build version : not installed
python version : 3.9.12.final.0
virtual packages : __archspec=1=x86_64
__glibc=2.31=0
__linux=5.4.0=0
__unix=0=0
base environment : /home/joao/.local/share/r-miniconda (writable)
conda av data dir : /home/joao/.local/share/r-miniconda/etc/conda
conda av metadata url : None
channel URLs : https://repo.anaconda.com/pkgs/main/linux-64
https://repo.anaconda.com/pkgs/main/noarch
https://repo.anaconda.com/pkgs/r/linux-64
https://repo.anaconda.com/pkgs/r/noarch
package cache : /home/joao/.local/share/r-miniconda/pkgs
/home/joao/.conda/pkgs
envs directories : /home/joao/.local/share/r-miniconda/envs
/home/joao/.conda/envs
platform : linux-64
user-agent : conda/23.1.0 requests/2.28.1 CPython/3.9.12 Linux/5.4.0-137-generic ubuntu/20.04.5 glibc/2.31
UID:GID : 1000:1000
netrc file : /home/joao/.netrc
offline mode : False
Note: you may need to restart the kernel to use updated packages.
'/home/joao/JR-IA/posts/20230128'
import numpy as npimport pandas as pd
# Plotting setupimport matplotlib as mplimport matplotlib.pyplot as plt%matplotlib inlinempl.rc('axes', labelsize=14)mpl.rc('xtick', labelsize=12)mpl.rc('ytick', labelsize=12)
Data
Excerpt from the book readme file: “This dataset appeared in a 1997 paper titled Sparse Spatial Autoregressions by Pace, R. Kelley and Ronald Barry, published in the Statistics and Probability Letters journal. They built it using the 1990 California census data. It contains one row per census block group. A block group is the smallest geographical unit for which the U.S. Census Bureau publishes sample data (a block group typically has a population of 600 to 3,000 people).”