Project: California Housing Prices



Housing Data

This notebook example is from the book Hands on Machine Learning by Aurelien Geron.

It is a great quick overview of how a Machine Leanring Algorithm can quickly be built and leveraged in real life business applications with only a couple dozen lines of code in a jupyter notebook.

Load the data

In this example it is a pretty quick and easy load from github into the jupyter notebook, but in later examples I will incorporate loading from S3, databases, and APIs, but the goal here is to just get familiar with the Machine Learning Toolkit.

import pandas as pd
import numpy as np
HOUSING_PATH = "https://raw.githubusercontent.com/ageron/handson-ml2/master/datasets/housing/housing.csv"

def load_housing_data(housing_path=HOUSING_PATH):
    return pd.read_csv(housing_path)

Quick Data Exploration

Now that the data is loaded into the notebook we can start to explore the underlying data and learn as much about the data types and structure. We should be keeping in mind a couple thing about the data.

Size of the data

Is the dataset we are using large enough to develop a machine learning algorithm that applies to the real world? If the dataset is too small we risk only fitting the model to a small subset of information without truly representing the population data.

On the other hand, if the dataset is too large, we need to start worrying about the ability of the machine we are devleoping on to be able to process the amount of data, and we could potentially need to start working in the Data Engineering space to distribute our data over severaly nodes in a cluster. Don’t worry if this doesn’t make sense to you right now, as I will focus on Data Engineering in detail in other posts. If you are interested now in distributed computing with python, you can research Dask or Spark (PySpark) in order to get a better understanding.

Quality of the Data

Is the data from a reputable & authoratative data source and is it complete and accurate? This is one of the major issues everyone will run into when doing Exploratory Data Analysis (EDA), even at major companies with mature data pipelines. It is best to make sure the data passes a “sniff” test. For example, if there is a date field, are there only dates in it? Does a population field only have Integers? Keep these questions in mind when exploring the datasets. A quick way to check a couple records is by using the .head() function in python to view the top couple of records.

housing = load_housing_data()
housing.head()
    longitude	latitude	housing_median_age	total_rooms	total_bedrooms	population	households	median_income	median_house_value	ocean_proximity
0	-122.23	37.88	41.0	880.0	129.0	322.0	126.0	8.3252	452600.0	NEAR BAY
1	-122.22	37.86	21.0	7099.0	1106.0	2401.0	1138.0	8.3014	358500.0	NEAR BAY
2	-122.24	37.85	52.0	1467.0	190.0	496.0	177.0	7.2574	352100.0	NEAR BAY
3	-122.25	37.85	52.0	1274.0	235.0	558.0	219.0	5.6431	341300.0	NEAR BAY
4	-122.25	37.85	52.0	1627.0	280.0	565.0	259.0	3.8462	342200.0	NEAR BAY

Data Types

Since we just discussed data quality, a good way to verify that an INT field is actually an integer is to use the .info() function to view the Data types (DType), but it also provides other useful information like the size of the dataset (which will help when deciding whether you should/need to distribute the data, as well as column counts, and number of records and whether or not they are null.

housing.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20640 entries, 0 to 20639
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           20640 non-null  float64
 1   latitude            20640 non-null  float64
 2   housing_median_age  20640 non-null  float64
 3   total_rooms         20640 non-null  float64
 4   total_bedrooms      20433 non-null  float64
 5   population          20640 non-null  float64
 6   households          20640 non-null  float64
 7   median_income       20640 non-null  float64
 8   median_house_value  20640 non-null  float64
 9   ocean_proximity     20640 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.6+ MB

Summary Statistics

Some more information you will want to know (and comes in handy later when deciding what features and labels you may to use) is the .describe() function on the dataset. This will give you the count of records as well as Min/Max, Percentiles, Mean, and Standard Deviations. This will let us understand our dataset better, which may come from user experience or anecdotal evidence. For example, if on our housing dataset I notice that the average number of bedrooms is 8, I may com to a couple conclusions, like maybe there are a lot of mansions in this dataset, or maybe apartments are being included as “houses”. We can always find out what really is going on with more analysis, but it could help me realize the data is poor or wrong before continuing my analysis.

housing.describe()
|     | longitude | latitude | housing\_median\_age | total_rooms | total_bedrooms | population | households | median_income | median\_house\_value |
| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| count | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20433.000000 | 20640.000000 | 20640.000000 | 20640.000000 | 20640.000000 |
| mean | -119.569704 | 35.631861 | 28.639486 | 2635.763081 | 537.870553 | 1425.476744 | 499.539680 | 3.870671 | 206855.816909 |
| std | 2.003532 | 2.135952 | 12.585558 | 2181.615252 | 421.385070 | 1132.462122 | 382.329753 | 1.899822 | 115395.615874 |
| min | -124.350000 | 32.540000 | 1.000000 | 2.000000 | 1.000000 | 3.000000 | 1.000000 | 0.499900 | 14999.000000 |
| 25% | -121.800000 | 33.930000 | 18.000000 | 1447.750000 | 296.000000 | 787.000000 | 280.000000 | 2.563400 | 119600.000000 |
| 50% | -118.490000 | 34.260000 | 29.000000 | 2127.000000 | 435.000000 | 1166.000000 | 409.000000 | 3.534800 | 179700.000000 |
| 75% | -118.010000 | 37.710000 | 37.000000 | 3148.000000 | 647.000000 | 1725.000000 | 605.000000 | 4.743250 | 264725.000000 |
| max | -114.310000 | 41.950000 | 52.000000 | 39320.000000 | 6445.000000 | 35682.000000 | 6082.000000 | 15.000100 | 500001.000000 |

Categorical Variables

Notice in our last output that we are missing a field from the dataset. That is because we cannot run summary statistics on objects that are not INT or FLOAT. So what can we do? We can split out the categories within that field to get a better understanding of what exactly makes up that field and how many objects fall into that category.

Be aware of what might actually be a primary key (which is a unique key per record) as this will not be valuable as categorical variable.

housing['ocean_proximity'].value_counts()
<1H OCEAN     9136
INLAND        6551
NEAR OCEAN    2658
NEAR BAY      2290
ISLAND           5
Name: ocean_proximity, dtype: int64

Distribution of Numerical Data

A good next step in the process is to visualize our numerical data by seeing how often certain numbers repeat or their proximity to one another. We can use histograms or density plots to easily visualize these differences. Histograms allow us to separate groups of data into “bins”. These bins are a count of datapoints that fall into categories or numeric ranges of the bins.

import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1a23a6a390>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a253afc10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a23fc5d50>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1a23ffb5d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a2402f8d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a24063bd0>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1a24096ed0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a25d6e210>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a25d6ec90>]],
      dtype=object)

Building the Training and Testing Sets for the model

An important part of building any Machine Learning Model is to make sure that you have a training set (used to build the model) and a test set that you can test the output model against. This mainly helps in making sure that your model is not underfitting, overfitting, and is valid for building a business application. For this example, we are going to build a model based on random selection of 80% of the data and test the model on the remaining 20%. This may not always be the best way to split, as we may want to stratify the data along income or location evenly, but we will dive deeper in “stratified sampling” later. For right now, a simple random split will do just fine.

from sklearn.model_selection import train_test_split

train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)
train_set.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 16512 entries, 14196 to 15795
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           16512 non-null  float64
 1   latitude            16512 non-null  float64
 2   housing_median_age  16512 non-null  float64
 3   total_rooms         16512 non-null  float64
 4   total_bedrooms      16512 non-null  float64
 5   population          16512 non-null  float64
 6   households          16512 non-null  float64
 7   median_income       16512 non-null  float64
 8   median_house_value  16512 non-null  float64
 9   ocean_proximity     16512 non-null  object 
dtypes: float64(9), object(1)
memory usage: 1.4+ MB
test_set.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 4128 entries, 20046 to 3665
Data columns (total 10 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   longitude           4128 non-null   float64
 1   latitude            4128 non-null   float64
 2   housing_median_age  4128 non-null   float64
 3   total_rooms         4128 non-null   float64
 4   total_bedrooms      3921 non-null   float64
 5   population          4128 non-null   float64
 6   households          4128 non-null   float64
 7   median_income       4128 non-null   float64
 8   median_house_value  4128 non-null   float64
 9   ocean_proximity     4128 non-null   object 
dtypes: float64(9), object(1)
memory usage: 354.8+ KB

Building a Predictive Model for Housing Price based on Income

Given our dataset we are predicting that housing prices can be predicted based on the income of the household. This means that our next step (considering that we have already completed some general Exploratory Data Analysis) is to create a sample and test set based on this information that we have been given. We have just created a test and training set, but since there is a good chance that income is not distributed evenly, for example, there is most likely a high probability that there are very few people with an income over $$500k $ compared to people making less than $$500k $, but we want to make sure that we include at representative sample of incomes over $$500k $ so that we can model that information. This is a great chance to implement stratified sampling, which means we are going to divide this into homogeneous subgroups called strata.

housing['income_cat'] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])
housing["income_cat"].hist()
<matplotlib.axes._subplots.AxesSubplot at 0x1a26a6da50>

Stratifying the Income Brackets

Scikit-Learn has great functions to help with everything related to Machine Learning.

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

strat_test_set["income_cat"].value_counts() / len(strat_test_set)
3    0.350533
2    0.318798
4    0.176357
5    0.114583
1    0.039729
Name: income_cat, dtype: float64

Train/Test Split Cleanup

Now that we have stratified the test set, we need to remove the field ‘income_cat’ we created that we used to stratify the train test split. Doing Train/Test splits is a vital part of building Machine Learning Models, and stratification is a way to take the split to a more accurate level.

for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)

Visualizing Geographical Data

Now that we have our Train and Test sets, we can start visualizing data and developing the model. First let us make a copy of the training set to do our model creation. Then we will the geographical data, since one of this models features is geographical data. We also want to make sure that when we visualize our graph that it is easy to read and understand. I won’t try to explain all of the options here, but this is the link to pandas options for plotting. The only thing I will mention about graph selection is to be aware of the type of “x”, “y” you are plotting. Geographical data, in this instance, scatter makes sense due to the fact lat/lon are points on a grid. If I wanted to do a time series graph, I would want to select a line graph. KDE/Density if I only had an “x” value and wanted to measure but not use bins in a histogram.

housing = strat_train_set.copy()

housing.plot(kind="scatter", x="longitude", y="latitude",
             alpha=0.4, s=housing["population"]/100, label="Population", figsize=(10, 7),
             c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,)

plt.legend()
<matplotlib.legend.Legend at 0x1a21ae2e50>

Finding the Correlations

Remember that we were told that income is a strong indicator of housing value. We can measure this with something called the standard correlation coefficient, also called Pearson’s R. It is a range from -1 to 1. 1 being an absolute correlation, 0 being no correlation, and -1 being an absolute inverse correlation.

We can use two ways to check correlation, either by providing a list of correlation values, or we can visually check. Optimally, we do both methods.

corr_matrix = housing.corr()

corr_matrix["median_house_value"].sort_values(ascending=False)
median_house_value    1.000000
median_income         0.687160
total_rooms           0.135097
housing_median_age    0.114110
households            0.064506
total_bedrooms        0.047689
population           -0.026920
longitude            -0.047432
latitude             -0.142724
Name: median_house_value, dtype: float64
from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms", "housing_median_age"]

scatter_matrix(housing[attributes], figsize=(12, 8))
array([[<matplotlib.axes._subplots.AxesSubplot object at 0x1a1e4b1850>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a248066d0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a27ea5ed0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a26aed710>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1a26b21f10>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a26b62750>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a26a94f50>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a27ecf790>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1a27ed8310>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a268d0c90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a273ecfd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a2742e810>],
       [<matplotlib.axes._subplots.AxesSubplot object at 0x1a27f2cf90>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a27f6e850>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a27fadbd0>,
        <matplotlib.axes._subplots.AxesSubplot object at 0x1a27fe4890>]],
      dtype=object)

housing.plot(kind="scatter", x="median_income", y="median_house_value", alpha=.1, c="lightgreen")
<matplotlib.axes._subplots.AxesSubplot at 0x1a2849c2d0>

Creating New Features

There doesn’t seem to be any real correlations beside the house value relative to income. This is frustrating when trying to build a robust model. One way around this is to create new features from the ones that already exist in the dataset. For example, say we have a very wealthy area where the rich on a lot of land. In our dataset, we would show that the area has a low number of bedrooms in total but may have a high number per household, which could possibly be correlated to the house value. Let’s do a couple adjustments to add the new features into our dataset, by mutating the current dataframe.

housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"] = housing["population"]/housing["households"]

corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
median_house_value          1.000000
median_income               0.687160
rooms_per_household         0.146285
total_rooms                 0.135097
housing_median_age          0.114110
households                  0.064506
total_bedrooms              0.047689
population_per_household   -0.021985
population                 -0.026920
longitude                  -0.047432
latitude                   -0.142724
bedrooms_per_room          -0.259984
Name: median_house_value, dtype: float64

Prepare Data for the Model

Let’s clean up the current variables and training sets we are using in order to move forward with our data modeling now that our Exploratory Data Analysis (EDA) is complete. We can no create two sets, one being our features and the other being our labels.

housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

Imputing Values

One thing to also be cautious of is missing values in the datasets. A quick and easy way to get rid of these is by using a .drop() or .dropna() function to remove the missing values. Or another option is to impute the median and fill the nulls. Scikit-Learn provides a useful function called SimpleImputer which takes care of the missing values for you. Since it is the easiest to use, we will use that function. Keep in mind that imputation cannot take place on non-numerical values, so you will need to copy those from your dataset then drop them, then add them back.

from sklearn.impute import SimpleImputer

## Imputation type (we are using a median)
imputer = SimpleImputer(strategy="median")

## Drop the non-numerical values
housing_num = housing.drop("ocean_proximity", axis=1)

## Fit the median values to the housing numbers 
imputer.fit(housing_num)

## Notice that the two following outputs are the same
print(imputer.statistics_)
print(housing_num.median().values)

## Now transform the training set by using your imputer
X = imputer.transform(housing_num)

## Put it back into a dataframe
housing_tr = pd.DataFrame(X, columns=housing_num.columns, index=housing_num.index)

housing_tr
[-118.51     34.26     29.     2119.5     433.     1164.      408.
    3.5409]
[-118.51     34.26     29.     2119.5     433.     1164.      408.
    3.5409]

|     | longitude | latitude | housing\_median\_age | total_rooms | total_bedrooms | population | households | median_income |
| --- | --- | --- | --- | --- | --- | --- | --- | --- |
| 17606 | -121.89 | 37.29 | 38.0 | 1568.0 | 351.0 | 710.0 | 339.0 | 2.7042 |
| 18632 | -121.93 | 37.05 | 14.0 | 679.0 | 108.0 | 306.0 | 113.0 | 6.4214 |
| 14650 | -117.20 | 32.77 | 31.0 | 1952.0 | 471.0 | 936.0 | 462.0 | 2.8621 |
| 3230 | -119.61 | 36.31 | 25.0 | 1847.0 | 371.0 | 1460.0 | 353.0 | 1.8839 |
| 3555 | -118.59 | 34.23 | 17.0 | 6592.0 | 1525.0 | 4459.0 | 1463.0 | 3.0347 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... |
| 6563 | -118.13 | 34.20 | 46.0 | 1271.0 | 236.0 | 573.0 | 210.0 | 4.9312 |
| 12053 | -117.56 | 33.88 | 40.0 | 1196.0 | 294.0 | 1052.0 | 258.0 | 2.0682 |
| 13908 | -116.40 | 34.09 | 9.0 | 4855.0 | 872.0 | 2098.0 | 765.0 | 3.2723 |
| 11159 | -118.01 | 33.82 | 31.0 | 1960.0 | 380.0 | 1356.0 | 356.0 | 4.0625 |
| 15775 | -122.45 | 37.77 | 52.0 | 3095.0 | 682.0 | 1269.0 | 639.0 | 3.5750 |

Categorical Variables

Let us now look at the only categorical variables in our dataset. Something of note is the use of single brackets housing["ocean_proximity"] vs double brackets housing[["ocean_proximity"]]. A single bracket, when using it on a Pandas DataFrame, in this case housing, will return a Pandas Series while a double bracket will return a Pandas DataFrame. You can always add a Pandas Series to an already existing data frame, but if you try to add two Pandas Series together, it will require the creation of a Pandas DataFrame. In this case, we want to keep the underlying structure of the data frame, so we will extract with a double bracket, but this will not allow us the flexibility of using a series.

housing_cat = housing[["ocean_proximity"]]
housing_cat
|     | ocean_proximity |
| --- | --- |
| 17606 | <1H OCEAN |
| 18632 | <1H OCEAN |
| 14650 | NEAR OCEAN |
| 3230 | INLAND |
| 3555 | <1H OCEAN |
| ... | ... |
| 6563 | INLAND |
| 12053 | INLAND |
| 13908 | INLAND |
| 11159 | <1H OCEAN |
| 15775 | NEAR BAY |

Converting Categorical Variables to Numerical Values - Ordinal Encoders

Machine Learning Algorithms require data to be in numerical formats in order to build their models properly, but how can we take an object (categorical variable) and convert it to a numeric value? We use a function in Scikit-Learn called OrdinalEncoder. This will count the number of categories and apply a number to it. For example, blue = 0 , red =1, yellow = 2, etc…

from sklearn.preprocessing import OrdinalEncoder

ordinal_encoder = OrdinalEncoder()
housing_cat_encoded = ordinal_encoder.fit_transform(housing_cat)
print(housing_cat_encoded[:10])
print(ordinal_encoder.categories_)
[[0.]
 [0.]
 [4.]
 [1.]
 [0.]
 [1.]
 [0.]
 [1.]
 [0.]
 [0.]]
[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
      dtype=object)]

One Hot Encoders

The categorical variable of ocean_proximity is now in a numerical format, but now there is a new issue. The categorical variable is ordered, and so the Machine Learning Model will assume 0 is closer to 1 than 0 is to 5. This is okay if the categories are similar to ranks, as in 0 = terrible, 1 = bad, 2 = okay, etc…, but one ocean proximity is not necessarily better than another. We don’t want our model thinking that NEAR OCEAN is better than INLAND etc…, and this is where One Hot Encoders come into play. One Hot Encoders ensure that all categorical variables either rank as 1 or 0, by changing the ranking system into a matrix.

from sklearn.preprocessing import OneHotEncoder

cat_encoder = OneHotEncoder()
housing_cat_1hot = cat_encoder.fit_transform(housing_cat)
print(cat_encoder.categories_)
print(housing_cat_1hot.toarray())
[array(['<1H OCEAN', 'INLAND', 'ISLAND', 'NEAR BAY', 'NEAR OCEAN'],
      dtype=object)]
[[1. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 0. 0. 0. 1.]
 ...
 [0. 1. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 0. 0. 1. 0.]]

Custom Transformers

Recall earlier that we created our X variable using Simple Imputation to make sure that any cells left Null were filled in so we did not receive an error with our Machine Learning Model. Let’s add in a couple of the additional fields we discussed earlier, also. We will import TransformerMixin which will allow us to use fit_transform if needed on our dataset. Also, we bring in BaseEstimator which will get us two extra methods called get_params() and set_params() which will be useful in automatic hyperparameter tuning.

It might be a good time to explain what a hyperparameter is. In machine learning, a hyperparameter is a parameter whose value is set before the learning process begins. By contrast, the values of other parameters are derived via training. In our case, we only have 1 hyperparameter and that is the add_bedrooms_per_room and if added it will allow us to determine whether or not this will help the machine learning algorithm or not.

from sklearn.base import BaseEstimator, TransformerMixin

# Location of total_rooms, total_bedrooms, population, and households in our array
rooms_ix, bedrooms_ix, population_ix, households_ix = 3, 4, 5, 6


class CombinedAttributesAdder(BaseEstimator, TransformerMixin):
    # Determine whether or not we will want to calcualte bedrooms per room
    def __init__(self, add_bedrooms_per_room=True):
        self.add_bedrooms_per_room = add_bedrooms_per_room

    # Fit the model to X
    def fit(self, X, y=None):
        return self

    # Create the arrays `rooms_per_household` and `population_per_household`
    # If add_bedrooms_per_room = True, then do that calculation
    # Then concat the arrays
    def transform(self, X, y=None):
        rooms_per_household = X[:, rooms_ix] / X[:, households_ix]
        population_per_household = X[:, population_ix] / X[:, households_ix]
        if self.add_bedrooms_per_room:
            bedrooms_per_room = X[:, bedrooms_ix] / X[:, rooms_ix]
            return np.c_[X, rooms_per_household, population_per_household, bedrooms_per_room]
        else:
            return np.c_[X, rooms_per_household, population_per_household]


# Initiate the CombinedAttributesAdder Class without add bedrooms per room calculation
attr_adder = CombinedAttributesAdder(add_bedrooms_per_room=False)
# Do the transforms on the arrays
housing_extra_attribs = attr_adder.transform(housing.values)
housing_extra_attribs
array([[-121.89, 37.29, 38.0, ..., '<1H OCEAN', 4.625368731563422,
        2.094395280235988],
       [-121.93, 37.05, 14.0, ..., '<1H OCEAN', 6.008849557522124,
        2.7079646017699117],
       [-117.2, 32.77, 31.0, ..., 'NEAR OCEAN', 4.225108225108225,
        2.0259740259740258],
       ...,
       [-116.4, 34.09, 9.0, ..., 'INLAND', 6.34640522875817,
        2.742483660130719],
       [-118.01, 33.82, 31.0, ..., '<1H OCEAN', 5.50561797752809,
        3.808988764044944],
       [-122.45, 37.77, 52.0, ..., 'NEAR BAY', 4.843505477308295,
        1.9859154929577465]], dtype=object)

Feature Scaling

One of the issues we have right now with our dataset is in the numerical data. If you notice, the range of numbers across features is quite variable. For example, total number of rooms ranges from 6-39,320, but income only ranges from 0-15. This variance will cause the Machine Learning Model to perform poorly. We can fix this with Min-Max Scaling (commonly referred to as normalization) or Standardization. In Min-Max Scaling, all numbers in each feature will be rebased to be from 0-1. Standardization subtracts the mean value (so that normalized numbers will always have a 0 mean) and then divides it by the standard deviation. This is good if you have a situation where most of you numbers are in a small range (say 0-20) but then you have a couple outliers (92, 99, 105), if you used Min-Max Scaling, a majority of the numbers would be in 0-0.2 range.

Transformation Pipelines

There are many data transformations that need to take place in a Machine Learning Models, and it is important that they get executed in the correct order. Which, is why Scikit-Learn has a Pipeline package.

## Handle Numerical Data
from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler

num_pipeline = Pipeline([
    ('imputer', SimpleImputer(strategy="median")),
    ('attribs_adder', CombinedAttributesAdder()),
    ('std_scaler', StandardScaler()),
])

## Handle Categorical Data
from sklearn.compose import ColumnTransformer

num_attribs = list(housing_num)
cat_attribs = ["ocean_proximity"]

full_pipeline = ColumnTransformer([
    ("num", num_pipeline, num_attribs),
    ("cat", OneHotEncoder(), cat_attribs),
])

housing_prepared = full_pipeline.fit_transform(housing)
housing_prepared
array([[-1.15604281,  0.77194962,  0.74333089, ...,  0.        ,
         0.        ,  0.        ],
       [-1.17602483,  0.6596948 , -1.1653172 , ...,  0.        ,
         0.        ,  0.        ],
       [ 1.18684903, -1.34218285,  0.18664186, ...,  0.        ,
         0.        ,  1.        ],
       ...,
       [ 1.58648943, -0.72478134, -1.56295222, ...,  0.        ,
         0.        ,  0.        ],
       [ 0.78221312, -0.85106801,  0.18664186, ...,  0.        ,
         0.        ,  0.        ],
       [-1.43579109,  0.99645926,  1.85670895, ...,  0.        ,
         1.        ,  0.        ]])

Training the Model

Since we are trying to predict a continuous variable (housing_value) based on a couple of inputs, a linear regression model should be sufficient. With just a couple of lines we can build our regression model.

from sklearn.linear_model import LinearRegression

lin_reg = LinearRegression()
lin_reg.fit(housing_prepared, housing_labels)

## Let's test out the result
some_data = housing.iloc[:5]
some_labels = housing_labels.iloc[:5]
some_data_prepared = full_pipeline.transform(some_data)
print("Predictions:", lin_reg.predict(some_data_prepared))
print("Labels:", list(some_labels))
Predictions: [210644.60459286 317768.80697211 210956.43331178  59218.98886849
 189747.55849879]
Labels: [286600.0, 340600.0, 196900.0, 46300.0, 254500.0]

Root Mean Squared Error

We can now test the accuracy of the model with Root Mean Square Error (RMSE). Variables:

  • m - is the number of instances we are testing, i.e. 2000 users of the website
  • h - is your systems prediction function, i.e. h(x^i) = 100 but the actual value is 98 than you have a difference of 2
  • y - is your actual value

$$ \begin{equation}\label{eq:} RMSE = \sqrt{\frac{1}{m}\Sigma_{i=1}^{m}{\Big(h\Big(x^i\Big)-y^i}\Big)^2} \end{equation} $$

from sklearn.metrics import mean_squared_error

housing_predictions = lin_reg.predict(housing_prepared)
lin_mse = mean_squared_error(housing_labels, housing_predictions)
lin_rmse = np.sqrt(lin_mse)
print("Linear Regression RMSE: $", lin_rmse)

from sklearn.tree import DecisionTreeRegressor

tree_reg = DecisionTreeRegressor()
tree_reg.fit(housing_prepared, housing_labels)

## Let's test out the result
housing_predictions = tree_reg.predict(housing_prepared)
tree_mse = mean_squared_error(housing_labels, housing_predictions)
tree_rmse = np.sqrt(tree_mse)
print("Decision Tree Regressor RMSE: $", tree_rmse)
Linear Regression RMSE: $ 68628.19819848923
Decision Tree Regressor RMSE: $ 0.0

Underfitting & Overfitting

Linear Regression We have a RMSE ~= 68,000, which is way too high in order for our model to be useful. You can take a look back at either our summery statistics or histograms to see that a significant majority of houses are valued under 300,000 which means a majority of the time we will most likely have major discrepancies. This means that we need to select a more powerful model in order to better fit the model. Let us try a Decision Tree.

Decision Tree Regressor We now have a RMSE = 0, which means we way overfit our model. There is another way we could validate.

Cross-Validation

Cross-Validation allows us to split the training sets into smaller sets called folds. It then will evaluate the decision tree 10 different times.

from sklearn.model_selection import cross_val_score

## Tree Regressor Cross-Validation
tree_scores = cross_val_score(tree_reg, housing_prepared, housing_labels
                        , scoring="neg_mean_squared_error", cv=10)

tree_rmse_scores = np.sqrt(-tree_scores)

def display_scores(scores, score_type=False):
    if score_type:
        print(score_type)
    print("Scores:", scores)
    print("Mean:", scores.mean())
    print("Standard Deviation:", scores.std(), "\n")
    
display_scores(tree_rmse_scores, score_type="Tree Regressor Scores")

## Linear Regression Cross-Validation
lin_scores = cross_val_score(lin_reg, housing_prepared, housing_labels
                            , scoring="neg_mean_squared_error", cv=10)

lin_rmse_scores = np.sqrt(-lin_scores)
display_scores(lin_rmse_scores, "Linear Regression")
Tree Regressor Scores
Scores: [68443.13363341 66963.43220524 71951.5871376  69744.3544759
 70956.28726992 74411.84802589 69512.13019592 71007.52586295
 77136.84489598 69924.55189707]
Mean: 71005.16955998751
Standard Deviation: 2792.6193178409467 

Linear Regression
Scores: [66782.73843989 66960.118071   70347.95244419 74739.57052552
 68031.13388938 71193.84183426 64969.63056405 68281.61137997
 71552.91566558 67665.10082067]
Mean: 69052.46136345083
Standard Deviation: 2731.6740017983493 

Random Forest Regressor

Those scores are pretty bad given the values of our houses, but Tree Regressor performed worse once we did validation, after it had overfit the original model. We can try to implement a Random Forest Regressor which is a Decision Tree algorithm, but Decision Tress trained as subsets of a larger Random Forest Model. This is called Ensemble Learning, which is building a model on top of models.

from sklearn.ensemble import RandomForestRegressor

forest_reg = RandomForestRegressor()
forest_reg.fit(housing_prepared, housing_labels)

## Let's test out the result
housing_predictions = forest_reg.predict(housing_prepared)
forest_mse = mean_squared_error(housing_labels, housing_predictions)
forest_rmse = np.sqrt(forest_mse)
print("Random Forest Regressor RMSE: $", forest_rmse)

## Cross-Validation
forest_scores = cross_val_score(forest_reg, housing_prepared, housing_labels
                               , scoring="neg_mean_squared_error", cv=10)

forest_rmse_scores = np.sqrt(-forest_scores)
display_scores(forest_rmse_scores, "Random Forest Regressor")
Random Forest Regressor RMSE: $ 18721.054609116614
Random Forest Regressor
Scores: [49724.37068679 47567.33847834 49775.12954272 52280.23837291
 49646.01360053 53781.59978957 48636.66258168 48236.17463882
 53141.66716146 50344.06960132]
Mean: 50313.326445413615
Standard Deviation: 1991.1343839008105 

Depending on your dataset and model, it could take a very long time to find all of the hyperparameters to fine tune your model. Scikit-Learn has a special method that make finding hyperparameters easier for you. An additional bonus is the GRidSearchCV will also do the cross validation for you. Using GridSeaerchCV we will rebuild our model using a random forest regressor.

from sklearn.model_selection import GridSearchCV

param_grid = [
    {'n_estimators': [3, 10, 30], 'max_features': [2, 4, 6, 8]},
    {'bootstrap': [False], 'n_estimators': [3, 10], 'max_features': [2, 3, 4]},
]

forest_reg = RandomForestRegressor()

grid_search = GridSearchCV(forest_reg, param_grid, cv=5,
                          scoring='neg_mean_squared_error',
                          return_train_score=True)

grid_search.fit(housing_prepared, housing_labels)

print("Best Parameters: ", grid_search.best_params_)
print("Best Estimators: ", grid_search.best_estimator_)

cvres = grid_search.cv_results_
for mean_score, params in zip(cvres["mean_test_score"], cvres["params"]):
    print(np.sqrt(-mean_score), params)
Best Parameters:  {'max_features': 8, 'n_estimators': 30}
Best Estimators:  RandomForestRegressor(bootstrap=True, ccp_alpha=0.0, criterion='mse',
                      max_depth=None, max_features=8, max_leaf_nodes=None,
                      max_samples=None, min_impurity_decrease=0.0,
                      min_impurity_split=None, min_samples_leaf=1,
                      min_samples_split=2, min_weight_fraction_leaf=0.0,
                      n_estimators=30, n_jobs=None, oob_score=False,
                      random_state=None, verbose=0, warm_start=False)
63751.62463032517 {'max_features': 2, 'n_estimators': 3}
55519.26758887549 {'max_features': 2, 'n_estimators': 10}
52854.50704958727 {'max_features': 2, 'n_estimators': 30}
59946.17467921899 {'max_features': 4, 'n_estimators': 3}
52927.20373239646 {'max_features': 4, 'n_estimators': 10}
50796.772611790264 {'max_features': 4, 'n_estimators': 30}
58807.941310200236 {'max_features': 6, 'n_estimators': 3}
52189.7593194014 {'max_features': 6, 'n_estimators': 10}
50201.77749605266 {'max_features': 6, 'n_estimators': 30}
58782.83779869211 {'max_features': 8, 'n_estimators': 3}
52334.76522229874 {'max_features': 8, 'n_estimators': 10}
49960.09392368449 {'max_features': 8, 'n_estimators': 30}
63674.945782081035 {'bootstrap': False, 'max_features': 2, 'n_estimators': 3}
54311.122616722656 {'bootstrap': False, 'max_features': 2, 'n_estimators': 10}
59749.420879375226 {'bootstrap': False, 'max_features': 3, 'n_estimators': 3}
53270.33634572209 {'bootstrap': False, 'max_features': 3, 'n_estimators': 10}
58698.551341328275 {'bootstrap': False, 'max_features': 4, 'n_estimators': 3}
51497.26148683206 {'bootstrap': False, 'max_features': 4, 'n_estimators': 10}

Analyzing Different Models

Each feature has a relative importance to the model. Here we can use grid_search.best_estimator_.feature_importances_ to determine how much each feature affects our model.

feature_importances = grid_search.best_estimator_.feature_importances_
extra_attribs = ["rooms_per_household", "pop_per_household", "bedrooms_per_room"]
cat_encoder = full_pipeline.named_transformers_["cat"]
cat_one_hot_attribs = list(cat_encoder.categories_[0])
attributes = num_attribs + extra_attribs + cat_one_hot_attribs

sorted(zip(feature_importances, attributes), reverse=True)
[(0.3794309763098958, 'median_income'),
 (0.14579475504671696, 'INLAND'),
 (0.10784884040860719, 'pop_per_household'),
 (0.07514859090712843, 'longitude'),
 (0.06187209331031717, 'latitude'),
 (0.06114686250510409, 'bedrooms_per_room'),
 (0.046797629222457254, 'rooms_per_household'),
 (0.04216877408475512, 'housing_median_age'),
 (0.015056733749506916, 'population'),
 (0.014845512974142713, 'total_rooms'),
 (0.014243477381994003, 'total_bedrooms'),
 (0.0140434292103578, 'households'),
 (0.01124699346515189, '<1H OCEAN'),
 (0.006471116507291989, 'NEAR OCEAN'),
 (0.0038259856761146635, 'NEAR BAY'),
 (5.8229240457955225e-05, 'ISLAND')]

Evaluate the Model on the Test Set

Now to test the final model!

final_model = grid_search.best_estimator_

X_test = strat_test_set.drop("median_house_value", axis=1)
y_test = strat_test_set["median_house_value"].copy()

X_test_prepared = full_pipeline.transform(X_test)

final_predictions = final_model.predict(X_test_prepared)

final_mse = mean_squared_error(y_test, final_predictions)
final_rmse = np.sqrt(final_mse)

print("Final MSE: ", final_mse)
print("Final RMSE: ", final_rmse)

## Show the confidence interval
from scipy import stats

confidence = 0.95
squared_errors = (final_predictions - y_test) ** 2
interval = np.sqrt(stats.t.interval(confidence, len(squared_errors) - 1,
                                   loc=squared_errors.mean(),
                                   scale=stats.sem(squared_errors)))
print("Interval: ", interval)
Final MSE:  2283090426.2149405
Final RMSE:  47781.695514233696
Interval:  [45783.60307351 49699.52255341]