IntermediateML cross-validation

5 minute read

This notebook is an exercise in the Intermediate Machine Learning course. You can reference the tutorial at this link.

In this exercise, you will leverage what you’ve learned to tune a machine learning model with cross-validation.

Setup

The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
    os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
    os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv") 
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex5 import *
print("Setup Complete")

Setup Complete

You will work with the Housing Prices Competition for Kaggle Learn Users from the previous exercise.

Ames Housing dataset image

Run the next code cell without changes to load the training and test data in X and X_test. For simplicity, we drop categorical variables.

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
train_data = pd.read_csv('../input/train.csv', index_col='Id')
test_data = pd.read_csv('../input/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
train_data.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = train_data.SalePrice              
train_data.drop(['SalePrice'], axis=1, inplace=True)

# Select numeric columns only
numeric_cols = [cname for cname in train_data.columns if train_data[cname].dtype in ['int64', 'float64']]
X = train_data[numeric_cols].copy()
X_test = test_data[numeric_cols].copy()

Use the next code cell to print the first several rows of the data.

X.head()

	MSSubClass	LotFrontage	LotArea	OverallQual	OverallCond	YearBuilt	YearRemodAdd	MasVnrArea	BsmtFinSF1	BsmtFinSF2	...	GarageArea	WoodDeckSF	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	MiscVal	MoSold	YrSold
Id
1	60	65.0	8450	7	5	2003	2003	196.0	706	0	...	548	0	61	0	0	0	0	0	2	2008
2	20	80.0	9600	6	8	1976	1976	0.0	978	0	...	460	298	0	0	0	0	0	0	5	2007
3	60	68.0	11250	7	5	2001	2002	162.0	486	0	...	608	0	42	0	0	0	0	0	9	2008
4	70	60.0	9550	7	5	1915	1970	0.0	216	0	...	642	0	35	272	0	0	0	0	2	2006
5	60	84.0	14260	8	5	2000	2000	350.0	655	0	...	836	192	84	0	0	0	0	0	12	2008

5 rows × 36 columns

X.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1460 entries, 1 to 1460
Data columns (total 36 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 MSSubClass     1460 non-null   int64  
 LotFrontage    1201 non-null   float64
 LotArea        1460 non-null   int64  
 OverallQual    1460 non-null   int64  
 OverallCond    1460 non-null   int64  
 YearBuilt      1460 non-null   int64  
 YearRemodAdd   1460 non-null   int64  
 MasVnrArea     1452 non-null   float64
 BsmtFinSF1     1460 non-null   int64  
 BsmtFinSF2     1460 non-null   int64  
BsmtUnfSF      1460 non-null   int64  
TotalBsmtSF    1460 non-null   int64  
1stFlrSF       1460 non-null   int64  
2ndFlrSF       1460 non-null   int64  
LowQualFinSF   1460 non-null   int64  
GrLivArea      1460 non-null   int64  
BsmtFullBath   1460 non-null   int64  
BsmtHalfBath   1460 non-null   int64  
FullBath       1460 non-null   int64  
HalfBath       1460 non-null   int64  
BedroomAbvGr   1460 non-null   int64  
KitchenAbvGr   1460 non-null   int64  
TotRmsAbvGrd   1460 non-null   int64  
Fireplaces     1460 non-null   int64  
GarageYrBlt    1379 non-null   float64
GarageCars     1460 non-null   int64  
GarageArea     1460 non-null   int64  
WoodDeckSF     1460 non-null   int64  
OpenPorchSF    1460 non-null   int64  
EnclosedPorch  1460 non-null   int64  
3SsnPorch      1460 non-null   int64  
ScreenPorch    1460 non-null   int64  
PoolArea       1460 non-null   int64  
MiscVal        1460 non-null   int64  
MoSold         1460 non-null   int64  
YrSold         1460 non-null   int64  
dtypes: float64(3), int64(33)
memory usage: 422.0 KB

So far, you’ve learned how to build pipelines with scikit-learn. For instance, the pipeline below will use SimpleImputer() to replace missing values in the data, before using RandomForestRegressor() to train a random forest model to make predictions. We set the number of trees in the random forest model with the n_estimators parameter, and setting random_state ensures reproducibility.

from sklearn.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer

my_pipeline = Pipeline(steps=[
    ('preprocessor', SimpleImputer()),
    ('model', RandomForestRegressor(n_estimators=50, random_state=0))
])

You have also learned how to use pipelines in cross-validation. The code below uses the cross_val_score() function to obtain the mean absolute error (MAE), averaged across five different folds. Recall we set the number of folds with the cv parameter.

from sklearn.model_selection import cross_val_score

# Multiply by -1 since sklearn calculates *negative* MAE
scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=5,
                              scoring='neg_mean_absolute_error')

print("Average MAE score:", scores.mean())

Average MAE score: 18276.410356164386

scores

array([18549.88568493, 17896.67034247, 18462.68472603, 16587.02472603,
       19885.78630137])

Step 1: Write a useful function

In this exercise, you’ll use cross-validation to select parameters for a machine learning model.

Begin by writing a function get_score() that reports the average (over three cross-validation folds) MAE of a machine learning pipeline that uses:

the data in X and y to create folds,
SimpleImputer() (with all parameters left as default) to replace missing values, and
RandomForestRegressor() (with random_state=0) to fit a random forest model.

The n_estimators parameter supplied to get_score() is used when setting the number of trees in the random forest model.

def get_score(n_estimators):
    """Return the average MAE over 3 CV folds of random forest model.
    
    Keyword argument:
    n_estimators -- the number of trees in the forest
    """
    # Replace this body with your own code
    my_pipeline = Pipeline(steps=[
        ('preprocessor', SimpleImputer()),
        ('model', RandomForestRegressor(n_estimators, random_state=0))
    ])
    scores = -1 * cross_val_score(my_pipeline, X, y,
                              cv=3,
                              scoring='neg_mean_absolute_error')
    return scores.mean()
    

# Check your answer
step_1.check()

<IPython.core.display.Javascript object>

Correct

# Lines below will give you a hint or solution code
#step_1.hint()
#step_1.solution()

Step 2: Test different parameter values

Now, you will use the function that you defined in Step 1 to evaluate the model performance corresponding to eight different values for the number of trees in the random forest: 50, 100, 150, …, 300, 350, 400.

Store your results in a Python dictionary results, where results[i] is the average MAE returned by get_score(i).

results = {n_estimators : get_score(n_estimators) for n_estimators in range(50, 401, 50)}
    
results

{50: 18353.8393511688,
18395.2151680032,
18288.730020956387,
18248.345889801505,
18255.26922247291,
18275.241922621914,
18270.29183308043,
18270.197974402367}

results = {n_estimators : get_score(n_estimators) for n_estimators in range(50, 401, 50)}

# Check your answer
step_2.check()

<IPython.core.display.Javascript object>

Correct

# Lines below will give you a hint or solution code
#step_2.hint()
#step_2.solution()

Use the next cell to visualize your results from Step 2. Run the code without changes.

import matplotlib.pyplot as plt
%matplotlib inline

plt.plot(list(results.keys()), list(results.values()))
plt.show()

png

Step 3: Find the best parameter value

Given the results, which value for n_estimators seems best for the random forest model? Use your answer to set the value of n_estimators_best.

n_estimators_best = 200

# Check your answer
step_3.check()

<IPython.core.display.Javascript object>

Correct

# Lines below will give you a hint or solution code
#step_3.hint()
#step_3.solution()

In this exercise, you have explored one method for choosing appropriate parameters in a machine learning model.

If you’d like to learn more about hyperparameter optimization, you’re encouraged to start with grid search, which is a straightforward method for determining the best combination of parameters for a machine learning model. Thankfully, scikit-learn also contains a built-in function GridSearchCV() that can make your grid search code very efficient!

Keep going

Continue to learn about gradient boosting, a powerful technique that achieves state-of-the-art results on a variety of datasets.

Have questions or comments? Visit the course discussion forum to chat with other learners.

Share on

Twitter Facebook LinkedIn

안철현(Charles An)

IntermediateML cross-validation

Setup

Step 1: Write a useful function

Step 2: Test different parameter values

Step 3: Find the best parameter value

Keep going

Share on

Leave a comment

You may also enjoy

Clustering with K-Means

Creating Featues

Tips for Creating Features

Mutual Information