IntermediateML pipelines

8 minute read

This notebook is an exercise in the Intermediate Machine Learning course. You can reference the tutorial at this link.

In this exercise, you will use pipelines to improve the efficiency of your machine learning code.

Setup

The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
    os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
    os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv") 
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex4 import *
print("Setup Complete")

Setup Complete

You will work with data from the Housing Prices Competition for Kaggle Learn Users.

Ames Housing dataset image

Run the next code cell without changes to load the training and validation sets in X_train, X_valid, y_train, and y_valid. The test set is loaded in X_test.

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X_full = pd.read_csv('../input/train.csv', index_col='Id')
X_test_full = pd.read_csv('../input/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)

# Break off validation set from training data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X_full, y, 
                                                                train_size=0.8, test_size=0.2,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if
                    X_train_full[cname].nunique() < 10 and 
                    X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if 
                X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()

X_train.head()

	MSZoning	Street	Alley	LotShape	LandContour	Utilities	LotConfig	LandSlope	Condition1	Condition2	...	GarageArea	WoodDeckSF	OpenPorchSF	EnclosedPorch	3SsnPorch	ScreenPorch	PoolArea	MiscVal	MoSold	YrSold
Id
619	RL	Pave	NaN	Reg	Lvl	AllPub	Inside	Gtl	Norm	Norm	...	774	0	108	0	0	260	0	0	7	2007
871	RL	Pave	NaN	Reg	Lvl	AllPub	Inside	Gtl	PosN	Norm	...	308	0	0	0	0	0	0	0	8	2009
93	RL	Pave	Grvl	IR1	HLS	AllPub	Inside	Gtl	Norm	Norm	...	432	0	0	44	0	0	0	0	8	2009
818	RL	Pave	NaN	IR1	Lvl	AllPub	CulDSac	Gtl	Norm	Norm	...	857	150	59	0	0	0	0	0	7	2008
303	RL	Pave	NaN	IR1	Lvl	AllPub	Corner	Gtl	Norm	Norm	...	843	468	81	0	0	0	0	0	1	2006

5 rows × 76 columns

The next code cell uses code from the tutorial to preprocess the data and train a model. Run this code without changes.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Define model
model = RandomForestRegressor(n_estimators=100, random_state=0)

# Bundle preprocessing and modeling code in a pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)
                     ])

# Preprocessing of training data, fit model 
clf.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = clf.predict(X_valid)

print('MAE:', mean_absolute_error(y_valid, preds))

MAE: 17861.780102739725

The code yields a value around 17862 for the mean absolute error (MAE). In the next step, you will amend the code to do better.

Step 1: Improve the performance

Part A

Now, it’s your turn! In the code cell below, define your own preprocessing steps and random forest model. Fill in values for the following variables:

numerical_transformer
categorical_transformer
model

To pass this part of the exercise, you need only define valid preprocessing steps and a random forest model.

X_train.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1168 entries, 619 to 685
Data columns (total 76 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 MSZoning       1168 non-null   object 
 Street         1168 non-null   object 
 Alley          71 non-null     object 
 LotShape       1168 non-null   object 
 LandContour    1168 non-null   object 
 Utilities      1168 non-null   object 
 LotConfig      1168 non-null   object 
 LandSlope      1168 non-null   object 
 Condition1     1168 non-null   object 
 Condition2     1168 non-null   object 
BldgType       1168 non-null   object 
HouseStyle     1168 non-null   object 
RoofStyle      1168 non-null   object 
RoofMatl       1168 non-null   object 
MasVnrType     1162 non-null   object 
ExterQual      1168 non-null   object 
ExterCond      1168 non-null   object 
Foundation     1168 non-null   object 
BsmtQual       1140 non-null   object 
BsmtCond       1140 non-null   object 
BsmtExposure   1140 non-null   object 
BsmtFinType1   1140 non-null   object 
BsmtFinType2   1139 non-null   object 
Heating        1168 non-null   object 
HeatingQC      1168 non-null   object 
CentralAir     1168 non-null   object 
Electrical     1167 non-null   object 
KitchenQual    1168 non-null   object 
Functional     1168 non-null   object 
FireplaceQu    617 non-null    object 
GarageType     1110 non-null   object 
GarageFinish   1110 non-null   object 
GarageQual     1110 non-null   object 
GarageCond     1110 non-null   object 
PavedDrive     1168 non-null   object 
PoolQC         4 non-null      object 
Fence          214 non-null    object 
MiscFeature    49 non-null     object 
SaleType       1168 non-null   object 
SaleCondition  1168 non-null   object 
MSSubClass     1168 non-null   int64  
LotFrontage    956 non-null    float64
LotArea        1168 non-null   int64  
OverallQual    1168 non-null   int64  
OverallCond    1168 non-null   int64  
YearBuilt      1168 non-null   int64  
YearRemodAdd   1168 non-null   int64  
MasVnrArea     1162 non-null   float64
BsmtFinSF1     1168 non-null   int64  
BsmtFinSF2     1168 non-null   int64  
BsmtUnfSF      1168 non-null   int64  
TotalBsmtSF    1168 non-null   int64  
1stFlrSF       1168 non-null   int64  
2ndFlrSF       1168 non-null   int64  
LowQualFinSF   1168 non-null   int64  
GrLivArea      1168 non-null   int64  
BsmtFullBath   1168 non-null   int64  
BsmtHalfBath   1168 non-null   int64  
FullBath       1168 non-null   int64  
HalfBath       1168 non-null   int64  
BedroomAbvGr   1168 non-null   int64  
KitchenAbvGr   1168 non-null   int64  
TotRmsAbvGrd   1168 non-null   int64  
Fireplaces     1168 non-null   int64  
GarageYrBlt    1110 non-null   float64
GarageCars     1168 non-null   int64  
GarageArea     1168 non-null   int64  
WoodDeckSF     1168 non-null   int64  
OpenPorchSF    1168 non-null   int64  
EnclosedPorch  1168 non-null   int64  
3SsnPorch      1168 non-null   int64  
ScreenPorch    1168 non-null   int64  
PoolArea       1168 non-null   int64  
MiscVal        1168 non-null   int64  
MoSold         1168 non-null   int64  
YrSold         1168 non-null   int64  
dtypes: float64(3), int64(33), object(40)
memory usage: 702.6+ KB

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='mean')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Define model
model = RandomForestRegressor(n_estimators=100, random_state=0)

# Check your answer
step_1.a.check()

<IPython.core.display.Javascript object>

Correct

# Lines below will give you a hint or solution code
step_1.a.hint()
#step_1.a.solution()

<IPython.core.display.Javascript object>

Hint: While there are many different potential solutions to this problem, we achieved satisfactory results by changing only column_transformer from the default value - specifically, we changed the strategy parameter that decides how missing values are imputed.

Part B

Run the code cell below without changes.

To pass this step, you need to have defined a pipeline in Part A that achieves lower MAE than the code above. You’re encouraged to take your time here and try out many different approaches, to see how low you can get the MAE! (If your code does not pass, please amend the preprocessing steps and model in Part A.)

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

# Check your answer
step_1.b.check()

MAE: 17479.87044520548

<IPython.core.display.Javascript object>

Correct

# Line below will give you a hint
#step_1.b.hint()

Step 2: Generate test predictions

Now, you’ll use your trained model to generate predictions with the test data.

# Preprocessing of test data, fit model
preds_test = my_pipeline.predict(X_test)

# Check your answer
step_2.check()

<IPython.core.display.Javascript object>

Correct

# Lines below will give you a hint or solution code
#step_2.hint()
#step_2.solution()

Run the next code cell without changes to save your results to a CSV file that can be submitted directly to the competition.

# Save test predictions to file
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)

output.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Id         1459 non-null   int64  
 1   SalePrice  1459 non-null   float64
dtypes: float64(1), int64(1)
memory usage: 22.9 KB

Submit your results

Once you have successfully completed Step 2, you’re ready to submit your results to the leaderboard! If you choose to do so, make sure that you have already joined the competition by clicking on the Join Competition button at this link.

Begin by clicking on the Save Version button in the top right corner of the window. This will generate a pop-up window.
Ensure that the Save and Run All option is selected, and then click on the Save button.
This generates a window in the bottom left corner of the notebook. After it has finished running, click on the number to the right of the Save Version button. This pulls up a list of versions on the right of the screen. Click on the ellipsis (…) to the right of the most recent version, and select Open in Viewer. This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
Click on the Output tab on the right of the screen. Then, click on the file you would like to submit, and click on the Submit button to submit your results to the leaderboard.

You have now successfully submitted to the competition!

If you want to keep working to improve your performance, select the Edit button in the top right of the screen. Then you can change your code and repeat the process. There’s a lot of room to improve, and you will climb up the leaderboard as you work.

Keep going

Move on to learn about cross-validation, a technique you can use to obtain more accurate estimates of model performance!

Have questions or comments? Visit the course discussion forum to chat with other learners.

Share on

Twitter Facebook LinkedIn

안철현(Charles An)

IntermediateML pipelines

Setup

Step 1: Improve the performance

Part A

Part B

Step 2: Generate test predictions

Submit your results

Keep going

Share on

Leave a comment

You may also enjoy

파사트 헤드라이트 램프(HID) 교체

Clustering with K-Means

Creating Featues

Tips for Creating Features