8 minute read

This notebook is an exercise in the Intermediate Machine Learning course. You can reference the tutorial at this link.


In this exercise, you will use pipelines to improve the efficiency of your machine learning code.

Setup

The questions below will give you feedback on your work. Run the following cell to set up the feedback system.

# Set up code checking
import os
if not os.path.exists("../input/train.csv"):
    os.symlink("../input/home-data-for-ml-course/train.csv", "../input/train.csv")  
    os.symlink("../input/home-data-for-ml-course/test.csv", "../input/test.csv") 
from learntools.core import binder
binder.bind(globals())
from learntools.ml_intermediate.ex4 import *
print("Setup Complete")
Setup Complete

You will work with data from the Housing Prices Competition for Kaggle Learn Users.

Ames Housing dataset image

Run the next code cell without changes to load the training and validation sets in X_train, X_valid, y_train, and y_valid. The test set is loaded in X_test.

import pandas as pd
from sklearn.model_selection import train_test_split

# Read the data
X_full = pd.read_csv('../input/train.csv', index_col='Id')
X_test_full = pd.read_csv('../input/test.csv', index_col='Id')

# Remove rows with missing target, separate target from predictors
X_full.dropna(axis=0, subset=['SalePrice'], inplace=True)
y = X_full.SalePrice
X_full.drop(['SalePrice'], axis=1, inplace=True)

# Break off validation set from training data
X_train_full, X_valid_full, y_train, y_valid = train_test_split(X_full, y, 
                                                                train_size=0.8, test_size=0.2,
                                                                random_state=0)

# "Cardinality" means the number of unique values in a column
# Select categorical columns with relatively low cardinality (convenient but arbitrary)
categorical_cols = [cname for cname in X_train_full.columns if
                    X_train_full[cname].nunique() < 10 and 
                    X_train_full[cname].dtype == "object"]

# Select numerical columns
numerical_cols = [cname for cname in X_train_full.columns if 
                X_train_full[cname].dtype in ['int64', 'float64']]

# Keep selected columns only
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_valid = X_valid_full[my_cols].copy()
X_test = X_test_full[my_cols].copy()
X_train.head()
MSZoning Street Alley LotShape LandContour Utilities LotConfig LandSlope Condition1 Condition2 ... GarageArea WoodDeckSF OpenPorchSF EnclosedPorch 3SsnPorch ScreenPorch PoolArea MiscVal MoSold YrSold
Id
619 RL Pave NaN Reg Lvl AllPub Inside Gtl Norm Norm ... 774 0 108 0 0 260 0 0 7 2007
871 RL Pave NaN Reg Lvl AllPub Inside Gtl PosN Norm ... 308 0 0 0 0 0 0 0 8 2009
93 RL Pave Grvl IR1 HLS AllPub Inside Gtl Norm Norm ... 432 0 0 44 0 0 0 0 8 2009
818 RL Pave NaN IR1 Lvl AllPub CulDSac Gtl Norm Norm ... 857 150 59 0 0 0 0 0 7 2008
303 RL Pave NaN IR1 Lvl AllPub Corner Gtl Norm Norm ... 843 468 81 0 0 0 0 0 1 2006

5 rows × 76 columns

The next code cell uses code from the tutorial to preprocess the data and train a model. Run this code without changes.

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='constant')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Define model
model = RandomForestRegressor(n_estimators=100, random_state=0)

# Bundle preprocessing and modeling code in a pipeline
clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('model', model)
                     ])

# Preprocessing of training data, fit model 
clf.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = clf.predict(X_valid)

print('MAE:', mean_absolute_error(y_valid, preds))
MAE: 17861.780102739725

The code yields a value around 17862 for the mean absolute error (MAE). In the next step, you will amend the code to do better.

Step 1: Improve the performance

Part A

Now, it’s your turn! In the code cell below, define your own preprocessing steps and random forest model. Fill in values for the following variables:

  • numerical_transformer
  • categorical_transformer
  • model

To pass this part of the exercise, you need only define valid preprocessing steps and a random forest model.

X_train.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 1168 entries, 619 to 685
Data columns (total 76 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   MSZoning       1168 non-null   object 
 1   Street         1168 non-null   object 
 2   Alley          71 non-null     object 
 3   LotShape       1168 non-null   object 
 4   LandContour    1168 non-null   object 
 5   Utilities      1168 non-null   object 
 6   LotConfig      1168 non-null   object 
 7   LandSlope      1168 non-null   object 
 8   Condition1     1168 non-null   object 
 9   Condition2     1168 non-null   object 
 10  BldgType       1168 non-null   object 
 11  HouseStyle     1168 non-null   object 
 12  RoofStyle      1168 non-null   object 
 13  RoofMatl       1168 non-null   object 
 14  MasVnrType     1162 non-null   object 
 15  ExterQual      1168 non-null   object 
 16  ExterCond      1168 non-null   object 
 17  Foundation     1168 non-null   object 
 18  BsmtQual       1140 non-null   object 
 19  BsmtCond       1140 non-null   object 
 20  BsmtExposure   1140 non-null   object 
 21  BsmtFinType1   1140 non-null   object 
 22  BsmtFinType2   1139 non-null   object 
 23  Heating        1168 non-null   object 
 24  HeatingQC      1168 non-null   object 
 25  CentralAir     1168 non-null   object 
 26  Electrical     1167 non-null   object 
 27  KitchenQual    1168 non-null   object 
 28  Functional     1168 non-null   object 
 29  FireplaceQu    617 non-null    object 
 30  GarageType     1110 non-null   object 
 31  GarageFinish   1110 non-null   object 
 32  GarageQual     1110 non-null   object 
 33  GarageCond     1110 non-null   object 
 34  PavedDrive     1168 non-null   object 
 35  PoolQC         4 non-null      object 
 36  Fence          214 non-null    object 
 37  MiscFeature    49 non-null     object 
 38  SaleType       1168 non-null   object 
 39  SaleCondition  1168 non-null   object 
 40  MSSubClass     1168 non-null   int64  
 41  LotFrontage    956 non-null    float64
 42  LotArea        1168 non-null   int64  
 43  OverallQual    1168 non-null   int64  
 44  OverallCond    1168 non-null   int64  
 45  YearBuilt      1168 non-null   int64  
 46  YearRemodAdd   1168 non-null   int64  
 47  MasVnrArea     1162 non-null   float64
 48  BsmtFinSF1     1168 non-null   int64  
 49  BsmtFinSF2     1168 non-null   int64  
 50  BsmtUnfSF      1168 non-null   int64  
 51  TotalBsmtSF    1168 non-null   int64  
 52  1stFlrSF       1168 non-null   int64  
 53  2ndFlrSF       1168 non-null   int64  
 54  LowQualFinSF   1168 non-null   int64  
 55  GrLivArea      1168 non-null   int64  
 56  BsmtFullBath   1168 non-null   int64  
 57  BsmtHalfBath   1168 non-null   int64  
 58  FullBath       1168 non-null   int64  
 59  HalfBath       1168 non-null   int64  
 60  BedroomAbvGr   1168 non-null   int64  
 61  KitchenAbvGr   1168 non-null   int64  
 62  TotRmsAbvGrd   1168 non-null   int64  
 63  Fireplaces     1168 non-null   int64  
 64  GarageYrBlt    1110 non-null   float64
 65  GarageCars     1168 non-null   int64  
 66  GarageArea     1168 non-null   int64  
 67  WoodDeckSF     1168 non-null   int64  
 68  OpenPorchSF    1168 non-null   int64  
 69  EnclosedPorch  1168 non-null   int64  
 70  3SsnPorch      1168 non-null   int64  
 71  ScreenPorch    1168 non-null   int64  
 72  PoolArea       1168 non-null   int64  
 73  MiscVal        1168 non-null   int64  
 74  MoSold         1168 non-null   int64  
 75  YrSold         1168 non-null   int64  
dtypes: float64(3), int64(33), object(40)
memory usage: 702.6+ KB
# Preprocessing for numerical data
numerical_transformer = SimpleImputer(strategy='mean')

# Preprocessing for categorical data
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))
])

# Bundle preprocessing for numerical and categorical data
preprocessor = ColumnTransformer(
    transformers=[
        ('num', numerical_transformer, numerical_cols),
        ('cat', categorical_transformer, categorical_cols)
    ])

# Define model
model = RandomForestRegressor(n_estimators=100, random_state=0)

# Check your answer
step_1.a.check()
<IPython.core.display.Javascript object>

Correct

# Lines below will give you a hint or solution code
step_1.a.hint()
#step_1.a.solution()
<IPython.core.display.Javascript object>

Hint: While there are many different potential solutions to this problem, we achieved satisfactory results by changing only column_transformer from the default value - specifically, we changed the strategy parameter that decides how missing values are imputed.

Part B

Run the code cell below without changes.

To pass this step, you need to have defined a pipeline in Part A that achieves lower MAE than the code above. You’re encouraged to take your time here and try out many different approaches, to see how low you can get the MAE! (If your code does not pass, please amend the preprocessing steps and model in Part A.)

# Bundle preprocessing and modeling code in a pipeline
my_pipeline = Pipeline(steps=[('preprocessor', preprocessor),
                              ('model', model)
                             ])

# Preprocessing of training data, fit model 
my_pipeline.fit(X_train, y_train)

# Preprocessing of validation data, get predictions
preds = my_pipeline.predict(X_valid)

# Evaluate the model
score = mean_absolute_error(y_valid, preds)
print('MAE:', score)

# Check your answer
step_1.b.check()
MAE: 17479.87044520548



<IPython.core.display.Javascript object>

Correct

# Line below will give you a hint
#step_1.b.hint()

Step 2: Generate test predictions

Now, you’ll use your trained model to generate predictions with the test data.

# Preprocessing of test data, fit model
preds_test = my_pipeline.predict(X_test)

# Check your answer
step_2.check()
<IPython.core.display.Javascript object>

Correct

# Lines below will give you a hint or solution code
#step_2.hint()
#step_2.solution()

Run the next code cell without changes to save your results to a CSV file that can be submitted directly to the competition.

# Save test predictions to file
output = pd.DataFrame({'Id': X_test.index,
                       'SalePrice': preds_test})
output.to_csv('submission.csv', index=False)
output.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1459 entries, 0 to 1458
Data columns (total 2 columns):
 #   Column     Non-Null Count  Dtype  
---  ------     --------------  -----  
 0   Id         1459 non-null   int64  
 1   SalePrice  1459 non-null   float64
dtypes: float64(1), int64(1)
memory usage: 22.9 KB

Submit your results

Once you have successfully completed Step 2, you’re ready to submit your results to the leaderboard! If you choose to do so, make sure that you have already joined the competition by clicking on the Join Competition button at this link.

  1. Begin by clicking on the Save Version button in the top right corner of the window. This will generate a pop-up window.
  2. Ensure that the Save and Run All option is selected, and then click on the Save button.
  3. This generates a window in the bottom left corner of the notebook. After it has finished running, click on the number to the right of the Save Version button. This pulls up a list of versions on the right of the screen. Click on the ellipsis (…) to the right of the most recent version, and select Open in Viewer. This brings you into view mode of the same page. You will need to scroll down to get back to these instructions.
  4. Click on the Output tab on the right of the screen. Then, click on the file you would like to submit, and click on the Submit button to submit your results to the leaderboard.

You have now successfully submitted to the competition!

If you want to keep working to improve your performance, select the Edit button in the top right of the screen. Then you can change your code and repeat the process. There’s a lot of room to improve, and you will climb up the leaderboard as you work.

Keep going

Move on to learn about cross-validation, a technique you can use to obtain more accurate estimates of model performance!


Have questions or comments? Visit the course discussion forum to chat with other learners.

Leave a comment