7 minute read

This notebook is an exercise in the Feature Engineering course. You can reference the tutorial at this link.


Introduction

In this exercise you’ll start developing the features you identified in Exercise 2 as having the most potential. As you work through this exercise, you might take a moment to look at the data documentation again and consider whether the features we’re creating make sense from a real-world perspective, and whether there are any useful combinations that stand out to you.

Run this cell to set everything up!

# Setup feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.feature_engineering_new.ex3 import *

import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor


def score_dataset(X, y, model=XGBRegressor()):
    # Label encoding for categoricals
    for colname in X.select_dtypes(["category", "object"]):
        X[colname], _ = X[colname].factorize()
    # Metric for Housing competition is RMSLE (Root Mean Squared Log Error)
    score = cross_val_score(
        model, X, y, cv=5, scoring="neg_mean_squared_log_error",
    )
    score = -1 * score.mean()
    score = np.sqrt(score)
    return score


# Prepare data
df = pd.read_csv("../input/fe-course-data/ames.csv")
X = df.copy()
y = X.pop("SalePrice")

Let’s start with a few mathematical combinations. We’ll focus on features describing areas – having the same units (square-feet) makes it easy to combine them in sensible ways. Since we’re using XGBoost (a tree-based model), we’ll focus on ratios and sums.

1) Create Mathematical Transforms

Create the following features:

  • LivLotRatio: the ratio of GrLivArea to LotArea
  • Spaciousness: the sum of FirstFlrSF and SecondFlrSF divided by TotRmsAbvGrd
  • TotalOutsideSF: the sum of WoodDeckSF, OpenPorchSF, EnclosedPorch, Threeseasonporch, and ScreenPorch
X.head()
MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities LotConfig ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YearSold SaleType SaleCondition
0 One_Story_1946_and_Newer_All_Styles Residential_Low_Density 141.0 31770.0 Pave No_Alley_Access Slightly_Irregular Lvl AllPub Corner ... 0.0 0.0 No_Pool No_Fence None 0.0 5 2010 WD Normal
1 One_Story_1946_and_Newer_All_Styles Residential_High_Density 80.0 11622.0 Pave No_Alley_Access Regular Lvl AllPub Inside ... 120.0 0.0 No_Pool Minimum_Privacy None 0.0 6 2010 WD Normal
2 One_Story_1946_and_Newer_All_Styles Residential_Low_Density 81.0 14267.0 Pave No_Alley_Access Slightly_Irregular Lvl AllPub Corner ... 0.0 0.0 No_Pool No_Fence Gar2 12500.0 6 2010 WD Normal
3 One_Story_1946_and_Newer_All_Styles Residential_Low_Density 93.0 11160.0 Pave No_Alley_Access Regular Lvl AllPub Corner ... 0.0 0.0 No_Pool No_Fence None 0.0 4 2010 WD Normal
4 Two_Story_1946_and_Newer Residential_Low_Density 74.0 13830.0 Pave No_Alley_Access Slightly_Irregular Lvl AllPub Inside ... 0.0 0.0 No_Pool Minimum_Privacy None 0.0 3 2010 WD Normal

5 rows × 78 columns

X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 78 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   MSSubClass        2930 non-null   object 
 1   MSZoning          2930 non-null   object 
 2   LotFrontage       2930 non-null   float64
 3   LotArea           2930 non-null   float64
 4   Street            2930 non-null   object 
 5   Alley             2930 non-null   object 
 6   LotShape          2930 non-null   object 
 7   LandContour       2930 non-null   object 
 8   Utilities         2930 non-null   object 
 9   LotConfig         2930 non-null   object 
 10  LandSlope         2930 non-null   object 
 11  Neighborhood      2930 non-null   object 
 12  Condition1        2930 non-null   object 
 13  Condition2        2930 non-null   object 
 14  BldgType          2930 non-null   object 
 15  HouseStyle        2930 non-null   object 
 16  OverallQual       2930 non-null   object 
 17  OverallCond       2930 non-null   object 
 18  YearBuilt         2930 non-null   int64  
 19  YearRemodAdd      2930 non-null   int64  
 20  RoofStyle         2930 non-null   object 
 21  RoofMatl          2930 non-null   object 
 22  Exterior1st       2930 non-null   object 
 23  Exterior2nd       2930 non-null   object 
 24  MasVnrType        2930 non-null   object 
 25  MasVnrArea        2930 non-null   float64
 26  ExterQual         2930 non-null   object 
 27  ExterCond         2930 non-null   object 
 28  Foundation        2930 non-null   object 
 29  BsmtQual          2930 non-null   object 
 30  BsmtCond          2930 non-null   object 
 31  BsmtExposure      2930 non-null   object 
 32  BsmtFinType1      2930 non-null   object 
 33  BsmtFinSF1        2930 non-null   float64
 34  BsmtFinType2      2930 non-null   object 
 35  BsmtFinSF2        2930 non-null   float64
 36  BsmtUnfSF         2930 non-null   float64
 37  TotalBsmtSF       2930 non-null   float64
 38  Heating           2930 non-null   object 
 39  HeatingQC         2930 non-null   object 
 40  CentralAir        2930 non-null   object 
 41  Electrical        2930 non-null   object 
 42  FirstFlrSF        2930 non-null   float64
 43  SecondFlrSF       2930 non-null   float64
 44  LowQualFinSF      2930 non-null   float64
 45  GrLivArea         2930 non-null   float64
 46  BsmtFullBath      2930 non-null   int64  
 47  BsmtHalfBath      2930 non-null   int64  
 48  FullBath          2930 non-null   int64  
 49  HalfBath          2930 non-null   int64  
 50  BedroomAbvGr      2930 non-null   int64  
 51  KitchenAbvGr      2930 non-null   int64  
 52  KitchenQual       2930 non-null   object 
 53  TotRmsAbvGrd      2930 non-null   int64  
 54  Functional        2930 non-null   object 
 55  Fireplaces        2930 non-null   int64  
 56  FireplaceQu       2930 non-null   object 
 57  GarageType        2930 non-null   object 
 58  GarageFinish      2930 non-null   object 
 59  GarageCars        2930 non-null   int64  
 60  GarageArea        2930 non-null   float64
 61  GarageQual        2930 non-null   object 
 62  GarageCond        2930 non-null   object 
 63  PavedDrive        2930 non-null   object 
 64  WoodDeckSF        2930 non-null   float64
 65  OpenPorchSF       2930 non-null   float64
 66  EnclosedPorch     2930 non-null   float64
 67  Threeseasonporch  2930 non-null   float64
 68  ScreenPorch       2930 non-null   float64
 69  PoolArea          2930 non-null   float64
 70  PoolQC            2930 non-null   object 
 71  Fence             2930 non-null   object 
 72  MiscFeature       2930 non-null   object 
 73  MiscVal           2930 non-null   float64
 74  MoSold            2930 non-null   int64  
 75  YearSold          2930 non-null   int64  
 76  SaleType          2930 non-null   object 
 77  SaleCondition     2930 non-null   object 
dtypes: float64(19), int64(13), object(46)
memory usage: 1.7+ MB
# YOUR CODE HERE
X_1 = pd.DataFrame()  # dataframe to hold new features

sf_columns = ["WoodDeckSF", "OpenPorchSF", "EnclosedPorch", "Threeseasonporch", "ScreenPorch"]

X_1["LivLotRatio"] = X.GrLivArea / X.LotArea
X_1["Spaciousness"] = (X.FirstFlrSF + X.SecondFlrSF) / X.TotRmsAbvGrd
X_1["TotalOutsideSF"] = X[sf_columns].sum(axis=1)


# Check your answer
q_1.check()
<IPython.core.display.Javascript object>

Correct

# Lines below will give you a hint or solution code
# q_1.hint()
#q_1.solution()
X_1.head()
LivLotRatio Spaciousness TotalOutsideSF
0 0.052125 236.571429 272.0
1 0.077095 179.200000 260.0
2 0.093152 221.500000 429.0
3 0.189068 263.750000 0.0
4 0.117787 271.500000 246.0

If you’ve discovered an interaction effect between a numeric feature and a categorical feature, you might want to model it explicitly using a one-hot encoding, like so:

# One-hot encode Categorical feature, adding a column prefix "Cat"
X_new = pd.get_dummies(df.Categorical, prefix="Cat")

# Multiply row-by-row
X_new = X_new.mul(df.Continuous, axis=0)

# Join the new features to the feature set
X = X.join(X_new)

2) Interaction with a Categorical

We discovered an interaction between BldgType and GrLivArea in Exercise 2. Now create their interaction features.

# YOUR CODE HERE
# One-hot encode BldgType. Use `prefix="Bldg"` in `get_dummies`
X_2 = pd.get_dummies(X.BldgType, prefix="Bldg")
# Multiply
X_2 = X_2.mul(X.GrLivArea, axis=0)


# Check your answer
q_2.check()
<IPython.core.display.Javascript object>

Correct

X_2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 5 columns):
 #   Column         Non-Null Count  Dtype  
---  ------         --------------  -----  
 0   Bldg_Duplex    2930 non-null   float64
 1   Bldg_OneFam    2930 non-null   float64
 2   Bldg_Twnhs     2930 non-null   float64
 3   Bldg_TwnhsE    2930 non-null   float64
 4   Bldg_TwoFmCon  2930 non-null   float64
dtypes: float64(5)
memory usage: 114.6 KB
X_2.head()
Bldg_Duplex Bldg_OneFam Bldg_Twnhs Bldg_TwnhsE Bldg_TwoFmCon
0 0.0 1656.0 0.0 0.0 0.0
1 0.0 896.0 0.0 0.0 0.0
2 0.0 1329.0 0.0 0.0 0.0
3 0.0 2110.0 0.0 0.0 0.0
4 0.0 1629.0 0.0 0.0 0.0
# Lines below will give you a hint or solution code
q_2.hint()
#q_2.solution()
<IPython.core.display.Javascript object>

Hint: Your code should look something like:

X_2 = pd.get_dummies(____, prefix="Bldg")
X_2 = X_2.mul(____, axis=0)

3) Count Feature

Let’s try creating a feature that describes how many kinds of outdoor areas a dwelling has. Create a feature PorchTypes that counts how many of the following are greater than 0.0:

WoodDeckSF
OpenPorchSF
EnclosedPorch
Threeseasonporch
ScreenPorch
X_3 = pd.DataFrame()

# YOUR CODE HERE
porch_columns = ["WoodDeckSF", "OpenPorchSF", "EnclosedPorch", "Threeseasonporch", "ScreenPorch"]
X_3["PorchTypes"] = X[porch_columns].gt(0).sum(axis=1)


# Check your answer
q_3.check()
<IPython.core.display.Javascript object>

Correct

# Lines below will give you a hint or solution code
#q_3.hint()
#q_3.solution()

4) Break Down a Categorical Feature

MSSubClass describes the type of a dwelling:

df.MSSubClass.unique()
array(['One_Story_1946_and_Newer_All_Styles', 'Two_Story_1946_and_Newer',
       'One_Story_PUD_1946_and_Newer',
       'One_and_Half_Story_Finished_All_Ages', 'Split_Foyer',
       'Two_Story_PUD_1946_and_Newer', 'Split_or_Multilevel',
       'One_Story_1945_and_Older', 'Duplex_All_Styles_and_Ages',
       'Two_Family_conversion_All_Styles_and_Ages',
       'One_and_Half_Story_Unfinished_All_Ages',
       'Two_Story_1945_and_Older', 'Two_and_Half_Story_All_Ages',
       'One_Story_with_Finished_Attic_All_Ages',
       'PUD_Multilevel_Split_Level_Foyer',
       'One_and_Half_Story_PUD_All_Ages'], dtype=object)

You can see that there is a more general categorization described (roughly) by the first word of each category. Create a feature containing only these first words by splitting MSSubClass at the first underscore _. (Hint: In the split method use an argument n=1.)

X_4 = pd.DataFrame()

# YOUR CODE HERE
X_4["MSClass"] = df.MSSubClass.str.split(pat="_", n=1, expand=True)[0]

# Check your answer
q_4.check()
<IPython.core.display.Javascript object>

Correct

# Lines below will give you a hint or solution code
q_4.hint()
#q_4.solution()
<IPython.core.display.Javascript object>

Hint: Your code should look something like:

X_4 = pd.DataFrame()

X_4["MSClass"] = df.____.str.____(____, n=1, expand=True)[____]

5) Use a Grouped Transform

The value of a home often depends on how it compares to typical homes in its neighborhood. Create a feature MedNhbdArea that describes the median of GrLivArea grouped on Neighborhood.

X_5 = pd.DataFrame()

# YOUR CODE HERE
X_5["MedNhbdArea"] = (
    X.groupby("Neighborhood")
    ["GrLivArea"]
    .transform("median"))

# Check your answer
q_5.check()
<IPython.core.display.Javascript object>

Correct

X_5.MedNhbdArea.unique()
array([1200. , 1560. , 1767. , 1632. , 1555. , 1092. , 1322. , 1832. ,
       1455.5, 2418. , 1575. , 1052. , 1226. , 1231. , 1374. , 1128. ,
       1694. , 1536.5, 1195.5, 1504. , 1648. , 1118. , 1282. , 1650.5,
       1706.5, 1398.5, 1320. ])
# Lines below will give you a hint or solution code
#q_5.hint()
#q_5.solution()

Now you’ve made your first new feature set! If you like, you can run the cell below to score the model with all of your new features added:

X_new = X.join([X_1, X_2, X_3, X_4, X_5])
score_dataset(X_new, y)
0.13847331710099203

Keep Going

Untangle spatial relationships by adding cluster labels to your dataset.


Have questions or comments? Visit the course discussion forum to chat with other learners.

Leave a comment