Creating Featues
This notebook is an exercise in the Feature Engineering course. You can reference the tutorial at this link.
Introduction
In this exercise you’ll start developing the features you identified in Exercise 2 as having the most potential. As you work through this exercise, you might take a moment to look at the data documentation again and consider whether the features we’re creating make sense from a real-world perspective, and whether there are any useful combinations that stand out to you.
Run this cell to set everything up!
# Setup feedback system
from learntools.core import binder
binder.bind(globals())
from learntools.feature_engineering_new.ex3 import *
import numpy as np
import pandas as pd
from sklearn.model_selection import cross_val_score
from xgboost import XGBRegressor
def score_dataset(X, y, model=XGBRegressor()):
# Label encoding for categoricals
for colname in X.select_dtypes(["category", "object"]):
X[colname], _ = X[colname].factorize()
# Metric for Housing competition is RMSLE (Root Mean Squared Log Error)
score = cross_val_score(
model, X, y, cv=5, scoring="neg_mean_squared_log_error",
)
score = -1 * score.mean()
score = np.sqrt(score)
return score
# Prepare data
df = pd.read_csv("../input/fe-course-data/ames.csv")
X = df.copy()
y = X.pop("SalePrice")
Let’s start with a few mathematical combinations. We’ll focus on features describing areas – having the same units (square-feet) makes it easy to combine them in sensible ways. Since we’re using XGBoost (a tree-based model), we’ll focus on ratios and sums.
1) Create Mathematical Transforms
Create the following features:
LivLotRatio
: the ratio ofGrLivArea
toLotArea
Spaciousness
: the sum ofFirstFlrSF
andSecondFlrSF
divided byTotRmsAbvGrd
TotalOutsideSF
: the sum ofWoodDeckSF
,OpenPorchSF
,EnclosedPorch
,Threeseasonporch
, andScreenPorch
X.head()
MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | LotConfig | ... | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YearSold | SaleType | SaleCondition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | One_Story_1946_and_Newer_All_Styles | Residential_Low_Density | 141.0 | 31770.0 | Pave | No_Alley_Access | Slightly_Irregular | Lvl | AllPub | Corner | ... | 0.0 | 0.0 | No_Pool | No_Fence | None | 0.0 | 5 | 2010 | WD | Normal |
1 | One_Story_1946_and_Newer_All_Styles | Residential_High_Density | 80.0 | 11622.0 | Pave | No_Alley_Access | Regular | Lvl | AllPub | Inside | ... | 120.0 | 0.0 | No_Pool | Minimum_Privacy | None | 0.0 | 6 | 2010 | WD | Normal |
2 | One_Story_1946_and_Newer_All_Styles | Residential_Low_Density | 81.0 | 14267.0 | Pave | No_Alley_Access | Slightly_Irregular | Lvl | AllPub | Corner | ... | 0.0 | 0.0 | No_Pool | No_Fence | Gar2 | 12500.0 | 6 | 2010 | WD | Normal |
3 | One_Story_1946_and_Newer_All_Styles | Residential_Low_Density | 93.0 | 11160.0 | Pave | No_Alley_Access | Regular | Lvl | AllPub | Corner | ... | 0.0 | 0.0 | No_Pool | No_Fence | None | 0.0 | 4 | 2010 | WD | Normal |
4 | Two_Story_1946_and_Newer | Residential_Low_Density | 74.0 | 13830.0 | Pave | No_Alley_Access | Slightly_Irregular | Lvl | AllPub | Inside | ... | 0.0 | 0.0 | No_Pool | Minimum_Privacy | None | 0.0 | 3 | 2010 | WD | Normal |
5 rows × 78 columns
X.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 78 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MSSubClass 2930 non-null object
1 MSZoning 2930 non-null object
2 LotFrontage 2930 non-null float64
3 LotArea 2930 non-null float64
4 Street 2930 non-null object
5 Alley 2930 non-null object
6 LotShape 2930 non-null object
7 LandContour 2930 non-null object
8 Utilities 2930 non-null object
9 LotConfig 2930 non-null object
10 LandSlope 2930 non-null object
11 Neighborhood 2930 non-null object
12 Condition1 2930 non-null object
13 Condition2 2930 non-null object
14 BldgType 2930 non-null object
15 HouseStyle 2930 non-null object
16 OverallQual 2930 non-null object
17 OverallCond 2930 non-null object
18 YearBuilt 2930 non-null int64
19 YearRemodAdd 2930 non-null int64
20 RoofStyle 2930 non-null object
21 RoofMatl 2930 non-null object
22 Exterior1st 2930 non-null object
23 Exterior2nd 2930 non-null object
24 MasVnrType 2930 non-null object
25 MasVnrArea 2930 non-null float64
26 ExterQual 2930 non-null object
27 ExterCond 2930 non-null object
28 Foundation 2930 non-null object
29 BsmtQual 2930 non-null object
30 BsmtCond 2930 non-null object
31 BsmtExposure 2930 non-null object
32 BsmtFinType1 2930 non-null object
33 BsmtFinSF1 2930 non-null float64
34 BsmtFinType2 2930 non-null object
35 BsmtFinSF2 2930 non-null float64
36 BsmtUnfSF 2930 non-null float64
37 TotalBsmtSF 2930 non-null float64
38 Heating 2930 non-null object
39 HeatingQC 2930 non-null object
40 CentralAir 2930 non-null object
41 Electrical 2930 non-null object
42 FirstFlrSF 2930 non-null float64
43 SecondFlrSF 2930 non-null float64
44 LowQualFinSF 2930 non-null float64
45 GrLivArea 2930 non-null float64
46 BsmtFullBath 2930 non-null int64
47 BsmtHalfBath 2930 non-null int64
48 FullBath 2930 non-null int64
49 HalfBath 2930 non-null int64
50 BedroomAbvGr 2930 non-null int64
51 KitchenAbvGr 2930 non-null int64
52 KitchenQual 2930 non-null object
53 TotRmsAbvGrd 2930 non-null int64
54 Functional 2930 non-null object
55 Fireplaces 2930 non-null int64
56 FireplaceQu 2930 non-null object
57 GarageType 2930 non-null object
58 GarageFinish 2930 non-null object
59 GarageCars 2930 non-null int64
60 GarageArea 2930 non-null float64
61 GarageQual 2930 non-null object
62 GarageCond 2930 non-null object
63 PavedDrive 2930 non-null object
64 WoodDeckSF 2930 non-null float64
65 OpenPorchSF 2930 non-null float64
66 EnclosedPorch 2930 non-null float64
67 Threeseasonporch 2930 non-null float64
68 ScreenPorch 2930 non-null float64
69 PoolArea 2930 non-null float64
70 PoolQC 2930 non-null object
71 Fence 2930 non-null object
72 MiscFeature 2930 non-null object
73 MiscVal 2930 non-null float64
74 MoSold 2930 non-null int64
75 YearSold 2930 non-null int64
76 SaleType 2930 non-null object
77 SaleCondition 2930 non-null object
dtypes: float64(19), int64(13), object(46)
memory usage: 1.7+ MB
# YOUR CODE HERE
X_1 = pd.DataFrame() # dataframe to hold new features
sf_columns = ["WoodDeckSF", "OpenPorchSF", "EnclosedPorch", "Threeseasonporch", "ScreenPorch"]
X_1["LivLotRatio"] = X.GrLivArea / X.LotArea
X_1["Spaciousness"] = (X.FirstFlrSF + X.SecondFlrSF) / X.TotRmsAbvGrd
X_1["TotalOutsideSF"] = X[sf_columns].sum(axis=1)
# Check your answer
q_1.check()
<IPython.core.display.Javascript object>
Correct
# Lines below will give you a hint or solution code
# q_1.hint()
#q_1.solution()
X_1.head()
LivLotRatio | Spaciousness | TotalOutsideSF | |
---|---|---|---|
0 | 0.052125 | 236.571429 | 272.0 |
1 | 0.077095 | 179.200000 | 260.0 |
2 | 0.093152 | 221.500000 | 429.0 |
3 | 0.189068 | 263.750000 | 0.0 |
4 | 0.117787 | 271.500000 | 246.0 |
If you’ve discovered an interaction effect between a numeric feature and a categorical feature, you might want to model it explicitly using a one-hot encoding, like so:
# One-hot encode Categorical feature, adding a column prefix "Cat"
X_new = pd.get_dummies(df.Categorical, prefix="Cat")
# Multiply row-by-row
X_new = X_new.mul(df.Continuous, axis=0)
# Join the new features to the feature set
X = X.join(X_new)
2) Interaction with a Categorical
We discovered an interaction between BldgType
and GrLivArea
in Exercise 2. Now create their interaction features.
# YOUR CODE HERE
# One-hot encode BldgType. Use `prefix="Bldg"` in `get_dummies`
X_2 = pd.get_dummies(X.BldgType, prefix="Bldg")
# Multiply
X_2 = X_2.mul(X.GrLivArea, axis=0)
# Check your answer
q_2.check()
<IPython.core.display.Javascript object>
Correct
X_2.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2930 entries, 0 to 2929
Data columns (total 5 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Bldg_Duplex 2930 non-null float64
1 Bldg_OneFam 2930 non-null float64
2 Bldg_Twnhs 2930 non-null float64
3 Bldg_TwnhsE 2930 non-null float64
4 Bldg_TwoFmCon 2930 non-null float64
dtypes: float64(5)
memory usage: 114.6 KB
X_2.head()
Bldg_Duplex | Bldg_OneFam | Bldg_Twnhs | Bldg_TwnhsE | Bldg_TwoFmCon | |
---|---|---|---|---|---|
0 | 0.0 | 1656.0 | 0.0 | 0.0 | 0.0 |
1 | 0.0 | 896.0 | 0.0 | 0.0 | 0.0 |
2 | 0.0 | 1329.0 | 0.0 | 0.0 | 0.0 |
3 | 0.0 | 2110.0 | 0.0 | 0.0 | 0.0 |
4 | 0.0 | 1629.0 | 0.0 | 0.0 | 0.0 |
# Lines below will give you a hint or solution code
q_2.hint()
#q_2.solution()
<IPython.core.display.Javascript object>
Hint: Your code should look something like:
X_2 = pd.get_dummies(____, prefix="Bldg")
X_2 = X_2.mul(____, axis=0)
3) Count Feature
Let’s try creating a feature that describes how many kinds of outdoor areas a dwelling has. Create a feature PorchTypes
that counts how many of the following are greater than 0.0:
WoodDeckSF
OpenPorchSF
EnclosedPorch
Threeseasonporch
ScreenPorch
X_3 = pd.DataFrame()
# YOUR CODE HERE
porch_columns = ["WoodDeckSF", "OpenPorchSF", "EnclosedPorch", "Threeseasonporch", "ScreenPorch"]
X_3["PorchTypes"] = X[porch_columns].gt(0).sum(axis=1)
# Check your answer
q_3.check()
<IPython.core.display.Javascript object>
Correct
# Lines below will give you a hint or solution code
#q_3.hint()
#q_3.solution()
4) Break Down a Categorical Feature
MSSubClass
describes the type of a dwelling:
df.MSSubClass.unique()
array(['One_Story_1946_and_Newer_All_Styles', 'Two_Story_1946_and_Newer',
'One_Story_PUD_1946_and_Newer',
'One_and_Half_Story_Finished_All_Ages', 'Split_Foyer',
'Two_Story_PUD_1946_and_Newer', 'Split_or_Multilevel',
'One_Story_1945_and_Older', 'Duplex_All_Styles_and_Ages',
'Two_Family_conversion_All_Styles_and_Ages',
'One_and_Half_Story_Unfinished_All_Ages',
'Two_Story_1945_and_Older', 'Two_and_Half_Story_All_Ages',
'One_Story_with_Finished_Attic_All_Ages',
'PUD_Multilevel_Split_Level_Foyer',
'One_and_Half_Story_PUD_All_Ages'], dtype=object)
You can see that there is a more general categorization described (roughly) by the first word of each category. Create a feature containing only these first words by splitting MSSubClass
at the first underscore _
. (Hint: In the split
method use an argument n=1
.)
X_4 = pd.DataFrame()
# YOUR CODE HERE
X_4["MSClass"] = df.MSSubClass.str.split(pat="_", n=1, expand=True)[0]
# Check your answer
q_4.check()
<IPython.core.display.Javascript object>
Correct
# Lines below will give you a hint or solution code
q_4.hint()
#q_4.solution()
<IPython.core.display.Javascript object>
Hint: Your code should look something like:
X_4 = pd.DataFrame()
X_4["MSClass"] = df.____.str.____(____, n=1, expand=True)[____]
5) Use a Grouped Transform
The value of a home often depends on how it compares to typical homes in its neighborhood. Create a feature MedNhbdArea
that describes the median of GrLivArea
grouped on Neighborhood
.
X_5 = pd.DataFrame()
# YOUR CODE HERE
X_5["MedNhbdArea"] = (
X.groupby("Neighborhood")
["GrLivArea"]
.transform("median"))
# Check your answer
q_5.check()
<IPython.core.display.Javascript object>
Correct
X_5.MedNhbdArea.unique()
array([1200. , 1560. , 1767. , 1632. , 1555. , 1092. , 1322. , 1832. ,
1455.5, 2418. , 1575. , 1052. , 1226. , 1231. , 1374. , 1128. ,
1694. , 1536.5, 1195.5, 1504. , 1648. , 1118. , 1282. , 1650.5,
1706.5, 1398.5, 1320. ])
# Lines below will give you a hint or solution code
#q_5.hint()
#q_5.solution()
Now you’ve made your first new feature set! If you like, you can run the cell below to score the model with all of your new features added:
X_new = X.join([X_1, X_2, X_3, X_4, X_5])
score_dataset(X_new, y)
0.13847331710099203
Keep Going
Untangle spatial relationships by adding cluster labels to your dataset.
Have questions or comments? Visit the course discussion forum to chat with other learners.
Leave a comment