Pandas Data-types and Missing
This notebook is an exercise in the Pandas course. You can reference the tutorial at this link.
Introduction
Run the following cell to load your data and some utility functions.
import pandas as pd
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
from learntools.core import binder; binder.bind(globals())
from learntools.pandas.data_types_and_missing_data import *
print("Setup complete.")
Setup complete.
Exercises
1.
What is the data type of the points
column in the dataset?
# Your code here
dtype = reviews.points.dtype
# Check your answer
q1.check()
<IPython.core.display.Javascript object>
Correct
#q1.hint()
#q1.solution()
2.
Create a Series from entries in the points
column, but convert the entries to strings. Hint: strings are str
in native Python.
point_strings = reviews.points.astype('str')
# Check your answer
q2.check()
<IPython.core.display.Javascript object>
Correct
#q2.hint()
#q2.solution()
3.
Sometimes the price column is null. How many reviews in the dataset are missing a price?
reviews[pd.isnull(reviews.price)]
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
13 | Italy | This is dominated by oak and oak-driven aromas... | Rosso | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Masseria Setteporte 2012 Rosso (Etna) | Nerello Mascalese | Masseria Setteporte |
30 | France | Red cherry fruit comes laced with light tannin... | Nouveau | 86 | NaN | Beaujolais | Beaujolais-Villages | NaN | Roger Voss | @vossroger | Domaine de la Madone 2012 Nouveau (Beaujolais... | Gamay | Domaine de la Madone |
31 | Italy | Merlot and Nero d'Avola form the base for this... | Calanìca Nero d'Avola-Merlot | 86 | NaN | Sicily & Sardinia | Sicilia | NaN | NaN | NaN | Duca di Salaparuta 2010 Calanìca Nero d'Avola-... | Red Blend | Duca di Salaparuta |
32 | Italy | Part of the extended Calanìca series, this Gri... | Calanìca Grillo-Viognier | 86 | NaN | Sicily & Sardinia | Sicilia | NaN | NaN | NaN | Duca di Salaparuta 2011 Calanìca Grillo-Viogni... | White Blend | Duca di Salaparuta |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
129844 | Italy | Doga delle Clavule is a neutral, mineral-drive... | Doga delle Clavule | 86 | NaN | Tuscany | Morellino di Scansano | NaN | NaN | NaN | Caparzo 2006 Doga delle Clavule (Morellino di... | Sangiovese | Caparzo |
129860 | Portugal | This rich wine has a firm structure as well as... | Pacheca Superior | 90 | NaN | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta da Pacheca 2013 Pacheca Superior Red (D... | Portuguese Red | Quinta da Pacheca |
129863 | Portugal | This mature wine that has 50% Touriga Nacional... | Reserva | 90 | NaN | Dão | NaN | NaN | Roger Voss | @vossroger | Seacampo 2011 Reserva Red (Dão) | Portuguese Red | Seacampo |
129893 | Italy | Aromas of passion fruit, hay and a vegetal not... | Corte Menini | 91 | NaN | Veneto | Soave Classico | NaN | Kerin O’Keefe | @kerinokeefe | Le Mandolare 2015 Corte Menini (Soave Classico) | Garganega | Le Mandolare |
129964 | France | Initially quite muted, this wine slowly develo... | Domaine Saint-Rémy Herrenweg | 90 | NaN | Alsace | Alsace | NaN | Roger Voss | @vossroger | Domaine Ehrhart 2013 Domaine Saint-Rémy Herren... | Gewürztraminer | Domaine Ehrhart |
8996 rows × 13 columns
type(reviews[pd.isnull(reviews.price)])
pandas.core.frame.DataFrame
n_missing_prices = len(reviews[pd.isnull(reviews.price)])
# Check your answer
q3.check()
<IPython.core.display.Javascript object>
Correct
n_missing_prices
8996
q3.hint()
#q3.solution()
<IPython.core.display.Javascript object>
Hint: Use pd.isnull()
.
4.
What are the most common wine-producing regions? Create a Series counting the number of times each value occurs in the region_1
field. This field is often missing data, so replace missing values with Unknown
. Sort in descending order. Your output should look something like this:
Unknown 21247
Napa Valley 4480
...
Bardolino Superiore 1
Primitivo del Tarantino 1
Name: region_1, Length: 1230, dtype: int64
reviews.region_1.fillna('Unknown').value_counts()
Unknown 21247
Napa Valley 4480
Columbia Valley (WA) 4124
Russian River Valley 3091
California 2629
...
Lamezia 1
Trentino Superiore 1
Grave del Friuli 1
Vin Santo di Carmignano 1
Paestum 1
Name: region_1, Length: 1230, dtype: int64
reviews_per_region = reviews.region_1.fillna('Unknown').value_counts()
# Check your answer
q4.check()
<IPython.core.display.Javascript object>
Correct
#q4.hint()
#q4.solution()
Keep going
Move on to renaming and combining.
Have questions or comments? Visit the course discussion forum to chat with other learners.
Leave a comment