3 minute read

This notebook is an exercise in the Pandas course. You can reference the tutorial at this link.


Introduction

Run the following cell to load your data and some utility functions.

import pandas as pd

reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

from learntools.core import binder; binder.bind(globals())
from learntools.pandas.data_types_and_missing_data import *
print("Setup complete.")
Setup complete.

Exercises

1.

What is the data type of the points column in the dataset?

# Your code here
dtype = reviews.points.dtype

# Check your answer
q1.check()
<IPython.core.display.Javascript object>

Correct

#q1.hint()
#q1.solution()

2.

Create a Series from entries in the points column, but convert the entries to strings. Hint: strings are str in native Python.

point_strings = reviews.points.astype('str')

# Check your answer
q2.check()
<IPython.core.display.Javascript object>

Correct

#q2.hint()
#q2.solution()

3.

Sometimes the price column is null. How many reviews in the dataset are missing a price?

reviews[pd.isnull(reviews.price)]
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
13 Italy This is dominated by oak and oak-driven aromas... Rosso 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Masseria Setteporte 2012 Rosso (Etna) Nerello Mascalese Masseria Setteporte
30 France Red cherry fruit comes laced with light tannin... Nouveau 86 NaN Beaujolais Beaujolais-Villages NaN Roger Voss @vossroger Domaine de la Madone 2012 Nouveau (Beaujolais... Gamay Domaine de la Madone
31 Italy Merlot and Nero d'Avola form the base for this... Calanìca Nero d'Avola-Merlot 86 NaN Sicily & Sardinia Sicilia NaN NaN NaN Duca di Salaparuta 2010 Calanìca Nero d'Avola-... Red Blend Duca di Salaparuta
32 Italy Part of the extended Calanìca series, this Gri... Calanìca Grillo-Viognier 86 NaN Sicily & Sardinia Sicilia NaN NaN NaN Duca di Salaparuta 2011 Calanìca Grillo-Viogni... White Blend Duca di Salaparuta
... ... ... ... ... ... ... ... ... ... ... ... ... ...
129844 Italy Doga delle Clavule is a neutral, mineral-drive... Doga delle Clavule 86 NaN Tuscany Morellino di Scansano NaN NaN NaN Caparzo 2006 Doga delle Clavule (Morellino di... Sangiovese Caparzo
129860 Portugal This rich wine has a firm structure as well as... Pacheca Superior 90 NaN Douro NaN NaN Roger Voss @vossroger Quinta da Pacheca 2013 Pacheca Superior Red (D... Portuguese Red Quinta da Pacheca
129863 Portugal This mature wine that has 50% Touriga Nacional... Reserva 90 NaN Dão NaN NaN Roger Voss @vossroger Seacampo 2011 Reserva Red (Dão) Portuguese Red Seacampo
129893 Italy Aromas of passion fruit, hay and a vegetal not... Corte Menini 91 NaN Veneto Soave Classico NaN Kerin O’Keefe @kerinokeefe Le Mandolare 2015 Corte Menini (Soave Classico) Garganega Le Mandolare
129964 France Initially quite muted, this wine slowly develo... Domaine Saint-Rémy Herrenweg 90 NaN Alsace Alsace NaN Roger Voss @vossroger Domaine Ehrhart 2013 Domaine Saint-Rémy Herren... Gewürztraminer Domaine Ehrhart

8996 rows × 13 columns

type(reviews[pd.isnull(reviews.price)])
pandas.core.frame.DataFrame
n_missing_prices = len(reviews[pd.isnull(reviews.price)])

# Check your answer
q3.check()
<IPython.core.display.Javascript object>

Correct

n_missing_prices
8996
q3.hint()
#q3.solution()
<IPython.core.display.Javascript object>

Hint: Use pd.isnull().

4.

What are the most common wine-producing regions? Create a Series counting the number of times each value occurs in the region_1 field. This field is often missing data, so replace missing values with Unknown. Sort in descending order. Your output should look something like this:

Unknown                    21247
Napa Valley                 4480
                           ...  
Bardolino Superiore            1
Primitivo del Tarantino        1
Name: region_1, Length: 1230, dtype: int64
reviews.region_1.fillna('Unknown').value_counts()
Unknown                    21247
Napa Valley                 4480
Columbia Valley (WA)        4124
Russian River Valley        3091
California                  2629
                           ...  
Lamezia                        1
Trentino Superiore             1
Grave del Friuli               1
Vin Santo di Carmignano        1
Paestum                        1
Name: region_1, Length: 1230, dtype: int64
reviews_per_region = reviews.region_1.fillna('Unknown').value_counts()

# Check your answer
q4.check()
<IPython.core.display.Javascript object>

Correct

#q4.hint()
#q4.solution()

Keep going

Move on to renaming and combining.


Have questions or comments? Visit the course discussion forum to chat with other learners.

Leave a comment