Pandas Data-types and Missing

3 minute read

This notebook is an exercise in the Pandas course. You can reference the tutorial at this link.

Introduction

Run the following cell to load your data and some utility functions.

import pandas as pd

reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

from learntools.core import binder; binder.bind(globals())
from learntools.pandas.data_types_and_missing_data import *
print("Setup complete.")

Setup complete.

Exercises

1.

What is the data type of the points column in the dataset?

# Your code here
dtype = reviews.points.dtype

# Check your answer
q1.check()

<IPython.core.display.Javascript object>

Correct

#q1.hint()
#q1.solution()

2.

Create a Series from entries in the points column, but convert the entries to strings. Hint: strings are str in native Python.

point_strings = reviews.points.astype('str')

# Check your answer
q2.check()

<IPython.core.display.Javascript object>

Correct

#q2.hint()
#q2.solution()

3.

Sometimes the price column is null. How many reviews in the dataset are missing a price?

reviews[pd.isnull(reviews.price)]

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
13	Italy	This is dominated by oak and oak-driven aromas...	Rosso	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Masseria Setteporte 2012 Rosso (Etna)	Nerello Mascalese	Masseria Setteporte
30	France	Red cherry fruit comes laced with light tannin...	Nouveau	86	NaN	Beaujolais	Beaujolais-Villages	NaN	Roger Voss	@vossroger	Domaine de la Madone 2012 Nouveau (Beaujolais...	Gamay	Domaine de la Madone
31	Italy	Merlot and Nero d'Avola form the base for this...	Calanìca Nero d'Avola-Merlot	86	NaN	Sicily & Sardinia	Sicilia	NaN	NaN	NaN	Duca di Salaparuta 2010 Calanìca Nero d'Avola-...	Red Blend	Duca di Salaparuta
32	Italy	Part of the extended Calanìca series, this Gri...	Calanìca Grillo-Viognier	86	NaN	Sicily & Sardinia	Sicilia	NaN	NaN	NaN	Duca di Salaparuta 2011 Calanìca Grillo-Viogni...	White Blend	Duca di Salaparuta
...	...	...	...	...	...	...	...	...	...	...	...	...	...
129844	Italy	Doga delle Clavule is a neutral, mineral-drive...	Doga delle Clavule	86	NaN	Tuscany	Morellino di Scansano	NaN	NaN	NaN	Caparzo 2006 Doga delle Clavule (Morellino di...	Sangiovese	Caparzo
129860	Portugal	This rich wine has a firm structure as well as...	Pacheca Superior	90	NaN	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta da Pacheca 2013 Pacheca Superior Red (D...	Portuguese Red	Quinta da Pacheca
129863	Portugal	This mature wine that has 50% Touriga Nacional...	Reserva	90	NaN	Dão	NaN	NaN	Roger Voss	@vossroger	Seacampo 2011 Reserva Red (Dão)	Portuguese Red	Seacampo
129893	Italy	Aromas of passion fruit, hay and a vegetal not...	Corte Menini	91	NaN	Veneto	Soave Classico	NaN	Kerin O’Keefe	@kerinokeefe	Le Mandolare 2015 Corte Menini (Soave Classico)	Garganega	Le Mandolare
129964	France	Initially quite muted, this wine slowly develo...	Domaine Saint-Rémy Herrenweg	90	NaN	Alsace	Alsace	NaN	Roger Voss	@vossroger	Domaine Ehrhart 2013 Domaine Saint-Rémy Herren...	Gewürztraminer	Domaine Ehrhart

8996 rows × 13 columns

type(reviews[pd.isnull(reviews.price)])

pandas.core.frame.DataFrame

n_missing_prices = len(reviews[pd.isnull(reviews.price)])

# Check your answer
q3.check()

<IPython.core.display.Javascript object>

Correct

n_missing_prices

q3.hint()
#q3.solution()

<IPython.core.display.Javascript object>

Hint: Use pd.isnull().

4.

What are the most common wine-producing regions? Create a Series counting the number of times each value occurs in the region_1 field. This field is often missing data, so replace missing values with Unknown. Sort in descending order. Your output should look something like this:

Unknown                    21247
Napa Valley                 4480
                           ...  
Bardolino Superiore            1
Primitivo del Tarantino        1
Name: region_1, Length: 1230, dtype: int64

reviews.region_1.fillna('Unknown').value_counts()

Unknown                    21247
Napa Valley                 4480
Columbia Valley (WA)        4124
Russian River Valley        3091
California                  2629
                           ...  
Lamezia                        1
Trentino Superiore             1
Grave del Friuli               1
Vin Santo di Carmignano        1
Paestum                        1
Name: region_1, Length: 1230, dtype: int64

reviews_per_region = reviews.region_1.fillna('Unknown').value_counts()

# Check your answer
q4.check()

<IPython.core.display.Javascript object>

Correct

#q4.hint()
#q4.solution()

Keep going

Move on to renaming and combining.

Have questions or comments? Visit the course discussion forum to chat with other learners.

Share on

Twitter Facebook LinkedIn

안철현(Charles An)

Pandas Data-types and Missing

Introduction

Exercises

1.

2.

3.

4.

Keep going

Share on

Leave a comment

You may also enjoy

FastAPI 시작해봅시다

파사트 헤드라이트 램프(HID) 교체

Clustering with K-Means

Creating Featues