Pandas Summary, Functions and Maps

4 minute read

This notebook is an exercise in the Pandas course. You can reference the tutorial at this link.

Introduction

Now you are ready to get a deeper understanding of your data.

Run the following cell to load your data and some utility functions (including code to check your answers).

import pandas as pd
pd.set_option("display.max_rows", 5)
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

from learntools.core import binder; binder.bind(globals())
from learntools.pandas.summary_functions_and_maps import *
print("Setup complete.")

reviews.head()

Setup complete.

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
1	Portugal	This is ripe and fruity, a wine that is smooth...	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
2	US	Tart and snappy, the flavors of lime flesh and...	NaN	87	14.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Rainstorm 2013 Pinot Gris (Willamette Valley)	Pinot Gris	Rainstorm
3	US	Pineapple rind, lemon pith and orange blossom ...	Reserve Late Harvest	87	13.0	Michigan	Lake Michigan Shore	NaN	Alexander Peartree	NaN	St. Julian 2013 Reserve Late Harvest Riesling ...	Riesling	St. Julian
4	US	Much like the regular bottling from 2012, this...	Vintner's Reserve Wild Child Block	87	65.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Sweet Cheeks 2012 Vintner's Reserve Wild Child...	Pinot Noir	Sweet Cheeks

Exercises

1.

What is the median of the points column in the reviews DataFrame?

median_points = reviews.points.median()

# Check your answer
q1.check()

<IPython.core.display.Javascript object>

Correct

#q1.hint()
#q1.solution()

2.

What countries are represented in the dataset? (Your answer should not include any duplicates.)

reviews.country.unique()

array(['Italy', 'Portugal', 'US', 'Spain', 'France', 'Germany',
       'Argentina', 'Chile', 'Australia', 'Austria', 'South Africa',
       'New Zealand', 'Israel', 'Hungary', 'Greece', 'Romania', 'Mexico',
       'Canada', nan, 'Turkey', 'Czech Republic', 'Slovenia',
       'Luxembourg', 'Croatia', 'Georgia', 'Uruguay', 'England',
       'Lebanon', 'Serbia', 'Brazil', 'Moldova', 'Morocco', 'Peru',
       'India', 'Bulgaria', 'Cyprus', 'Armenia', 'Switzerland',
       'Bosnia and Herzegovina', 'Ukraine', 'Slovakia', 'Macedonia',
       'China', 'Egypt'], dtype=object)

countries = reviews.country.unique()

# Check your answer
q2.check()

<IPython.core.display.Javascript object>

Correct

#q2.hint()
#q2.solution()

3.

How often does each country appear in the dataset? Create a Series reviews_per_country mapping countries to the count of reviews of wines from that country.

reviews_per_country = reviews.country.value_counts()

# Check your answer
q3.check()

<IPython.core.display.Javascript object>

Correct

#q3.hint()
#q3.solution()

4.

Create variable centered_price containing a version of the price column with the mean price subtracted.

(Note: this ‘centering’ transformation is a common preprocessing step before applying various machine learning algorithms.)

centered_price = reviews.price - reviews.price.mean()

# Check your answer
q4.check()

<IPython.core.display.Javascript object>

Correct

#q4.hint()
#q4.solution()

5.

I’m an economical wine buyer. Which wine is the “best bargain”? Create a variable bargain_wine with the title of the wine with the highest points-to-price ratio in the dataset.

bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']

# Check your answer
q5.check()

<IPython.core.display.Javascript object>

Correct

bargain_wine

'Bandit NV Merlot (California)'

badwine_idx = (reviews.points / reviews.price).idxmin()
bad_wine = reviews.loc[badwine_idx, 'title']

bad_wine

'Château les Ormes Sorbet 2013  Médoc'

q5.hint()
q5.solution()

<IPython.core.display.Javascript object>

Hint: The idxmax method may be useful here.

<IPython.core.display.Javascript object>

Solution:

bargain_idx = (reviews.points / reviews.price).idxmax()
bargain_wine = reviews.loc[bargain_idx, 'title']

6.

There are only so many words you can use when describing a bottle of wine. Is a wine more likely to be “tropical” or “fruity”? Create a Series descriptor_counts counting how many times each of these two words appears in the description column in the dataset. (For simplicity, let’s ignore the capitalized versions of these words.)

reviews.columns

Index(['country', 'description', 'designation', 'points', 'price', 'province',
       'region_1', 'region_2', 'taster_name', 'taster_twitter_handle', 'title',
       'variety', 'winery'],
      dtype='object')

reviews.description.count()

n_trop = reviews.description.map(lambda desc: 'tropical' in desc).sum()
n_fruity = reviews.description.map(lambda desc: 'fruity' in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])

# Check your answer
q6.check()

<IPython.core.display.Javascript object>

Correct

q6.hint()
q6.solution()

<IPython.core.display.Javascript object>

Hint: Use a map to check each description for the string tropical, then count up the number of times this is True. Repeat this for fruity. Finally, create a Series combining the two values.

<IPython.core.display.Javascript object>

Solution:

n_trop = reviews.description.map(lambda desc: "tropical" in desc).sum()
n_fruity = reviews.description.map(lambda desc: "fruity" in desc).sum()
descriptor_counts = pd.Series([n_trop, n_fruity], index=['tropical', 'fruity'])

7.

We’d like to host these wine reviews on our website, but a rating system ranging from 80 to 100 points is too hard to understand - we’d like to translate them into simple star ratings. A score of 95 or higher counts as 3 stars, a score of at least 85 but less than 95 is 2 stars. Any other score is 1 star.

Also, the Canadian Vintners Association bought a lot of ads on the site, so any wines from Canada should automatically get 3 stars, regardless of points.

Create a series star_ratings with the number of stars corresponding to each review in the dataset.

type(reviews)

pandas.core.frame.DataFrame

def make_star(row):
    if row.country == 'Canada':
        return 3
    elif row.points >= 95:
        return 3
    elif row.points >= 85 and row.points <95:
        return 2
    else:
        return 1
        

star_ratings = reviews.apply(make_star, axis='columns')

# Check your answer
q7.check()

<IPython.core.display.Javascript object>

Correct

star_ratings.describe()

count    129971.000000
mean          1.924999
             ...      
75%           2.000000
max           3.000000
Length: 8, dtype: float64

q7.hint()
q7.solution()

<IPython.core.display.Javascript object>

Hint: Begin by writing a custom function that accepts a row from the DataFrame as input and returns the star rating corresponding to the row. Then, use DataFrame.apply to apply the custom function to every row in the dataset.

<IPython.core.display.Javascript object>

Solution:

def stars(row):
    if row.country == 'Canada':
        return 3
    elif row.points >= 95:
        return 3
    elif row.points >= 85:
        return 2
    else:
        return 1
    
star_ratings = reviews.apply(stars, axis='columns')

Keep going

Continue to grouping and sorting.

Have questions or comments? Visit the course discussion forum to chat with other learners.

Share on

Twitter Facebook LinkedIn

안철현(Charles An)

Pandas Summary, Functions and Maps

Introduction

Exercises

1.

2.

3.

4.

5.

6.

7.

Keep going

Share on

Leave a comment

You may also enjoy

파사트 헤드라이트 램프(HID) 교체

Clustering with K-Means

Creating Featues

Tips for Creating Features