Pandas grouping and Sorting
This notebook is an exercise in the Pandas course. You can reference the tutorial at this link.
Introduction
In these exercises we’ll apply groupwise analysis to our dataset.
Run the code cell below to load the data before running the exercises.
import pandas as pd
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
#pd.set_option("display.max_rows", 5)
from learntools.core import binder; binder.bind(globals())
from learntools.pandas.grouping_and_sorting import *
print("Setup complete.")
Setup complete.
Exercises
1.
Who are the most common wine reviewers in the dataset? Create a Series
whose index is the taster_twitter_handle
category from the dataset, and whose values count how many reviews each person wrote.
reviews.columns
Index(['country', 'description', 'designation', 'points', 'price', 'province',
'region_1', 'region_2', 'taster_name', 'taster_twitter_handle', 'title',
'variety', 'winery'],
dtype='object')
reviews.taster_twitter_handle.head()
0 @kerinokeefe
1 @vossroger
2 @paulgwine
3 NaN
4 @paulgwine
Name: taster_twitter_handle, dtype: object
reviews.groupby('taster_twitter_handle').describe()
points | price | |||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | |
taster_twitter_handle | ||||||||||||||||
@AnneInVino | 3685.0 | 90.562551 | 2.373100 | 80.0 | 89.0 | 90.0 | 92.0 | 97.0 | 3398.0 | 31.230135 | 25.295871 | 10.0 | 19.0 | 25.0 | 38.00 | 1100.0 |
@JoeCz | 5147.0 | 88.536235 | 2.858701 | 80.0 | 87.0 | 89.0 | 91.0 | 100.0 | 5012.0 | 35.175579 | 44.434444 | 5.0 | 16.0 | 22.0 | 40.00 | 850.0 |
@bkfiona | 27.0 | 86.888889 | 1.739437 | 82.0 | 86.0 | 87.0 | 88.0 | 91.0 | 27.0 | 31.148148 | 16.154789 | 17.0 | 22.5 | 27.0 | 35.00 | 80.0 |
@gordone_cellars | 4177.0 | 88.626287 | 2.698341 | 80.0 | 87.0 | 89.0 | 91.0 | 97.0 | 4171.0 | 26.935507 | 17.475901 | 4.0 | 15.0 | 23.0 | 32.00 | 220.0 |
@kerinokeefe | 10776.0 | 88.867947 | 2.474240 | 80.0 | 87.0 | 89.0 | 90.0 | 100.0 | 9874.0 | 41.953413 | 38.727135 | 6.0 | 20.0 | 30.0 | 50.00 | 800.0 |
@laurbuzz | 1835.0 | 87.739510 | 2.530672 | 81.0 | 86.0 | 88.0 | 90.0 | 95.0 | 1713.0 | 24.492703 | 21.967640 | 5.0 | 13.0 | 18.0 | 28.00 | 350.0 |
@mattkettmann | 6332.0 | 90.008686 | 2.571257 | 81.0 | 88.0 | 90.0 | 92.0 | 97.0 | 6237.0 | 38.642136 | 31.994921 | 7.0 | 25.0 | 35.0 | 48.00 | 2013.0 |
@paulgwine | 9532.0 | 89.082564 | 2.814445 | 80.0 | 87.0 | 89.0 | 91.0 | 100.0 | 9498.0 | 33.644873 | 18.936484 | 5.0 | 20.0 | 29.0 | 42.00 | 275.0 |
@suskostrzewa | 1085.0 | 86.609217 | 2.376140 | 80.0 | 85.0 | 87.0 | 88.0 | 94.0 | 1073.0 | 22.908667 | 17.311163 | 7.0 | 14.0 | 19.0 | 26.00 | 320.0 |
@vboone | 9537.0 | 89.213379 | 2.996796 | 80.0 | 87.0 | 90.0 | 91.0 | 99.0 | 9507.0 | 46.621963 | 32.655537 | 7.0 | 25.0 | 39.0 | 56.00 | 625.0 |
@vossroger | 25514.0 | 88.708003 | 3.036373 | 80.0 | 86.0 | 88.0 | 91.0 | 100.0 | 20172.0 | 38.649960 | 71.540473 | 5.0 | 15.0 | 22.0 | 40.00 | 3300.0 |
@wawinereport | 4966.0 | 88.755739 | 2.458547 | 80.0 | 87.0 | 89.0 | 91.0 | 97.0 | 4925.0 | 34.085888 | 20.029977 | 6.0 | 20.0 | 30.0 | 42.00 | 240.0 |
@wineschach | 15134.0 | 86.907493 | 3.022859 | 80.0 | 85.0 | 87.0 | 89.0 | 98.0 | 14951.0 | 25.231155 | 28.723655 | 4.0 | 12.0 | 17.0 | 25.00 | 770.0 |
@winewchristina | 6.0 | 87.833333 | 3.600926 | 82.0 | 87.0 | 88.0 | 89.0 | 93.0 | 6.0 | 29.333333 | 11.165423 | 19.0 | 22.0 | 28.5 | 29.75 | 50.0 |
@worldwineguys | 1005.0 | 88.719403 | 2.044055 | 82.0 | 88.0 | 89.0 | 90.0 | 97.0 | 995.0 | 25.238191 | 19.984364 | 6.0 | 15.0 | 20.0 | 30.00 | 320.0 |
# Your code here
reviews_written = reviews.groupby('taster_twitter_handle').taster_twitter_handle.count()
# Check your answer
q1.check()
<IPython.core.display.Javascript object>
Correct:
reviews_written = reviews.groupby('taster_twitter_handle').size()
or
reviews_written = reviews.groupby('taster_twitter_handle').taster_twitter_handle.count()
#q1.hint()
#q1.solution()
2.
What is the best wine I can buy for a given amount of money? Create a Series
whose index is wine prices and whose values is the maximum number of points a wine costing that much was given in a review. Sort the values by price, ascending (so that 4.0
dollars is at the top and 3300.0
dollars is at the bottom).
reviews.groupby('price').points.max()
price
4.0 86
5.0 87
6.0 88
7.0 91
8.0 91
..
1900.0 98
2000.0 97
2013.0 91
2500.0 96
3300.0 88
Name: points, Length: 390, dtype: int64
best_rating_per_price = reviews.groupby('price').points.max()
# Check your answer
q2.check()
<IPython.core.display.Javascript object>
Correct
#q2.hint()
#q2.solution()
3.
What are the minimum and maximum prices for each variety
of wine? Create a DataFrame
whose index is the variety
category from the dataset and whose values are the min
and max
values thereof.
reviews.groupby('variety').price.agg([min, max])
min | max | |
---|---|---|
variety | ||
Abouriou | 15.0 | 75.0 |
Agiorgitiko | 10.0 | 66.0 |
Aglianico | 6.0 | 180.0 |
Aidani | 27.0 | 27.0 |
Airen | 8.0 | 10.0 |
... | ... | ... |
Zinfandel | 5.0 | 100.0 |
Zlahtina | 13.0 | 16.0 |
Zweigelt | 9.0 | 70.0 |
Çalkarası | 19.0 | 19.0 |
Žilavka | 15.0 | 15.0 |
707 rows × 2 columns
price_extremes = reviews.groupby('variety').price.agg([min, max])
# Check your answer
q3.check()
<IPython.core.display.Javascript object>
Correct
#q3.hint()
#q3.solution()
4.
What are the most expensive wine varieties? Create a variable sorted_varieties
containing a copy of the dataframe from the previous question where varieties are sorted in descending order based on minimum price, then on maximum price (to break ties).
price_extremes.head()
min | max | |
---|---|---|
variety | ||
Abouriou | 15.0 | 75.0 |
Agiorgitiko | 10.0 | 66.0 |
Aglianico | 6.0 | 180.0 |
Aidani | 27.0 | 27.0 |
Airen | 8.0 | 10.0 |
sorted_varieties = price_extremes.sort_values(by=['min', 'max'],ascending=False)
# Check your answer
q4.check()
<IPython.core.display.Javascript object>
Correct
# q4.hint()
# q4.solution()
5.
Create a Series
whose index is reviewers and whose values is the average review score given out by that reviewer. Hint: you will need the taster_name
and points
columns.
reviews.groupby('taster_name').points.mean().sort_values()
taster_name
Alexander Peartree 85.855422
Carrie Dykes 86.395683
Susan Kostrzewa 86.609217
Fiona Adams 86.888889
Michael Schachner 86.907493
Lauren Buzzeo 87.739510
Christina Pickard 87.833333
Jeff Jenssen 88.319756
Anna Lee C. Iijima 88.415629
Joe Czerwinski 88.536235
Jim Gordon 88.626287
Roger Voss 88.708003
Sean P. Sullivan 88.755739
Kerin O’Keefe 88.867947
Paul Gregutt 89.082564
Mike DeSimone 89.101167
Virginie Boone 89.213379
Matt Kettmann 90.008686
Anne Krebiehl MW 90.562551
Name: points, dtype: float64
reviewer_mean_ratings = reviews.groupby('taster_name').points.mean()
# Check your answer
q5.check()
<IPython.core.display.Javascript object>
Correct
# q5.hint()
#q5.solution()
Are there significant differences in the average scores assigned by the various reviewers? Run the cell below to use the describe()
method to see a summary of the range of values.
reviewer_mean_ratings.describe()
count 19.000000
mean 88.233026
std 1.243610
min 85.855422
25% 87.323501
50% 88.536235
75% 88.975256
max 90.562551
Name: points, dtype: float64
6.
What combination of countries and varieties are most common? Create a Series
whose index is a MultiIndex
of {country, variety}
pairs. For example, a pinot noir produced in the US should map to {"US", "Pinot Noir"}
. Sort the values in the Series
in descending order based on wine count.
reviews.groupby(['country', 'variety']).describe()
points | price | ||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
count | mean | std | min | 25% | 50% | 75% | max | count | mean | std | min | 25% | 50% | 75% | max | ||
country | variety | ||||||||||||||||
Argentina | Barbera | 1.0 | 85.000000 | NaN | 85.0 | 85.00 | 85.0 | 85.00 | 85.0 | 1.0 | 18.000000 | NaN | 18.0 | 18.00 | 18.0 | 18.00 | 18.0 |
Bonarda | 105.0 | 86.504762 | 2.587410 | 80.0 | 85.00 | 87.0 | 89.00 | 92.0 | 105.0 | 16.628571 | 6.367370 | 9.0 | 13.00 | 15.0 | 18.00 | 40.0 | |
Bordeaux-style Red Blend | 89.0 | 89.820225 | 2.990750 | 81.0 | 88.00 | 91.0 | 92.00 | 96.0 | 86.0 | 41.546512 | 24.686580 | 10.0 | 25.00 | 35.5 | 50.00 | 120.0 | |
Bordeaux-style White Blend | 1.0 | 83.000000 | NaN | 83.0 | 83.00 | 83.0 | 83.00 | 83.0 | 1.0 | 14.000000 | NaN | 14.0 | 14.00 | 14.0 | 14.00 | 14.0 | |
Cabernet Blend | 8.0 | 88.250000 | 2.549510 | 85.0 | 86.75 | 88.0 | 89.25 | 93.0 | 8.0 | 36.250000 | 24.852422 | 17.0 | 24.75 | 30.0 | 35.75 | 95.0 | |
... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
Uruguay | Tannat-Cabernet Franc | 2.0 | 90.000000 | 1.414214 | 89.0 | 89.50 | 90.0 | 90.50 | 91.0 | 2.0 | 19.000000 | 4.242641 | 16.0 | 17.50 | 19.0 | 20.50 | 22.0 |
Tannat-Merlot | 6.0 | 86.500000 | 1.048809 | 85.0 | 86.00 | 86.5 | 87.00 | 88.0 | 6.0 | 17.666667 | 6.860515 | 12.0 | 13.00 | 14.5 | 21.25 | 29.0 | |
Tannat-Syrah | 1.0 | 84.000000 | NaN | 84.0 | 84.00 | 84.0 | 84.00 | 84.0 | 1.0 | 16.000000 | NaN | 16.0 | 16.00 | 16.0 | 16.00 | 16.0 | |
Albariño | 6.0 | 87.333333 | 2.581989 | 85.0 | 85.25 | 86.5 | 89.25 | 91.0 | 6.0 | 22.333333 | 4.131182 | 17.0 | 19.00 | 25.0 | 25.00 | 25.0 | |
Tempranillo-Tannat | 1.0 | 88.000000 | NaN | 88.0 | 88.00 | 88.0 | 88.00 | 88.0 | 1.0 | 20.000000 | NaN | 20.0 | 20.00 | 20.0 | 20.00 | 20.0 |
1612 rows × 16 columns
reviews.groupby(['country', 'variety']).variety.count().sort_values(ascending=False)
country variety
US Pinot Noir 9885
Cabernet Sauvignon 7315
Chardonnay 6801
France Bordeaux-style Red Blend 4725
Italy Red Blend 3624
...
Mexico Cinsault 1
Grenache 1
Merlot 1
Rosado 1
Uruguay White Blend 1
Name: variety, Length: 1612, dtype: int64
country_variety_counts = reviews.groupby(['country', 'variety']).variety.count().sort_values(ascending=False)
# Check your answer
q6.check()
<IPython.core.display.Javascript object>
Correct
#q6.hint()
#q6.solution()
Keep going
Move on to the data types and missing data.
Have questions or comments? Visit the course discussion forum to chat with other learners.
Leave a comment