Pandas grouping and Sorting

6 minute read

This notebook is an exercise in the Pandas course. You can reference the tutorial at this link.

Introduction

In these exercises we’ll apply groupwise analysis to our dataset.

Run the code cell below to load the data before running the exercises.

import pandas as pd

reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
#pd.set_option("display.max_rows", 5)

from learntools.core import binder; binder.bind(globals())
from learntools.pandas.grouping_and_sorting import *
print("Setup complete.")

Setup complete.

Exercises

1.

Who are the most common wine reviewers in the dataset? Create a Series whose index is the taster_twitter_handle category from the dataset, and whose values count how many reviews each person wrote.

reviews.columns

Index(['country', 'description', 'designation', 'points', 'price', 'province',
       'region_1', 'region_2', 'taster_name', 'taster_twitter_handle', 'title',
       'variety', 'winery'],
      dtype='object')

reviews.taster_twitter_handle.head()

  @kerinokeefe
    @vossroger
   @paulgwine 
           NaN
   @paulgwine 
Name: taster_twitter_handle, dtype: object

reviews.groupby('taster_twitter_handle').describe()

	points								price
	count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
taster_twitter_handle
@AnneInVino	3685.0	90.562551	2.373100	80.0	89.0	90.0	92.0	97.0	3398.0	31.230135	25.295871	10.0	19.0	25.0	38.00	1100.0
@JoeCz	5147.0	88.536235	2.858701	80.0	87.0	89.0	91.0	100.0	5012.0	35.175579	44.434444	5.0	16.0	22.0	40.00	850.0
@bkfiona	27.0	86.888889	1.739437	82.0	86.0	87.0	88.0	91.0	27.0	31.148148	16.154789	17.0	22.5	27.0	35.00	80.0
@gordone_cellars	4177.0	88.626287	2.698341	80.0	87.0	89.0	91.0	97.0	4171.0	26.935507	17.475901	4.0	15.0	23.0	32.00	220.0
@kerinokeefe	10776.0	88.867947	2.474240	80.0	87.0	89.0	90.0	100.0	9874.0	41.953413	38.727135	6.0	20.0	30.0	50.00	800.0
@laurbuzz	1835.0	87.739510	2.530672	81.0	86.0	88.0	90.0	95.0	1713.0	24.492703	21.967640	5.0	13.0	18.0	28.00	350.0
@mattkettmann	6332.0	90.008686	2.571257	81.0	88.0	90.0	92.0	97.0	6237.0	38.642136	31.994921	7.0	25.0	35.0	48.00	2013.0
@paulgwine	9532.0	89.082564	2.814445	80.0	87.0	89.0	91.0	100.0	9498.0	33.644873	18.936484	5.0	20.0	29.0	42.00	275.0
@suskostrzewa	1085.0	86.609217	2.376140	80.0	85.0	87.0	88.0	94.0	1073.0	22.908667	17.311163	7.0	14.0	19.0	26.00	320.0
@vboone	9537.0	89.213379	2.996796	80.0	87.0	90.0	91.0	99.0	9507.0	46.621963	32.655537	7.0	25.0	39.0	56.00	625.0
@vossroger	25514.0	88.708003	3.036373	80.0	86.0	88.0	91.0	100.0	20172.0	38.649960	71.540473	5.0	15.0	22.0	40.00	3300.0
@wawinereport	4966.0	88.755739	2.458547	80.0	87.0	89.0	91.0	97.0	4925.0	34.085888	20.029977	6.0	20.0	30.0	42.00	240.0
@wineschach	15134.0	86.907493	3.022859	80.0	85.0	87.0	89.0	98.0	14951.0	25.231155	28.723655	4.0	12.0	17.0	25.00	770.0
@winewchristina	6.0	87.833333	3.600926	82.0	87.0	88.0	89.0	93.0	6.0	29.333333	11.165423	19.0	22.0	28.5	29.75	50.0
@worldwineguys	1005.0	88.719403	2.044055	82.0	88.0	89.0	90.0	97.0	995.0	25.238191	19.984364	6.0	15.0	20.0	30.00	320.0

# Your code here
reviews_written = reviews.groupby('taster_twitter_handle').taster_twitter_handle.count()

# Check your answer
q1.check()

<IPython.core.display.Javascript object>

Correct:

reviews_written = reviews.groupby('taster_twitter_handle').size()

reviews_written = reviews.groupby('taster_twitter_handle').taster_twitter_handle.count()

#q1.hint()
#q1.solution()

2.

What is the best wine I can buy for a given amount of money? Create a Series whose index is wine prices and whose values is the maximum number of points a wine costing that much was given in a review. Sort the values by price, ascending (so that 4.0 dollars is at the top and 3300.0 dollars is at the bottom).

reviews.groupby('price').points.max()

price
0       86
0       87
0       88
0       91
0       91
          ..
0    98
0    97
0    91
0    96
0    88
Name: points, Length: 390, dtype: int64

best_rating_per_price = reviews.groupby('price').points.max()

# Check your answer
q2.check()

<IPython.core.display.Javascript object>

Correct

#q2.hint()
#q2.solution()

3.

What are the minimum and maximum prices for each variety of wine? Create a DataFrame whose index is the variety category from the dataset and whose values are the min and max values thereof.

reviews.groupby('variety').price.agg([min, max])

	min	max
variety
Abouriou	15.0	75.0
Agiorgitiko	10.0	66.0
Aglianico	6.0	180.0
Aidani	27.0	27.0
Airen	8.0	10.0
...	...	...
Zinfandel	5.0	100.0
Zlahtina	13.0	16.0
Zweigelt	9.0	70.0
Çalkarası	19.0	19.0
Žilavka	15.0	15.0

707 rows × 2 columns

price_extremes = reviews.groupby('variety').price.agg([min, max])

# Check your answer
q3.check()

<IPython.core.display.Javascript object>

Correct

#q3.hint()
#q3.solution()

4.

What are the most expensive wine varieties? Create a variable sorted_varieties containing a copy of the dataframe from the previous question where varieties are sorted in descending order based on minimum price, then on maximum price (to break ties).

price_extremes.head()

	min	max
variety
Abouriou	15.0	75.0
Agiorgitiko	10.0	66.0
Aglianico	6.0	180.0
Aidani	27.0	27.0
Airen	8.0	10.0

sorted_varieties = price_extremes.sort_values(by=['min', 'max'],ascending=False)

# Check your answer
q4.check()

<IPython.core.display.Javascript object>

Correct

# q4.hint()
# q4.solution()

5.

Create a Series whose index is reviewers and whose values is the average review score given out by that reviewer. Hint: you will need the taster_name and points columns.

reviews.groupby('taster_name').points.mean().sort_values()

taster_name
Alexander Peartree    85.855422
Carrie Dykes          86.395683
Susan Kostrzewa       86.609217
Fiona Adams           86.888889
Michael Schachner     86.907493
Lauren Buzzeo         87.739510
Christina Pickard     87.833333
Jeff Jenssen          88.319756
Anna Lee C. Iijima    88.415629
Joe Czerwinski        88.536235
Jim Gordon            88.626287
Roger Voss            88.708003
Sean P. Sullivan      88.755739
Kerin O’Keefe         88.867947
Paul Gregutt          89.082564
Mike DeSimone         89.101167
Virginie Boone        89.213379
Matt Kettmann         90.008686
Anne Krebiehl MW      90.562551
Name: points, dtype: float64

reviewer_mean_ratings = reviews.groupby('taster_name').points.mean()

# Check your answer
q5.check()

<IPython.core.display.Javascript object>

Correct

# q5.hint()
#q5.solution()

Are there significant differences in the average scores assigned by the various reviewers? Run the cell below to use the describe() method to see a summary of the range of values.

reviewer_mean_ratings.describe()

count    19.000000
mean     88.233026
std       1.243610
min      85.855422
25%      87.323501
50%      88.536235
75%      88.975256
max      90.562551
Name: points, dtype: float64

6.

What combination of countries and varieties are most common? Create a Series whose index is a MultiIndexof {country, variety} pairs. For example, a pinot noir produced in the US should map to {"US", "Pinot Noir"}. Sort the values in the Series in descending order based on wine count.

reviews.groupby(['country', 'variety']).describe()

		points								price
		count	mean	std	min	25%	50%	75%	max	count	mean	std	min	25%	50%	75%	max
country	variety
Argentina	Barbera	1.0	85.000000	NaN	85.0	85.00	85.0	85.00	85.0	1.0	18.000000	NaN	18.0	18.00	18.0	18.00	18.0
	Bonarda	105.0	86.504762	2.587410	80.0	85.00	87.0	89.00	92.0	105.0	16.628571	6.367370	9.0	13.00	15.0	18.00	40.0
	Bordeaux-style Red Blend	89.0	89.820225	2.990750	81.0	88.00	91.0	92.00	96.0	86.0	41.546512	24.686580	10.0	25.00	35.5	50.00	120.0
	Bordeaux-style White Blend	1.0	83.000000	NaN	83.0	83.00	83.0	83.00	83.0	1.0	14.000000	NaN	14.0	14.00	14.0	14.00	14.0
	Cabernet Blend	8.0	88.250000	2.549510	85.0	86.75	88.0	89.25	93.0	8.0	36.250000	24.852422	17.0	24.75	30.0	35.75	95.0
...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...	...
Uruguay	Tannat-Cabernet Franc	2.0	90.000000	1.414214	89.0	89.50	90.0	90.50	91.0	2.0	19.000000	4.242641	16.0	17.50	19.0	20.50	22.0
	Tannat-Merlot	6.0	86.500000	1.048809	85.0	86.00	86.5	87.00	88.0	6.0	17.666667	6.860515	12.0	13.00	14.5	21.25	29.0
	Tannat-Syrah	1.0	84.000000	NaN	84.0	84.00	84.0	84.00	84.0	1.0	16.000000	NaN	16.0	16.00	16.0	16.00	16.0
	Albariño	6.0	87.333333	2.581989	85.0	85.25	86.5	89.25	91.0	6.0	22.333333	4.131182	17.0	19.00	25.0	25.00	25.0
	Tempranillo-Tannat	1.0	88.000000	NaN	88.0	88.00	88.0	88.00	88.0	1.0	20.000000	NaN	20.0	20.00	20.0	20.00	20.0

1612 rows × 16 columns

reviews.groupby(['country', 'variety']).variety.count().sort_values(ascending=False)

country  variety                 
US       Pinot Noir                  9885
         Cabernet Sauvignon          7315
         Chardonnay                  6801
France   Bordeaux-style Red Blend    4725
Italy    Red Blend                   3624
                                     ... 
Mexico   Cinsault                       1
         Grenache                       1
         Merlot                         1
         Rosado                         1
Uruguay  White Blend                    1
Name: variety, Length: 1612, dtype: int64

country_variety_counts = reviews.groupby(['country', 'variety']).variety.count().sort_values(ascending=False)

# Check your answer
q6.check()

<IPython.core.display.Javascript object>

Correct

#q6.hint()
#q6.solution()

Keep going

Move on to the data types and missing data.

Have questions or comments? Visit the course discussion forum to chat with other learners.

Share on

Twitter Facebook LinkedIn

안철현(Charles An)

Pandas grouping and Sorting

Introduction

Exercises

1.

2.

3.

4.

5.

6.

Keep going

Share on

Leave a comment

You may also enjoy

FastAPI 시작해봅시다

파사트 헤드라이트 램프(HID) 교체

Clustering with K-Means

Creating Featues