11 minute read

This notebook is an exercise in the Pandas course. You can reference the tutorial at this link.


Introduction

Run the following cell to load your data and some utility functions.

import pandas as pd

reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

from learntools.core import binder; binder.bind(globals())
from learntools.pandas.renaming_and_combining import *
print("Setup complete.")
Setup complete.

Exercises

View the first several lines of your data by running the cell below:

reviews.head()
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks

1.

region_1 and region_2 are pretty uninformative names for locale columns in the dataset. Create a copy of reviews with these columns renamed to region and locale, respectively.

# Your code here
renamed = reviews.rename(columns = {'region_1':'region', 'region_2':'locale'})

# Check your answer
q1.check()
<IPython.core.display.Javascript object>

Correct

renamed.head()
country description designation points price province region locale taster_name taster_twitter_handle title variety winery
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks
#q1.hint()
#q1.solution()

2.

Set the index name in the dataset to wines.

reindexed = reviews.rename_axis("wines", axis='rows')

# Check your answer
q2.check()
<IPython.core.display.Javascript object>

Correct

reindexed.head()
country description designation points price province region_1 region_2 taster_name taster_twitter_handle title variety winery
wines
0 Italy Aromas include tropical fruit, broom, brimston... Vulkà Bianco 87 NaN Sicily & Sardinia Etna NaN Kerin O’Keefe @kerinokeefe Nicosia 2013 Vulkà Bianco (Etna) White Blend Nicosia
1 Portugal This is ripe and fruity, a wine that is smooth... Avidagos 87 15.0 Douro NaN NaN Roger Voss @vossroger Quinta dos Avidagos 2011 Avidagos Red (Douro) Portuguese Red Quinta dos Avidagos
2 US Tart and snappy, the flavors of lime flesh and... NaN 87 14.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Rainstorm 2013 Pinot Gris (Willamette Valley) Pinot Gris Rainstorm
3 US Pineapple rind, lemon pith and orange blossom ... Reserve Late Harvest 87 13.0 Michigan Lake Michigan Shore NaN Alexander Peartree NaN St. Julian 2013 Reserve Late Harvest Riesling ... Riesling St. Julian
4 US Much like the regular bottling from 2012, this... Vintner's Reserve Wild Child Block 87 65.0 Oregon Willamette Valley Willamette Valley Paul Gregutt @paulgwine Sweet Cheeks 2012 Vintner's Reserve Wild Child... Pinot Noir Sweet Cheeks
#q2.hint()
#q2.solution()

3.

The Things on Reddit dataset includes product links from a selection of top-ranked forums (“subreddits”) on reddit.com. Run the cell below to load a dataframe of products mentioned on the /r/gaming subreddit and another dataframe for products mentioned on the r//movies subreddit.

gaming_products = pd.read_csv("../input/things-on-reddit/top-things/top-things/reddits/g/gaming.csv")
gaming_products['subreddit'] = "r/gaming"
movie_products = pd.read_csv("../input/things-on-reddit/top-things/top-things/reddits/m/movies.csv")
movie_products['subreddit'] = "r/movies"

Create a DataFrame of products mentioned on either subreddit.

gaming_products.head()
name category amazon_link total_mentions subreddit_mentions subreddit
0 BOOMco Halo Covenant Needler Blaster Toys & Games https://www.amazon.com/BOOMco-Halo-Covenant-Ne... 4.0 4 r/gaming
1 Raspberry PI 3 Model B 1.2GHz 64-bit quad-core... Electronics https://www.amazon.com/Raspberry-Model-A1-2GHz... 19.0 3 r/gaming
2 CanaKit 5V 2.5A Raspberry Pi 3 Power Supply / ... Electronics https://www.amazon.com/CanaKit-Raspberry-Suppl... 7.0 3 r/gaming
3 Panasonic K-KJ17MCA4BA Advanced Individual Cel... Electronics https://www.amazon.com/Panasonic-Advanced-Indi... 29.0 2 r/gaming
4 Mayflash GameCube Controller Adapter for Wii U... Electronics https://www.amazon.com/GameCube-Controller-Ada... 24.0 2 r/gaming
gaming_products.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 493 entries, 0 to 492
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   name                493 non-null    object 
 1   category            493 non-null    object 
 2   amazon_link         493 non-null    object 
 3   total_mentions      491 non-null    float64
 4   subreddit_mentions  493 non-null    int64  
 5   subreddit           493 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 23.2+ KB
gaming_products.describe()
total_mentions subreddit_mentions
count 491.000000 493.000000
mean 2.853360 1.064909
std 4.504293 0.284858
min 1.000000 1.000000
25% 1.000000 1.000000
50% 1.000000 1.000000
75% 3.000000 1.000000
max 46.000000 4.000000
movie_products.head()
name category amazon_link total_mentions subreddit_mentions subreddit
0 Marvel Cinematic Universe: Phase One - Avenger... Movies & TV https://www.amazon.com/Marvel-Cinematic-Univer... 4.0 3 r/movies
1 On Stranger Tides Books https://www.amazon.com/Stranger-Tides-Tim-Powe... 3.0 3 r/movies
2 Superintelligence: Paths, Dangers, Strategies Books https://www.amazon.com/Superintelligence-Dange... 7.0 2 r/movies
3 The Secret History of Star Wars Books https://www.amazon.com/Secret-History-Star-War... 4.0 2 r/movies
4 2D Glasses 4 Pack - Turns 3D movies back into ... Electronics https://www.amazon.com/gp/product/B00K9E7GCC 3.0 2 r/movies
movie_products.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   name                303 non-null    object 
 1   category            303 non-null    object 
 2   amazon_link         303 non-null    object 
 3   total_mentions      302 non-null    float64
 4   subreddit_mentions  303 non-null    int64  
 5   subreddit           303 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 14.3+ KB
movie_products.describe()
total_mentions subreddit_mentions
count 302.000000 303.000000
mean 1.470199 1.046205
std 1.131300 0.239710
min 1.000000 1.000000
25% 1.000000 1.000000
50% 1.000000 1.000000
75% 1.000000 1.000000
max 8.000000 3.000000
combined_products = pd.concat([gaming_products, movie_products])

# Check your answer
q3.check()
<IPython.core.display.Javascript object>

Correct

#q3.hint()
#q3.solution()

4.

The Powerlifting Database dataset on Kaggle includes one CSV table for powerlifting meets and a separate one for powerlifting competitors. Run the cell below to load these datasets into dataframes:

powerlifting_meets = pd.read_csv("../input/powerlifting-database/meets.csv")
powerlifting_competitors = pd.read_csv("../input/powerlifting-database/openpowerlifting.csv")
powerlifting_meets.head()
MeetID MeetPath Federation Date MeetCountry MeetState MeetTown MeetName
0 0 365strong/1601 365Strong 2016-10-29 USA NC Charlotte 2016 Junior & Senior National Powerlifting Cha...
1 1 365strong/1602 365Strong 2016-11-19 USA MO Ozark Thanksgiving Powerlifting Classic
2 2 365strong/1603 365Strong 2016-07-09 USA NC Charlotte Charlotte Europa Games
3 3 365strong/1604 365Strong 2016-06-11 USA SC Rock Hill Carolina Cup Push Pull Challenge
4 4 365strong/1605 365Strong 2016-04-10 USA SC Rock Hill Eastern USA Challenge
powerlifting_meets.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8482 entries, 0 to 8481
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   MeetID       8482 non-null   int64 
 1   MeetPath     8482 non-null   object
 2   Federation   8482 non-null   object
 3   Date         8482 non-null   object
 4   MeetCountry  8482 non-null   object
 5   MeetState    5496 non-null   object
 6   MeetTown     6973 non-null   object
 7   MeetName     8482 non-null   object
dtypes: int64(1), object(7)
memory usage: 530.2+ KB
powerlifting_competitors.head()
MeetID Name Sex Equipment Age Division BodyweightKg WeightClassKg Squat4Kg BestSquatKg Bench4Kg BestBenchKg Deadlift4Kg BestDeadliftKg TotalKg Place Wilks
0 0 Angie Belk Terry F Wraps 47.0 Mst 45-49 59.60 60 NaN 47.63 NaN 20.41 NaN 70.31 138.35 1 155.05
1 0 Dawn Bogart F Single-ply 42.0 Mst 40-44 58.51 60 NaN 142.88 NaN 95.25 NaN 163.29 401.42 1 456.38
2 0 Dawn Bogart F Single-ply 42.0 Open Senior 58.51 60 NaN 142.88 NaN 95.25 NaN 163.29 401.42 1 456.38
3 0 Dawn Bogart F Raw 42.0 Open Senior 58.51 60 NaN NaN NaN 95.25 NaN NaN 95.25 1 108.29
4 0 Destiny Dula F Raw 18.0 Teen 18-19 63.68 67.5 NaN NaN NaN 31.75 NaN 90.72 122.47 1 130.47
powerlifting_competitors.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 386414 entries, 0 to 386413
Data columns (total 17 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   MeetID          386414 non-null  int64  
 1   Name            386414 non-null  object 
 2   Sex             386414 non-null  object 
 3   Equipment       386414 non-null  object 
 4   Age             147147 non-null  float64
 5   Division        370571 non-null  object 
 6   BodyweightKg    384012 non-null  float64
 7   WeightClassKg   382602 non-null  object 
 8   Squat4Kg        1243 non-null    float64
 9   BestSquatKg     298071 non-null  float64
 10  Bench4Kg        1962 non-null    float64
 11  BestBenchKg     356364 non-null  float64
 12  Deadlift4Kg     2800 non-null    float64
 13  BestDeadliftKg  317847 non-null  float64
 14  TotalKg         363237 non-null  float64
 15  Place           385322 non-null  object 
 16  Wilks           362194 non-null  float64
dtypes: float64(10), int64(1), object(6)
memory usage: 50.1+ MB
powerlifting_competitors.describe()
MeetID Age BodyweightKg Squat4Kg BestSquatKg Bench4Kg BestBenchKg Deadlift4Kg BestDeadliftKg TotalKg Wilks
count 386414.000000 147147.000000 384012.000000 1243.000000 298071.000000 1962.000000 356364.000000 2800.000000 317847.000000 363237.000000 362194.000000
mean 5143.015804 31.668237 86.934912 107.036404 176.569941 45.722905 118.347509 113.597193 195.040633 424.000249 301.080601
std 2552.099838 12.900342 23.140843 166.976620 69.222785 151.668221 54.848850 170.201657 61.580675 196.355147 116.360396
min 0.000000 5.000000 15.880000 -440.500000 -477.500000 -360.000000 -522.500000 -461.000000 -410.000000 11.000000 13.730000
25% 2979.000000 22.000000 70.300000 87.500000 127.500000 -90.000000 79.380000 110.000000 147.500000 272.160000 237.380000
50% 5960.000000 28.000000 83.200000 145.000000 174.630000 90.250000 115.000000 157.500000 195.000000 424.110000 319.660000
75% 7175.000000 39.000000 100.000000 212.500000 217.720000 167.500000 150.000000 219.990000 238.140000 565.000000 379.290000
max 8481.000000 95.000000 242.400000 450.000000 573.790000 378.750000 488.500000 418.000000 460.400000 1365.310000 779.380000

Both tables include references to a MeetID, a unique key for each meet (competition) included in the database. Using this, generate a dataset combining the two tables into one.

left = powerlifting_meets.set_index('MeetID')
right = powerlifting_competitors.set_index('MeetID')
powerlifting_combined = left.join(right)

# Check your answer
q4.check()
<IPython.core.display.Javascript object>

Correct

powerlifting_combined.head()
MeetPath Federation Date MeetCountry MeetState MeetTown MeetName Name Sex Equipment ... WeightClassKg Squat4Kg BestSquatKg Bench4Kg BestBenchKg Deadlift4Kg BestDeadliftKg TotalKg Place Wilks
MeetID
0 365strong/1601 365Strong 2016-10-29 USA NC Charlotte 2016 Junior & Senior National Powerlifting Cha... Angie Belk Terry F Wraps ... 60 NaN 47.63 NaN 20.41 NaN 70.31 138.35 1 155.05
0 365strong/1601 365Strong 2016-10-29 USA NC Charlotte 2016 Junior & Senior National Powerlifting Cha... Dawn Bogart F Single-ply ... 60 NaN 142.88 NaN 95.25 NaN 163.29 401.42 1 456.38
0 365strong/1601 365Strong 2016-10-29 USA NC Charlotte 2016 Junior & Senior National Powerlifting Cha... Dawn Bogart F Single-ply ... 60 NaN 142.88 NaN 95.25 NaN 163.29 401.42 1 456.38
0 365strong/1601 365Strong 2016-10-29 USA NC Charlotte 2016 Junior & Senior National Powerlifting Cha... Dawn Bogart F Raw ... 60 NaN NaN NaN 95.25 NaN NaN 95.25 1 108.29
0 365strong/1601 365Strong 2016-10-29 USA NC Charlotte 2016 Junior & Senior National Powerlifting Cha... Destiny Dula F Raw ... 67.5 NaN NaN NaN 31.75 NaN 90.72 122.47 1 130.47

5 rows × 23 columns

powerlifting_combined.tail()
MeetPath Federation Date MeetCountry MeetState MeetTown MeetName Name Sex Equipment ... WeightClassKg Squat4Kg BestSquatKg Bench4Kg BestBenchKg Deadlift4Kg BestDeadliftKg TotalKg Place Wilks
MeetID
8481 xpc/2017-finals XPC 2017-03-03 USA OH Columbus 2017 XPC Finals William Barabas M Multi-ply ... 125 NaN NaN NaN NaN NaN 347.5 347.5 2 202.60
8481 xpc/2017-finals XPC 2017-03-03 USA OH Columbus 2017 XPC Finals Justin Zottl M Multi-ply ... 125 NaN NaN NaN NaN NaN 322.5 322.5 3 185.77
8481 xpc/2017-finals XPC 2017-03-03 USA OH Columbus 2017 XPC Finals Jake Anderson M Multi-ply ... 125 NaN NaN NaN NaN NaN 367.5 367.5 1 211.17
8481 xpc/2017-finals XPC 2017-03-03 USA OH Columbus 2017 XPC Finals Jeff Bumanglag M Multi-ply ... 140 NaN NaN NaN NaN NaN 320.0 320.0 3 181.85
8481 xpc/2017-finals XPC 2017-03-03 USA OH Columbus 2017 XPC Finals Shane Hammock M Multi-ply ... 140 NaN NaN NaN NaN NaN 362.5 362.5 2 205.18

5 rows × 23 columns

powerlifting_combined.columns
Index(['MeetPath', 'Federation', 'Date', 'MeetCountry', 'MeetState',
       'MeetTown', 'MeetName', 'Name', 'Sex', 'Equipment', 'Age', 'Division',
       'BodyweightKg', 'WeightClassKg', 'Squat4Kg', 'BestSquatKg', 'Bench4Kg',
       'BestBenchKg', 'Deadlift4Kg', 'BestDeadliftKg', 'TotalKg', 'Place',
       'Wilks'],
      dtype='object')
powerlifting_combined.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 386414 entries, 0 to 8481
Data columns (total 23 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 0   MeetPath        386414 non-null  object 
 1   Federation      386414 non-null  object 
 2   Date            386414 non-null  object 
 3   MeetCountry     386414 non-null  object 
 4   MeetState       314271 non-null  object 
 5   MeetTown        292414 non-null  object 
 6   MeetName        386414 non-null  object 
 7   Name            386414 non-null  object 
 8   Sex             386414 non-null  object 
 9   Equipment       386414 non-null  object 
 10  Age             147147 non-null  float64
 11  Division        370571 non-null  object 
 12  BodyweightKg    384012 non-null  float64
 13  WeightClassKg   382602 non-null  object 
 14  Squat4Kg        1243 non-null    float64
 15  BestSquatKg     298071 non-null  float64
 16  Bench4Kg        1962 non-null    float64
 17  BestBenchKg     356364 non-null  float64
 18  Deadlift4Kg     2800 non-null    float64
 19  BestDeadliftKg  317847 non-null  float64
 20  TotalKg         363237 non-null  float64
 21  Place           385322 non-null  object 
 22  Wilks           362194 non-null  float64
dtypes: float64(10), object(13)
memory usage: 70.8+ MB
#q4.hint()
#q4.solution()

Congratulations!

You’ve finished the Pandas micro-course. Many data scientists feel efficiency with Pandas is the most useful and practical skill they have, because it allows you to progress quickly in any project you have.

If you’d like to apply your new skills to examining geospatial data, you’re encouraged to check out our Geospatial Analysis micro-course.

You can also take advantage of your Pandas skills by entering a Kaggle Competition or by answering a question you find interesting using Kaggle Datasets.


Have questions or comments? Visit the course discussion forum to chat with other learners.

Leave a comment