Pandas Renaming and Combining
This notebook is an exercise in the Pandas course. You can reference the tutorial at this link.
Introduction
Run the following cell to load your data and some utility functions.
import pandas as pd
reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)
from learntools.core import binder; binder.bind(globals())
from learntools.pandas.renaming_and_combining import *
print("Setup complete.")
Setup complete.
Exercises
View the first several lines of your data by running the cell below:
reviews.head()
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
4 | US | Much like the regular bottling from 2012, this... | Vintner's Reserve Wild Child Block | 87 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child... | Pinot Noir | Sweet Cheeks |
1.
region_1
and region_2
are pretty uninformative names for locale columns in the dataset. Create a copy of reviews
with these columns renamed to region
and locale
, respectively.
# Your code here
renamed = reviews.rename(columns = {'region_1':'region', 'region_2':'locale'})
# Check your answer
q1.check()
<IPython.core.display.Javascript object>
Correct
renamed.head()
country | description | designation | points | price | province | region | locale | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
4 | US | Much like the regular bottling from 2012, this... | Vintner's Reserve Wild Child Block | 87 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child... | Pinot Noir | Sweet Cheeks |
#q1.hint()
#q1.solution()
2.
Set the index name in the dataset to wines
.
reindexed = reviews.rename_axis("wines", axis='rows')
# Check your answer
q2.check()
<IPython.core.display.Javascript object>
Correct
reindexed.head()
country | description | designation | points | price | province | region_1 | region_2 | taster_name | taster_twitter_handle | title | variety | winery | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
wines | |||||||||||||
0 | Italy | Aromas include tropical fruit, broom, brimston... | Vulkà Bianco | 87 | NaN | Sicily & Sardinia | Etna | NaN | Kerin O’Keefe | @kerinokeefe | Nicosia 2013 Vulkà Bianco (Etna) | White Blend | Nicosia |
1 | Portugal | This is ripe and fruity, a wine that is smooth... | Avidagos | 87 | 15.0 | Douro | NaN | NaN | Roger Voss | @vossroger | Quinta dos Avidagos 2011 Avidagos Red (Douro) | Portuguese Red | Quinta dos Avidagos |
2 | US | Tart and snappy, the flavors of lime flesh and... | NaN | 87 | 14.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Rainstorm 2013 Pinot Gris (Willamette Valley) | Pinot Gris | Rainstorm |
3 | US | Pineapple rind, lemon pith and orange blossom ... | Reserve Late Harvest | 87 | 13.0 | Michigan | Lake Michigan Shore | NaN | Alexander Peartree | NaN | St. Julian 2013 Reserve Late Harvest Riesling ... | Riesling | St. Julian |
4 | US | Much like the regular bottling from 2012, this... | Vintner's Reserve Wild Child Block | 87 | 65.0 | Oregon | Willamette Valley | Willamette Valley | Paul Gregutt | @paulgwine | Sweet Cheeks 2012 Vintner's Reserve Wild Child... | Pinot Noir | Sweet Cheeks |
#q2.hint()
#q2.solution()
3.
The Things on Reddit dataset includes product links from a selection of top-ranked forums (“subreddits”) on reddit.com. Run the cell below to load a dataframe of products mentioned on the /r/gaming subreddit and another dataframe for products mentioned on the r//movies subreddit.
gaming_products = pd.read_csv("../input/things-on-reddit/top-things/top-things/reddits/g/gaming.csv")
gaming_products['subreddit'] = "r/gaming"
movie_products = pd.read_csv("../input/things-on-reddit/top-things/top-things/reddits/m/movies.csv")
movie_products['subreddit'] = "r/movies"
Create a DataFrame
of products mentioned on either subreddit.
gaming_products.head()
name | category | amazon_link | total_mentions | subreddit_mentions | subreddit | |
---|---|---|---|---|---|---|
0 | BOOMco Halo Covenant Needler Blaster | Toys & Games | https://www.amazon.com/BOOMco-Halo-Covenant-Ne... | 4.0 | 4 | r/gaming |
1 | Raspberry PI 3 Model B 1.2GHz 64-bit quad-core... | Electronics | https://www.amazon.com/Raspberry-Model-A1-2GHz... | 19.0 | 3 | r/gaming |
2 | CanaKit 5V 2.5A Raspberry Pi 3 Power Supply / ... | Electronics | https://www.amazon.com/CanaKit-Raspberry-Suppl... | 7.0 | 3 | r/gaming |
3 | Panasonic K-KJ17MCA4BA Advanced Individual Cel... | Electronics | https://www.amazon.com/Panasonic-Advanced-Indi... | 29.0 | 2 | r/gaming |
4 | Mayflash GameCube Controller Adapter for Wii U... | Electronics | https://www.amazon.com/GameCube-Controller-Ada... | 24.0 | 2 | r/gaming |
gaming_products.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 493 entries, 0 to 492
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 493 non-null object
1 category 493 non-null object
2 amazon_link 493 non-null object
3 total_mentions 491 non-null float64
4 subreddit_mentions 493 non-null int64
5 subreddit 493 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 23.2+ KB
gaming_products.describe()
total_mentions | subreddit_mentions | |
---|---|---|
count | 491.000000 | 493.000000 |
mean | 2.853360 | 1.064909 |
std | 4.504293 | 0.284858 |
min | 1.000000 | 1.000000 |
25% | 1.000000 | 1.000000 |
50% | 1.000000 | 1.000000 |
75% | 3.000000 | 1.000000 |
max | 46.000000 | 4.000000 |
movie_products.head()
name | category | amazon_link | total_mentions | subreddit_mentions | subreddit | |
---|---|---|---|---|---|---|
0 | Marvel Cinematic Universe: Phase One - Avenger... | Movies & TV | https://www.amazon.com/Marvel-Cinematic-Univer... | 4.0 | 3 | r/movies |
1 | On Stranger Tides | Books | https://www.amazon.com/Stranger-Tides-Tim-Powe... | 3.0 | 3 | r/movies |
2 | Superintelligence: Paths, Dangers, Strategies | Books | https://www.amazon.com/Superintelligence-Dange... | 7.0 | 2 | r/movies |
3 | The Secret History of Star Wars | Books | https://www.amazon.com/Secret-History-Star-War... | 4.0 | 2 | r/movies |
4 | 2D Glasses 4 Pack - Turns 3D movies back into ... | Electronics | https://www.amazon.com/gp/product/B00K9E7GCC | 3.0 | 2 | r/movies |
movie_products.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 6 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 303 non-null object
1 category 303 non-null object
2 amazon_link 303 non-null object
3 total_mentions 302 non-null float64
4 subreddit_mentions 303 non-null int64
5 subreddit 303 non-null object
dtypes: float64(1), int64(1), object(4)
memory usage: 14.3+ KB
movie_products.describe()
total_mentions | subreddit_mentions | |
---|---|---|
count | 302.000000 | 303.000000 |
mean | 1.470199 | 1.046205 |
std | 1.131300 | 0.239710 |
min | 1.000000 | 1.000000 |
25% | 1.000000 | 1.000000 |
50% | 1.000000 | 1.000000 |
75% | 1.000000 | 1.000000 |
max | 8.000000 | 3.000000 |
combined_products = pd.concat([gaming_products, movie_products])
# Check your answer
q3.check()
<IPython.core.display.Javascript object>
Correct
#q3.hint()
#q3.solution()
4.
The Powerlifting Database dataset on Kaggle includes one CSV table for powerlifting meets and a separate one for powerlifting competitors. Run the cell below to load these datasets into dataframes:
powerlifting_meets = pd.read_csv("../input/powerlifting-database/meets.csv")
powerlifting_competitors = pd.read_csv("../input/powerlifting-database/openpowerlifting.csv")
powerlifting_meets.head()
MeetID | MeetPath | Federation | Date | MeetCountry | MeetState | MeetTown | MeetName | |
---|---|---|---|---|---|---|---|---|
0 | 0 | 365strong/1601 | 365Strong | 2016-10-29 | USA | NC | Charlotte | 2016 Junior & Senior National Powerlifting Cha... |
1 | 1 | 365strong/1602 | 365Strong | 2016-11-19 | USA | MO | Ozark | Thanksgiving Powerlifting Classic |
2 | 2 | 365strong/1603 | 365Strong | 2016-07-09 | USA | NC | Charlotte | Charlotte Europa Games |
3 | 3 | 365strong/1604 | 365Strong | 2016-06-11 | USA | SC | Rock Hill | Carolina Cup Push Pull Challenge |
4 | 4 | 365strong/1605 | 365Strong | 2016-04-10 | USA | SC | Rock Hill | Eastern USA Challenge |
powerlifting_meets.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8482 entries, 0 to 8481
Data columns (total 8 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MeetID 8482 non-null int64
1 MeetPath 8482 non-null object
2 Federation 8482 non-null object
3 Date 8482 non-null object
4 MeetCountry 8482 non-null object
5 MeetState 5496 non-null object
6 MeetTown 6973 non-null object
7 MeetName 8482 non-null object
dtypes: int64(1), object(7)
memory usage: 530.2+ KB
powerlifting_competitors.head()
MeetID | Name | Sex | Equipment | Age | Division | BodyweightKg | WeightClassKg | Squat4Kg | BestSquatKg | Bench4Kg | BestBenchKg | Deadlift4Kg | BestDeadliftKg | TotalKg | Place | Wilks | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 0 | Angie Belk Terry | F | Wraps | 47.0 | Mst 45-49 | 59.60 | 60 | NaN | 47.63 | NaN | 20.41 | NaN | 70.31 | 138.35 | 1 | 155.05 |
1 | 0 | Dawn Bogart | F | Single-ply | 42.0 | Mst 40-44 | 58.51 | 60 | NaN | 142.88 | NaN | 95.25 | NaN | 163.29 | 401.42 | 1 | 456.38 |
2 | 0 | Dawn Bogart | F | Single-ply | 42.0 | Open Senior | 58.51 | 60 | NaN | 142.88 | NaN | 95.25 | NaN | 163.29 | 401.42 | 1 | 456.38 |
3 | 0 | Dawn Bogart | F | Raw | 42.0 | Open Senior | 58.51 | 60 | NaN | NaN | NaN | 95.25 | NaN | NaN | 95.25 | 1 | 108.29 |
4 | 0 | Destiny Dula | F | Raw | 18.0 | Teen 18-19 | 63.68 | 67.5 | NaN | NaN | NaN | 31.75 | NaN | 90.72 | 122.47 | 1 | 130.47 |
powerlifting_competitors.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 386414 entries, 0 to 386413
Data columns (total 17 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MeetID 386414 non-null int64
1 Name 386414 non-null object
2 Sex 386414 non-null object
3 Equipment 386414 non-null object
4 Age 147147 non-null float64
5 Division 370571 non-null object
6 BodyweightKg 384012 non-null float64
7 WeightClassKg 382602 non-null object
8 Squat4Kg 1243 non-null float64
9 BestSquatKg 298071 non-null float64
10 Bench4Kg 1962 non-null float64
11 BestBenchKg 356364 non-null float64
12 Deadlift4Kg 2800 non-null float64
13 BestDeadliftKg 317847 non-null float64
14 TotalKg 363237 non-null float64
15 Place 385322 non-null object
16 Wilks 362194 non-null float64
dtypes: float64(10), int64(1), object(6)
memory usage: 50.1+ MB
powerlifting_competitors.describe()
MeetID | Age | BodyweightKg | Squat4Kg | BestSquatKg | Bench4Kg | BestBenchKg | Deadlift4Kg | BestDeadliftKg | TotalKg | Wilks | |
---|---|---|---|---|---|---|---|---|---|---|---|
count | 386414.000000 | 147147.000000 | 384012.000000 | 1243.000000 | 298071.000000 | 1962.000000 | 356364.000000 | 2800.000000 | 317847.000000 | 363237.000000 | 362194.000000 |
mean | 5143.015804 | 31.668237 | 86.934912 | 107.036404 | 176.569941 | 45.722905 | 118.347509 | 113.597193 | 195.040633 | 424.000249 | 301.080601 |
std | 2552.099838 | 12.900342 | 23.140843 | 166.976620 | 69.222785 | 151.668221 | 54.848850 | 170.201657 | 61.580675 | 196.355147 | 116.360396 |
min | 0.000000 | 5.000000 | 15.880000 | -440.500000 | -477.500000 | -360.000000 | -522.500000 | -461.000000 | -410.000000 | 11.000000 | 13.730000 |
25% | 2979.000000 | 22.000000 | 70.300000 | 87.500000 | 127.500000 | -90.000000 | 79.380000 | 110.000000 | 147.500000 | 272.160000 | 237.380000 |
50% | 5960.000000 | 28.000000 | 83.200000 | 145.000000 | 174.630000 | 90.250000 | 115.000000 | 157.500000 | 195.000000 | 424.110000 | 319.660000 |
75% | 7175.000000 | 39.000000 | 100.000000 | 212.500000 | 217.720000 | 167.500000 | 150.000000 | 219.990000 | 238.140000 | 565.000000 | 379.290000 |
max | 8481.000000 | 95.000000 | 242.400000 | 450.000000 | 573.790000 | 378.750000 | 488.500000 | 418.000000 | 460.400000 | 1365.310000 | 779.380000 |
Both tables include references to a MeetID
, a unique key for each meet (competition) included in the database. Using this, generate a dataset combining the two tables into one.
left = powerlifting_meets.set_index('MeetID')
right = powerlifting_competitors.set_index('MeetID')
powerlifting_combined = left.join(right)
# Check your answer
q4.check()
<IPython.core.display.Javascript object>
Correct
powerlifting_combined.head()
MeetPath | Federation | Date | MeetCountry | MeetState | MeetTown | MeetName | Name | Sex | Equipment | ... | WeightClassKg | Squat4Kg | BestSquatKg | Bench4Kg | BestBenchKg | Deadlift4Kg | BestDeadliftKg | TotalKg | Place | Wilks | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MeetID | |||||||||||||||||||||
0 | 365strong/1601 | 365Strong | 2016-10-29 | USA | NC | Charlotte | 2016 Junior & Senior National Powerlifting Cha... | Angie Belk Terry | F | Wraps | ... | 60 | NaN | 47.63 | NaN | 20.41 | NaN | 70.31 | 138.35 | 1 | 155.05 |
0 | 365strong/1601 | 365Strong | 2016-10-29 | USA | NC | Charlotte | 2016 Junior & Senior National Powerlifting Cha... | Dawn Bogart | F | Single-ply | ... | 60 | NaN | 142.88 | NaN | 95.25 | NaN | 163.29 | 401.42 | 1 | 456.38 |
0 | 365strong/1601 | 365Strong | 2016-10-29 | USA | NC | Charlotte | 2016 Junior & Senior National Powerlifting Cha... | Dawn Bogart | F | Single-ply | ... | 60 | NaN | 142.88 | NaN | 95.25 | NaN | 163.29 | 401.42 | 1 | 456.38 |
0 | 365strong/1601 | 365Strong | 2016-10-29 | USA | NC | Charlotte | 2016 Junior & Senior National Powerlifting Cha... | Dawn Bogart | F | Raw | ... | 60 | NaN | NaN | NaN | 95.25 | NaN | NaN | 95.25 | 1 | 108.29 |
0 | 365strong/1601 | 365Strong | 2016-10-29 | USA | NC | Charlotte | 2016 Junior & Senior National Powerlifting Cha... | Destiny Dula | F | Raw | ... | 67.5 | NaN | NaN | NaN | 31.75 | NaN | 90.72 | 122.47 | 1 | 130.47 |
5 rows × 23 columns
powerlifting_combined.tail()
MeetPath | Federation | Date | MeetCountry | MeetState | MeetTown | MeetName | Name | Sex | Equipment | ... | WeightClassKg | Squat4Kg | BestSquatKg | Bench4Kg | BestBenchKg | Deadlift4Kg | BestDeadliftKg | TotalKg | Place | Wilks | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
MeetID | |||||||||||||||||||||
8481 | xpc/2017-finals | XPC | 2017-03-03 | USA | OH | Columbus | 2017 XPC Finals | William Barabas | M | Multi-ply | ... | 125 | NaN | NaN | NaN | NaN | NaN | 347.5 | 347.5 | 2 | 202.60 |
8481 | xpc/2017-finals | XPC | 2017-03-03 | USA | OH | Columbus | 2017 XPC Finals | Justin Zottl | M | Multi-ply | ... | 125 | NaN | NaN | NaN | NaN | NaN | 322.5 | 322.5 | 3 | 185.77 |
8481 | xpc/2017-finals | XPC | 2017-03-03 | USA | OH | Columbus | 2017 XPC Finals | Jake Anderson | M | Multi-ply | ... | 125 | NaN | NaN | NaN | NaN | NaN | 367.5 | 367.5 | 1 | 211.17 |
8481 | xpc/2017-finals | XPC | 2017-03-03 | USA | OH | Columbus | 2017 XPC Finals | Jeff Bumanglag | M | Multi-ply | ... | 140 | NaN | NaN | NaN | NaN | NaN | 320.0 | 320.0 | 3 | 181.85 |
8481 | xpc/2017-finals | XPC | 2017-03-03 | USA | OH | Columbus | 2017 XPC Finals | Shane Hammock | M | Multi-ply | ... | 140 | NaN | NaN | NaN | NaN | NaN | 362.5 | 362.5 | 2 | 205.18 |
5 rows × 23 columns
powerlifting_combined.columns
Index(['MeetPath', 'Federation', 'Date', 'MeetCountry', 'MeetState',
'MeetTown', 'MeetName', 'Name', 'Sex', 'Equipment', 'Age', 'Division',
'BodyweightKg', 'WeightClassKg', 'Squat4Kg', 'BestSquatKg', 'Bench4Kg',
'BestBenchKg', 'Deadlift4Kg', 'BestDeadliftKg', 'TotalKg', 'Place',
'Wilks'],
dtype='object')
powerlifting_combined.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 386414 entries, 0 to 8481
Data columns (total 23 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 MeetPath 386414 non-null object
1 Federation 386414 non-null object
2 Date 386414 non-null object
3 MeetCountry 386414 non-null object
4 MeetState 314271 non-null object
5 MeetTown 292414 non-null object
6 MeetName 386414 non-null object
7 Name 386414 non-null object
8 Sex 386414 non-null object
9 Equipment 386414 non-null object
10 Age 147147 non-null float64
11 Division 370571 non-null object
12 BodyweightKg 384012 non-null float64
13 WeightClassKg 382602 non-null object
14 Squat4Kg 1243 non-null float64
15 BestSquatKg 298071 non-null float64
16 Bench4Kg 1962 non-null float64
17 BestBenchKg 356364 non-null float64
18 Deadlift4Kg 2800 non-null float64
19 BestDeadliftKg 317847 non-null float64
20 TotalKg 363237 non-null float64
21 Place 385322 non-null object
22 Wilks 362194 non-null float64
dtypes: float64(10), object(13)
memory usage: 70.8+ MB
#q4.hint()
#q4.solution()
Congratulations!
You’ve finished the Pandas micro-course. Many data scientists feel efficiency with Pandas is the most useful and practical skill they have, because it allows you to progress quickly in any project you have.
If you’d like to apply your new skills to examining geospatial data, you’re encouraged to check out our Geospatial Analysis micro-course.
You can also take advantage of your Pandas skills by entering a Kaggle Competition or by answering a question you find interesting using Kaggle Datasets.
Have questions or comments? Visit the course discussion forum to chat with other learners.
Leave a comment