Pandas Renaming and Combining

11 minute read

This notebook is an exercise in the Pandas course. You can reference the tutorial at this link.

Introduction

Run the following cell to load your data and some utility functions.

import pandas as pd

reviews = pd.read_csv("../input/wine-reviews/winemag-data-130k-v2.csv", index_col=0)

from learntools.core import binder; binder.bind(globals())
from learntools.pandas.renaming_and_combining import *
print("Setup complete.")

Setup complete.

Exercises

View the first several lines of your data by running the cell below:

reviews.head()

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
1	Portugal	This is ripe and fruity, a wine that is smooth...	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
2	US	Tart and snappy, the flavors of lime flesh and...	NaN	87	14.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Rainstorm 2013 Pinot Gris (Willamette Valley)	Pinot Gris	Rainstorm
3	US	Pineapple rind, lemon pith and orange blossom ...	Reserve Late Harvest	87	13.0	Michigan	Lake Michigan Shore	NaN	Alexander Peartree	NaN	St. Julian 2013 Reserve Late Harvest Riesling ...	Riesling	St. Julian
4	US	Much like the regular bottling from 2012, this...	Vintner's Reserve Wild Child Block	87	65.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Sweet Cheeks 2012 Vintner's Reserve Wild Child...	Pinot Noir	Sweet Cheeks

1.

region_1 and region_2 are pretty uninformative names for locale columns in the dataset. Create a copy of reviews with these columns renamed to region and locale, respectively.

# Your code here
renamed = reviews.rename(columns = {'region_1':'region', 'region_2':'locale'})

# Check your answer
q1.check()

<IPython.core.display.Javascript object>

Correct

renamed.head()

	country	description	designation	points	price	province	region	locale	taster_name	taster_twitter_handle	title	variety	winery
0	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
1	Portugal	This is ripe and fruity, a wine that is smooth...	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
2	US	Tart and snappy, the flavors of lime flesh and...	NaN	87	14.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Rainstorm 2013 Pinot Gris (Willamette Valley)	Pinot Gris	Rainstorm
3	US	Pineapple rind, lemon pith and orange blossom ...	Reserve Late Harvest	87	13.0	Michigan	Lake Michigan Shore	NaN	Alexander Peartree	NaN	St. Julian 2013 Reserve Late Harvest Riesling ...	Riesling	St. Julian
4	US	Much like the regular bottling from 2012, this...	Vintner's Reserve Wild Child Block	87	65.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Sweet Cheeks 2012 Vintner's Reserve Wild Child...	Pinot Noir	Sweet Cheeks

#q1.hint()
#q1.solution()

2.

Set the index name in the dataset to wines.

reindexed = reviews.rename_axis("wines", axis='rows')

# Check your answer
q2.check()

<IPython.core.display.Javascript object>

Correct

reindexed.head()

	country	description	designation	points	price	province	region_1	region_2	taster_name	taster_twitter_handle	title	variety	winery
wines
0	Italy	Aromas include tropical fruit, broom, brimston...	Vulkà Bianco	87	NaN	Sicily & Sardinia	Etna	NaN	Kerin O’Keefe	@kerinokeefe	Nicosia 2013 Vulkà Bianco (Etna)	White Blend	Nicosia
1	Portugal	This is ripe and fruity, a wine that is smooth...	Avidagos	87	15.0	Douro	NaN	NaN	Roger Voss	@vossroger	Quinta dos Avidagos 2011 Avidagos Red (Douro)	Portuguese Red	Quinta dos Avidagos
2	US	Tart and snappy, the flavors of lime flesh and...	NaN	87	14.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Rainstorm 2013 Pinot Gris (Willamette Valley)	Pinot Gris	Rainstorm
3	US	Pineapple rind, lemon pith and orange blossom ...	Reserve Late Harvest	87	13.0	Michigan	Lake Michigan Shore	NaN	Alexander Peartree	NaN	St. Julian 2013 Reserve Late Harvest Riesling ...	Riesling	St. Julian
4	US	Much like the regular bottling from 2012, this...	Vintner's Reserve Wild Child Block	87	65.0	Oregon	Willamette Valley	Willamette Valley	Paul Gregutt	@paulgwine	Sweet Cheeks 2012 Vintner's Reserve Wild Child...	Pinot Noir	Sweet Cheeks

#q2.hint()
#q2.solution()

3.

The Things on Reddit dataset includes product links from a selection of top-ranked forums (“subreddits”) on reddit.com. Run the cell below to load a dataframe of products mentioned on the /r/gaming subreddit and another dataframe for products mentioned on the r//movies subreddit.

gaming_products = pd.read_csv("../input/things-on-reddit/top-things/top-things/reddits/g/gaming.csv")
gaming_products['subreddit'] = "r/gaming"
movie_products = pd.read_csv("../input/things-on-reddit/top-things/top-things/reddits/m/movies.csv")
movie_products['subreddit'] = "r/movies"

Create a DataFrame of products mentioned on either subreddit.

gaming_products.head()

	name	category	amazon_link	total_mentions	subreddit_mentions	subreddit
0	BOOMco Halo Covenant Needler Blaster	Toys & Games	https://www.amazon.com/BOOMco-Halo-Covenant-Ne...	4.0	4	r/gaming
1	Raspberry PI 3 Model B 1.2GHz 64-bit quad-core...	Electronics	https://www.amazon.com/Raspberry-Model-A1-2GHz...	19.0	3	r/gaming
2	CanaKit 5V 2.5A Raspberry Pi 3 Power Supply / ...	Electronics	https://www.amazon.com/CanaKit-Raspberry-Suppl...	7.0	3	r/gaming
3	Panasonic K-KJ17MCA4BA Advanced Individual Cel...	Electronics	https://www.amazon.com/Panasonic-Advanced-Indi...	29.0	2	r/gaming
4	Mayflash GameCube Controller Adapter for Wii U...	Electronics	https://www.amazon.com/GameCube-Controller-Ada...	24.0	2	r/gaming

gaming_products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 493 entries, 0 to 492
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   name                493 non-null    object 
 1   category            493 non-null    object 
 2   amazon_link         493 non-null    object 
 3   total_mentions      491 non-null    float64
 4   subreddit_mentions  493 non-null    int64  
 5   subreddit           493 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 23.2+ KB

gaming_products.describe()

	total_mentions	subreddit_mentions
count	491.000000	493.000000
mean	2.853360	1.064909
std	4.504293	0.284858
min	1.000000	1.000000
25%	1.000000	1.000000
50%	1.000000	1.000000
75%	3.000000	1.000000
max	46.000000	4.000000

movie_products.head()

	name	category	amazon_link	total_mentions	subreddit_mentions	subreddit
0	Marvel Cinematic Universe: Phase One - Avenger...	Movies & TV	https://www.amazon.com/Marvel-Cinematic-Univer...	4.0	3	r/movies
1	On Stranger Tides	Books	https://www.amazon.com/Stranger-Tides-Tim-Powe...	3.0	3	r/movies
2	Superintelligence: Paths, Dangers, Strategies	Books	https://www.amazon.com/Superintelligence-Dange...	7.0	2	r/movies
3	The Secret History of Star Wars	Books	https://www.amazon.com/Secret-History-Star-War...	4.0	2	r/movies
4	2D Glasses 4 Pack - Turns 3D movies back into ...	Electronics	https://www.amazon.com/gp/product/B00K9E7GCC	3.0	2	r/movies

movie_products.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 303 entries, 0 to 302
Data columns (total 6 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   name                303 non-null    object 
 1   category            303 non-null    object 
 2   amazon_link         303 non-null    object 
 3   total_mentions      302 non-null    float64
 4   subreddit_mentions  303 non-null    int64  
 5   subreddit           303 non-null    object 
dtypes: float64(1), int64(1), object(4)
memory usage: 14.3+ KB

movie_products.describe()

	total_mentions	subreddit_mentions
count	302.000000	303.000000
mean	1.470199	1.046205
std	1.131300	0.239710
min	1.000000	1.000000
25%	1.000000	1.000000
50%	1.000000	1.000000
75%	1.000000	1.000000
max	8.000000	3.000000

combined_products = pd.concat([gaming_products, movie_products])

# Check your answer
q3.check()

<IPython.core.display.Javascript object>

Correct

#q3.hint()
#q3.solution()

4.

The Powerlifting Database dataset on Kaggle includes one CSV table for powerlifting meets and a separate one for powerlifting competitors. Run the cell below to load these datasets into dataframes:

powerlifting_meets = pd.read_csv("../input/powerlifting-database/meets.csv")
powerlifting_competitors = pd.read_csv("../input/powerlifting-database/openpowerlifting.csv")

powerlifting_meets.head()

	MeetID	MeetPath	Federation	Date	MeetCountry	MeetState	MeetTown	MeetName
0	0	365strong/1601	365Strong	2016-10-29	USA	NC	Charlotte	2016 Junior & Senior National Powerlifting Cha...
1	1	365strong/1602	365Strong	2016-11-19	USA	MO	Ozark	Thanksgiving Powerlifting Classic
2	2	365strong/1603	365Strong	2016-07-09	USA	NC	Charlotte	Charlotte Europa Games
3	3	365strong/1604	365Strong	2016-06-11	USA	SC	Rock Hill	Carolina Cup Push Pull Challenge
4	4	365strong/1605	365Strong	2016-04-10	USA	SC	Rock Hill	Eastern USA Challenge

powerlifting_meets.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8482 entries, 0 to 8481
Data columns (total 8 columns):
 #   Column       Non-Null Count  Dtype 
---  ------       --------------  ----- 
 0   MeetID       8482 non-null   int64 
 1   MeetPath     8482 non-null   object
 2   Federation   8482 non-null   object
 3   Date         8482 non-null   object
 4   MeetCountry  8482 non-null   object
 5   MeetState    5496 non-null   object
 6   MeetTown     6973 non-null   object
 7   MeetName     8482 non-null   object
dtypes: int64(1), object(7)
memory usage: 530.2+ KB

powerlifting_competitors.head()

	Name	Sex	Equipment	Age	Division	BodyweightKg	WeightClassKg	Squat4Kg	BestSquatKg	Bench4Kg	BestBenchKg	Deadlift4Kg	BestDeadliftKg	TotalKg	Place	Wilks
0	Angie Belk Terry	F	Wraps	47.0	Mst 45-49	59.60	60	NaN	47.63	NaN	20.41	NaN	70.31	138.35	1	155.05
1	Dawn Bogart	F	Single-ply	42.0	Mst 40-44	58.51	60	NaN	142.88	NaN	95.25	NaN	163.29	401.42	1	456.38
2	Dawn Bogart	F	Single-ply	42.0	Open Senior	58.51	60	NaN	142.88	NaN	95.25	NaN	163.29	401.42	1	456.38
3	Dawn Bogart	F	Raw	42.0	Open Senior	58.51	60	NaN	NaN	NaN	95.25	NaN	NaN	95.25	1	108.29
4	Destiny Dula	F	Raw	18.0	Teen 18-19	63.68	67.5	NaN	NaN	NaN	31.75	NaN	90.72	122.47	1	130.47

powerlifting_competitors.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 386414 entries, 0 to 386413
Data columns (total 17 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 MeetID          386414 non-null  int64  
 Name            386414 non-null  object 
 Sex             386414 non-null  object 
 Equipment       386414 non-null  object 
 Age             147147 non-null  float64
 Division        370571 non-null  object 
 BodyweightKg    384012 non-null  float64
 WeightClassKg   382602 non-null  object 
 Squat4Kg        1243 non-null    float64
 BestSquatKg     298071 non-null  float64
Bench4Kg        1962 non-null    float64
BestBenchKg     356364 non-null  float64
Deadlift4Kg     2800 non-null    float64
BestDeadliftKg  317847 non-null  float64
TotalKg         363237 non-null  float64
Place           385322 non-null  object 
Wilks           362194 non-null  float64
dtypes: float64(10), int64(1), object(6)
memory usage: 50.1+ MB

powerlifting_competitors.describe()

	MeetID	Age	BodyweightKg	Squat4Kg	BestSquatKg	Bench4Kg	BestBenchKg	Deadlift4Kg	BestDeadliftKg	TotalKg	Wilks
count	386414.000000	147147.000000	384012.000000	1243.000000	298071.000000	1962.000000	356364.000000	2800.000000	317847.000000	363237.000000	362194.000000
mean	5143.015804	31.668237	86.934912	107.036404	176.569941	45.722905	118.347509	113.597193	195.040633	424.000249	301.080601
std	2552.099838	12.900342	23.140843	166.976620	69.222785	151.668221	54.848850	170.201657	61.580675	196.355147	116.360396
min	0.000000	5.000000	15.880000	-440.500000	-477.500000	-360.000000	-522.500000	-461.000000	-410.000000	11.000000	13.730000
25%	2979.000000	22.000000	70.300000	87.500000	127.500000	-90.000000	79.380000	110.000000	147.500000	272.160000	237.380000
50%	5960.000000	28.000000	83.200000	145.000000	174.630000	90.250000	115.000000	157.500000	195.000000	424.110000	319.660000
75%	7175.000000	39.000000	100.000000	212.500000	217.720000	167.500000	150.000000	219.990000	238.140000	565.000000	379.290000
max	8481.000000	95.000000	242.400000	450.000000	573.790000	378.750000	488.500000	418.000000	460.400000	1365.310000	779.380000

Both tables include references to a MeetID, a unique key for each meet (competition) included in the database. Using this, generate a dataset combining the two tables into one.

left = powerlifting_meets.set_index('MeetID')
right = powerlifting_competitors.set_index('MeetID')

powerlifting_combined = left.join(right)

# Check your answer
q4.check()

<IPython.core.display.Javascript object>

Correct

powerlifting_combined.head()

	MeetPath	Federation	Date	MeetCountry	MeetState	MeetTown	MeetName	Name	Sex	Equipment	...	WeightClassKg	Squat4Kg	BestSquatKg	Bench4Kg	BestBenchKg	Deadlift4Kg	BestDeadliftKg	TotalKg	Place	Wilks
MeetID
0	365strong/1601	365Strong	2016-10-29	USA	NC	Charlotte	2016 Junior & Senior National Powerlifting Cha...	Angie Belk Terry	F	Wraps	...	60	NaN	47.63	NaN	20.41	NaN	70.31	138.35	1	155.05
0	365strong/1601	365Strong	2016-10-29	USA	NC	Charlotte	2016 Junior & Senior National Powerlifting Cha...	Dawn Bogart	F	Single-ply	...	60	NaN	142.88	NaN	95.25	NaN	163.29	401.42	1	456.38
0	365strong/1601	365Strong	2016-10-29	USA	NC	Charlotte	2016 Junior & Senior National Powerlifting Cha...	Dawn Bogart	F	Single-ply	...	60	NaN	142.88	NaN	95.25	NaN	163.29	401.42	1	456.38
0	365strong/1601	365Strong	2016-10-29	USA	NC	Charlotte	2016 Junior & Senior National Powerlifting Cha...	Dawn Bogart	F	Raw	...	60	NaN	NaN	NaN	95.25	NaN	NaN	95.25	1	108.29
0	365strong/1601	365Strong	2016-10-29	USA	NC	Charlotte	2016 Junior & Senior National Powerlifting Cha...	Destiny Dula	F	Raw	...	67.5	NaN	NaN	NaN	31.75	NaN	90.72	122.47	1	130.47

5 rows × 23 columns

powerlifting_combined.tail()

	MeetPath	Federation	Date	MeetCountry	MeetState	MeetTown	MeetName	Name	Sex	Equipment	...	WeightClassKg	Squat4Kg	BestSquatKg	Bench4Kg	BestBenchKg	Deadlift4Kg	BestDeadliftKg	TotalKg	Place	Wilks
MeetID
8481	xpc/2017-finals	XPC	2017-03-03	USA	OH	Columbus	2017 XPC Finals	William Barabas	M	Multi-ply	...	125	NaN	NaN	NaN	NaN	NaN	347.5	347.5	2	202.60
8481	xpc/2017-finals	XPC	2017-03-03	USA	OH	Columbus	2017 XPC Finals	Justin Zottl	M	Multi-ply	...	125	NaN	NaN	NaN	NaN	NaN	322.5	322.5	3	185.77
8481	xpc/2017-finals	XPC	2017-03-03	USA	OH	Columbus	2017 XPC Finals	Jake Anderson	M	Multi-ply	...	125	NaN	NaN	NaN	NaN	NaN	367.5	367.5	1	211.17
8481	xpc/2017-finals	XPC	2017-03-03	USA	OH	Columbus	2017 XPC Finals	Jeff Bumanglag	M	Multi-ply	...	140	NaN	NaN	NaN	NaN	NaN	320.0	320.0	3	181.85
8481	xpc/2017-finals	XPC	2017-03-03	USA	OH	Columbus	2017 XPC Finals	Shane Hammock	M	Multi-ply	...	140	NaN	NaN	NaN	NaN	NaN	362.5	362.5	2	205.18

5 rows × 23 columns

powerlifting_combined.columns

Index(['MeetPath', 'Federation', 'Date', 'MeetCountry', 'MeetState',
       'MeetTown', 'MeetName', 'Name', 'Sex', 'Equipment', 'Age', 'Division',
       'BodyweightKg', 'WeightClassKg', 'Squat4Kg', 'BestSquatKg', 'Bench4Kg',
       'BestBenchKg', 'Deadlift4Kg', 'BestDeadliftKg', 'TotalKg', 'Place',
       'Wilks'],
      dtype='object')

powerlifting_combined.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 386414 entries, 0 to 8481
Data columns (total 23 columns):
 #   Column          Non-Null Count   Dtype  
---  ------          --------------   -----  
 MeetPath        386414 non-null  object 
 Federation      386414 non-null  object 
 Date            386414 non-null  object 
 MeetCountry     386414 non-null  object 
 MeetState       314271 non-null  object 
 MeetTown        292414 non-null  object 
 MeetName        386414 non-null  object 
 Name            386414 non-null  object 
 Sex             386414 non-null  object 
 Equipment       386414 non-null  object 
Age             147147 non-null  float64
Division        370571 non-null  object 
BodyweightKg    384012 non-null  float64
WeightClassKg   382602 non-null  object 
Squat4Kg        1243 non-null    float64
BestSquatKg     298071 non-null  float64
Bench4Kg        1962 non-null    float64
BestBenchKg     356364 non-null  float64
Deadlift4Kg     2800 non-null    float64
BestDeadliftKg  317847 non-null  float64
TotalKg         363237 non-null  float64
Place           385322 non-null  object 
Wilks           362194 non-null  float64
dtypes: float64(10), object(13)
memory usage: 70.8+ MB

#q4.hint()
#q4.solution()

Congratulations!

You’ve finished the Pandas micro-course. Many data scientists feel efficiency with Pandas is the most useful and practical skill they have, because it allows you to progress quickly in any project you have.

If you’d like to apply your new skills to examining geospatial data, you’re encouraged to check out our Geospatial Analysis micro-course.

You can also take advantage of your Pandas skills by entering a Kaggle Competition or by answering a question you find interesting using Kaggle Datasets.

Have questions or comments? Visit the course discussion forum to chat with other learners.

Share on

Twitter Facebook LinkedIn

안철현(Charles An)

Pandas Renaming and Combining

Introduction

Exercises

1.

2.

3.

4.

Congratulations!

Share on

Leave a comment

You may also enjoy

Clustering with K-Means

Creating Featues

Tips for Creating Features

Mutual Information