Scatter Plots

8 minute read

This notebook is an exercise in the Data Visualization course. You can reference the tutorial at this link.

In this exercise, you will use your new knowledge to propose a solution to a real-world scenario. To succeed, you will need to import data into Python, answer questions using the data, and generate scatter plots to understand patterns in the data.

Scenario

You work for a major candy producer, and your goal is to write a report that your company can use to guide the design of its next product. Soon after starting your research, you stumble across this very interesting dataset containing results from a fun survey to crowdsource favorite candies.

Setup

Run the next cell to import and configure the Python libraries that you need to complete the exercise.

import pandas as pd
pd.plotting.register_matplotlib_converters()
import matplotlib.pyplot as plt
%matplotlib inline
import seaborn as sns
print("Setup Complete")

Setup Complete

The questions below will give you feedback on your work. Run the following cell to set up our feedback system.

# Set up code checking
import os
if not os.path.exists("../input/candy.csv"):
    os.symlink("../input/data-for-datavis/candy.csv", "../input/candy.csv") 
from learntools.core import binder
binder.bind(globals())
from learntools.data_viz_to_coder.ex4 import *
print("Setup Complete")

Setup Complete

Step 1: Load the Data

Read the candy data file into candy_data. Use the "id" column to label the rows.

# Path of the file to read
candy_filepath = "../input/candy.csv"

# Fill in the line below to read the file into a variable candy_data
candy_data = pd.read_csv(candy_filepath, index_col="id")

# Run the line below with no changes to check that you've loaded the data correctly
step_1.check()

<IPython.core.display.Javascript object>

Correct

# Lines below will give you a hint or solution code
#step_1.hint()
#step_1.solution()

Step 2: Review the data

Use a Python command to print the first five rows of the data.

# Print the first five rows of the data
candy_data.head()

	competitorname	chocolate	fruity	caramel	peanutyalmondy	nougat	crispedricewafer	hard	bar	pluribus	sugarpercent	pricepercent	winpercent
id
0	100 Grand	Yes	No	Yes	No	No	Yes	No	Yes	No	0.732	0.860	66.971725
1	3 Musketeers	Yes	No	No	No	Yes	No	No	Yes	No	0.604	0.511	67.602936
2	Air Heads	No	Yes	No	No	No	No	No	No	No	0.906	0.511	52.341465
3	Almond Joy	Yes	No	No	Yes	No	No	No	Yes	No	0.465	0.767	50.347546
4	Baby Ruth	Yes	No	Yes	Yes	Yes	No	No	Yes	No	0.604	0.767	56.914547

candy_data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 83 entries, 0 to 82
Data columns (total 13 columns):
 #   Column            Non-Null Count  Dtype  
---  ------            --------------  -----  
 0   competitorname    83 non-null     object 
 1   chocolate         83 non-null     object 
 2   fruity            83 non-null     object 
 3   caramel           83 non-null     object 
 4   peanutyalmondy    83 non-null     object 
 5   nougat            83 non-null     object 
 6   crispedricewafer  83 non-null     object 
 7   hard              83 non-null     object 
 8   bar               83 non-null     object 
 9   pluribus          83 non-null     object 
 10  sugarpercent      83 non-null     float64
 11  pricepercent      83 non-null     float64
 12  winpercent        83 non-null     float64
dtypes: float64(3), object(10)
memory usage: 9.1+ KB

The dataset contains 83 rows, where each corresponds to a different candy bar. There are 13 columns:

'competitorname' contains the name of the candy bar.
the next 9 columns (from 'chocolate' to 'pluribus') describe the candy. For instance, rows with chocolate candies have "Yes" in the 'chocolate' column (and candies without chocolate have "No" in the same column).
'sugarpercent' provides some indication of the amount of sugar, where higher values signify higher sugar content.
'pricepercent' shows the price per unit, relative to the other candies in the dataset.
'winpercent' is calculated from the survey results; higher values indicate that the candy was more popular with survey respondents.

Use the first five rows of the data to answer the questions below.

# Fill in the line below: Which candy was more popular with survey respondents:
# '3 Musketeers' or 'Almond Joy'?  (Please enclose your answer in single quotes.)
more_popular = '3 Musketeers'

# Fill in the line below: Which candy has higher sugar content: 'Air Heads'
# or 'Baby Ruth'? (Please enclose your answer in single quotes.)
more_sugar = 'Air Heads'

# Check your answers
step_2.check()

<IPython.core.display.Javascript object>

Correct

# Lines below will give you a hint or solution code
#step_2.hint()
#step_2.solution()

Step 3: The role of sugar

Do people tend to prefer candies with higher sugar content?

Part A

Create a scatter plot that shows the relationship between 'sugarpercent' (on the horizontal x-axis) and 'winpercent' (on the vertical y-axis). Don’t add a regression line just yet – you’ll do that in the next step!

# Scatter plot showing the relationship between 'sugarpercent' and 'winpercent'
sns.scatterplot(x=candy_data['sugarpercent'], y=candy_data['winpercent'])

# Check your answer
step_3.a.check()

<IPython.core.display.Javascript object>

Correct

png

# Lines below will give you a hint or solution code
#step_3.a.hint()
#step_3.a.solution_plot()

Part B

Does the scatter plot show a strong correlation between the two variables? If so, are candies with more sugar relatively more or less popular with the survey respondents?

step_3.b.hint()

<IPython.core.display.Javascript object>

Hint: Compare candies with higher sugar content (on the right side of the chart) to candies with lower sugar content (on the left side of the chart). Is one group clearly more popular than the other?

# Check your answer (Run this code cell to receive credit!)
step_3.b.solution()

<IPython.core.display.Javascript object>

Solution: The scatter plot does not show a strong correlation between the two variables. Since there is no clear relationship between the two variables, this tells us that sugar content does not play a strong role in candy popularity.

Step 4: Take a closer look

Part A

Create the same scatter plot you created in Step 3, but now with a regression line!

# Scatter plot w/ regression line showing the relationship between 'sugarpercent' and 'winpercent'
sns.regplot(x=candy_data['sugarpercent'], y=candy_data['winpercent'])

# Check your answer
step_4.a.check()

<IPython.core.display.Javascript object>

Correct

png

# Lines below will give you a hint or solution code
#step_4.a.hint()
#step_4.a.solution_plot()

Part B

According to the plot above, is there a slight correlation between 'winpercent' and 'sugarpercent'? What does this tell you about the candy that people tend to prefer?

step_4.b.hint()

<IPython.core.display.Javascript object>

Hint: Does the regression line have a positive or negative slope?

# Check your answer (Run this code cell to receive credit!)
step_4.b.solution()

<IPython.core.display.Javascript object>

Solution: Since the regression line has a slightly positive slope, this tells us that there is a slightly positive correlation between 'winpercent' and 'sugarpercent'. Thus, people have a slight preference for candies containing relatively more sugar.

Step 5: Chocolate!

In the code cell below, create a scatter plot to show the relationship between 'pricepercent' (on the horizontal x-axis) and 'winpercent' (on the vertical y-axis). Use the 'chocolate' column to color-code the points. Don’t add any regression lines just yet – you’ll do that in the next step!

# Scatter plot showing the relationship between 'pricepercent', 'winpercent', and 'chocolate'
sns.scatterplot(x=candy_data['pricepercent'], y=candy_data['winpercent'], hue=candy_data['chocolate'])

# Check your answer
step_5.check()

<IPython.core.display.Javascript object>

Correct

png

# Lines below will give you a hint or solution code
#step_5.hint()
#step_5.solution_plot()

Can you see any interesting patterns in the scatter plot? We’ll investigate this plot further by adding regression lines in the next step!

Step 6: Investigate chocolate

Part A

Create the same scatter plot you created in Step 5, but now with two regression lines, corresponding to (1) chocolate candies and (2) candies without chocolate.

# Color-coded scatter plot w/ regression lines
sns.lmplot(x='pricepercent', y='winpercent', hue='chocolate', data=candy_data)

# Check your answer
step_6.a.check()

<IPython.core.display.Javascript object>

Correct

png

# Lines below will give you a hint or solution code
#step_6.a.hint()
#step_6.a.solution_plot()

Part B

Using the regression lines, what conclusions can you draw about the effects of chocolate and price on candy popularity?

step_6.b.hint()

<IPython.core.display.Javascript object>

Hint: Look at each regression line - do you notice a positive or negative slope?

# Check your answer (Run this code cell to receive credit!)
step_6.b.solution()

<IPython.core.display.Javascript object>

Solution: We’ll begin with the regression line for chocolate candies. Since this line has a slightly positive slope, we can say that more expensive chocolate candies tend to be more popular (than relatively cheaper chocolate candies). Likewise, since the regression line for candies without chocolate has a negative slope, we can say that if candies don’t contain chocolate, they tend to be more popular when they are cheaper. One important note, however, is that the dataset is quite small – so we shouldn’t invest too much trust in these patterns! To inspire more confidence in the results, we should add more candies to the dataset.

Step 7: Everybody loves chocolate.

Part A

Create a categorical scatter plot to highlight the relationship between 'chocolate' and 'winpercent'. Put 'chocolate' on the (horizontal) x-axis, and 'winpercent' on the (vertical) y-axis.

# Scatter plot showing the relationship between 'chocolate' and 'winpercent'
sns.swarmplot(x=candy_data['chocolate'], y=candy_data['winpercent'])

# Check your answer
step_7.a.check()

<IPython.core.display.Javascript object>

Correct

png

# Lines below will give you a hint or solution code
#step_7.a.hint()
#step_7.a.solution_plot()

Part B

You decide to dedicate a section of your report to the fact that chocolate candies tend to be more popular than candies without chocolate. Which plot is more appropriate to tell this story: the plot from Step 6, or the plot from Step 7?

step_7.b.hint()

<IPython.core.display.Javascript object>

Hint: Which plot communicates more information? In general, it’s good practice to use the simplest plot that tells the entire story of interest.

# Check your answer (Run this code cell to receive credit!)
step_7.b.solution()

<IPython.core.display.Javascript object>

Solution: In this case, the categorical scatter plot from Step 7 is the more appropriate plot. While both plots tell the desired story, the plot from Step 6 conveys far more information that could distract from the main point.

Keep going

Explore histograms and density plots.

Have questions or comments? Visit the course discussion forum to chat with other learners.

Share on

Twitter Facebook LinkedIn

안철현(Charles An)

Scatter Plots

Scenario

Setup

Step 1: Load the Data

Step 2: Review the data

Step 3: The role of sugar

Part A

Part B

Step 4: Take a closer look

Part A

Part B

Step 5: Chocolate!

Step 6: Investigate chocolate

Part A

Part B

Step 7: Everybody loves chocolate.

Part A

Part B

Keep going

Share on

Leave a comment

You may also enjoy

FastAPI 시작해봅시다

파사트 헤드라이트 램프(HID) 교체

Clustering with K-Means

Creating Featues