Graphical Displays

Students cheer on the Redhawks during a sporting event at Miami University.

  • Explain the Dataset - (0:35)
  • Importing Necessary Packages - (1:40)
  • Read the Dataset - (3:36)
  • Organize Data - (5:25)
  • Specific Graphical Displays Will be Covered by Other Videos in This Series
In this section, we use the dataset cargame.csv to demonstrate how to create basic graphical displays in Python. Below is the scenario for the data:
  • A toy company has four types of vehicles for sale: car, truck, racer, and taxi. To judge the quality of the different types of vehicles, a team records the following characteristics: the user’s gender (Gender), the type of a car in the trial, the distance vehicle is pulled back (Pull Distance), the distance the car goes after pulling it back (Distance) and the time for each trial in seconds (Time).

First, we use the module Pandas to open and read the data. If you didn't enable the object inspector to display the documentation of a function when the function is called and would like to have this feature, please visit our Basic Syntax page.

import urllib  # the module for reading a url
import numpy as np 
import pandas as pd # the module for opening a .xlsx file 
import matplotlib.pyplot as plt # the module we will use for several graphical displays 
import scipy.stats as stats # the module for probability plots 
from mpl_toolkits.mplot3d import Axes3D # a function for 3D plots 
#### Read the dataset first, open the file directly by using the URL 
cargame_online = urllib.urlopen("https://miamioh.instructure.com//files//3082832//download?download_frd=1") 
cargame = pd.read_csv(cargame_online) # read the file
cargame.head() # read the first five lines of the dataset 
cargame_group = cargame.groupby("Name of Car") # group data by Name of Car for the later use 
total = cargame_group.sum() # get sum for each group 
counts = cargame_group.size() # get size for each group 
category = np.arange(len(counts)) # assign a number for each type of vehicle

Bar Chart

  • Set name of the figure - (0:26)
  • Plot the bar chart - (0:46)
  • Set title - (1:21)
  • Labels - (1:46)
  • Show the plot - (3:03)
  • Horizontal Bar Graph - (3:26)
  • Example - (3:55)

A bar chart can be used to see the distribution of a categorical variable. In the following example, we study the number of vehicles in each group and use a bar chart to see the distribution.

plt.figure("Bar Chart Example 1")  # name the figure, this is no need for graphing 
plt.bar(category, counts, align="center", color="purple", alpha = 0.5)  # plot a bar chart,  alpha is the transparent parameter (0.0 transparent through 1.0 opaque)
plt.title("Distribution of Vehicles", fontsize=16, color="blue")  # set title
plt.xticks(category, ("Car", "Racer", "Taxi", "Truck"), fontsize="14", color="blue")  # set xticks
plt.ylabel("Counts", fontsize = 14, color="blue")  # set label for y axis
plt.show()  # show the plot

bar chart showing counts of cars is second largest at around 83, racers around 75, taxi around 89, and trucks around 75

Note: The y-axis can also be changed to represent the relative frequency. Can you figure out how to do this?

In the following example, we study the distribution of average distance that each type of vehicles covered and create a horizontal bar chart by using the Python function plt.barh().

plt.figure("Bar Chart Example 2")
average_D = total["Distance"]/counts
plt.barh(category, average_D, align="center", color = "green", alpha = 0.3)
plt.xlabel("Average Distance (inches)")
plt.yticks(category, ("Car", "Racer", "Taxi", "Truck"))
plt.show()

horizontal bar chart of average distance in inches on the x-axis. truck at the top with average distance of around 32 inches, taxi next around 65 inches, then racer around 75 inches and finally car at around 51 inches traveled on average

In the following example, we study the stacked bar chart for the total distance grouped by two variables: Name of Car and Distance. Here we use the plot() function in the module Pandas. In the legend method, we use two parameters: loc and ncol.

loc indicates the location of the legend, it can be an integer (0 to 10) or a string or a pair of floats

ncol is an integer that shows the number of columns that the legend has

# create a new data frame from the original dataset, you should think about why we need this
onlydistance = cargame[["Gender", "Name of Car", "Distance"]]
# calculate the total distance by groups
cargame_moregroup = onlydistance.groupby(["Name of Car", "Gender"]).sum()
# use stacked = False for a side-by-side bar chart 
myplot = cargame_moregroup.unstack().plot(kind="bar", stacked=True, title="Total Distance by Name of Car", figsize=(8, 6), rot=90, alpha=0.5)  # rot is a parameter for rotating ticks, try rot=90
myplot.set_xlabel("Name of Car", fontsize=14)
myplot.set_ylabel("Total Distance (inches)", fontsize=14)
myplot.legend(["Female", "Male"], loc=9, ncol=2)  # set legend

stacked bar chart comparing female and male users of the vehicles. type of car shown on the x-axis with car showing around 30% of about 4300 total inches for male user, racer showing around 10% of about 5800 total inches for male users, taxi showing around 50% of about 5800 total inches for male users, and finally trucks showing around 50% of 2500 total inches covered for males

Tutorials for learning how to create Python bar charts can be found at matplotlib, PythonSpot, pyplot, Plotly, pandas, and seaborn (You need to download the library first, but there are lots of good features. Highly recommended for professional data visualization!).

Pie Chart

Similarly, we can use a pie chart to see the distribution of Vehicles.

plt.figure("Pie Chart Example")
# set colors we would like to use
colors = ['lightgreen', 'gold', 'lightskyblue', 'lightcoral']  
# use autopact to display the percent value
plt.pie(counts, labels = ["Car", "Racer", "Taxi", "Truck"], colors = colors, autopct='%1.1f%%')  
plt.title("Distribution of Vehicles")
plt.show()

pie chart showing percentage of all vehicles is about the same with racer, car, truck, and taxi at 23.4%, 25.8%, 23.4%, and 27.4% respectively

Remark: We use autopct to display the percent value using Python string formatting. For example, autopct='%1.1f%%' means that for each pie wedge, the format string is '1.1f%'. Try autopct='%1.2f%%' or autopct='%1.1f' to see four yourself how it works.

Tutorials for learning how to make Python pie charts can be found at matplotlib, PythonSpot, and Plotly.

Line Plot

This following example shows how to create a line plot using the average distance of each group.

plt.figure("Line Plot Example")
plt.plot(category, average_D, color = "red", linestyle="==", linewidth=3)
plt.xticks(category, ("Car", "Racer", "Taxi", "Truck"))
plt.xlabel("Name of Car", fontsize=14, color = "blue")
plt.ylabel("Average Distance (inches)", fontsize=14, color = "blue")
plt.show()

Average distance in inches shown along the vertical axis, with actual values connected by a dashed line of car, to racer, to taxi, and finally truck which are depicted on the horizontal axis

Tutorials for learning Python line plots can be found at matplotlib, PythonSpot, and Plotly.

Scatterplot

Now, we would like to see the association between two variables: Pull Distance and Distance.

# randomly select 325 numbers for colors, we can just use one color
colors = np.random.rand(325)  
# randomly select the area of each dot for the scatterplot, we can just use the same size of markers
area = np.pi*(20*np.random.rand(325))  
plt.figure("Scatter Plot Example")
plt.scatter(cargame["Pull Distance"], cargame["Distance"], s=area, c=colors, marker = "o")
plt.xlabel("Pull Distance", fontsize=14)
plt.ylabel("Distance", fontsize=14)
plt.show()

scatter plot with distance covered on the vertical axis and pull distance on the horizontal axis. the plot shows a fanning pattern starting close to zero zero and moves out towards unusual results of around 400 inches traveled with a pull distance of around 20 inches and around 190 inches traveled with a pull distance of about 41 inches. Most observations are clustered between pull distances of zero and 20 inches and covered distances of zero to 150 inches.

Tutorials for learning how to make Python scatter plots can be found at matplotlib and Plotly.

Histogram

In the following example, we use a histogram to study the distribution of Distance.

plt.figure("Histogram Example 1")
plt.hist(cargame["Distance"], bins = 20, color = 'purple', normed=False, alpha=0.5)  # set normed = True for a probability distribution
plt.title("Distribution of Distance")
plt.xlabel("Distance (inches)")
plt.ylabel("Frequency")
plt.show()

histogram with frequency on the vertical axis and ranging from zero to 80. The horizontal axis of distance covered in inches ranges from zero to almost 400 inches. each bin has a width of 20 inches. The first five bars represent the majority of the observations, ranging from zero to 100 inches and show frequencies of around 48, 75, 60, 58, and 50 vehicles. The next six bars show a dramatic decline in frequencies of around 17, 5, 2, 2, 1, and 1 vehicle. Then there is a gap in the histogram from 220 inches until a single count of an approximate distance of 300 inches and a final gap from this observation to a single count of one last vehicle at around 380 inches.

Sometimes, we may want to plot two histograms on the same figure, so we can easily compare the distributions of two quantitative variables. Below is an example using our previous histograms; however, this can be extended to multiple plots as well.

plt.figure("Histogram Example 2")
plt.hist(cargame["Pull Distance"], bins = 5, color = 'green', normed=True, alpha=0.5, label="Pull Distance")
plt.hist(cargame["Distance"], bins = 20, color = 'purple', normed=True, alpha=0.5, label="Distance") 
# set normed = True for a probability distribution
plt.title("Distribution of Distance")
plt.ylabel("Probability")
plt.show()

frequency shown on the vertical axis and distance in inches on the horizontal axis. The green histogram of pull distance displays skinnier bin size of 5 inches and overlaps the purple histogram of traveled distance with a wider bin size of 20 inches. However, there is enough transparency in the histograms to see all of each histogram even where they overlap.

Tutorials for learning Python histograms can be found at matplotlib, PythonSpot, Plotly, and seaborn.

Boxplot

Here, we use a boxplot to see the Time each trial took. Here we use the plot() function in the module Pandas and set patch_artist = True to fill boxes with color. If we set notch = False in the boxplot() function, we will have a regular(rectangular) boxplot.

cargame[["Time"]].plot(kind="box", notch = True, patch_artist=True, color={'medians': 'blue', 'boxes': 'gold', 'whiskers': 'red'}, medianprops={'linestyle': '==', 'linewidth': 3})

Boxplot is displayed vertically. Box portion indicates Q1 and Q3 at roughly 2.5 and 4.5 with the median identified by a short horizontal black line located around 3 units. Whiskers extend above and below the box as a vertical dashed line ending at around 7.5 and zero units respectively. Further, there are approximately six outliers identified with plus symbols and are scattered from the top of the high end of the whisker up to the max value of 12 units shown on the vertical axis.

Now, we want to compare distributions of Distance across each type of car. We use the boxplot() function in the module Pandas because it works well with grouping. To assign different colors in boxes, we store the parameters in the boxplot with "dictionary" data format (return_type="dict") and then change the color parameter for boxes.

onlydistance = cargame[["Gender", "Name of Car", "Distance"]]
myboxplot = onlydistance.boxplot(notch = True, patch_artist=True, by="Name of Car", return_type="dict")  
# by = "Name of Car" means we want to plot by grouping "Name of Car"
colors = ["lightgreen", "pink", "lightskyblue", "tan"]  # set colors we want to use
# assign colors to boxes 
[myboxplot["Distance"]["boxes"][k].set_color(colors[k]) for k in range(0, len(myboxplot["Distance"]["boxes"]))]

Side by side box plots are positioned beside each other horizontally and represent the four types of vehicles - car, racer, taxi, and truck, in that order. Each boxplot is colored differently, in the order of green, red, blue, and brown. The vertical axis is automatically adjusted to allow the outliers for each vehicle type to be visible.

Furthermore, we study distributions of Distance in different car groups and genders. Again, we use the boxplot() function in the module Pandas.

onlydistance.boxplot(notch = True, patch_artist=True, by = ["Name of Car", "Gender"])

Similar to the general example of a side by side box plot except now there are eight box plots ordered first by vehicle type and then by gender. Another difference is that all boxplots are colored the same, each is blue.

Tutorials for learning to make boxplots in Python can be found at matplotlib, plotly, pandas, seaborn.

Density Plot

Next, we would like to study the density plot of Distance. Here is a way to do it.

cargame[["Distance"]].plot(kind="density", color = "red")

a continuous red curve depicts the density of the distribution for distance. Though it closely resembles the shape of the histogram, the curve could be misunderstood since the smoothing of the observations into a curve makes it look like negative distances were possible.

Tutorials for learning to make Python density plots can be found at seaborn.

QQ Plot

Assume we would like to compare the quantiles of the normal distribution with the values of Distance that were observed. Here, we need the function probplot() in the class stats within the module scipy (see above example for code associated with loading scipy).

# set the distribution to be normal and the plot function is plt 
stats.probplot(cargame["Distance"], dist="norm", plot=plt)

 The automatically sorted values for the observed distances are depicted on the vertical axis and assumed quantiles of the normal distribution on the horizontal axis. There are solid blue dots representing the data and a solid red line as a reference to where the normal quantiles between -3 and 3 should land. Also, the R squared value for the linear relationship is depicted inside the plot as text.

It seems that the variable Distance is not normally distributed, which matches our earlier findings when inspecting the histogram of Distance.

# read the data as a list, so we can sort it
Pull_D = list(cargame["Pull Distance"])  
D = list(cargame["Distance"]) 
Pull_D.sort()  # sort Pull Distance
D.sort()  # sort Distance
plt.plot(Pull_D, D, "o")
z = np.polyfit(Pull_D, D, 1) 
p = np.poly1d(z)
plt.plot(Pull_D,p(Pull_D),"r==", linewidth=3)
plt.title("Q-Q plot", size=24)
plt.xlabel("Pull Distribution quantiles", size=14)
plt.ylabel("Distance quantiles", size=14)
plt.tick_params(labelsize=12)
plt.show())

3-D plot

If we have three quantitative variables, we may like to see the association between them visually. For this, we use the Axes3D() function in the class mplot3d within the module mpl_toolkits to create a 3-D plot.

fig = plt.figure("3D Scatter Plot Example")
my3Dplot = Axes3D(fig)
plt.scatter(cargame["Pull Distance"], cargame["Time"], cargame["Distance"], c="blue", marker="o", alpha=0.5)
plt.xlabel("Pull Distance", fontsize=14)
plt.ylabel("Time", fontsize=14)
my3Dplot.set_zlabel("Distance", fontsize=14)
plt.show()

3 D scatter plot with pull distance depicted left to right, time depicted front to back, and distance depicted up and down

Tutorials for learning Python 3-D plots can be found at matplotlib.

Subplots

Subplots are very useful when organizing multiple plots in a single figure. Here is a simple example to demonstrate how to create subplots.

plt.figure("Subplot Example")  # name the figure, it is no need to have this for graphing
plt.subplot(2, 1, 1)  # (2, 1, 1) means the plot has 2 rows and 1 column, and this is the first subplot
plt.hist(cargame["Pull Distance"], bins = 10, color = 'purple', normed=True, alpha=0.5) # set normed = True for a probability distribution
plt.title("Distribution of Pull Distance")
plt.xlabel("Pull Distance (inches)")
plt.ylabel("Probability")

plt.subplot(2, 1, 2) # this is the second subplot
plt.hist(cargame["Distance"], bins = 10, color = 'purple', normed=True, alpha=0.5) # set normed = True for a probability distribution
plt.title("Distribution of Distance")
plt.xlabel("Distance (inches)")
plt.ylabel("Probability")
plt.tight_layout()  
# tight_layout() adjusts spacing between subplots to minimize the overlaps, put # in front of this line and run the code again, you should see the difference
plt.show()

Separate histograms are shown stacked on top of one another with pull distance on top and distance on bottom. Both depict probability on the vertical axis.

Tutorials for learning how to make Python subplots can be found at matplotlib, pyplot, plotly, seaborn.