
Graphical Displays

- Explain the Dataset - (0:35)
- Importing Necessary Packages - (1:40)
- Read the Dataset - (3:36)
- Organize Data - (5:25)
- Specific Graphical Displays Will be Covered by Other Videos in This Series
- A toy company has four types of vehicles for sale: car, truck, racer, and taxi. To judge the quality of the different types of vehicles, a team records the following characteristics: the user’s gender (Gender), the type of a car in the trial, the distance vehicle is pulled back (Pull Distance), the distance the car goes after pulling it back (Distance) and the time for each trial in seconds (Time).
First, we use the module Pandas to open and read the data. If you didn't enable the object inspector to display the documentation of a function when the function is called and would like to have this feature, please visit our Basic Syntax page.
import urllib # the module for reading a url
import numpy as np
import pandas as pd # the module for opening a .xlsx file
import matplotlib.pyplot as plt # the module we will use for several graphical displays
import scipy.stats as stats # the module for probability plots
from mpl_toolkits.mplot3d import Axes3D # a function for 3D plots
#### Read the dataset first, open the file directly by using the URL
cargame_online = urllib.urlopen("https://miamioh.instructure.com//files//3082832//download?download_frd=1")
cargame = pd.read_csv(cargame_online) # read the file
cargame.head() # read the first five lines of the dataset
cargame_group = cargame.groupby("Name of Car") # group data by Name of Car for the later use
total = cargame_group.sum() # get sum for each group
counts = cargame_group.size() # get size for each group
category = np.arange(len(counts)) # assign a number for each type of vehicle
Bar Chart
- Set name of the figure - (0:26)
- Plot the bar chart - (0:46)
- Set title - (1:21)
- Labels - (1:46)
- Show the plot - (3:03)
- Horizontal Bar Graph - (3:26)
- Example - (3:55)
A bar chart can be used to see the distribution of a categorical variable. In the following example, we study the number of vehicles in each group and use a bar chart to see the distribution.
plt.figure("Bar Chart Example 1") # name the figure, this is no need for graphing
plt.bar(category, counts, align="center", color="purple", alpha = 0.5) # plot a bar chart, alpha is the transparent parameter (0.0 transparent through 1.0 opaque)
plt.title("Distribution of Vehicles", fontsize=16, color="blue") # set title
plt.xticks(category, ("Car", "Racer", "Taxi", "Truck"), fontsize="14", color="blue") # set xticks
plt.ylabel("Counts", fontsize = 14, color="blue") # set label for y axis
plt.show() # show the plot
Note: The y-axis can also be changed to represent the relative frequency. Can you figure out how to do this?
In the following example, we study the distribution of average distance that each type of vehicles covered and create a horizontal bar chart by using the Python function plt.barh().
plt.figure("Bar Chart Example 2")
average_D = total["Distance"]/counts
plt.barh(category, average_D, align="center", color = "green", alpha = 0.3)
plt.xlabel("Average Distance (inches)")
plt.yticks(category, ("Car", "Racer", "Taxi", "Truck"))
plt.show()
In the following example, we study the stacked bar chart for the total distance grouped by two variables: Name of Car and Distance. Here we use the plot() function in the module Pandas. In the legend method, we use two parameters: loc and ncol.
loc indicates the location of the legend, it can be an integer (0 to 10) or a string or a pair of floats
ncol is an integer that shows the number of columns that the legend has
# create a new data frame from the original dataset, you should think about why we need this
onlydistance = cargame[["Gender", "Name of Car", "Distance"]]
# calculate the total distance by groups
cargame_moregroup = onlydistance.groupby(["Name of Car", "Gender"]).sum()
# use stacked = False for a side-by-side bar chart
myplot = cargame_moregroup.unstack().plot(kind="bar", stacked=True, title="Total Distance by Name of Car", figsize=(8, 6), rot=90, alpha=0.5) # rot is a parameter for rotating ticks, try rot=90
myplot.set_xlabel("Name of Car", fontsize=14)
myplot.set_ylabel("Total Distance (inches)", fontsize=14)
myplot.legend(["Female", "Male"], loc=9, ncol=2) # set legend
Tutorials for learning how to create Python bar charts can be found at matplotlib, PythonSpot, pyplot, Plotly, pandas, and seaborn (You need to download the library first, but there are lots of good features. Highly recommended for professional data visualization!).
Pie Chart
Similarly, we can use a pie chart to see the distribution of Vehicles.
plt.figure("Pie Chart Example")
# set colors we would like to use
colors = ['lightgreen', 'gold', 'lightskyblue', 'lightcoral']
# use autopact to display the percent value
plt.pie(counts, labels = ["Car", "Racer", "Taxi", "Truck"], colors = colors, autopct='%1.1f%%')
plt.title("Distribution of Vehicles")
plt.show()
Remark: We use autopct to display the percent value using Python string formatting. For example, autopct='%1.1f%%' means that for each pie wedge, the format string is '1.1f%'. Try autopct='%1.2f%%' or autopct='%1.1f' to see four yourself how it works.
Tutorials for learning how to make Python pie charts can be found at matplotlib, PythonSpot, and Plotly.
Line Plot
This following example shows how to create a line plot using the average distance of each group.
plt.figure("Line Plot Example")
plt.plot(category, average_D, color = "red", linestyle="==", linewidth=3)
plt.xticks(category, ("Car", "Racer", "Taxi", "Truck"))
plt.xlabel("Name of Car", fontsize=14, color = "blue")
plt.ylabel("Average Distance (inches)", fontsize=14, color = "blue")
plt.show()
Tutorials for learning Python line plots can be found at matplotlib, PythonSpot, and Plotly.
Scatterplot
Now, we would like to see the association between two variables: Pull Distance and Distance.
# randomly select 325 numbers for colors, we can just use one color
colors = np.random.rand(325)
# randomly select the area of each dot for the scatterplot, we can just use the same size of markers
area = np.pi*(20*np.random.rand(325))
plt.figure("Scatter Plot Example")
plt.scatter(cargame["Pull Distance"], cargame["Distance"], s=area, c=colors, marker = "o")
plt.xlabel("Pull Distance", fontsize=14)
plt.ylabel("Distance", fontsize=14)
plt.show()
Tutorials for learning how to make Python scatter plots can be found at matplotlib and Plotly.
Histogram
In the following example, we use a histogram to study the distribution of Distance.
plt.figure("Histogram Example 1")
plt.hist(cargame["Distance"], bins = 20, color = 'purple', normed=False, alpha=0.5) # set normed = True for a probability distribution
plt.title("Distribution of Distance")
plt.xlabel("Distance (inches)")
plt.ylabel("Frequency")
plt.show()
Sometimes, we may want to plot two histograms on the same figure, so we can easily compare the distributions of two quantitative variables. Below is an example using our previous histograms; however, this can be extended to multiple plots as well.
plt.figure("Histogram Example 2")
plt.hist(cargame["Pull Distance"], bins = 5, color = 'green', normed=True, alpha=0.5, label="Pull Distance")
plt.hist(cargame["Distance"], bins = 20, color = 'purple', normed=True, alpha=0.5, label="Distance")
# set normed = True for a probability distribution
plt.title("Distribution of Distance")
plt.ylabel("Probability")
plt.show()
Tutorials for learning Python histograms can be found at matplotlib, PythonSpot, Plotly, and seaborn.
Boxplot
Here, we use a boxplot to see the Time each trial took. Here we use the plot() function in the module Pandas and set patch_artist = True to fill boxes with color. If we set notch = False in the boxplot() function, we will have a regular(rectangular) boxplot.
cargame[["Time"]].plot(kind="box", notch = True, patch_artist=True, color={'medians': 'blue', 'boxes': 'gold', 'whiskers': 'red'}, medianprops={'linestyle': '==', 'linewidth': 3})
Now, we want to compare distributions of Distance across each type of car. We use the boxplot() function in the module Pandas because it works well with grouping. To assign different colors in boxes, we store the parameters in the boxplot with "dictionary" data format (return_type="dict") and then change the color parameter for boxes.
onlydistance = cargame[["Gender", "Name of Car", "Distance"]]
myboxplot = onlydistance.boxplot(notch = True, patch_artist=True, by="Name of Car", return_type="dict")
# by = "Name of Car" means we want to plot by grouping "Name of Car"
colors = ["lightgreen", "pink", "lightskyblue", "tan"] # set colors we want to use
# assign colors to boxes
[myboxplot["Distance"]["boxes"][k].set_color(colors[k]) for k in range(0, len(myboxplot["Distance"]["boxes"]))]
Furthermore, we study distributions of Distance in different car groups and genders. Again, we use the boxplot() function in the module Pandas.
onlydistance.boxplot(notch = True, patch_artist=True, by = ["Name of Car", "Gender"])
Tutorials for learning to make boxplots in Python can be found at matplotlib, plotly, pandas, seaborn.
Density Plot
Next, we would like to study the density plot of Distance. Here is a way to do it.
cargame[["Distance"]].plot(kind="density", color = "red")
Tutorials for learning to make Python density plots can be found at seaborn.
QQ Plot
Assume we would like to compare the quantiles of the normal distribution with the values of Distance that were observed. Here, we need the function probplot() in the class stats within the module scipy (see above example for code associated with loading scipy).
# set the distribution to be normal and the plot function is plt
stats.probplot(cargame["Distance"], dist="norm", plot=plt)
It seems that the variable Distance is not normally distributed, which matches our earlier findings when inspecting the histogram of Distance.
# read the data as a list, so we can sort it
Pull_D = list(cargame["Pull Distance"])
D = list(cargame["Distance"])
Pull_D.sort() # sort Pull Distance
D.sort() # sort Distance
plt.plot(Pull_D, D, "o")
z = np.polyfit(Pull_D, D, 1)
p = np.poly1d(z)
plt.plot(Pull_D,p(Pull_D),"r==", linewidth=3)
plt.title("Q-Q plot", size=24)
plt.xlabel("Pull Distribution quantiles", size=14)
plt.ylabel("Distance quantiles", size=14)
plt.tick_params(labelsize=12)
plt.show())
3-D plot
If we have three quantitative variables, we may like to see the association between them visually. For this, we use the Axes3D() function in the class mplot3d within the module mpl_toolkits to create a 3-D plot.
fig = plt.figure("3D Scatter Plot Example")
my3Dplot = Axes3D(fig)
plt.scatter(cargame["Pull Distance"], cargame["Time"], cargame["Distance"], c="blue", marker="o", alpha=0.5)
plt.xlabel("Pull Distance", fontsize=14)
plt.ylabel("Time", fontsize=14)
my3Dplot.set_zlabel("Distance", fontsize=14)
plt.show()
Tutorials for learning Python 3-D plots can be found at matplotlib.
Subplots
Subplots are very useful when organizing multiple plots in a single figure. Here is a simple example to demonstrate how to create subplots.
plt.figure("Subplot Example") # name the figure, it is no need to have this for graphing
plt.subplot(2, 1, 1) # (2, 1, 1) means the plot has 2 rows and 1 column, and this is the first subplot
plt.hist(cargame["Pull Distance"], bins = 10, color = 'purple', normed=True, alpha=0.5) # set normed = True for a probability distribution
plt.title("Distribution of Pull Distance")
plt.xlabel("Pull Distance (inches)")
plt.ylabel("Probability")
plt.subplot(2, 1, 2) # this is the second subplot
plt.hist(cargame["Distance"], bins = 10, color = 'purple', normed=True, alpha=0.5) # set normed = True for a probability distribution
plt.title("Distribution of Distance")
plt.xlabel("Distance (inches)")
plt.ylabel("Probability")
plt.tight_layout()
# tight_layout() adjusts spacing between subplots to minimize the overlaps, put # in front of this line and run the code again, you should see the difference
plt.show()
Tutorials for learning how to make Python subplots can be found at matplotlib, pyplot, plotly, seaborn.