Part 4: Pandas Dataframes

Topic 1: What is a Dataframe

A dataframe is a data structure constructed with rows and columns, similar to a database or Excel spreadsheet. It consists of a dictionary of lists in which the list each have their own identifiers or keys, such as “last name” or “food group.”

Topic 2: Creating a Dataframe

To create a dataframe you must first create a dictionary. A dictionary is a list of values linked to keys. The keys are separated from their values with colons and brackets as shown below. In this case, the dictionary keys will become the column names for the DataFrame. The key would be “Grades” and the values would be “A, B, C, D, F”.

These are the dictionary methods and what they do. We won’t go into too much detail on dictionaries but they may become important in the future if you’re working with data structures and algorithms.

Method

Usage

Values()

Return a list of all values in the dictionary

Update()

Updates the dictionary with the specified key-value pairs

setdefault()

Returns the value of the specified key. If the key does not exist insert the key, with the specified value

clear()

Removes all the elements from the dictionary

keys()

Returns a list containing the keys of the dictionary

pop()

Removes the element with the specified key

popitem()

Removes the last inserted key-value pair

get()

Returns the value of the specified key

items()

Returns a list containing a tuple for each key value pair

copy()

Returns a copy of the dictionary

fromkeys()

Returns a dictionary with the specified keys and value

 

To begin we enter a dictionary list into the DataFrame() parameters.

DataFrames will automatically be indexed 0 to n, with n being the number of values in the dictionary. We can override this indexing by using the “index = “ parameter after our dictionary in order to manually set what the row headers for our data will be.

Note: Most times you won’t specify an index and pandas will create one automatically.

Topic 3: Looking at the Data

Now some useful commands for dealing with pandas dataframes

When you want to see the top of a data frame the .head() method will allow you starting from the first indexed row the first 4 rows. The tail() method will do the same but starting from the last indexed row.

Because when creating the data frame we specified an index when we want to select certain columns it will also show up. In this example we just want the gpa column.

From this output you see that it gives us our student index and their gpa’s. To get one column the syntax is df[‘<column>’] for multiple columns you’ll have to use a list therefore it would look like df[[‘<column>’, ‘<column>’]]. Two brackets are required.

Selecting parts of columns and rows instead of all of the values is a tad more complex. Let’s start by only selecting the first three out of four student records. You do so using the following syntax:

Choosing the middle columns:

Remember, indexing starts at 0. 

The way indexing works is it doesn’t include the last indexed number. In the first subset df[:3] this means up to and not including the 4th row(index starts at 0).

To get a single value from a dataframe the syntax is: df.<column>[<index number>]

This gives us the gpa of the third student. 

Got it Down? Click here for Part 5!