Intro to pandas

pandas is a Python library used to analyze, clean, explore, and manipulate data. It is a key part of the data viz toolchain and is widely used in the data science world. pandas specializes in manipulating row-columnar data tables with its core datatype, the DataFrame, which can be thought of as a very fast, programmatic spreadsheet.

Why Use Pandas?

While Python on its own is good at manipulating data, prior to pandas, it didn’t have the kind of data analysis and modeling capabilities to rival languages like R. pandas solves this problem and when used with Jupyter Notebook and Matplotlib, gives us a great way to conduct interactive data analysis using Python. Since pandas makes use of NumPy arrays, it can also be easily integrated with other tools in the NumPy ecosystem like SciPy and scikit-learn.

Pandas Setup

Installing pandas

pandas should be pre-installed with python distributions such as Anaconda, or can be installed through PIP using the below command:

pip install pandas

Importing pandas

pandas is typically imported under the alias pd as shown below.

import pandas as pd

The DataFrame

When working with pandas, the first step is usually to load some data into a DataFrame. A DataFrame is pandas’ representation of a data set and can simply be thought of as a multi-dimensional table.

Creating a DataFrame

To create a DataFrame, just pass your data object into pandas' DataFrame method. By convention, DataFrame variables start with df:

data = {
  "calories": [420, 380, 390],
  "duration": [50, 40, 45]
}

df = pd.DataFrame(data)

print(df)

If your data is in a JSON-like record-based format, you can use the below method to convert it into the pandas friendly column-based format. The below code and the above code will produce the same DataFrame.

data = [
	{"calories": 420, "duration": 50},
	{"calories": 380, "duration": 40},
	{"calories": 390, "duration": 45}
]

df = pd.DataFrame.from_dict(data)

print(df)

# Out:         
#					calories   duration
# 0       420        50
# 1       380        40
# 2       390        45

DataFrame Rows

As demonstrated above, a DataFrame is essentially a table with rows and columns. You can use pandas’ loc attribute to return one or more specified rows of data. When selecting multiple rows of data, the result is a DataFrame.

print(df.loc[0])

# Out: calories    420
#      duration     50
#      Name: 0, dtype: int64

print(df.loc[[0, 1]])

# Out:         
#         calories   duration
# 0       420        50
# 1       380        40

DataFrame Columns

You can grab the values from a specific column in your DataFrame using either of the below notations. This will return a Series with the column’s information (more on Series below).

calorie_col = df.calories
calorie_col = df["calories"]

# Out:             
#	0       420
#	1       380
#	2       390

DataFrame Indices

The numbers to the left of the data in the above examples are the default numerical indices that can be used to access rows. Indices can also be renamed if needed. The first example below creates custom named indices and the second example uses the specified calories column value as the index.

df = pd.DataFrame(data, index = ["day1", "day2", "day3"])
print(df)

# Out: 
#            calories   duration
#	day1       420        50
#	day2       380        40
#	day3       390        45

df = pd.DataFrame(data)
df = df.set_index('calories')
print(df)

# Out:  
#            calories   duration
#	420        420        50
#	380        380        40
#	390        390        45

Once named indices are set, you can locate the corresponding rows using the index name:

print(df.loc["day1"])

DataFrame Series

Both rows and columns of a DataFrame are stored as pandas Series. A Series is basically a one-dimensional array that can contain any datatype. Like with NumPy arrays, you can manipulate all of the pieces of data in a Series with a single function. Series can also be combined to form a DataFrame.

Series can be created using a Python list or NumPy array (with or without specified indices). They can also be created using a dictionary that has a specified index as the key and value as the value.

s = pd.Series([1, 2, 3])
print(s)

# Out:
# 0    1
# 1    2
# 2    3

s = pd.Series([1, 2, 3], index=['a', 'b', 'c'])
print(s)

# Out:
# a    1
# b    2
# c    3

s = pd.Series({'a':1, 'b':2, 'c':3})
print(s)

# Out:
# a    1
# b    2
# c    3

Finally, we can pass a single, scalar value as data to the Series, provided we also specify an index. The scalar value is then applied to all indices:

s = pd.Series(9, {'a', 'b', 'c'})
print(s)

# Out:
# a    9
# b    9
# c    9

Series are like NumPy arrays (ndarray), so they can be passed to most NumPy functions:

s = pd.Series([1, 2, 3], ['a', 'b', 'c'])
s = np.sqrt(s)
print(s)

# Out:
# a    1.000000
# b    1.414214
# c    1.732051

Slicing operations also work as they would with Python lists or ndarray's, in which case the index labels are preserved:

s = s[1:3]
print(s)

# Out:
# b  2
# c  3

Unlike NumPy arrays, Series can hold different data types. Applying a function to a Series will work with the different data types:

combined = pd.Series([1, 2.1, 'foo']) + pd.Series([2, 3, 'bar'])
print(combined)

# Out:
# 0         3
# 1       5.1
# 2    foobar

Series can then be easily joined together using the below concat method. The name property provides a column name and the axis argument of 1 indicates that each of the series are columns.

calories = pd.Series([420, 380], name='calories')
durations = pd.Series([50, 40], name='duration')

df = pd.concat([calories, durations], axis=1)

Selecting Groups

Particularly when exploring data, it is often useful to group data based on their values in various columns. One way to do this is to use the DataFrame’s groupby method to group a column and then use the get_group method to select the required group:

data = [
	{"name": "Zola", "species": "cat"},
	{"name": "Didi", "species": "cat"},
	{"name": "Stella", "species": "dog"}
]
df = pd.DataFrame.from_dict(data)

species_groups = df.groupby("species")
cats = species_groups.get_group("cat")
print(cats)

# Out:         
#         name        species
# 0       Zola        cat
# 1       Didi        cat

The below example achieves the same result using a Boolean mask. The Boolean mask creates a new DataFrame, returning True or False for each row in the table. It can then be applied to the original DataFrame to select a subset of its rows:

df.species == "cat"

# Out:
# 0       True
# 1       True
# 2       False

df[df.species == "cat"]
print(df)

# Out:         
#         name        species
# 0       Zola        cat
# 1       Didi        cat

Loading External Data into a DataFrame

It is unlikely that you will be creating a DataFrame from Python containers directly, as we have done thus far. It is more common to use one of pandas’ data-reading methods to load data in from a different file format (JSON, CSV, DB, etc.)

The below example loads data in from a JSON file. You can optionally specify an orient argument (split, records, index, columns, or values), but pandas will usually be able to automatically detect the form of your data.

df = pd.read_json('file.json')

The process of loading data in from a CSV file is quite similar. There are a number of CSV specific options that can be specified as well, which can be found here.

df = pd.read_csv('data.csv')

pandas uses Python’s xlrd module to read Excel 2003 and the openpyxl module to read Excel 2007+ files. If using openpyxl, install it using PIP:

pip install openpyxl

Excel documents often have multiple sheets. You can use the below method to load in data from more than one sheet, including the column to use as the DataFrame’s row labels as index_col. You can also provide a list of additional strings that you want the DataFrame to recognize as NaN by specifying a na_values argument. Finally, the parse_cols parameter lets you select the sheet columns to be parsed if you only want to select certain columns. There are more options that can be passed in here, which you can read about in the pandas documentation.

dfs = pd.read_excel('data/nobel_winners.xlsx', ['WinnersSheet1','WinnersSheet2'], index_col=1, na_values=['NA'], parse_cols=[1, 3])

Saving Your Cleaned DataFrame

Once you have cleaned your DataFrame, you will likely want to save your cleaned data. You can do so using one of the below commands, depending on the format in which you would like to save your data.

When saving to JSON, pandas’ default column-based data format is not particularly useful for data viz, so you will likely want to save your data using the records orientation.

# Save data in JSON format
json = df.to_json('data_cleaned.json', orient='records')

# Save data to a CSV file
df.to_csv('data.csv', encoding='utf-8')

# Save data to an Excel sheet
df.to_excel('data_cleaned.xlsx', sheet_name='CleanedData')

Working with Databases

DataFrames are also often created using databases such as SQL and MongoDB. Importing data from and exporting data into databases is relatively straight forward. This article will walk you through this process with SQL and you can read up on the MongoDB set-up here.

Wrapping Up

The above should provide a decent basic foundation for the core pandas concepts. Actually cleaning and manipulating the data requires many more methods and techniques. I linked some resources below that go more in depth on those topics.

To see a real life demonstration of how pandas can be used to clean and explore data, check out this post from my Visualizing Migration in Football project.

Resources

I’d like to give a massive shoutout to the wonderful book, Data Visualization with Python and JavaScript by Kyran Dale. Many of the above examples were from this book and I highly recommend it to anybody looking to improve their data visualization skills using Python and D3.

Some other great resources:

pandas Documentation

W3 Schools pandas Tutorial

Real Python Data Cleaning Tutorial

Real Python Exploring Data Tutorial

Created by Zoe Ferencova