Contents

Pandas Best Practices

What is this article about?
This article provides a curated list of the daily best practices and use cases of the Pandas library. It is meant to be a rolling release note in which I will mark down all of the better practice I came across along my journey towards data science. Accordingly, the list is going to be (very) long thus feel free to use the table of contents to jump over to the topic you may interest.

What is Pandas?

/posts/pandas-best-practice/not_pandas.jpg
This is not a Panda or the Pandas I’m talking about.

Pandas is one of the most-used standard Python library when dealing with matrix-like transformable datasets which range from a simple tabular to a more complex structure dataset. It is an open-source library providing faster, flexible, and expressive mean for data manipulation and one of the essential to learn library in case you are getting in the data science field.
It mainly provides two well-suited data structures to multiple data type include Series and DataFrame. This article will get into how to use both of them and the case on which one may be helpful over the other in the following.

Installation

Pandas can be installed via pip from PyPI

1
pip install pandas

or using the Anaconda or Miniconda environment,

1
conda install pandas

Pandas components

As mentioned earlier and to handle different types of datasets, pandas provides 2 data structures mainly wrapped over Numpy. Series is nothing but one dimension representation similar to the Numpy array but structured in the way it is explicitly associate an index with the values. Whereas DataFrame is a multi-dimensional array that can be thought of as a collection of series and its rows are also indexed by default unless it is explicitly done. They are pretty similar in terms of operations as one can apply almost the same manipulation to both of them.

credit: www.learndatasci.com

Pandas main components

Dataset

The dataset being used in the following is either being created manually or loaded from Seaborn repository or the dataset Traffic and Pedestrian Stops by the Police in Rhode Island to make advanced operations. The choice lies in the fact that this dataset has multiple data types from which multiple scenarios can be derived.

Best practices

1. Importing necessary libraries

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 

# whether to explicitly show the whole interactive output cells in 
# jupyter notebook rather than showing only the last line.
# use `last_expr` to revert back the default behavior
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# whether to display all DataFrame rows and cols 
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)  
Info
  • For the full data analysis and visualization tasks, pandas is often incomplete and you need to have hands-on skills to other Python libraries like Numpy, Matplotlib or Seaborn but they are not mandatory. They are only used here to help us generating dataset or ploting important findings.
  • In this article, additional information about the code is explained using Python comments.

2. Get to know the dataset

This is the first essential tasks that you would perform before getting started your analysis. Understanding the dataset may extend from knowing what’s inside, its content, its shape, the variable names, the variable datatypes, the ratio of null values, or going deeper to know the number of unique values of categorical variables, or even gather the descriptive statistics or the variable distribution. The following code snippet shows a few essential functions along with the must to know arguments to answer these questions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
police = pd.read_csv('datasets/police_project.csv', 
                 sep=',', 
                 parse_dates=['stop_date'],
                 index_col=None,
                 usecols=['stop_date', 'driver_gender','driver_age_raw', 'violation'])

police.head(3)
police.info(memory_usage='deep')
# Whether to display all columns with their respective types 
#police.columns.to_series().groupby(police.dtypes).groups
# Only filter `number`, `datetime`, and `object` types
police.dropna().describe(include=[np.number, np.datetime64, np.object])
police.select_dtypes(include=['number', 'object']).describe()
Output

df.read_csv() is one of the most common Pandas method designed specifically, but not limited to, to read a CSV data source. By default, CSV columns are separated by comma , as the name stands for but sometimes the data source comes with different column character separation such as ;, |,\tab. In such case, you can explicitly specify the separation character using the argument sep. Here, the argument parse_dates is used to format the date column to be of type datetime and index_col is the way to set some variable as the DataFrame index. You can pass a list of columns to this argument. Using usecols, you can define a set of columns in which you are about to perform the analysis.


df.head(3) shows the top 3 rows. You can specify as many rows as you want. Without specification, it will give the top 5 rows.

stop_date driver_gender driver_age violation
0 2005-01-02 M 20.0 Speeding
1 2005-01-18 M 40.0 Speeding
2 2005-01-23 M 33.0 Speeding

df.info() will trough us some important properties of the entire dataset. By specifying memory_usage='deep' it will introspect and return the exact amount of memory consumption.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
  RangeIndex: 91741 entries, 0 to 91740
  Data columns (total 4 columns):
    #   Column         Non-Null Count  Dtype         
  ---  ------         --------------  -----         
    0   stop_date      91741 non-null  datetime64[ns]
    1   driver_gender  86406 non-null  object        
    2   driver_age     86120 non-null  float64       
    3   violation      86408 non-null  object        
  dtypes: datetime64[ns](1), float64(1), object(2)
  memory usage: 12.0 MB

dropna().describe() and dropna().select_dtypes().describe() are here used to get a quick summary of statistics by filtered columns by the data types. include=[np.number, np.datetime64, np.object] and select_dtypes(include=['number']) filter out columns by data types and by default the first one will include all the columns whereas the second will through an error as select_dtypes() required at least one parameter. The first doesn’t have any effect in this case as all the dataset dtypes are including in the list whereas the second only outputs column number of type number which is here driver_age_raw. To get a quick summary of columns datatypes grouped by columns, you can uncomment the code at the line 9. We can notice from the first output that some of the outputs are NaN and that is so because properties are data types dependent which means those that can apply to object data cannot apply to other data types and vice-versa.

stop_date driver_gender driver_age_raw violation
count 86405 86405 86405.000000 86405
unique 3767 2 NaN 6
top 2012-02-28 00:00:00 M NaN Speeding
freq 64 62895 NaN 48461
first 2005-01-02 00:00:00 NaN NaN NaN
last 2015-12-31 00:00:00 NaN NaN NaN
mean NaN NaN 1970.536080 NaN
std NaN NaN 110.514739 NaN
min NaN NaN 0.000000 NaN
25% NaN NaN 1967.000000 NaN
50% NaN NaN 1980.000000 NaN
75% NaN NaN 1987.000000 NaN
max NaN NaN 8801.000000 NaN
driver_age_raw
count 86414.000000
mean 1970.491228
std 110.914909
min 0.000000
25% 1967.000000
50% 1980.000000
75% 1987.000000
max 8801.000000

3. Generate data sample and rename columns

Generating synthetic sample may help us in some situation like conducting an experiment, for instance, testing algorithms, making visualization to better understand changes over your experiment, and so on. There are several ways of generating samples for any purpose but the following made use of Pandas and Numpy to make it simple.

1
2
3
4
5
6
7
8
# Create a dataframe and rename columns 
# Eg. 1
foo = pd.DataFrame({'Col One': [100, 200], 'Col Two': [300, 400]})
foo.rename({'Col One': 'col_one', 'Col Two': 'col_two'}, axis=1, inplace=True)
# Eg. 2
bar = pd.DataFrame(np.random.rand(4, 5), columns=list('abcde'))
bar.add_prefix('X_')
# 
Output

Here we have 2 different ways to create DataFrame and later renaming columns. The first will create a 2 columns and 2 rows DataFrame then renames the column names using the function rename().

col_one col_two
0 100 300
1 200 400

The second consists of constructing random samples and names columns after the list passed over the columns attribute then prefixes these names using the method add_prefix()

X_a X_b X_c X_d X_e
0 0.226161 0.588816 0.335782 0.866124 0.274059
1 0.364500 0.021647 0.243754 0.628392 0.124118
2 0.507057 0.769180 0.364644 0.246468 0.716658
3 0.865142 0.201212 0.165379 0.397729 0.195590

4. Reverse row and column order

These are basic operations which consist of reversing the original order of the DataFrame row-wise or column-wise. Here, reset_index() is used to drop the old index and create a new one starting from 0. We can notice the difference between the first and the second output table.

1
2
3
4
5
6
# Reverse rows
police.loc[::-1, :].head(3)
# Reverse rows and drop the old index 
police.loc[::-1, :].reset_index(drop=True).head(3)
# Reverse columns
police.loc[:, ::-1].head(3)
Output

police.loc[::-1, :].head(3)

stop_date driver_gender driver_age_raw violation
91740 2015-12-31 M 1959.0 Speeding
91739 2015-12-31 M 1993.0 Speeding
91738 2015-12-31 M 1992.0 Moving violation

police.loc[::-1, :].reset_index(drop=True).head(3)

stop_date driver_gender driver_age_raw violation
0 2015-12-31 M 1959.0 Speeding
1 2015-12-31 M 1993.0 Speeding
2 2015-12-31 M 1992.0 Moving violation

police.loc[:, ::-1].head(3)

violation driver_age_raw driver_gender stop_date
0 Speeding 1985.0 M 2005-01-02
1 Speeding 1965.0 M 2005-01-18
2 Speeding 1972.0 M 2005-01-23

5. Missing data

Missing data is a common problem that we encounter in our data analysis task. In the machine learning task, one can define missing values as missing information that would bring additional information if they are observed1. For that, investigating the missing information plays an important role in the data analysis problem. The following method displays the column-wise percentage of missing data followed by a pythonic way of getting rid of columns having only missing data followed by a tricky way to filter columns out based on the ratio of missing values.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
# Reload the datset with full features
police = pd.read_csv('datasets/police_project.csv', 
                 sep=',', 
                 parse_dates=['stop_date'],
                 index_col=None)

def percent_missing(df):
    '''Missing data information summary'''
    percent_missing = (df.isnull().sum() / len(df) * 100).round(2)
    missing = pd.DataFrame({'column_name': df.columns, 'percent_missing': percent_missing})
    missing.sort_values('percent_missing', ascending=False, inplace=True)
    missing.reset_index(drop=True, inplace=True)
    return missing

def remove_missing(data, keep_ratio=.9):
    '''Missing data removal'''
    df = data.copy()
    # The pythonic way of getting rid of columns having only missing values
    df.dropna(axis="columns", how='all', inplace=True)
    # Keep columns having at least 90%(default value of keep_ratio)
    df.dropna(axis=1, thresh=len(df)*keep_ratio, inplace=True)
    # Alternative using index 
    #df = df[df.columns[df.isna().sum()/df.shape[0] < 0.9 ]]
    return df
    
# Missing data summary 
missing = percent_missing(police)
missing
# Keep columns having at least 90% values
new_police = remove_missing(police, .9)
new_missing = percent_missing(new_police)
new_missing
Output

police dataset

stop_date stop_time county_name driver_gender driver_age_raw driver_age driver_race violation_raw violation search_conducted search_type stop_outcome is_arrested stop_duration drugs_related_stop
0 2005-01-02 01:55 NaN M 1985.0 20.0 White Speeding Speeding False NaN Citation False 0-15 Min False
1 2005-01-18 08:15 NaN M 1965.0 40.0 White Speeding Speeding False NaN Citation False 0-15 Min False
2 2005-01-23 23:15 NaN M 1972.0 33.0 White Speeding Speeding False NaN Citation False 0-15 Min False

Missing data summary: missing

features missing %
0 county_name 100.00
1 search_type 96.52
2 driver_age 6.13
3 driver_gender 5.82
4 driver_age_raw 5.81
5 driver_race 5.81
6 violation_raw 5.81
7 violation 5.81
8 stop_outcome 5.81
9 is_arrested 5.81
10 stop_duration 5.81
11 stop_date 0.00
12 stop_time 0.00
13 search_conducted 0.00
14 drugs_related_stop 0.00

Keep columns contain at least 90% values: new_missing
Here we can notice that the features get reduced. The first two columns county_name and search_type have been removed as they held missing values beyond the defined ratio.

features missing %
0 driver_age 6.13
1 driver_gender 5.82
2 driver_age_raw 5.81
3 driver_race 5.81
4 violation_raw 5.81
5 violation 5.81
6 stop_outcome 5.81
7 is_arrested 5.81
8 stop_duration 5.81
9 stop_date 0.00
10 stop_time 0.00
11 search_conducted 0.00
12 drugs_related_stop 0.00

6. Split cells data

Splitting data is useful while dealing with object or category like data types. Let us assume that we are interested in splitting the stop_date into day, month, and year. There are at least two ways of doing so. The first and easiest approach is to treat the attribute stop_date as string and apply the transformation onto the string since there exists a couple of string manipulation helper on Python. But in our case, stop_date has already parsed to datetime64 while loading the dataset. By using the string helper functions, we are obliged to downcast again the field into str then applying split to the data afterward. This is computationally expensive when having millions of rows. The second approach comes into play to address this problem. In this, we treat the attribute as datetime64 and apply directly the transformation onto it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
# Approach 1.1: string parsing
police[['year', 'month', 'day']] = police['stop_date'].astype(str).str.split('-', expand=True)
police.head(3)
police.drop(columns=['day', 'month', 'year'], inplace=True)

# Approach 1.2: string parsing
police['year'] = police["stop_date"].apply(lambda x: str(x).split('-')[0])
police['month'] = police["stop_date"].apply(lambda x: str(x).split('-')[1])
police['day'] = police["stop_date"].apply(lambda x: str(x).split('-')[2])
police.head(3)
police.drop(columns=['day', 'month', 'year'], inplace=True)

# Approach 2: datetime parsing
police['day'] = police['stop_date'].dt.day
police['month'] = police['stop_date'].dt.month
police['year'] = police['stop_date'].dt.year
police.head(3)
police.drop(columns=['day', 'month', 'year'], inplace=True)
Output

The string parsing approach will produce the same output though there is a slight difference in the way how the data is being parsed in this case. The first method via astype() will parse the date while the data is loaded as datetime64 whereas the lambda() function will process right after the data has been loaded. Therefore, the first one will have the day string whereas the second will get the day along with the time which is a set of hour, minute, and second.
The DateTime parsing is an in-built Pandas feature that can only apply to columns of type datetime but is the most convenient way to extract date components.

stop_date stop_time county_name driver_gender driver_age_raw driver_age driver_race violation_raw violation search_conducted search_type stop_outcome is_arrested stop_duration drugs_related_stop year month day
0 2005-01-02 01:55 NaN M 1985.0 20.0 White Speeding Speeding False NaN Citation False 0-15 Min False 2005 01 02
1 2005-01-18 08:15 NaN M 1965.0 40.0 White Speeding Speeding False NaN Citation False 0-15 Min False 2005 01 18
2 2005-01-23 23:15 NaN M 1972.0 33.0 White Speeding Speeding False NaN Citation False 0-15 Min False 2005 01 23

The same operation can easily be applied to object or category data types

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12

df = pd.DataFrame({'fullname':['John Doe McArthur', 'Jane Smith McDonald'],
                   'address':['Los Angeles, CA', 'Washington, DC']})

# Split fullname using space or tab as criteria
df['fullname'].str.split('\s+', expand=True)
df
# Split fullname using the same criteria and rename the newly generated columns
df[["first", "middle", "last"]] = df['fullname'].str.split('\s+', expand=True)
# Split location into city and state using comma as criteria 
df[["City", "State"]] = df['address'].str.split(',', expand=True)
df
Output

The dataframe looks like

fullname address
0 John Doe McArthur Los Angeles, CA
1 Jane Smith McDonald Washington, DC

After splitting fullname by space, it will generate three fields

0 1 2
0 John Doe McArthur
1 Jane Smith McDonald

After splitting fullname and address, the generated fields have now their own header names

fullname address First Middle Last City State
0 John Doe McArthur Los Angeles, CA John Doe McArthur Los Angeles CA
1 Jane Smith McDonald Washington, DC Jane Smith McDonald Washington DC

7. Filter a dataframe

What makes a Pandas dataframe different from the other multi-dimensional data structures like Numpy is the way how it organizes the dataset to ease the manipulation task. Pandas makes use of index at every dimension of data in a Series or DataFrame. The index serves as label for each and every row and column to allow users accessing any specific set of cells. This makes the filtering tasks pretty straightforward. Indexes are, by default, of type integer unless it has explicitly preset.
There are multiple ways to access the indexes but two of them are common amongst the data science community including indexing operator which refers to the brackets, [] directly attached after a Series or DataFrame, and indexers that are in-built attributes of a Series and DataFrame like .iloc and .loc.
Alike to the previous approach, I will give some set of examples in the following from which you can easly derive for your own use case.