Pandas Best Practices

Mahery Ranaivoson included in Data Analysis

Sep 27, 2020 2680 words 13 minutes

Contents

What is this article about?

This article provides a curated list of the daily best practices and use cases of the Pandas library. It is meant to be a rolling release note in which I will mark down all of the better practice I came across along my journey towards data science. Accordingly, the list is going to be (very) long thus feel free to use the table of contents to jump over to the topic you may interest.

What is Pandas?

/posts/pandas-best-practice/not_pandas.jpg — This is not a Panda or the Pandas I’m talking about.

Pandas is one of the most-used standard Python library when dealing with matrix-like transformable datasets which range from a simple tabular to a more complex structure dataset. It is an open-source library providing faster, flexible, and expressive mean for data manipulation and one of the essential to learn library in case you are getting in the data science field.
It mainly provides two well-suited data structures to multiple data type include Series and DataFrame. This article will get into how to use both of them and the case on which one may be helpful over the other in the following.

Installation

Pandas can be installed via pip from PyPI

1

pip install pandas

or using the Anaconda or Miniconda environment,

1

conda install pandas

Pandas components

As mentioned earlier and to handle different types of datasets, pandas provides 2 data structures mainly wrapped over Numpy. Series is nothing but one dimension representation similar to the Numpy array but structured in the way it is explicitly associate an index with the values. Whereas DataFrame is a multi-dimensional array that can be thought of as a collection of series and its rows are also indexed by default unless it is explicitly done. They are pretty similar in terms of operations as one can apply almost the same manipulation to both of them.

credit: www.learndatasci.com — Pandas main components

Dataset

The dataset being used in the following is either being created manually or loaded from Seaborn repository or the dataset Traffic and Pedestrian Stops by the Police in Rhode Island to make advanced operations. The choice lies in the fact that this dataset has multiple data types from which multiple scenarios can be derived.

Best practices

1. Importing necessary libraries

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


import pandas as pd 
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt 

# whether to explicitly show the whole interactive output cells in 
# jupyter notebook rather than showing only the last line.
# use `last_expr` to revert back the default behavior
from IPython.core.interactiveshell import InteractiveShell
InteractiveShell.ast_node_interactivity = "all"
# whether to display all DataFrame rows and cols 
pd.set_option('display.max_rows', None)
pd.set_option('display.max_columns', None)  

Info

For the full data analysis and visualization tasks, pandas is often incomplete and you need to have hands-on skills to other Python libraries like Numpy, Matplotlib or Seaborn but they are not mandatory. They are only used here to help us generating dataset or ploting important findings.
In this article, additional information about the code is explained using Python comments.

2. Get to know the dataset

This is the first essential tasks that you would perform before getting started your analysis. Understanding the dataset may extend from knowing what’s inside, its content, its shape, the variable names, the variable datatypes, the ratio of null values, or going deeper to know the number of unique values of categorical variables, or even gather the descriptive statistics or the variable distribution. The following code snippet shows a few essential functions along with the must to know arguments to answer these questions.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13


police = pd.read_csv('datasets/police_project.csv', 
                 sep=',', 
                 parse_dates=['stop_date'],
                 index_col=None,
                 usecols=['stop_date', 'driver_gender','driver_age_raw', 'violation'])

police.head(3)
police.info(memory_usage='deep')
# Whether to display all columns with their respective types 
#police.columns.to_series().groupby(police.dtypes).groups
# Only filter `number`, `datetime`, and `object` types
police.dropna().describe(include=[np.number, np.datetime64, np.object])
police.select_dtypes(include=['number', 'object']).describe()

Output

df.read_csv() is one of the most common Pandas method designed specifically, but not limited to, to read a CSV data source. By default, CSV columns are separated by comma , as the name stands for but sometimes the data source comes with different column character separation such as ;, |,\tab. In such case, you can explicitly specify the separation character using the argument sep. Here, the argument parse_dates is used to format the date column to be of type datetime and index_col is the way to set some variable as the DataFrame index. You can pass a list of columns to this argument. Using usecols, you can define a set of columns in which you are about to perform the analysis.

df.head(3) shows the top 3 rows. You can specify as many rows as you want. Without specification, it will give the top 5 rows.

	stop_date	driver_gender	driver_age	violation
0	2005-01-02	M	20.0	Speeding
1	2005-01-18	M	40.0	Speeding
2	2005-01-23	M	33.0	Speeding

df.info() will trough us some important properties of the entire dataset. By specifying memory_usage='deep' it will introspect and return the exact amount of memory consumption.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10


  RangeIndex: 91741 entries, 0 to 91740
  Data columns (total 4 columns):
    #   Column         Non-Null Count  Dtype         
  ---  ------         --------------  -----         
    0   stop_date      91741 non-null  datetime64[ns]
    1   driver_gender  86406 non-null  object        
    2   driver_age     86120 non-null  float64       
    3   violation      86408 non-null  object        
  dtypes: datetime64[ns](1), float64(1), object(2)
  memory usage: 12.0 MB

dropna().describe() and dropna().select_dtypes().describe() are here used to get a quick summary of statistics by filtered columns by the data types. include=[np.number, np.datetime64, np.object] and select_dtypes(include=['number']) filter out columns by data types and by default the first one will include all the columns whereas the second will through an error as select_dtypes() required at least one parameter. The first doesn’t have any effect in this case as all the dataset dtypes are including in the list whereas the second only outputs column number of type number which is here driver_age_raw. To get a quick summary of columns datatypes grouped by columns, you can uncomment the code at the line 9. We can notice from the first output that some of the outputs are NaN and that is so because properties are data types dependent which means those that can apply to object data cannot apply to other data types and vice-versa.

	stop_date	driver_gender	driver_age_raw	violation
count	86405	86405	86405.000000	86405
unique	3767	2	NaN	6
top	2012-02-28 00:00:00	M	NaN	Speeding
freq	64	62895	NaN	48461
first	2005-01-02 00:00:00	NaN	NaN	NaN
last	2015-12-31 00:00:00	NaN	NaN	NaN
mean	NaN	NaN	1970.536080	NaN
std	NaN	NaN	110.514739	NaN
min	NaN	NaN	0.000000	NaN
25%	NaN	NaN	1967.000000	NaN
50%	NaN	NaN	1980.000000	NaN
75%	NaN	NaN	1987.000000	NaN
max	NaN	NaN	8801.000000	NaN

	driver_age_raw
count	86414.000000
mean	1970.491228
std	110.914909
min	0.000000
25%	1967.000000
50%	1980.000000
75%	1987.000000
max	8801.000000

3. Generate data sample and rename columns

Generating synthetic sample may help us in some situation like conducting an experiment, for instance, testing algorithms, making visualization to better understand changes over your experiment, and so on. There are several ways of generating samples for any purpose but the following made use of Pandas and Numpy to make it simple.

1
2
3
4
5
6
7
8


# Create a dataframe and rename columns 
# Eg. 1
foo = pd.DataFrame({'Col One': [100, 200], 'Col Two': [300, 400]})
foo.rename({'Col One': 'col_one', 'Col Two': 'col_two'}, axis=1, inplace=True)
# Eg. 2
bar = pd.DataFrame(np.random.rand(4, 5), columns=list('abcde'))
bar.add_prefix('X_')
# 

Output

Here we have 2 different ways to create DataFrame and later renaming columns. The first will create a 2 columns and 2 rows DataFrame then renames the column names using the function rename().

	col_one	col_two
0	100	300
1	200	400

The second consists of constructing random samples and names columns after the list passed over the columns attribute then prefixes these names using the method add_prefix()

	X_a	X_b	X_c	X_d	X_e
0	0.226161	0.588816	0.335782	0.866124	0.274059
1	0.364500	0.021647	0.243754	0.628392	0.124118
2	0.507057	0.769180	0.364644	0.246468	0.716658
3	0.865142	0.201212	0.165379	0.397729	0.195590

4. Reverse row and column order

These are basic operations which consist of reversing the original order of the DataFrame row-wise or column-wise. Here, reset_index() is used to drop the old index and create a new one starting from 0. We can notice the difference between the first and the second output table.

1
2
3
4
5
6


# Reverse rows
police.loc[::-1, :].head(3)
# Reverse rows and drop the old index 
police.loc[::-1, :].reset_index(drop=True).head(3)
# Reverse columns
police.loc[:, ::-1].head(3)

Output

police.loc[::-1, :].head(3)

	stop_date	driver_gender	driver_age_raw	violation
91740	2015-12-31	M	1959.0	Speeding
91739	2015-12-31	M	1993.0	Speeding
91738	2015-12-31	M	1992.0	Moving violation

police.loc[::-1, :].reset_index(drop=True).head(3)

	stop_date	driver_gender	driver_age_raw	violation
0	2015-12-31	M	1959.0	Speeding
1	2015-12-31	M	1993.0	Speeding
2	2015-12-31	M	1992.0	Moving violation

police.loc[:, ::-1].head(3)

	violation	driver_age_raw	driver_gender	stop_date
0	Speeding	1985.0	M	2005-01-02
1	Speeding	1965.0	M	2005-01-18
2	Speeding	1972.0	M	2005-01-23

5. Missing data

Missing data is a common problem that we encounter in our data analysis task. In the machine learning task, one can define missing values as missing information that would bring additional information if they are observed¹. For that, investigating the missing information plays an important role in the data analysis problem. The following method displays the column-wise percentage of missing data followed by a pythonic way of getting rid of columns having only missing data followed by a tricky way to filter columns out based on the ratio of missing values.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32


# Reload the datset with full features
police = pd.read_csv('datasets/police_project.csv', 
                 sep=',', 
                 parse_dates=['stop_date'],
                 index_col=None)

def percent_missing(df):
    '''Missing data information summary'''
    percent_missing = (df.isnull().sum() / len(df) * 100).round(2)
    missing = pd.DataFrame({'column_name': df.columns, 'percent_missing': percent_missing})
    missing.sort_values('percent_missing', ascending=False, inplace=True)
    missing.reset_index(drop=True, inplace=True)
    return missing

def remove_missing(data, keep_ratio=.9):
    '''Missing data removal'''
    df = data.copy()
    # The pythonic way of getting rid of columns having only missing values
    df.dropna(axis="columns", how='all', inplace=True)
    # Keep columns having at least 90%(default value of keep_ratio)
    df.dropna(axis=1, thresh=len(df)*keep_ratio, inplace=True)
    # Alternative using index 
    #df = df[df.columns[df.isna().sum()/df.shape[0] < 0.9 ]]
    return df
    
# Missing data summary 
missing = percent_missing(police)
missing
# Keep columns having at least 90% values
new_police = remove_missing(police, .9)
new_missing = percent_missing(new_police)
new_missing

Output

police dataset

	stop_date	stop_time	county_name	driver_gender	driver_age_raw	driver_age	driver_race	violation_raw	violation	search_conducted	search_type	stop_outcome	is_arrested	stop_duration	drugs_related_stop
0	2005-01-02	01:55	NaN	M	1985.0	20.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False
1	2005-01-18	08:15	NaN	M	1965.0	40.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False
2	2005-01-23	23:15	NaN	M	1972.0	33.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False

Missing data summary: missing

	features	missing %
0	county_name	100.00
1	search_type	96.52
2	driver_age	6.13
3	driver_gender	5.82
4	driver_age_raw	5.81
5	driver_race	5.81
6	violation_raw	5.81
7	violation	5.81
8	stop_outcome	5.81
9	is_arrested	5.81
10	stop_duration	5.81
11	stop_date	0.00
12	stop_time	0.00
13	search_conducted	0.00
14	drugs_related_stop	0.00

Keep columns contain at least 90% values: new_missing
Here we can notice that the features get reduced. The first two columns county_name and search_type have been removed as they held missing values beyond the defined ratio.

	features	missing %
0	driver_age	6.13
1	driver_gender	5.82
2	driver_age_raw	5.81
3	driver_race	5.81
4	violation_raw	5.81
5	violation	5.81
6	stop_outcome	5.81
7	is_arrested	5.81
8	stop_duration	5.81
9	stop_date	0.00
10	stop_time	0.00
11	search_conducted	0.00
12	drugs_related_stop	0.00

6. Split cells data

Splitting data is useful while dealing with object or category like data types. Let us assume that we are interested in splitting the stop_date into day, month, and year. There are at least two ways of doing so. The first and easiest approach is to treat the attribute stop_date as string and apply the transformation onto the string since there exists a couple of string manipulation helper on Python. But in our case, stop_date has already parsed to datetime64 while loading the dataset. By using the string helper functions, we are obliged to downcast again the field into str then applying split to the data afterward. This is computationally expensive when having millions of rows. The second approach comes into play to address this problem. In this, we treat the attribute as datetime64 and apply directly the transformation onto it.

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18


# Approach 1.1: string parsing
police[['year', 'month', 'day']] = police['stop_date'].astype(str).str.split('-', expand=True)
police.head(3)
police.drop(columns=['day', 'month', 'year'], inplace=True)

# Approach 1.2: string parsing
police['year'] = police["stop_date"].apply(lambda x: str(x).split('-')[0])
police['month'] = police["stop_date"].apply(lambda x: str(x).split('-')[1])
police['day'] = police["stop_date"].apply(lambda x: str(x).split('-')[2])
police.head(3)
police.drop(columns=['day', 'month', 'year'], inplace=True)

# Approach 2: datetime parsing
police['day'] = police['stop_date'].dt.day
police['month'] = police['stop_date'].dt.month
police['year'] = police['stop_date'].dt.year
police.head(3)
police.drop(columns=['day', 'month', 'year'], inplace=True)

Output

The string parsing approach will produce the same output though there is a slight difference in the way how the data is being parsed in this case. The first method via astype() will parse the date while the data is loaded as datetime64 whereas the lambda() function will process right after the data has been loaded. Therefore, the first one will have the day string whereas the second will get the day along with the time which is a set of hour, minute, and second.
The DateTime parsing is an in-built Pandas feature that can only apply to columns of type datetime but is the most convenient way to extract date components.

	stop_date	stop_time	county_name	driver_gender	driver_age_raw	driver_age	driver_race	violation_raw	violation	search_conducted	search_type	stop_outcome	is_arrested	stop_duration	drugs_related_stop	year	month	day
0	2005-01-02	01:55	NaN	M	1985.0	20.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False	2005	01	02
1	2005-01-18	08:15	NaN	M	1965.0	40.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False	2005	01	18
2	2005-01-23	23:15	NaN	M	1972.0	33.0	White	Speeding	Speeding	False	NaN	Citation	False	0-15 Min	False	2005	01	23

The same operation can easily be applied to object or category data types

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12



df = pd.DataFrame({'fullname':['John Doe McArthur', 'Jane Smith McDonald'],
                   'address':['Los Angeles, CA', 'Washington, DC']})

# Split fullname using space or tab as criteria
df['fullname'].str.split('\s+', expand=True)
df
# Split fullname using the same criteria and rename the newly generated columns
df[["first", "middle", "last"]] = df['fullname'].str.split('\s+', expand=True)
# Split location into city and state using comma as criteria 
df[["City", "State"]] = df['address'].str.split(',', expand=True)
df

Output

The dataframe looks like

	fullname	address
0	John Doe McArthur	Los Angeles, CA
1	Jane Smith McDonald	Washington, DC

After splitting fullname by space, it will generate three fields

	0	1	2
0	John	Doe	McArthur
1	Jane	Smith	McDonald

After splitting fullname and address, the generated fields have now their own header names

	fullname	address	First	Middle	Last	City	State
0	John Doe McArthur	Los Angeles, CA	John	Doe	McArthur	Los Angeles	CA
1	Jane Smith McDonald	Washington, DC	Jane	Smith	McDonald	Washington	DC

7. Filter a dataframe

What makes a Pandas dataframe different from the other multi-dimensional data structures like Numpy is the way how it organizes the dataset to ease the manipulation task. Pandas makes use of index at every dimension of data in a Series or DataFrame. The index serves as label for each and every row and column to allow users accessing any specific set of cells. This makes the filtering tasks pretty straightforward. Indexes are, by default, of type integer unless it has explicitly preset.
There are multiple ways to access the indexes but two of them are common amongst the data science community including indexing operator which refers to the brackets, [] directly attached after a Series or DataFrame, and indexers that are in-built attributes of a Series and DataFrame like .iloc and .loc.
Alike to the previous approach, I will give some set of examples in the following from which you can easly derive for your own use case.

All about missing data handling, by Baijayanta Roy ↩︎

Contents

Pandas Best Practices

What is Pandas?

Installation

Pandas components

Pandas main components

Dataset

Best practices

1. Importing necessary libraries

2. Get to know the dataset

3. Generate data sample and rename columns

4. Reverse row and column order

5. Missing data

6. Split cells data

7. Filter a dataframe