dplyr-style Data Manipulation with Pipes in Python

I often use R’s dplyr package for exploratory data analysis and data manipulation. In addition to providing a consistent set of functions that one can use to solve the most common data manipulation problems, dplyr also allows one to write elegant, chainable data manipulation code using pipes.

Now, Python is my main language and pandas is my swiss army knife for data analysis, yet I often wished there was a Python package that allowed dplyr-style data manipulation directly on pandas DataFrames. I searched the Internet and found a package called dfply, developed by Kiefer Katovich. Like dplyr, dfply also allows chaining of multiple operations with pipe operators.

This post will focus on the core functions of the dfply package and show how to use them to manipulate pandas DataFrames. The complete source code and dataset is available on Github.

Getting Started

The first thing we need to do is install the package using pip.

pip install dfply

According to the project’s Github repo, dfply only works with Python 3, so ensure you have the right version of Python installed.

Data

To explore the functionality of dfply, we will use the same data used by the Introduction to dplyr vignette. The data is from the Bureau of Transporation Statistics and it contains information about all the 336,776 flights that departed from New York City in 2013.

from dfply import *
import pandas as pd

flight_data = pd.read_csv('nycflights13.csv')
flight_data.head()
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
0 201311517.05152.0830.081911.0UA1545N14228EWRIAH227.014005152013-01-01 05:00:00
1 201311533.05294.0850.083020.0UA1714N24211LGAIAH227.014165292013-01-01 05:00:00
2 201311542.05402.0923.085033.0AA1141N619AAJFKMIA160.010895402013-01-01 05:00:00
3 201311544.0545-1.01004.01022-18.0B6725N804JBJFKBQN183.015765452013-01-01 05:00:00
4 201311554.0600-6.0812.0837-25.0DL461N668DNLGAATL116.0762602013-01-01 06:00:00

Piping

Let’s say you want to perform n discrete transformation operations on your dataset before outputting the final result. The most common way is to perform the operations step by step and store the result of each step in a variable. The variable holding the intermediate result is then used in the next step of the transformation pipeline. Let’s take a look at an abstract example.

# 'original_data' could be a pandas DataFrame.
result_1 = transformation_1(original_data, *args, **kwargs)
result_2 = transformation_2(result_1, *args, **kwargs)
result_3 = transformation_3(result_2, *args, **kwargs)
.
.
.
final_result = transformation_n(result_n-1, *args, **kwargs)

This isn’t very elegant code and it can get confusing and messy to write. This is where piping comes to the rescue. Piping allows us to rewrite the above code without needing those intermediate variables.

final_result = original_data -->
                transformation_1(*args, **kwargs) -->
                transformation_2(*args, **kwargs) -->
                transformation_3(*args, **kwargs) -->
                .
                .
                .
                transformation_n(*args, **kwargs)

Magic?! No, it isn’t. Piping works by implicitly making the output of one stage the input of the following stage. In other words, each transformation step works on the transformed result of its previous step.

Piping with dfply

dfply allows chaining multiple operations on a pandas DataFrame with the >> operator. One can chain operations and assign the final output (a pandas DataFrame, since dfply works directly on DataFrames) to a variable. In dfply, the DataFrame result of each step of a chain of operations is represented by X.
For example, if you want to select three columns from a DataFrame in a step, drop the third column in the next step, and then show the first three rows of the final dataframe, you could do something like this:

# 'data' is the original pandas DataFrame
(data >>
 select(X.first_col, X.second_col, X.third_col) >>
 drop(X.third_col) >>
 head(3))

select and drop are both dfply transformation functions, while X represents the result of each transformation step.

Exploring some of dfply’s transformation methods

dfply provides a set of functions for selecting and dropping columns, subsetting and filtering rows, grouping data, and reshaping data, to name a few.

Select and drop columns with select() and drop()

Occassionally, you will work on datasets with a lot of columns, but only a subset of the columns will be of interest; select() allows you to select these columns.
For example, to select the origin, dest, and hour columns in the flight_data DataFrame we loaded earlier, we do:

(flight_data >>
 select(X.origin, X.dest, X.hour))
origin dest hour
0 EWRIAH5
1 LGAIAH5
2 JFKMIA5
3 JFKBQN5
4 LGAATL6

drop() is the inverse of select(). It returns all the columns except those passed in as arguments.
For example, to get all the columns except the year, month, and day columns:

(flight_data >>
 drop(X.year, X.month, X.day))

You can also drop columns inside the select() method by putting a tilde ~ in front of the column(s) you wish to drop.
For example, to select all but the hour and minute columns in the flight_data DataFrame:

(flight_data >>
 select(~X.hour, ~X.minute))

Filter rows with mask()

mask() allows you to select a subset of rows in a pandas DataFrame based on logical criteria. mask() selects all the rows where the criteria is/are true.
For example, to select all flights longer than 10 hours that originated from JFK airport on January 1:

(flight_data >>
  mask(X.month == 1, X.day == 1, X.origin == 'JFK', X.hour > 10))
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
151 201311848.01835853.01001.01950851.0MQ3944N942MQJFKBWI41.018418352013-01-01 18:00:00
258 2013111059.01100-1.01210.01215-5.0MQ3792N509MQJFKDCA50.02131102013-01-01 11:00:00
265 2013111111.01115-4.01222.01226-4.0B624N279JBJFKBTV52.026611152013-01-01 11:00:00
266 2013111112.0110012.01440.014382.0UA285N517UAJFKSFO364.025861102013-01-01 11:00:00
272 2013111124.0110024.01435.014314.0B6641N590JBJFKSFO349.025861102013-01-01 11:00:00

Sort rows with arrange()

arrange() allows you to order rows based on one or multiple columns; the default behaviour is to sort the rows in ascending order.
For example, to sort by distance and then by the number of hours the flights take, we do:

(flight_data >>
 arrange(X.distance, X.hour))
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier flight tailnum origin dest air_time distance hour minute time_hour
275945 2013727NaN106NaNNaN245NaNUS1632NaNEWRLGANaN17162013-07-27 01:00:00
3083 2013141240.0120040.01333.0130627.0EV4193N14972EWRPHL30.0801202013-01-04 12:00:00
3901 2013151155.01200-5.01241.01306-25.0EV4193N14902EWRPHL29.0801202013-01-05 12:00:00
3426 2013141829.01615134.01937.01721136.0EV4502N15983EWRPHL28.08016152013-01-04 16:00:00
10235 20131121613.01617-4.01708.01722-14.0EV4616N11150EWRPHL36.08016172013-01-12 16:00:00

To sort in descending order, you set the ascending keyword argument of arrange() to False, like this:

(flight_data >>
 arrange(X.distance, X.hour, ascending=False))

Add new columns with mutate()

mutate() allows you to create new columns in the DataFrame. The new columns can be composed from existing columns.
For example, let’s create two new columns: one by dividing the distance column by 1000, and the other by concatenating the carrier and origin columns. We will name these new columns new_distance and carrier_origin respectively.

(flight_data >>
 mutate(
   new_distance = X.distance / 1000,
   carrier_origin = X.carrier + X.origin
 ))
year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier ... tailnum origin dest air_time distance hour minute time_hour carrier_origin new_distance
0 201311517.05152.0830.081911.0UA...N14228EWRIAH227.014005152013-01-01 05:00:00UAEWR1.400
1 201311533.05294.0850.083020.0UA...N24211LGAIAH227.014165292013-01-01 05:00:00UALGA1.416
2 201311542.05402.0923.085033.0AA...N619AAJFKMIA160.010895402013-01-01 05:00:00AAJFK1.089
3 201311544.0545-1.01004.01022-18.0B6...N804JBJFKBQN183.015765452013-01-01 05:00:00B6JFK1.576
4 201311554.0600-6.0812.0837-25.0DL...N668DNLGAATL116.0762602013-01-01 06:00:00DLLGA0.762

The newly created columns will be at the end of the DataFrame.

Group and ungroup data with group_by() and ungroup()

group_by() allows you to group the DataFrame by one or multiple columns. Functions chained after group_by() are applied on the group until the DataFrame is ungrouped with the ungroup() function. For example, to group the data by the originating airport, we do:

(flight_data >>
 group_by(X.origin))

Summarise data using summarize()

summarize() is typically used together with group_by() to reduce each group into a single row summary. In other words, the output will have one row for each group. For example, to calculate the mean distance for flights originating from every airport, we do:

(flight_data >>
 group_by(X.origin) >>
 summarize(mean_distance = X.distance.mean())
)
origin mean_distance
0 EWR1056.742790
1 JFK1266.249077
2 LGA779.835671

Bringing it all together with pipes

Let’s say you want to perform the following operations on the flights data

  • [Step 1]: Filter out all flights less than 10 hours
  • [Step 2]: Create a new column, speed, using the formula [distance / (air time * 60)]
  • [Step 3]: Calculate the mean speed for flights originating from each airport
  • [Step 4]: Sort the result by mean speed in descending order

We will write the operations using dfply piping operator >>. We won’t have to use intermediate variables to save the result of each step.

(flight_data >>
  mask(X.hour > 10) >> # step 1
  mutate(speed = X.distance / (X.air_time * 60)) >> # step 2
  group_by(X.origin) >> # step 3a
  summarize(mean_speed = X.speed.mean()) >> # step 3b
  arrange(X.mean_speed, ascending=False) # step 4
)
origin mean_speed
0 EWR0.109777
1 JFK0.109427
2 LGA0.107362

If we use pandas data manipulation functions instead of dfply’s, our code will look something like this:

flight_data.loc[flight_data['hour'] > 10, 'speed'] = flight_data['distance'] / (flight_data['air_time'] * 60)
result = flight_data.groupby('origin', as_index=False)['speed'].mean()
result.sort_values('speed', ascending=False)

I find the dfply version easier to read and understand than the pandas version.

Conclusion

This is by no means an exhaustive coverage of the functionality of the dfply package. The package documentation is really good and I advise that you check it out to learn more.

If you have suggestions or questions, please drop a comment in the comment section below. You can also send me an email at hello [at] allenkunle [dot] me or tweet at me @allenakinkunle, and I will reply as soon as I can.

The complete source code for this blog post is available on Github. Thank you for reading, and please don’t forget to share.