dplyr-style Data Manipulation with Pipes in Python

03 Jan 2018

I often use R’s dplyr package for exploratory data analysis and data manipulation. In addition to providing a consistent set of functions that one can use to solve the most common data manipulation problems, dplyr also allows one to write elegant, chainable data manipulation code using pipes.

Now, Python is my main language and pandas is my swiss army knife for data analysis, yet I often wished there was a Python package that allowed dplyr-style data manipulation directly on pandas DataFrames. I searched the Internet and found a package called dfply, developed by Kiefer Katovich. Like dplyr, dfply also allows chaining of multiple operations with pipe operators.

This post will focus on the core functions of the dfply package and show how to use them to manipulate pandas DataFrames. The complete source code and dataset is available on Github.

Getting Started

The first thing we need to do is install the package using pip.

pip install dfply

According to the project’s Github repo, dfply only works with Python 3, so ensure you have the right version of Python installed.

Data

To explore the functionality of dfply, we will use the same data used by the Introduction to dplyr vignette. The data is from the Bureau of Transporation Statistics and it contains information about all the 336,776 flights that departed from New York City in 2013.

from dfply import *
import pandas as pd

flight_data = pd.read_csv('nycflights13.csv')
flight_data.head()

	year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
0	2013	1	1	517.0	515	2.0	830.0	819	11.0	UA	1545	N14228	EWR	IAH	227.0	1400	5	15	2013-01-01 05:00:00
1	2013	1	1	533.0	529	4.0	850.0	830	20.0	UA	1714	N24211	LGA	IAH	227.0	1416	5	29	2013-01-01 05:00:00
2	2013	1	1	542.0	540	2.0	923.0	850	33.0	AA	1141	N619AA	JFK	MIA	160.0	1089	5	40	2013-01-01 05:00:00
3	2013	1	1	544.0	545	-1.0	1004.0	1022	-18.0	B6	725	N804JB	JFK	BQN	183.0	1576	5	45	2013-01-01 05:00:00
4	2013	1	1	554.0	600	-6.0	812.0	837	-25.0	DL	461	N668DN	LGA	ATL	116.0	762	6	0	2013-01-01 06:00:00

Piping

Let’s say you want to perform n discrete transformation operations on your dataset before outputting the final result. The most common way is to perform the operations step by step and store the result of each step in a variable. The variable holding the intermediate result is then used in the next step of the transformation pipeline. Let’s take a look at an abstract example.

# 'original_data' could be a pandas DataFrame.
result_1 = transformation_1(original_data, *args, **kwargs)
result_2 = transformation_2(result_1, *args, **kwargs)
result_3 = transformation_3(result_2, *args, **kwargs)
.
.
.
final_result = transformation_n(result_n-1, *args, **kwargs)

This isn’t very elegant code and it can get confusing and messy to write. This is where piping comes to the rescue. Piping allows us to rewrite the above code without needing those intermediate variables.

final_result = original_data -->
                transformation_1(*args, **kwargs) -->
                transformation_2(*args, **kwargs) -->
                transformation_3(*args, **kwargs) -->
                .
                .
                .
                transformation_n(*args, **kwargs)

Magic?! No, it isn’t. Piping works by implicitly making the output of one stage the input of the following stage. In other words, each transformation step works on the transformed result of its previous step.

Piping with dfply

dfply allows chaining multiple operations on a pandas DataFrame with the >> operator. One can chain operations and assign the final output (a pandas DataFrame, since dfply works directly on DataFrames) to a variable. In dfply, the DataFrame result of each step of a chain of operations is represented by X.
For example, if you want to select three columns from a DataFrame in a step, drop the third column in the next step, and then show the first three rows of the final dataframe, you could do something like this:

# 'data' is the original pandas DataFrame
(data >>
 select(X.first_col, X.second_col, X.third_col) >>
 drop(X.third_col) >>
 head(3))

select and drop are both dfply transformation functions, while X represents the result of each transformation step.

Exploring some of dfply’s transformation methods

dfply provides a set of functions for selecting and dropping columns, subsetting and filtering rows, grouping data, and reshaping data, to name a few.

Select and drop columns with `select()` and `drop()`

Occassionally, you will work on datasets with a lot of columns, but only a subset of the columns will be of interest; select() allows you to select these columns.
For example, to select the origin, dest, and hour columns in the flight_data DataFrame we loaded earlier, we do:

(flight_data >>
 select(X.origin, X.dest, X.hour))

	origin	dest	hour
0	EWR	IAH	5
1	LGA	IAH	5
2	JFK	MIA	5
3	JFK	BQN	5
4	LGA	ATL	6

drop() is the inverse of select(). It returns all the columns except those passed in as arguments.
For example, to get all the columns except the year, month, and day columns:

(flight_data >>
 drop(X.year, X.month, X.day))

You can also drop columns inside the select() method by putting a tilde ~ in front of the column(s) you wish to drop.
For example, to select all but the hour and minute columns in the flight_data DataFrame:

(flight_data >>
 select(~X.hour, ~X.minute))

Filter rows with `mask()`

mask() allows you to select a subset of rows in a pandas DataFrame based on logical criteria. mask() selects all the rows where the criteria is/are true.
For example, to select all flights longer than 10 hours that originated from JFK airport on January 1:

(flight_data >>
  mask(X.month == 1, X.day == 1, X.origin == 'JFK', X.hour > 10))

	year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
151	2013	1	1	848.0	1835	853.0	1001.0	1950	851.0	MQ	3944	N942MQ	JFK	BWI	41.0	184	18	35	2013-01-01 18:00:00
258	2013	1	1	1059.0	1100	-1.0	1210.0	1215	-5.0	MQ	3792	N509MQ	JFK	DCA	50.0	213	11	0	2013-01-01 11:00:00
265	2013	1	1	1111.0	1115	-4.0	1222.0	1226	-4.0	B6	24	N279JB	JFK	BTV	52.0	266	11	15	2013-01-01 11:00:00
266	2013	1	1	1112.0	1100	12.0	1440.0	1438	2.0	UA	285	N517UA	JFK	SFO	364.0	2586	11	0	2013-01-01 11:00:00
272	2013	1	1	1124.0	1100	24.0	1435.0	1431	4.0	B6	641	N590JB	JFK	SFO	349.0	2586	11	0	2013-01-01 11:00:00

Sort rows with `arrange()`

arrange() allows you to order rows based on one or multiple columns; the default behaviour is to sort the rows in ascending order.
For example, to sort by distance and then by the number of hours the flights take, we do:

(flight_data >>
 arrange(X.distance, X.hour))

	year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	flight	tailnum	origin	dest	air_time	distance	hour	minute	time_hour
275945	2013	7	27	NaN	106	NaN	NaN	245	NaN	US	1632	NaN	EWR	LGA	NaN	17	1	6	2013-07-27 01:00:00
3083	2013	1	4	1240.0	1200	40.0	1333.0	1306	27.0	EV	4193	N14972	EWR	PHL	30.0	80	12	0	2013-01-04 12:00:00
3901	2013	1	5	1155.0	1200	-5.0	1241.0	1306	-25.0	EV	4193	N14902	EWR	PHL	29.0	80	12	0	2013-01-05 12:00:00
3426	2013	1	4	1829.0	1615	134.0	1937.0	1721	136.0	EV	4502	N15983	EWR	PHL	28.0	80	16	15	2013-01-04 16:00:00
10235	2013	1	12	1613.0	1617	-4.0	1708.0	1722	-14.0	EV	4616	N11150	EWR	PHL	36.0	80	16	17	2013-01-12 16:00:00

To sort in descending order, you set the ascending keyword argument of arrange() to False, like this:

(flight_data >>
 arrange(X.distance, X.hour, ascending=False))

Add new columns with `mutate()`

mutate() allows you to create new columns in the DataFrame. The new columns can be composed from existing columns.
For example, let’s create two new columns: one by dividing the distance column by 1000, and the other by concatenating the carrier and origin columns. We will name these new columns new_distance and carrier_origin respectively.

(flight_data >>
 mutate(
   new_distance = X.distance / 1000,
   carrier_origin = X.carrier + X.origin
 ))

	year	month	day	dep_time	sched_dep_time	dep_delay	arr_time	sched_arr_time	arr_delay	carrier	...	tailnum	origin	dest	air_time	distance	hour	minute	time_hour	carrier_origin	new_distance
0	2013	1	1	517.0	515	2.0	830.0	819	11.0	UA	...	N14228	EWR	IAH	227.0	1400	5	15	2013-01-01 05:00:00	UAEWR	1.400
1	2013	1	1	533.0	529	4.0	850.0	830	20.0	UA	...	N24211	LGA	IAH	227.0	1416	5	29	2013-01-01 05:00:00	UALGA	1.416
2	2013	1	1	542.0	540	2.0	923.0	850	33.0	AA	...	N619AA	JFK	MIA	160.0	1089	5	40	2013-01-01 05:00:00	AAJFK	1.089
3	2013	1	1	544.0	545	-1.0	1004.0	1022	-18.0	B6	...	N804JB	JFK	BQN	183.0	1576	5	45	2013-01-01 05:00:00	B6JFK	1.576
4	2013	1	1	554.0	600	-6.0	812.0	837	-25.0	DL	...	N668DN	LGA	ATL	116.0	762	6	0	2013-01-01 06:00:00	DLLGA	0.762

The newly created columns will be at the end of the DataFrame.

Group and ungroup data with `group_by()` and `ungroup()`

group_by() allows you to group the DataFrame by one or multiple columns. Functions chained after group_by() are applied on the group until the DataFrame is ungrouped with the ungroup() function. For example, to group the data by the originating airport, we do:

(flight_data >>
 group_by(X.origin))

Summarise data using `summarize()`

summarize() is typically used together with group_by() to reduce each group into a single row summary. In other words, the output will have one row for each group. For example, to calculate the mean distance for flights originating from every airport, we do:

(flight_data >>
 group_by(X.origin) >>
 summarize(mean_distance = X.distance.mean())
)

	origin	mean_distance
0	EWR	1056.742790
1	JFK	1266.249077
2	LGA	779.835671

Bringing it all together with pipes

Let’s say you want to perform the following operations on the flights data

[Step 1]: Filter out all flights less than 10 hours
[Step 2]: Create a new column, speed, using the formula [distance / (air time * 60)]
[Step 3]: Calculate the mean speed for flights originating from each airport
[Step 4]: Sort the result by mean speed in descending order

We will write the operations using dfply piping operator >>. We won’t have to use intermediate variables to save the result of each step.

(flight_data >>
  mask(X.hour > 10) >> # step 1
  mutate(speed = X.distance / (X.air_time * 60)) >> # step 2
  group_by(X.origin) >> # step 3a
  summarize(mean_speed = X.speed.mean()) >> # step 3b
  arrange(X.mean_speed, ascending=False) # step 4
)

	origin	mean_speed
0	EWR	0.109777
1	JFK	0.109427
2	LGA	0.107362

If we use pandas data manipulation functions instead of dfply’s, our code will look something like this:

flight_data.loc[flight_data['hour'] > 10, 'speed'] = flight_data['distance'] / (flight_data['air_time'] * 60)
result = flight_data.groupby('origin', as_index=False)['speed'].mean()
result.sort_values('speed', ascending=False)

I find the dfply version easier to read and understand than the pandas version.

Conclusion

This is by no means an exhaustive coverage of the functionality of the dfply package. The package documentation is really good and I advise that you check it out to learn more.

If you have suggestions or questions, please drop a comment in the comment section below. You can also send me an email at hello [at] allenkunle [dot] me or tweet at me @allenakinkunle, and I will reply as soon as I can.

The complete source code for this blog post is available on Github. Thank you for reading, and please don’t forget to share.