dplyr-style Data Manipulation with Pipes in Python
I often use R’s dplyr
package for exploratory data analysis and data manipulation. In addition to providing a consistent set of functions that one can use to solve the most common data manipulation problems, dplyr also allows one to write elegant, chainable data manipulation code using pipes.
Now, Python is my main language and pandas
is my swiss army knife for data analysis, yet I often wished there was a Python package that allowed dplyr-style data manipulation directly on pandas DataFrames. I searched the Internet and found a package called dfply
, developed by Kiefer Katovich. Like dplyr, dfply also allows chaining of multiple operations with pipe operators.
This post will focus on the core functions of the dfply package and show how to use them to manipulate pandas DataFrames. The complete source code and dataset is available on Github.
Getting Started
The first thing we need to do is install the package using pip
.
According to the project’s Github repo, dfply only works with Python 3, so ensure you have the right version of Python installed.
Data
To explore the functionality of dfply, we will use the same data used by the Introduction to dplyr vignette. The data is from the Bureau of Transporation Statistics and it contains information about all the 336,776 flights that departed from New York City in 2013.
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | 1545 | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01 05:00:00 |
1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | 1714 | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01 05:00:00 |
2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | 1141 | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01 05:00:00 |
3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | 725 | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01 05:00:00 |
4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | 461 | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01 06:00:00 |
Piping
Let’s say you want to perform n
discrete transformation operations on your dataset before outputting the final result. The most common way is to perform the operations step by step and store the result of each step in a variable. The variable holding the intermediate result is then used in the next step of the transformation pipeline. Let’s take a look at an abstract example.
This isn’t very elegant code and it can get confusing and messy to write. This is where piping comes to the rescue. Piping allows us to rewrite the above code without needing those intermediate variables.
Magic?! No, it isn’t. Piping works by implicitly making the output of one stage the input of the following stage. In other words, each transformation step works on the transformed result of its previous step.
Piping with dfply
dfply allows chaining multiple operations on a pandas DataFrame with the >>
operator. One can chain operations and assign the final output (a pandas DataFrame, since dfply works directly on DataFrames) to a variable. In dfply, the DataFrame result of each step of a chain of operations is represented by X
.
For example, if you want to select three columns from a DataFrame in a step, drop the third column in the next step, and then show the first three rows of the final dataframe, you could do something like this:
select
and drop
are both dfply transformation functions, while X
represents the result of each transformation step.
Exploring some of dfply’s transformation methods
dfply
provides a set of functions for selecting and dropping columns, subsetting and filtering rows, grouping data, and reshaping data, to name a few.
Select and drop columns with select()
and drop()
Occassionally, you will work on datasets with a lot of columns, but only a subset of the columns will be of interest; select()
allows you to select these columns.
For example, to select the origin
, dest
, and hour
columns in the flight_data
DataFrame we loaded earlier, we do:
origin | dest | hour | |
---|---|---|---|
0 | EWR | IAH | 5 |
1 | LGA | IAH | 5 |
2 | JFK | MIA | 5 |
3 | JFK | BQN | 5 |
4 | LGA | ATL | 6 |
drop()
is the inverse of select()
. It returns all the columns except those passed in as arguments.
For example, to get all the columns except the year
, month
, and day
columns:
You can also drop columns inside the select()
method by putting a tilde ~
in front of the column(s) you wish to drop.
For example, to select all but the hour
and minute
columns in the flight_data
DataFrame:
Filter rows with mask()
mask()
allows you to select a subset of rows in a pandas DataFrame based on logical criteria. mask()
selects all the rows where the criteria is/are true.
For example, to select all flights longer than 10 hours that originated from JFK airport on January 1:
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
151 | 2013 | 1 | 1 | 848.0 | 1835 | 853.0 | 1001.0 | 1950 | 851.0 | MQ | 3944 | N942MQ | JFK | BWI | 41.0 | 184 | 18 | 35 | 2013-01-01 18:00:00 |
258 | 2013 | 1 | 1 | 1059.0 | 1100 | -1.0 | 1210.0 | 1215 | -5.0 | MQ | 3792 | N509MQ | JFK | DCA | 50.0 | 213 | 11 | 0 | 2013-01-01 11:00:00 |
265 | 2013 | 1 | 1 | 1111.0 | 1115 | -4.0 | 1222.0 | 1226 | -4.0 | B6 | 24 | N279JB | JFK | BTV | 52.0 | 266 | 11 | 15 | 2013-01-01 11:00:00 |
266 | 2013 | 1 | 1 | 1112.0 | 1100 | 12.0 | 1440.0 | 1438 | 2.0 | UA | 285 | N517UA | JFK | SFO | 364.0 | 2586 | 11 | 0 | 2013-01-01 11:00:00 |
272 | 2013 | 1 | 1 | 1124.0 | 1100 | 24.0 | 1435.0 | 1431 | 4.0 | B6 | 641 | N590JB | JFK | SFO | 349.0 | 2586 | 11 | 0 | 2013-01-01 11:00:00 |
Sort rows with arrange()
arrange()
allows you to order rows based on one or multiple columns; the default behaviour is to sort the rows in ascending order.
For example, to sort by distance
and then by the number of hours
the flights take, we do:
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | flight | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
275945 | 2013 | 7 | 27 | NaN | 106 | NaN | NaN | 245 | NaN | US | 1632 | NaN | EWR | LGA | NaN | 17 | 1 | 6 | 2013-07-27 01:00:00 |
3083 | 2013 | 1 | 4 | 1240.0 | 1200 | 40.0 | 1333.0 | 1306 | 27.0 | EV | 4193 | N14972 | EWR | PHL | 30.0 | 80 | 12 | 0 | 2013-01-04 12:00:00 |
3901 | 2013 | 1 | 5 | 1155.0 | 1200 | -5.0 | 1241.0 | 1306 | -25.0 | EV | 4193 | N14902 | EWR | PHL | 29.0 | 80 | 12 | 0 | 2013-01-05 12:00:00 |
3426 | 2013 | 1 | 4 | 1829.0 | 1615 | 134.0 | 1937.0 | 1721 | 136.0 | EV | 4502 | N15983 | EWR | PHL | 28.0 | 80 | 16 | 15 | 2013-01-04 16:00:00 |
10235 | 2013 | 1 | 12 | 1613.0 | 1617 | -4.0 | 1708.0 | 1722 | -14.0 | EV | 4616 | N11150 | EWR | PHL | 36.0 | 80 | 16 | 17 | 2013-01-12 16:00:00 |
To sort in descending order, you set the ascending
keyword argument of arrange()
to False
, like this:
Add new columns with mutate()
mutate()
allows you to create new columns in the DataFrame. The new columns can be composed from existing columns.
For example, let’s create two new columns: one by dividing the distance
column by 1000
, and the other by concatenating the carrier
and origin
columns. We will name these new columns new_distance
and carrier_origin
respectively.
year | month | day | dep_time | sched_dep_time | dep_delay | arr_time | sched_arr_time | arr_delay | carrier | ... | tailnum | origin | dest | air_time | distance | hour | minute | time_hour | carrier_origin | new_distance | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 2013 | 1 | 1 | 517.0 | 515 | 2.0 | 830.0 | 819 | 11.0 | UA | ... | N14228 | EWR | IAH | 227.0 | 1400 | 5 | 15 | 2013-01-01 05:00:00 | UAEWR | 1.400 |
1 | 2013 | 1 | 1 | 533.0 | 529 | 4.0 | 850.0 | 830 | 20.0 | UA | ... | N24211 | LGA | IAH | 227.0 | 1416 | 5 | 29 | 2013-01-01 05:00:00 | UALGA | 1.416 |
2 | 2013 | 1 | 1 | 542.0 | 540 | 2.0 | 923.0 | 850 | 33.0 | AA | ... | N619AA | JFK | MIA | 160.0 | 1089 | 5 | 40 | 2013-01-01 05:00:00 | AAJFK | 1.089 |
3 | 2013 | 1 | 1 | 544.0 | 545 | -1.0 | 1004.0 | 1022 | -18.0 | B6 | ... | N804JB | JFK | BQN | 183.0 | 1576 | 5 | 45 | 2013-01-01 05:00:00 | B6JFK | 1.576 |
4 | 2013 | 1 | 1 | 554.0 | 600 | -6.0 | 812.0 | 837 | -25.0 | DL | ... | N668DN | LGA | ATL | 116.0 | 762 | 6 | 0 | 2013-01-01 06:00:00 | DLLGA | 0.762 |
The newly created columns will be at the end of the DataFrame.
Group and ungroup data with group_by()
and ungroup()
group_by()
allows you to group the DataFrame by one or multiple columns. Functions chained after group_by()
are applied on the group until the DataFrame is ungrouped with the ungroup()
function. For example, to group the data by the originating airport, we do:
Summarise data using summarize()
summarize()
is typically used together with group_by()
to reduce each group into a single row summary. In other words, the output will have one row for each group. For example, to calculate the mean distance for flights originating from every airport, we do:
origin | mean_distance | |
---|---|---|
0 | EWR | 1056.742790 |
1 | JFK | 1266.249077 |
2 | LGA | 779.835671 |
Bringing it all together with pipes
Let’s say you want to perform the following operations on the flights data
- [Step 1]: Filter out all flights less than 10 hours
- [Step 2]: Create a new column,
speed
, using the formula [distance / (air time * 60)] - [Step 3]: Calculate the mean speed for flights originating from each airport
- [Step 4]: Sort the result by mean speed in descending order
We will write the operations using dfply
piping operator >>
. We won’t have to use intermediate variables to save the result of each step.
origin | mean_speed | |
---|---|---|
0 | EWR | 0.109777 |
1 | JFK | 0.109427 |
2 | LGA | 0.107362 |
If we use pandas
data manipulation functions instead of dfply
’s, our code will look something like this:
I find the dfply
version easier to read and understand than the pandas
version.
Conclusion
This is by no means an exhaustive coverage of the functionality of the dfply
package. The package documentation is really good and I advise that you check it out to learn more.
If you have suggestions or questions, please drop a comment in the comment section below. You can also send me an email at hello [at] allenkunle [dot] me or tweet at me @allenakinkunle, and I will reply as soon as I can.
The complete source code for this blog post is available on Github. Thank you for reading, and please don’t forget to share.