Exploratory Analysis of the Washington's Post Police Shooting dataset using R and Plotly
The Washington Post has been compiling a database of fatal shootings in the United States by police officers in the line of duty since January 1, 2015. The dataset contains information such as the date and the state in which the killing occurred, the age, gender and, race of the deceased.
In light of the recent Police killings in the US, I thought it would be interesting to perform the following exploratory analyses using the dataset:
- The distribution of the deceased by age group
- The distribution of Police killings per million people for each state using a choropleth map.
We will use R and Plotly for this purpose.
Required Packages
For this analysis, we will need the dplyr
package for data manipulation, the magrittr
package for pipe-like operations, and the plotly
package for creating interactive graphs. Install these packages using R’s install.packages
method if you haven’t already, then load the packages using the library
method.
The complete source code for this post is available on Github.
Data Preparation
The first thing we need to do is read the data into R as a dataframe from where it is hosted on Github.
We use the colnames()
method to get the column names of the data frame.
We see that there are 14 variables. We only need the age
and state
variables for our analysis so we use dplyr
’s select()
method to create a new data frame with only those variables selected. Remember that you can always check out a method’s documentation in R using ?method_name
.
Let’s see what the new dataframe looks like
The age column contains integer values and the state column contains the 2-letter abbreviations for the states in which the killings occurred.
Handling Missing Values
Before performing any analysis, it’s a good practice to check for anomalies like missing values in the data. Missing values are coded with NA
in most datasets, so we check for this in our dataset. Be familiar with how missing values are represented in your dataset and handle them accordingly.
The sapply()
method applies the function supplied as its argument to each column of the data frame. The function returns the number of cells that have NA
as their values in each of the dataframe’s column. To learn more about the Apply family of functions in R, check this tutorial.
We see that there are 37 missing values coded as NA
in the age
column so we handle this by recoding them to 0
. The choice of 0 will be apparent in the following section.
How many people are killed by age group?
Since the ages are integer values, I thought it would be more useful to group them and then plot the distribution of killings in the age groups. We convert the integer values into categories (factors) using the cut()
method. Running ?cut
in R, we get the following documentation:
cut
divides the range of x into intervals and codes the values in x according to which interval they fall. The leftmost interval corresponds to level one, the next leftmost to level two and so on.
Since we coded our missing age values as 0, we set the label for any value between -Inf and 1 to “Unknown”. This interval covers our missing values coded as 0. Ages between 1 and 18 are grouped into “Under 18”, 18 - 30 coded as “18-29”, 30 - 45 coded as “30-44”, 45 - 60 coded as “45 - 59” while ages between 60 and Inf are coded as “60 above”.
In the following code snippet, we group the data frame by the age group categories column cat_age
we created above and then summarise each group by counting the number of killings. We pass the summarised data into ggplot and display it as a bar chart.
The reason Plotly was chosen for this analysis is because it allows us to create beautiful, interactive visualisation with APIs in Python, R and many other languages. The ggplotly
method accepts a ggplot
object as argument and turns it into an interactive graph. Check out some example plots created with Plotly’s R library.
The %>%
operator in the snippet above might look strange to some R users. %>%
, available in the magrittr
package, allows us to pipe values into an expression or a function call. It improves code readability by removing the need to create a bunch of variables to hold the results of function calls. Check out magrittr
’s vignette for a detailed explanation of how to use the package.
The plot is interactive so hover and click on it to interact with it.
What is the Police killings per million population value for each state?
To compare the number of police killings across US states in our dataset, we will compute the police killings per million population value for each state and plot these values on a Plotly
’s choropleth map.
For this analysis, we need the population estimates for each state and the full state names. These details are not in the Washington Post’s dataset so we need to get them from other sources. We will use the state.abb
(contains the 2-letter state name abbreviations) and state.name
(contains the full state names) datasets available in R to get the full state names into our shooting_data
dataframe.
The state
data sets contain information relating to the 50 states of the US and they are arranged according to alphabetical order of the state names. So for an index i
, the 2-letter state name abbreviation at state.abb[i]
will map to the full state name at state.name[i]
. One caveat of the state
datasets is that they do not contain information about the District of Columbia.
For the population estimates of each state, I created a state population data set from data I extracted from the United States Census Bureau website. I have cleaned the data set and it contains 2015 population estimates for each of the 50 US states and the District of Columbia. The data set is hosted here.
We start our analysis by adding the full state name for each row containing the 2-letter state name abbreviation in our dataframe. We check for matches between the values of our shooting_data$state
column and the state.abb
vector. If there is a match, the match()
method returns the index of the match in the state.abb
vector, else it returns NA
, in this case this only happens for DC (District of Columbia) abbreviation.
Let’s take a look into how the dataframe looks now
We see that the full state name has been added to the dataframe. To get the population estimates for each state, we read in the state population data set.
In the following code snippet, we group the data frame by state name and count the number of killings in each state. We use merge()
to perform an inner join of our shooting_data
dataframe and newly imported state_population_data
dataframe by their common column state_name
. We then use dplyr
’s mutate
method to add the computed Police killings per million for each state to the dataframe. The hover
variable, also added to the dataframe, is used by Plotly
to display a hover text when the choropleth map is interacted with.
The next step is to use Plotly
’s plot_ly
method to create the choropleth map. We pass in the killings_by_state
dataframe we created above as argument to the function among other configuration arguments.
Hover on the map, zoom in and out to interact with it.
Conclusion
This blog post shows the power of R for quick exploratory analysis and demonstrates how static graphs can be brought to life with Plotly. Plotly is easy to use for users familiar with ggplot
as it uses a similar syntax. If you have suggestions or questions, please drop a comment in the comment section below. You can also send me emails at hello [at] allenkunle [dot] me or tweet at me @allenakinkunle, and I will reply as soon as I can.
The complete source code is available on Github. Feel free to check and make suggestions for improvements. Thank you for reading!