Exploratory Analysis of the Washington's Post Police Shooting dataset using R and Plotly
The Washington Post has been compiling a database of fatal shootings in the United States by police officers in the line of duty since January 1, 2015. The dataset contains information such as the date and the state in which the killing occurred, the age, gender and, race of the deceased.
In light of the recent Police killings in the US, I thought it would be interesting to perform the following exploratory analyses using the dataset:
- The distribution of the deceased by age group
- The distribution of Police killings per million people for each state using a choropleth map.
We will use R and Plotly for this purpose.
For this analysis, we will need the
dplyr package for data manipulation, the
magrittr package for pipe-like operations, and the
plotly package for creating interactive graphs. Install these packages using R’s
install.packages method if you haven’t already, then load the packages using the
The complete source code for this post is available on Github.
The first thing we need to do is read the data into R as a dataframe from where it is hosted on Github.
We use the
colnames() method to get the column names of the data frame.
We see that there are 14 variables. We only need the
state variables for our analysis so we use
select() method to create a new data frame with only those variables selected. Remember that you can always check out a method’s documentation in R using
Let’s see what the new dataframe looks like
The age column contains integer values and the state column contains the 2-letter abbreviations for the states in which the killings occurred.
Handling Missing Values
Before performing any analysis, it’s a good practice to check for anomalies like missing values in the data. Missing values are coded with
NA in most datasets, so we check for this in our dataset. Be familiar with how missing values are represented in your dataset and handle them accordingly.
sapply() method applies the function supplied as its argument to each column of the data frame. The function returns the number of cells that have
NA as their values in each of the dataframe’s column. To learn more about the Apply family of functions in R, check this tutorial.
We see that there are 37 missing values coded as
NA in the
age column so we handle this by recoding them to
0. The choice of 0 will be apparent in the following section.
How many people are killed by age group?
Since the ages are integer values, I thought it would be more useful to group them and then plot the distribution of killings in the age groups. We convert the integer values into categories (factors) using the
cut() method. Running
?cut in R, we get the following documentation:
cutdivides the range of x into intervals and codes the values in x according to which interval they fall. The leftmost interval corresponds to level one, the next leftmost to level two and so on.
Since we coded our missing age values as 0, we set the label for any value between -Inf and 1 to “Unknown”. This interval covers our missing values coded as 0. Ages between 1 and 18 are grouped into “Under 18”, 18 - 30 coded as “18-29”, 30 - 45 coded as “30-44”, 45 - 60 coded as “45 - 59” while ages between 60 and Inf are coded as “60 above”.
In the following code snippet, we group the data frame by the age group categories column
cat_age we created above and then summarise each group by counting the number of killings. We pass the summarised data into ggplot and display it as a bar chart.
The reason Plotly was chosen for this analysis is because it allows us to create beautiful, interactive visualisation with APIs in Python, R and many other languages. The
ggplotly method accepts a
ggplot object as argument and turns it into an interactive graph. Check out some example plots created with Plotly’s R library.
%>% operator in the snippet above might look strange to some R users.
%>%, available in the
magrittr package, allows us to pipe values into an expression or a function call. It improves code readability by removing the need to create a bunch of variables to hold the results of function calls. Check out
magrittr’s vignette for a detailed explanation of how to use the package.
The plot is interactive so hover and click on it to interact with it.
What is the Police killings per million population value for each state?
To compare the number of police killings across US states in our dataset, we will compute the police killings per million population value for each state and plot these values on a
Plotly’s choropleth map.
For this analysis, we need the population estimates for each state and the full state names. These details are not in the Washington Post’s dataset so we need to get them from other sources. We will use the
state.abb (contains the 2-letter state name abbreviations) and
state.name (contains the full state names) datasets available in R to get the full state names into our
state data sets contain information relating to the 50 states of the US and they are arranged according to alphabetical order of the state names. So for an index
i, the 2-letter state name abbreviation at
state.abb[i] will map to the full state name at
state.name[i]. One caveat of the
state datasets is that they do not contain information about the District of Columbia.
For the population estimates of each state, I created a state population data set from data I extracted from the United States Census Bureau website. I have cleaned the data set and it contains 2015 population estimates for each of the 50 US states and the District of Columbia. The data set is hosted here.
We start our analysis by adding the full state name for each row containing the 2-letter state name abbreviation in our dataframe. We check for matches between the values of our
shooting_data$state column and the
state.abb vector. If there is a match, the
match() method returns the index of the match in the
state.abb vector, else it returns
NA, in this case this only happens for DC (District of Columbia) abbreviation.
Let’s take a look into how the dataframe looks now
We see that the full state name has been added to the dataframe. To get the population estimates for each state, we read in the state population data set.
In the following code snippet, we group the data frame by state name and count the number of killings in each state. We use
merge() to perform an inner join of our
shooting_data dataframe and newly imported
state_population_data dataframe by their common column
state_name. We then use
mutate method to add the computed Police killings per million for each state to the dataframe. The
hover variable, also added to the dataframe, is used by
Plotly to display a hover text when the choropleth map is interacted with.
The next step is to use
plot_ly method to create the choropleth map. We pass in the
killings_by_state dataframe we created above as argument to the function among other configuration arguments.
Hover on the map, zoom in and out to interact with it.
This blog post shows the power of R for quick exploratory analysis and demonstrates how static graphs can be brought to life with Plotly. Plotly is easy to use for users familiar with
ggplot as it uses a similar syntax. If you have suggestions or questions, please drop a comment in the comment section below. You can also send me emails at hello [at] allenkunle [dot] me or tweet at me @allenakinkunle, and I will reply as soon as I can.
The complete source code is available on Github. Feel free to check and make suggestions for improvements. Thank you for reading!