Motivation


After reading one of my regular blogs, Ben Wellington’s IQuantNY post, I was intrigued to find that there is a spike in the number of robberies that occur between 3 and 4PM. Does this correlate to the time the schools in NYC usually end, if so, we expect the spike in the robberies to occur on weekdays, and not weekends. If we refer to the blog post, it does indeed.

I decided to explore this for myself, though I wanted to see whether the robberies occured near schools. My hypothesis is that robberies which occur after school (between 3 and 4PM), will happen close to schools. I gather the NYC felony data from the NYC Open data website here, and the NYC school data here.

If there are a large number of robberies occuring close to schools then this add weight to our hypothesis. I will go about trying to confirm my hypothesis by grouping the robberies using k-means clustering and seeing if many of the clusters overlap, or are close to schools. I make the assumption that school children will "diffuse" evenly from their school in all and equal directions, such that their mean location will converge at the school.

I will perform the analysis in R, opposed to python since the mapping function is quick and easy in R, using the ggmaps library. In the end, I will transfer my dataset to carto as the maps produced are pretty, and the tooltips function are interactive, intuitive and useful for displaying information to users.

Following, I will compare the distances from robberies to the closest schools with a uniform random distribution of locations across New York City, if there is any statistical significance between the two, we can say that the location that robberies occur is dependent on their distance to the closest school.

The R source code can be found on my github here, and a link to the carto profile is here.


Exploratory Data Analysis in R


First I load in the usual suspects:

library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(ggmap)
library(caret)
library(caret)
library(sp)
library(raster)
library(rgdal)
library(rgeos)
library(spatstat)

Following, I read in the data.

felony.data <- read_csv('data/NYPD_7_Major_Felony_Incidents.csv')
str(felony.data)

NYC Felony Data Structure

See here ▼

Classes 'tbl_df', 'tbl' and 'data.frame':   1123465 obs. of  20 variables:
 $ OBJECTID              : int  1 2 3 4 5 6 7 8 9 10 ...
 $ Identifier            : chr  "f070032d" "c6245d4d" "716dbc6f" "638cd7b7" ...
 $ Occurrence Date       : chr  "09/06/1940 07:30:00 PM" "12/14/1968 12:20:00 AM" "10/30/1970 03:30:00 PM" "07/18/1972 11:00:00 PM" ...
 $ Day of Week           : chr  "Friday" "Saturday" "Friday" "Tuesday" ...
 $ Occurrence Month      : chr  "Sep" "Dec" "Oct" "Jul" ...
 $ Occurrence Day        : int  6 14 30 18 21 1 13 2 8 26 ...
 $ Occurrence Year       : int  1940 1968 1970 1972 1987 1990 1990 1992 1994 1994 ...
 $ Occurrence Hour       : int  19 0 15 23 0 9 0 16 18 0 ...
 $ CompStat Month        : int  9 12 10 7 5 9 6 3 7 6 ...
 $ CompStat Day          : int  7 14 31 19 28 17 7 27 31 4 ...
 $ CompStat Year         : int  2010 2008 2008 2012 2009 2014 2007 2012 2008 2008 ...
 $ Offense               : chr  "BURGLARY" "GRAND LARCENY" "BURGLARY" "GRAND LARCENY OF MOTOR VEHICLE" ...
 $ Offense Classification: chr  "FELONY" "FELONY" "FELONY" "FELONY" ...
 $ Sector                : chr  "D" "G" "H" "F" ...
 $ Precinct              : int  66 28 84 73 75 105 73 101 103 17 ...
 $ Borough               : chr  "BROOKLYN" "MANHATTAN" "BROOKLYN" "BROOKLYN" ...
 $ Jurisdiction          : chr  "N.Y. POLICE DEPT" "N.Y. POLICE DEPT" "N.Y. POLICE DEPT" "N.Y. POLICE DEPT" ...
 $ XCoordinate           : int  987478 996470 986508 1005876 1017958 1058407 1010272 1053678 1041749 992029 ...
 $ YCoordinate           : int  166141 232106 190249 182440 182266 204788 183760 159044 196938 213332 ...
 $ Location 1            : chr  "(40.6227027620001, -73.9883732929999)" "(40.8037530600001, -73.955861904)" "(40.688874254, -73.9918594329999)" "(40.6674141890001, -73.9220463899999)" ...

To tidy the I use dplyr. As a sanity check I will try to approximately recreate the plot showing the spike in robberies in at 3PM for weekdays. This will make sure I am looking at the correct set of data before moving on. The original plot can be found on the blog post here.

tidy.fdata <- felony.data %>% filter(Offense == 'ROBBERY') %>%
  select(Hour = `Occurrence Hour`,Day.Week = `Day of Week`) %>%
  filter(!(Day.Week %in% c('Saturday', 'Sunday'))) %>%
  group_by(Hour) %>%
  summarise(Hourly.Crime.Rate= n())

Now that I have aggregated and summarised the data frame using dplyr I can recreate the plot from IQuantNY simply using ggplot.

# Recreate plot from IQuantNY (roughly)
ggplot(data = tidy.fdata) + geom_line(aes(x = Hour, y = Hourly.Crime.Rate)
                                      , color= 'darkgreen')

We can see the the plot is recreated, with the characteristic spike at 3PM (15:00hr). Now, to analyse the data at 3PM, we can filter out that subset of the data.

new.fdata <- felony.data %>% filter(Offense == 'ROBBERY') %>%
  select(Hour = `Occurrence Hour`,Day.Week = `Day of Week`, `Location 1`) %>%
  filter(!(Day.Week %in% c('Saturday', 'Sunday')), Hour =='15')


Now we have the felony data, we want to get the school location data, so we load it into the workspace:

schooldf <- read_csv('data/Location_Information_Report.csv')
str(schooldf)


NYC School Data Structure

See here ▼

Classes 'tbl_df', 'tbl' and 'data.frame':   1641 obs. of  45 variables:
 $ School Year                            : chr  "2015-16" "2015-16" "2015-16" "2015-16" ...
 $ ATS System Code                        : chr  "01M015" "01M019" "01M020" "01M034" ...
 $ Location Code                          : chr  "M015" "M019" "M020" "M034" ...
 $ Location Name                          : chr  "P.S. 015 Roberto Clemente" "P.S. 019 Asher Levy" "P.S. 020 Anna Silver" "P.S. 034 Franklin D. Roosevelt" ...
 $ BEDS Number                            : num  3.1e+11 3.1e+11 3.1e+11 3.1e+11 3.1e+11 ...
 $ Managed By Name                        : chr  "DOE" "DOE" "DOE" "DOE" ...
 $ Location Type Description              : chr  "General Academic" "General Academic" "General Academic" "General Academic" ...
 $ Location Category Description          : chr  "Elementary" "Elementary" "Elementary" "K-8" ...
 $ Grades                                 : chr  "PK,01,02,03,04,05,SE" "PK,0K,01,02,03,04,05,SE" "PK,0K,01,02,03,04,05,SE" "PK,0K,01,02,03,04,05,06,07,08,SE" ...
 $ Grades Final                           : chr  "PK,0K,01,02,03,04,05" "PK,0K,01,02,03,04,05" "PK,0K,01,02,03,04,05" "PK,0K,01,02,03,04,05,06,07,08" ...
 $ Open Date                              : chr  "01/01/1644" "01/01/20637" "01/01/23193" "01/01/20271" ...
 $ Status Description                     : chr  "Open" "Open" "Open" "Open" ...
 $ Building Code                          : chr  "M015" "M019" "M020" "M034" ...
 $ Primary Address                        : chr  "333 EAST 4 STREET" "185 1 AVENUE" "166 ESSEX STREET" "730 EAST 12 STREET" ...
 $ City                                   : chr  "MANHATTAN" "MANHATTAN" "MANHATTAN" "MANHATTAN" ...
 $ State Code                             : chr  "NY" "NY" "NY" "NY" ...
 $ Zip                                    : int  10009 10003 10002 10009 10009 10009 10002 10002 10002 10002 ...
 $ Principal Name                         : chr  "IRENE SANCHEZ" "JACQUELINE FLANAGAN" "Carmen Colon" "Rosemarie Gonzalez" ...
 $ Principal Title                        : chr  "PRINCIPAL" "PRINCIPAL" "PRINCIPAL" "PRINCIPAL" ...
 $ Principal Email                        : chr  "ISanchez11@schools.nyc.gov" "JFlanagan@schools.nyc.gov" "CColon5@schools.nyc.gov" "RGonzal52@schools.nyc.gov" ...
 $ Principal Phone Number                 : chr  "212-228-8730" "212-533-5340" "212-254-9577" "212-228-4433" ...
 $ Fax Number                             : chr  "212-477-0931" "212-673-1477" "212-254-3526" "212-353-1973" ...
 $ Geographical District Code             : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Administrative District Code           : int  1 1 1 1 1 1 1 1 1 1 ...
 $ Administrative District Location Code  : chr  "M801" "M801" "M801" "M801" ...
 $ Administrative District Name           : chr  "COMMUNITY SCHOOL DISTRICT 01" "COMMUNITY SCHOOL DISTRICT 01" "COMMUNITY SCHOOL DISTRICT 01" "COMMUNITY SCHOOL DISTRICT 01" ...
 $ Superintendent                         : chr  "PHILLIPS, DANIELLA" "PHILLIPS, DANIELLA" "PHILLIPS, DANIELLA" "PHILLIPS, DANIELLA" ...
 $ Superintendent Location Code           : chr  "M801" "M801" "M801" "M801" ...
 $ Superintendent Email                   : chr  "DPhilli@schools.nyc.gov" "DPhilli@schools.nyc.gov" "DPhilli@schools.nyc.gov" "DPhilli@schools.nyc.gov" ...
 $ Community School Sup Name              : chr  "PHILLIPS, DANIELLA" "PHILLIPS, DANIELLA" "PHILLIPS, DANIELLA" "PHILLIPS, DANIELLA" ...
 $ Community School Sup Email             : chr  "DPhilli@schools.nyc.gov" "DPhilli@schools.nyc.gov" "DPhilli@schools.nyc.gov" "DPhilli@schools.nyc.gov" ...
 $ BFSC Location Code                     : chr  "MFSC" "MFSC" "MFSC" "MFSC" ...
 $ BFSC Director Name                     : chr  "CHU, YUET" "CHU, YUET" "CHU, YUET" "CHU, YUET" ...
 $ BFSC Director Title                    : chr  "BFSC Director" "BFSC Director" "BFSC Director" "BFSC Director" ...
 $ BFSC Director Phone                    : chr  "646-470-0721" "646-470-0721" "646-470-0721" "646-470-0721" ...
 $ BFSC Director Email                    : chr  "YChu@schools.nyc.gov" "YChu@schools.nyc.gov" "YChu@schools.nyc.gov" "YChu@schools.nyc.gov" ...
 $ HighSchool Network Location Code       : chr  NA NA NA NA ...
 $ HighSchool Network Name                : chr  NA NA NA NA ...
 $ HighSchool Network Superintendent      : chr  NA NA NA NA ...
 $ HighSchool Network Superintendent Email: chr  NA NA NA NA ...
 $ PROSE SCHOOL                           : chr  "N" "N" "N" "N" ...
 $ X Coordinate                           : int  990141 988547 988044 991163 988071 989351 989817 988726 988726 988749 ...
 $ Y Coordinate                           : int  202349 205239 202068 203782 203210 202733 199992 199497 199497 201282 ...
 $ Latitude                               : num  40.7 40.7 40.7 40.7 40.7 ...
 $ Longitude                              : num  -74 -74 -74 -74 -74 ...


From looking at the structure of the school data above we can see that there is a lot of extraneous variables that can be disgarded, also maybe not every school is worth looking at, since I do not expect pre-kindegarten children to either be robbing people or being robbed themselves. We can look at the different types of schoolin the dataset:

unique(schooldf$`Location Category Description`)
[1] "Elementary"                      "K-8"
[3] "High school"                     "Secondary School"
[5] "Junior High-Intermediate-Middle" "K-12 all grades"
[7] "Early Childhood"                 "District Pre-K Center"          


We will just use those that schools that contain the students above 8th grade.

rel.schools <- c('High School', 'Secondary School', 'K-12 all grades')

tidy.schooldf <- schooldf %>%
  select(`Location Name`, `Location Category Description`, Latitude, Longitude) %>%
  filter(`Location Category Description` %in% rel.schools)

dim(tidy.schooldf)
[1] 119   4


There are 119 total schools in NYC that contain students above 8th grade.

Following the longitude and latitude coordinates from the felony data need to be obtained. Since they have the form:

new.fdata$`Location 1`[1]
[1] "(40.61384715, -73.981437773)"

I use regular expressions to obtain the exact coordinates.

new.fdata2 <- new.fdata %>%
  separate(`Location 1`, c('Lat_tmp', 'Long_tmp'),sep = ',') %>%
  mutate(Lat = as.numeric(gsub("\\(|\\)","", Lat_tmp))) %>%
  mutate(Long = as.numeric(gsub("\\(|\\)","", Long_tmp)))


It may not be helpful to plot all the locations of the robberies, so I cluster the longitude and latitude coordinates - choosing 150 clusters. I want to choose a cluster number that is more than the total number of school, but not so much that we don't see any clustering.

set.seed(666)
latlong_cluster <- kmeans(cbind(new.fdata2$Lat, new.fdata2$Long), centers  = 150)


Next, I bind the new latitude, longitude coordinates, group by the cluster number and summarise by counting the number of instances to get total number of robberies in that cluster of longitude and latitude coordinates.

new.fdata3 <- cbind(new.fdata2, Cluster = latlong_cluster$cluster,
                    Cluster.Lat = latlong_cluster$centers[latlong_cluster$cluster,1],
                    Cluster.Long = latlong_cluster$centers[latlong_cluster$cluster,2])

new.fdata4 <- new.fdata3 %>%
  group_by(Cluster.Lat, Cluster.Long, Cluster) %>%
  summarise(Total.Counts= n())


Next I can plot the map in R.

# = Plot on Map
map <- get_map(location = 'New York City', zoom= 10,
               source = 'google', color = 'color')
map <- ggmap(map) + geom_point(data = new.fdata4, aes(x=Cluster.Long, y=Cluster.Lat,
                                                      color = new.fdata4$Total.Counts),
                               size = new.fdata4$Total.Counts/100) +
  scale_color_gradient(low = 'red', high = 'yellow') +
  geom_jitter(data = tidy.schooldf, aes(x= Longitude, y= Latitude), shape = 4) +
  labs(title = 'New York City', x = 'Longitude', y = 'Latitude')
map


The data can also be exported to carto.com since it looks so good, and the result is shown below:



We can see that there are many large clusters where many robberies occur which are between schools. For example, the cluster of 162 robberies in east Williamsburg, between the "Young Women's Leadership School", "Lyons Community School", and "Juan Morel Campos Secondary School". Also in Mott Haven, in the Bronx, there is a cluster of 78 robberies by P.S. 168, as well as a cluster of 137 robberies by P.S. K396, and "Teachers Preporatory High School" in Brownsville, Brooklyn.

There seems to be more correlation between the location of the schools and robberies in the Bronx. This may be however due to a larger concentration of schools in a smaller are compared to other boroughs to be conclusive.


Geo-Spatial Distribution of the Data


We can examine this concept further by looking at the distribution of the distance bewteen the unclustered robberies and the closest school. To determine whether robberies do occur close to schools we can compare to a uniform random distribution of points in New York City.

To do this I use a shapefile, which can be found here. This gives me the boundary of New York City, so I'm not choosing random points that lie in the east river. I load in the shapefile and access the boundary using the following R code:

# load shapefile and read in
shp <-'../Data-Incubator-Challenge/Vision-Zero/nybb_16b/nybb.shp'
ny <- readOGR(shp, ogrListLayers(shp)[1], stringsAsFactors=FALSE)
OGR data source with driver: ESRI Shapefile
Source: "../Data-Incubator-Challenge/Vision-Zero/nybb_16b/nybb.shp", layer: "nybb"
with 5 features
It has 4 fields
map <- spTransform(ny, CRS("+proj=longlat +datum=WGS84"))


Sanity check on 400 random points.

num.Rands_test = 400 # number of random numbers
# Outer longitude, latitude limits of NYC
long.Min_test = -74.2; long.Max_test = -73.65
lat.Min_test = 40.4; lat.Max_test = 40.95
# generate random numbers
random.Longs_test = runif(num.Rands_test, long.Min_test, long.Max_test)
random.Lats_test = runif(num.Rands_test, lat.Min_test, lat.Max_test)

# convert to dataframe
dat_test <- data.frame(Longitude = random.Longs_test,
                  Latitude  = random.Lats_test)
LongLats_test <- dat_test
# convert to coordinates
coordinates(dat_test) <- ~ Longitude + Latitude
# Set the projection of the SpatialPointsDataFrame using the projection of the shapefile
proj4string(dat_test) <- proj4string(map)

# select points only over manhattan boundry
rand.points_test <- cbind(over(dat_test, map), LongLats_test) %>%
  filter(!is.na(BoroName)) %>%
  dplyr::select(Latitude, Longitude)

nycmap <- get_googlemap('maspeth NYC', zoom = 11)
ggmap(nycmap) + geom_point(data = rand.points_test, aes(x = Longitude, y= Latitude))

All the points do lie in New York City, and we can continue making our random distribution, using more points.

num.Rands = 10000 # number of random numbers
# Outer longitude, latitude limits of NYC
long.Min = -74.2; long.Max = -73.65
lat.Min = 40.4; lat.Max = 40.95
# generate random numbers
random.Longs = runif(num.Rands, long.Min, long.Max)
random.Lats = runif(num.Rands, lat.Min, lat.Max)

# convert to dataframe
dat <- data.frame(Longitude = random.Longs,
                  Latitude  = random.Lats)
LongLats <- dat
# convert to coordinates
coordinates(dat) <- ~ Longitude + Latitude
# Set the projection of the SpatialPointsDataFrame using the projection of the shapefile
proj4string(dat) <- proj4string(map)

# select points only over manhattan boundry
rand.points <- cbind(over(dat, map), LongLats) %>%
  filter(BoroName == 'Manhattan') %>%
  dplyr::select(Latitude, Longitude)
# Use approx from http://www.movable-type.co.uk/scripts/latlong.html
# average kms in a lat or long degree
ave.km.per.Lat.NY <- 111.2
ave.km.per.Long.NY <- 84.24

tidy.schooldf<- filter(tidy.schooldf, !is.na(Latitude) & !is.na(Longitude))
# Convert data frames to lists for calculating geographical distances
felony_df_lst <- structure(list(lon = new.fdata2$Long*ave.km.per.Long.NY,
                                 lat = new.fdata2$Lat*ave.km.per.Lat.NY),
                            .Names = c("lon", "lat"),
                            row.names = c(NA, as.integer(nrow(new.fdata2))),
                            class = "data.frame")

school_df_lst <- structure(list(name = tidy.schooldf$`Location Name`,
                lon = tidy.schooldf$Longitude*ave.km.per.Long.NY,
                                lat = tidy.schooldf$Latitude*ave.km.per.Lat.NY),
                           .Names = c("lon", "lat"),
                           row.names = c(NA, as.integer(nrow(tidy.schooldf))),
                           class = "data.frame")

rand.points_lst <- structure(list(lon = rand.points$Longitude*ave.km.per.Long.NY,
                                  lat = rand.points$Latitude*ave.km.per.Lat.NY),
                             .Names = c("lon", "lat"), row.names = c(NA, nrow(rand.points)),
                             class = "data.frame")

felony_df_sp <- SpatialPoints(felony_df_lst)
school_df_sp <- SpatialPoints(school_df_lst[,2:3])
rand.points_sp <- SpatialPoints(rand.points_lst)

# Check distance to closest school
rand.points_lst$nearest.school <- apply(gDistance(rand.points_sp,
                                                        school_df_sp, byid=TRUE), 2, min)
felony_df_lst$nearest.school <- apply(gDistance(felony_df_sp, school_df_sp, byid=TRUE), 2, min)

felony_df_lst$which.school <- school_df_lst[apply(gDistance(felony_df_sp, school_df_sp, byid=TRUE), 2, which.min),1]

top.schools.robbery <- felony_df_lst %>%
  dplyr::group_by(which.school) %>%
  dplyr::summarise(robbery.count = n()) %>%
  dplyr::arrange(desc(robbery.count))

# Multiply by 1000 to convert kilometers to meters
rand_min_Dists_df <- data.frame(dists = rand.points_lst$nearest.school*1000)
felony_min_Dists_df <- data.frame(dists = felony_df_lst$nearest.school*1000)

# Get mean of log distance for plotting
mean.Randpts2school <- mean(log(rand_min_Dists_df$dists))
mean.felony2school <- mean(log(felony_min_Dists_df$dists))

# Plot histogram
ggplot() +
  geom_histogram(data = log(rand_min_Dists_df),
                 aes(x = dists, y=..count../sum(..count..), fill= 'green', color = 'green'),
                 fill = "green", alpha = 0.2, bins = 70) +
  geom_histogram(data = log(felony_min_Dists_df),
                 aes(x = dists, y=..count../sum(..count..), fill= 'red', color = 'red'),
                 fill = "red",  alpha = 0.2, bins = 70) +
  geom_vline(xintercept = (mean.Randpts2school), color = 'green') +
  geom_vline(xintercept = (mean.felony2school), color = 'red') +
  labs(list(Title = 'Log Distance to Closest Tourist Attraction'
                                , y = 'Normalized Count', x = 'Log distance(m)')) +
  # ylim(c(0, 0.042)) +
  scale_colour_manual(name="group", values=c("red" = "red", "green"="green"),
                      labels=c("green"="Random Uniform Distribution", "red"="Felony Distribution")) +
  scale_fill_manual(name="group", values=c("red" = "red", "green"="green"),
                    labels=c("green"="Random Uniform Distribution", "red"="Felony Distribution"))


Histogram of distance to schools


We can see that the distribution of robberies is significantly shifted from the random uniform points distribution. We find from the data that the mean distance from any point in New York City to the closest school is 1905.9m, and the mean distance from a robbery to the closest school is 1185.3m, a reduction of 37.8%. Moreover, we can perofrm a t-test on the two distributions to determine whether there is any statistical significance between the two.

t.test(rand_min_Dists_df, felony_min_Dists_df)
    Welch Two Sample t-test

data:  rand_min_Dists_df and felony_min_Dists_df
t = -14.279, df = 13625, p-value < 2.2e-16
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
 -389.1212 -295.1829
sample estimates:
mean of x mean of y
 843.1478 1185.2999 


The p-value between the two distributions is less than 0.05, so the two distributions are statistically significant, we can reject the null hypothesis that robberies do not occur close to schools.


Robberies Occur Around Which Schools


Taking the top 10 results from the data frame top.schools.robbery above.


School Name Number of robberies
Young Women's Leadership School, Queens 378
Metropolitan Expeditionary Learning School 339
East New York Family Academy 317
Academy for College Preparation and Career Exploration: A College Board School 290
Academy for Scholarship and Entrepreneurship: A College Board School 257
Teachers Preparatory High School 252
P.S. X012 Lewis and Clark School 239
East-West School of International Studies 220
Frederick Douglass Academy IV Secondary School 217
Eagle Academy for Young Men III 214


Interestingly, the names of these schools do not fit the typical archetype of schools that would have many robberies around them. One might expect public schools to have many robberies around them, yet here the data only shows one in the top ten. This may be because there are other schools close by, but not strictly he closest to the robberies.

Conclusion


To conclude, this analysis has provided convincing results that robberies that occur between 3PM and 4PM happen close to schools. However, we are unable to determine from the data whether school children are the robbery victims or suspects, or both. This analysis could be used to determine the effectiveness of after-school programs. Since a spatial analysis as well as a temporal analysis is provided this could identify which schools would be a good potential for such after-school programs to keep students safe or out of trouble.