## Motivation

After reading one of my regular blogs, Ben Wellingtonâ€™s IQuantNY post, I was intrigued to find that there is a spike in the number of robberies that occur between 3 and 4PM. Does this correlate to the time the schools in NYC usually end, if so, we expect the spike in the robberies to occur on weekdays, and not weekends. If we refer to the blog post, it does indeed.

I decided to explore this for myself, though I wanted to see whether the robberies occured near schools. My hypothesis is that robberies which occur after school (between 3 and 4PM), will happen close to schools. I gather the NYC felony data from the NYC Open data website here, and the NYC school data here.

If there are a large number of robberies occuring close to schools then this add weight to our hypothesis. I will go about trying to confirm my hypothesis by grouping the robberies using k-means clustering and seeing if many of the clusters overlap, or are close to schools. I make the assumption that school children will "diffuse" evenly from their school in all and equal directions, such that their mean location will converge at the school.

I will perform the analysis in R, opposed to python since the mapping function is quick and easy in R, using the ggmaps library. In the end, I will transfer my dataset to carto as the maps produced are pretty, and the tooltips function are interactive, intuitive and useful for displaying information to users.

Following, I will compare the distances from robberies to the closest schools with a uniform random distribution of locations across New York City, if there is any statistical significance between the two, we can say that the location that robberies occur is dependent on their distance to the closest school.

The R source code can be found on my github here, and a link to the carto profile is here.

## Exploratory Data Analysis in R

First I load in the usual suspects:

library(readr)
library(dplyr)
library(tidyr)
library(ggplot2)
library(ggmap)
library(caret)
library(caret)
library(sp)
library(raster)
library(rgdal)
library(rgeos)
library(spatstat)

Following, I read in the data.

felony.data <- read_csv('data/NYPD_7_Major_Felony_Incidents.csv')
str(felony.data)

### NYC Felony Data Structure

#### See here ▼

Classes 'tbl_df', 'tbl' and 'data.frame':   1123465 obs. of  20 variables:
$OBJECTID : int 1 2 3 4 5 6 7 8 9 10 ...$ Identifier            : chr  "f070032d" "c6245d4d" "716dbc6f" "638cd7b7" ...
$Occurrence Date : chr "09/06/1940 07:30:00 PM" "12/14/1968 12:20:00 AM" "10/30/1970 03:30:00 PM" "07/18/1972 11:00:00 PM" ...$ Day of Week           : chr  "Friday" "Saturday" "Friday" "Tuesday" ...
$Occurrence Month : chr "Sep" "Dec" "Oct" "Jul" ...$ Occurrence Day        : int  6 14 30 18 21 1 13 2 8 26 ...
$Occurrence Year : int 1940 1968 1970 1972 1987 1990 1990 1992 1994 1994 ...$ Occurrence Hour       : int  19 0 15 23 0 9 0 16 18 0 ...
$CompStat Month : int 9 12 10 7 5 9 6 3 7 6 ...$ CompStat Day          : int  7 14 31 19 28 17 7 27 31 4 ...
$CompStat Year : int 2010 2008 2008 2012 2009 2014 2007 2012 2008 2008 ...$ Offense               : chr  "BURGLARY" "GRAND LARCENY" "BURGLARY" "GRAND LARCENY OF MOTOR VEHICLE" ...
$Offense Classification: chr "FELONY" "FELONY" "FELONY" "FELONY" ...$ Sector                : chr  "D" "G" "H" "F" ...
$Precinct : int 66 28 84 73 75 105 73 101 103 17 ...$ Borough               : chr  "BROOKLYN" "MANHATTAN" "BROOKLYN" "BROOKLYN" ...
$Jurisdiction : chr "N.Y. POLICE DEPT" "N.Y. POLICE DEPT" "N.Y. POLICE DEPT" "N.Y. POLICE DEPT" ...$ XCoordinate           : int  987478 996470 986508 1005876 1017958 1058407 1010272 1053678 1041749 992029 ...
$YCoordinate : int 166141 232106 190249 182440 182266 204788 183760 159044 196938 213332 ...$ Location 1            : chr  "(40.6227027620001, -73.9883732929999)" "(40.8037530600001, -73.955861904)" "(40.688874254, -73.9918594329999)" "(40.6674141890001, -73.9220463899999)" ...

To tidy the I use dplyr. As a sanity check I will try to approximately recreate the plot showing the spike in robberies in at 3PM for weekdays. This will make sure I am looking at the correct set of data before moving on. The original plot can be found on the blog post here.

tidy.fdata <- felony.data %>% filter(Offense == 'ROBBERY') %>%
select(Hour = Occurrence Hour,Day.Week = Day of Week) %>%
filter(!(Day.Week %in% c('Saturday', 'Sunday'))) %>%
group_by(Hour) %>%
summarise(Hourly.Crime.Rate= n())

Now that I have aggregated and summarised the data frame using dplyr I can recreate the plot from IQuantNY simply using ggplot.

# Recreate plot from IQuantNY (roughly)
ggplot(data = tidy.fdata) + geom_line(aes(x = Hour, y = Hourly.Crime.Rate)
, color= 'darkgreen')

We can see the the plot is recreated, with the characteristic spike at 3PM (15:00hr). Now, to analyse the data at 3PM, we can filter out that subset of the data.

new.fdata <- felony.data %>% filter(Offense == 'ROBBERY') %>%
select(Hour = Occurrence Hour,Day.Week = Day of Week, Location 1) %>%
filter(!(Day.Week %in% c('Saturday', 'Sunday')), Hour =='15')

Now we have the felony data, we want to get the school location data, so we load it into the workspace:

schooldf <- read_csv('data/Location_Information_Report.csv')
str(schooldf)

### NYC School Data Structure

#### See here ▼

Classes 'tbl_df', 'tbl' and 'data.frame':   1641 obs. of  45 variables:
$School Year : chr "2015-16" "2015-16" "2015-16" "2015-16" ...$ ATS System Code                        : chr  "01M015" "01M019" "01M020" "01M034" ...
$Location Code : chr "M015" "M019" "M020" "M034" ...$ Location Name                          : chr  "P.S. 015 Roberto Clemente" "P.S. 019 Asher Levy" "P.S. 020 Anna Silver" "P.S. 034 Franklin D. Roosevelt" ...
$BEDS Number : num 3.1e+11 3.1e+11 3.1e+11 3.1e+11 3.1e+11 ...$ Managed By Name                        : chr  "DOE" "DOE" "DOE" "DOE" ...
$Location Type Description : chr "General Academic" "General Academic" "General Academic" "General Academic" ...$ Location Category Description          : chr  "Elementary" "Elementary" "Elementary" "K-8" ...
$Grades : chr "PK,01,02,03,04,05,SE" "PK,0K,01,02,03,04,05,SE" "PK,0K,01,02,03,04,05,SE" "PK,0K,01,02,03,04,05,06,07,08,SE" ...$ Grades Final                           : chr  "PK,0K,01,02,03,04,05" "PK,0K,01,02,03,04,05" "PK,0K,01,02,03,04,05" "PK,0K,01,02,03,04,05,06,07,08" ...
$Open Date : chr "01/01/1644" "01/01/20637" "01/01/23193" "01/01/20271" ...$ Status Description                     : chr  "Open" "Open" "Open" "Open" ...
$Building Code : chr "M015" "M019" "M020" "M034" ...$ Primary Address                        : chr  "333 EAST 4 STREET" "185 1 AVENUE" "166 ESSEX STREET" "730 EAST 12 STREET" ...
$City : chr "MANHATTAN" "MANHATTAN" "MANHATTAN" "MANHATTAN" ...$ State Code                             : chr  "NY" "NY" "NY" "NY" ...
$Zip : int 10009 10003 10002 10009 10009 10009 10002 10002 10002 10002 ...$ Principal Name                         : chr  "IRENE SANCHEZ" "JACQUELINE FLANAGAN" "Carmen Colon" "Rosemarie Gonzalez" ...
$Principal Title : chr "PRINCIPAL" "PRINCIPAL" "PRINCIPAL" "PRINCIPAL" ...$ Principal Email                        : chr  "ISanchez11@schools.nyc.gov" "JFlanagan@schools.nyc.gov" "CColon5@schools.nyc.gov" "RGonzal52@schools.nyc.gov" ...
$Principal Phone Number : chr "212-228-8730" "212-533-5340" "212-254-9577" "212-228-4433" ...$ Fax Number                             : chr  "212-477-0931" "212-673-1477" "212-254-3526" "212-353-1973" ...
$Geographical District Code : int 1 1 1 1 1 1 1 1 1 1 ...$ Administrative District Code           : int  1 1 1 1 1 1 1 1 1 1 ...
$Administrative District Location Code : chr "M801" "M801" "M801" "M801" ...$ Administrative District Name           : chr  "COMMUNITY SCHOOL DISTRICT 01" "COMMUNITY SCHOOL DISTRICT 01" "COMMUNITY SCHOOL DISTRICT 01" "COMMUNITY SCHOOL DISTRICT 01" ...
$Superintendent : chr "PHILLIPS, DANIELLA" "PHILLIPS, DANIELLA" "PHILLIPS, DANIELLA" "PHILLIPS, DANIELLA" ...$ Superintendent Location Code           : chr  "M801" "M801" "M801" "M801" ...
$Superintendent Email : chr "DPhilli@schools.nyc.gov" "DPhilli@schools.nyc.gov" "DPhilli@schools.nyc.gov" "DPhilli@schools.nyc.gov" ...$ Community School Sup Name              : chr  "PHILLIPS, DANIELLA" "PHILLIPS, DANIELLA" "PHILLIPS, DANIELLA" "PHILLIPS, DANIELLA" ...
$Community School Sup Email : chr "DPhilli@schools.nyc.gov" "DPhilli@schools.nyc.gov" "DPhilli@schools.nyc.gov" "DPhilli@schools.nyc.gov" ...$ BFSC Location Code                     : chr  "MFSC" "MFSC" "MFSC" "MFSC" ...
$BFSC Director Name : chr "CHU, YUET" "CHU, YUET" "CHU, YUET" "CHU, YUET" ...$ BFSC Director Title                    : chr  "BFSC Director" "BFSC Director" "BFSC Director" "BFSC Director" ...
$BFSC Director Phone : chr "646-470-0721" "646-470-0721" "646-470-0721" "646-470-0721" ...$ BFSC Director Email                    : chr  "YChu@schools.nyc.gov" "YChu@schools.nyc.gov" "YChu@schools.nyc.gov" "YChu@schools.nyc.gov" ...
$HighSchool Network Location Code : chr NA NA NA NA ...$ HighSchool Network Name                : chr  NA NA NA NA ...
$HighSchool Network Superintendent : chr NA NA NA NA ...$ HighSchool Network Superintendent Email: chr  NA NA NA NA ...
$PROSE SCHOOL : chr "N" "N" "N" "N" ...$ X Coordinate                           : int  990141 988547 988044 991163 988071 989351 989817 988726 988726 988749 ...
$Y Coordinate : int 202349 205239 202068 203782 203210 202733 199992 199497 199497 201282 ...$ Latitude                               : num  40.7 40.7 40.7 40.7 40.7 ...
$Longitude : num -74 -74 -74 -74 -74 ... From looking at the structure of the school data above we can see that there is a lot of extraneous variables that can be disgarded, also maybe not every school is worth looking at, since I do not expect pre-kindegarten children to either be robbing people or being robbed themselves. We can look at the different types of schoolin the dataset: unique(schooldf$Location Category Description)
[1] "Elementary"                      "K-8"
[3] "High school"                     "Secondary School"
[5] "Junior High-Intermediate-Middle" "K-12 all grades"
[7] "Early Childhood"                 "District Pre-K Center"          

We will just use those that schools that contain the students above 8th grade.

rel.schools <- c('High School', 'Secondary School', 'K-12 all grades')

tidy.schooldf <- schooldf %>%
select(Location Name, Location Category Description, Latitude, Longitude) %>%
filter(Location Category Description %in% rel.schools)

dim(tidy.schooldf)
[1] 119   4

There are 119 total schools in NYC that contain students above 8th grade.

Following the longitude and latitude coordinates from the felony data need to be obtained. Since they have the form:

new.fdata$Location 1[1]  [1] "(40.61384715, -73.981437773)" I use regular expressions to obtain the exact coordinates. new.fdata2 <- new.fdata %>% separate(Location 1, c('Lat_tmp', 'Long_tmp'),sep = ',') %>% mutate(Lat = as.numeric(gsub("\$$|\$$","", Lat_tmp))) %>% mutate(Long = as.numeric(gsub("\$$|\$$","", Long_tmp))) It may not be helpful to plot all the locations of the robberies, so I cluster the longitude and latitude coordinates - choosing 150 clusters. I want to choose a cluster number that is more than the total number of school, but not so much that we don't see any clustering. set.seed(666) latlong_cluster <- kmeans(cbind(new.fdata2$Lat, new.fdata2$Long), centers = 150) Next, I bind the new latitude, longitude coordinates, group by the cluster number and summarise by counting the number of instances to get total number of robberies in that cluster of longitude and latitude coordinates. new.fdata3 <- cbind(new.fdata2, Cluster = latlong_cluster$cluster,
Cluster.Lat = latlong_cluster$centers[latlong_cluster$cluster,1],
Cluster.Long = latlong_cluster$centers[latlong_cluster$cluster,2])

new.fdata4 <- new.fdata3 %>%
group_by(Cluster.Lat, Cluster.Long, Cluster) %>%
summarise(Total.Counts= n())

Next I can plot the map in R.

# = Plot on Map
map <- get_map(location = 'New York City', zoom= 10,
source = 'google', color = 'color')
map <- ggmap(map) + geom_point(data = new.fdata4, aes(x=Cluster.Long, y=Cluster.Lat,
color = new.fdata4$Total.Counts), size = new.fdata4$Total.Counts/100) +
scale_color_gradient(low = 'red', high = 'yellow') +
geom_jitter(data = tidy.schooldf, aes(x= Longitude, y= Latitude), shape = 4) +
labs(title = 'New York City', x = 'Longitude', y = 'Latitude')
map

The data can also be exported to carto.com since it looks so good, and the result is shown below:

We can see that there are many large clusters where many robberies occur which are between schools. For example, the cluster of 162 robberies in east Williamsburg, between the "Young Women's Leadership School", "Lyons Community School", and "Juan Morel Campos Secondary School". Also in Mott Haven, in the Bronx, there is a cluster of 78 robberies by P.S. 168, as well as a cluster of 137 robberies by P.S. K396, and "Teachers Preporatory High School" in Brownsville, Brooklyn.

There seems to be more correlation between the location of the schools and robberies in the Bronx. This may be however due to a larger concentration of schools in a smaller are compared to other boroughs to be conclusive.

## Geo-Spatial Distribution of the Data

We can examine this concept further by looking at the distribution of the distance bewteen the unclustered robberies and the closest school. To determine whether robberies do occur close to schools we can compare to a uniform random distribution of points in New York City.

To do this I use a shapefile, which can be found here. This gives me the boundary of New York City, so I'm not choosing random points that lie in the east river. I load in the shapefile and access the boundary using the following R code:

# load shapefile and read in
shp <-'../Data-Incubator-Challenge/Vision-Zero/nybb_16b/nybb.shp'
ny <- readOGR(shp, ogrListLayers(shp)[1], stringsAsFactors=FALSE)
OGR data source with driver: ESRI Shapefile
Source: "../Data-Incubator-Challenge/Vision-Zero/nybb_16b/nybb.shp", layer: "nybb"
with 5 features
It has 4 fields
map <- spTransform(ny, CRS("+proj=longlat +datum=WGS84"))

Sanity check on 400 random points.

num.Rands_test = 400 # number of random numbers
# Outer longitude, latitude limits of NYC
long.Min_test = -74.2; long.Max_test = -73.65
lat.Min_test = 40.4; lat.Max_test = 40.95
# generate random numbers
random.Longs_test = runif(num.Rands_test, long.Min_test, long.Max_test)
random.Lats_test = runif(num.Rands_test, lat.Min_test, lat.Max_test)

# convert to dataframe
dat_test <- data.frame(Longitude = random.Longs_test,
Latitude  = random.Lats_test)
LongLats_test <- dat_test
# convert to coordinates
coordinates(dat_test) <- ~ Longitude + Latitude
# Set the projection of the SpatialPointsDataFrame using the projection of the shapefile
proj4string(dat_test) <- proj4string(map)

# select points only over manhattan boundry
rand.points_test <- cbind(over(dat_test, map), LongLats_test) %>%
filter(!is.na(BoroName)) %>%
dplyr::select(Latitude, Longitude)

nycmap <- get_googlemap('maspeth NYC', zoom = 11)
ggmap(nycmap) + geom_point(data = rand.points_test, aes(x = Longitude, y= Latitude))