### Hypothesis

My hypothesis is that when there are changes in the concentration of lead in the NYC air, there will also be correlated changes in the number of complaints due to lead in the NYC 311 complaints.

The 311 dataset comes from the NYC open data website that can be found here: ‘NYC 311’. The data on the air quality came from the EPA website here: ‘EPA’.

#### Choosing complaints

There are a number of different complaints in the NYC 311 dataset. I have chosen complaints that I think will show any direct correlation. Specifally I focus on complaints that could be due to lead contaminants in water, lead in paint, and lead in waste.

complaints <- c('Lead','Radioactive Material','Water Quality',
'Air Quality','Industrial Waste','Drinking Water',
'Water System','Drinking','PAINT/PLASTER',
'PLUMBING', 'General Construction/Plumbing')
filtered_df <- df_311 %>% filter(Complaint Type %in% complaints) %>%
select(date_ = Date, Complaint Type, Longitude, Latitude)

dim(filtered_df)
[1] 139413      4

We can see that there is over 139,000 observations

During other anlyses of the PM2.5, carbon monoxide, and ozone air quality datasets, the air quality index (AQI), was a good indicator for those variables.

If we look at the unique values we can see that they do not vary at all so would not be helpful in any analyses.

unique(df_pb$DAILY_AQI_VALUE) [1] 0 unique(df_pb$DAILY_OBS_COUNT)
[1] 1

In this case we will just use the concentraion of lead in the air as our independent variable.

Next we will check for duplicates in the dates. Duplicates may occur from air quality measuremnts taken on the same day, from different measurement stations. We will take the mean of them all.

sum(duplicated(df_pb2\$date_))
[1] 235
df_pb3 <- df_pb2 %>% group_by(date_) %>% dplyr::summarise(Pb.conc = mean(Pb.conc))
head(df_pb3)
Source: local data frame [6 x 2]

date_ Pb.conc
(date)   (dbl)
1 2015-01-06  0.0040
2 2015-01-12  0.0048
3 2015-01-18  0.0028
4 2015-01-24  0.0038
5 2015-01-30  0.0100
6 2015-02-05  0.0156

There are indeed duplicates in the date, and moreover we can see data is taken every 6 days.

We can plot the variation in lead concentration in $$\frac{\mu g}{m^3}$$ over time.

We can plot the location of the measurement station in the NYC area, and that there is only one. Because of this we will only take data in the boroughs closest to the measurement station, Manhattan, Staten Island, and Brooklyn, as this will most closely represent the lead concentration from at the location of the complaints. Moreover we will only use the lead concentration from this measurement station.