library(tidyverse)
library(tidylog)
library(openintro)
library(ggthemes)
theme_set(theme_clean())
<- read_csv("../raw/Public_Schools.csv")
Public_Schools <- read_csv("../data/hh.csv") hh
Because there are schools included in this dataset that are not in the US, we are going to filter out all schools that are not in the US. Additionally, lets get rid of any variables that are irrelevant to our future analyses. District ID is the leaid so we are renaming it for consistency with other datasets.
<- Public_Schools %>%
US_PS filter(COUNTRY == "USA") %>%
select(-(X:OBJECTID),
-(ADDRESS:WEBSITE),
-FT_TEACHER,
-SHELTER_ID,
-NCESID,
-ENROLLMENT)%>%
rename(leaid =DISTRICTID,
level = LEVEL_)
## filter: removed 1,379 rows (1%), 100,955 rows remaining
## select: dropped 28 variables (X, Y, OBJECTID, NCESID, ADDRESS, …)
## rename: renamed 2 variables (level, leaid)
Lets join using the LEAID (district ID) so we can have a better idea of waht is going on in relation to hh conditions.
<- US_PS %>%
school_dist_join left_join(hh, by = c("leaid"))
## left_join: added 17 columns (state, dist, children, pct_pov, pct_SP, …)
## > rows only in x 9,705
## > rows only in y ( 315)
## > matched rows 91,250
## > =========
## > rows total 100,955
There appears to be 11769 schools in the public school dataset that do not have coinciding school districts documented in the household dataset.
%>%
school_dist_join is.na() %>%
colSums()
## NAME level ST_GRADE END_GRADE leaid state dist children
## 0 0 0 0 0 9705 9705 9705
## pct_pov pct_SP SP_MOE pct_HHVJ HHVJ_MOE pct_CC CC_MOE pct_NCI
## 9705 9705 9705 9705 9705 9705 9705 9705
## nci_MOE pct_CD CD_MOE pct_CLI CLI_MOE region
## 9705 9705 9705 9705 9705 9821
%>%
school_dist_joinfilter(is.na(dist))
## filter: removed 91,250 rows (90%), 9,705 rows remaining
## # A tibble: 9,705 × 22
## NAME level ST_GRADE END_GRADE leaid state dist children pct_pov pct_SP
## <chr> <chr> <chr> <chr> <dbl> <chr> <chr> <dbl> <dbl> <dbl>
## 1 PARAMOUN… HIGH 06 12 6.01e5 <NA> <NA> NA NA NA
## 2 OLIVE GR… OTHER KG 12 6.01e5 <NA> <NA> NA NA NA
## 3 PINNACLE… HIGH 05 12 6.91e5 <NA> <NA> NA NA NA
## 4 TRINITY … OTHER KG 12 6.91e5 <NA> <NA> NA NA NA
## 5 OAKLAND … MIDD… 06 08 6.91e5 <NA> <NA> NA NA NA
## 6 KINGSMAN… HIGH 06 12 1.10e6 <NA> <NA> NA NA NA
## 7 FRIENDSH… ELEM… PK 05 1.10e6 <NA> <NA> NA NA NA
## 8 TWO RIVE… ELEM… PK 03 1.10e6 <NA> <NA> NA NA NA
## 9 MONUMENT… MIDD… 05 07 1.10e6 <NA> <NA> NA NA NA
## 10 ARC HIGH 10 12 1.70e6 <NA> <NA> NA NA NA
## # … with 9,695 more rows, and 12 more variables: SP_MOE <dbl>, pct_HHVJ <dbl>,
## # HHVJ_MOE <dbl>, pct_CC <dbl>, CC_MOE <dbl>, pct_NCI <dbl>, nci_MOE <dbl>,
## # pct_CD <dbl>, CD_MOE <dbl>, pct_CLI <dbl>, CLI_MOE <dbl>, region <chr>
Lets take a look at what levels of education exist in the data set so we can try to isolate the districts with high schools. Based on the level variable, we can see that there are 22886 schools categorized as high schools.
%>%
school_dist_joincount(level)
## count: now 10 rows and 2 columns, ungrouped
## # A tibble: 10 × 2
## level n
## <chr> <int>
## 1 ADULT EDUCATION 28
## 2 ELEMENTARY 52608
## 3 HIGH 22886
## 4 MIDDLE 16393
## 5 NOT APPLICABLE 818
## 6 NOT REPORTED 2169
## 7 OTHER 3827
## 8 PREKINDERGARTEN 1479
## 9 SECONDARY 597
## 10 UNGRADED 150
There are a few schools that are considered “other” that include K-12 schools with graduation rates, as well as many of the “secondary” schools, so lets filter this down to only include districts with high schools in them.
Using the end grade variable as a marker of high schools, there are 26366 schools that end with grade 12 (3480 more schools than are included in the “HIGH” category”.
%>%
school_dist_joincount(END_GRADE)
## count: now 19 rows and 2 columns, ungrouped
## # A tibble: 19 × 2
## END_GRADE n
## <chr> <int>
## 1 01 541
## 2 02 1604
## 3 03 1441
## 4 04 3946
## 5 05 27618
## 6 06 10786
## 7 07 498
## 8 08 21569
## 9 09 740
## 10 10 260
## 11 11 314
## 12 12 26366
## 13 13 99
## 14 AE 28
## 15 KG 529
## 16 M 2169
## 17 N 818
## 18 PK 1479
## 19 UG 150
This inconsistency in the dataset bares the question of how we are going to accurately filter the dataset to only include high schools. One option is to include schools that end in grade 12 OR are classified as high schools. This would include some of schools that appear to be high schools but are categorized as “other”, “secondary” or “middle”. Lets try it out.
<- school_dist_join %>%
high_dist filter(END_GRADE == "12" | level == "HIGH")
## filter: removed 74,498 rows (74%), 26,457 rows remaining
%>%
high_distcount(level)
## count: now 3 rows and 2 columns, ungrouped
## # A tibble: 3 × 2
## level n
## <chr> <int>
## 1 HIGH 22886
## 2 MIDDLE 4
## 3 OTHER 3567
This removed 74,498 schools, leaving only 26,457 schools in the data set that are likely high schools. This included most of the schools in the “other” category, as well as four schools categorized as “middle”.
We want a list of all leaids (districts) with high schools as a method of filtering the rest of our datasets to only include districts with high schools. It would make sense to do this given that there should only be graduation rates for districts with high schools.
<- high_dist %>%
high_dist select(leaid)%>%
distinct()
## select: dropped 21 variables (NAME, level, ST_GRADE, END_GRADE, state, …)
## distinct: removed 13,481 rows (51%), 12,976 rows remaining