library(tidyverse)
library(tidylog)
library(openintro)
library(ggthemes)
theme_set(theme_clean())

Import data

Public_Schools <- read_csv("../raw/Public_Schools.csv")
hh <- read_csv("../data/hh.csv")

Because there are schools included in this dataset that are not in the US, we are going to filter out all schools that are not in the US. Additionally, lets get rid of any variables that are irrelevant to our future analyses. District ID is the leaid so we are renaming it for consistency with other datasets.

US_PS <- Public_Schools %>%
  filter(COUNTRY == "USA") %>%
  select(-(X:OBJECTID),
         -(ADDRESS:WEBSITE), 
         -FT_TEACHER, 
         -SHELTER_ID, 
         -NCESID, 
         -ENROLLMENT)%>%
  rename(leaid =DISTRICTID, 
         level = LEVEL_)
## filter: removed 1,379 rows (1%), 100,955 rows remaining
## select: dropped 28 variables (X, Y, OBJECTID, NCESID, ADDRESS, …)
## rename: renamed 2 variables (level, leaid)

Join household data with the public school data.

Lets join using the LEAID (district ID) so we can have a better idea of waht is going on in relation to hh conditions.

school_dist_join <- US_PS %>%
  left_join(hh, by = c("leaid"))
## left_join: added 17 columns (state, dist, children, pct_pov, pct_SP, …)
##            > rows only in x     9,705
##            > rows only in y  (    315)
##            > matched rows      91,250
##            >                 =========
##            > rows total       100,955

There appears to be 11769 schools in the public school dataset that do not have coinciding school districts documented in the household dataset.

school_dist_join %>%
  is.na() %>%
  colSums()
##      NAME     level  ST_GRADE END_GRADE     leaid     state      dist  children 
##         0         0         0         0         0      9705      9705      9705 
##   pct_pov    pct_SP    SP_MOE  pct_HHVJ  HHVJ_MOE    pct_CC    CC_MOE   pct_NCI 
##      9705      9705      9705      9705      9705      9705      9705      9705 
##   nci_MOE    pct_CD    CD_MOE   pct_CLI   CLI_MOE    region 
##      9705      9705      9705      9705      9705      9821
school_dist_join%>%
  filter(is.na(dist))
## filter: removed 91,250 rows (90%), 9,705 rows remaining
## # A tibble: 9,705 × 22
##    NAME      level ST_GRADE END_GRADE  leaid state dist  children pct_pov pct_SP
##    <chr>     <chr> <chr>    <chr>      <dbl> <chr> <chr>    <dbl>   <dbl>  <dbl>
##  1 PARAMOUN… HIGH  06       12        6.01e5 <NA>  <NA>        NA      NA     NA
##  2 OLIVE GR… OTHER KG       12        6.01e5 <NA>  <NA>        NA      NA     NA
##  3 PINNACLE… HIGH  05       12        6.91e5 <NA>  <NA>        NA      NA     NA
##  4 TRINITY … OTHER KG       12        6.91e5 <NA>  <NA>        NA      NA     NA
##  5 OAKLAND … MIDD… 06       08        6.91e5 <NA>  <NA>        NA      NA     NA
##  6 KINGSMAN… HIGH  06       12        1.10e6 <NA>  <NA>        NA      NA     NA
##  7 FRIENDSH… ELEM… PK       05        1.10e6 <NA>  <NA>        NA      NA     NA
##  8 TWO RIVE… ELEM… PK       03        1.10e6 <NA>  <NA>        NA      NA     NA
##  9 MONUMENT… MIDD… 05       07        1.10e6 <NA>  <NA>        NA      NA     NA
## 10 ARC       HIGH  10       12        1.70e6 <NA>  <NA>        NA      NA     NA
## # … with 9,695 more rows, and 12 more variables: SP_MOE <dbl>, pct_HHVJ <dbl>,
## #   HHVJ_MOE <dbl>, pct_CC <dbl>, CC_MOE <dbl>, pct_NCI <dbl>, nci_MOE <dbl>,
## #   pct_CD <dbl>, CD_MOE <dbl>, pct_CLI <dbl>, CLI_MOE <dbl>, region <chr>

Quick look into education levels

Lets take a look at what levels of education exist in the data set so we can try to isolate the districts with high schools. Based on the level variable, we can see that there are 22886 schools categorized as high schools.

school_dist_join%>%
  count(level) 
## count: now 10 rows and 2 columns, ungrouped
## # A tibble: 10 × 2
##    level               n
##    <chr>           <int>
##  1 ADULT EDUCATION    28
##  2 ELEMENTARY      52608
##  3 HIGH            22886
##  4 MIDDLE          16393
##  5 NOT APPLICABLE    818
##  6 NOT REPORTED     2169
##  7 OTHER            3827
##  8 PREKINDERGARTEN  1479
##  9 SECONDARY         597
## 10 UNGRADED          150

There are a few schools that are considered “other” that include K-12 schools with graduation rates, as well as many of the “secondary” schools, so lets filter this down to only include districts with high schools in them.

Using the end grade variable as a marker of high schools, there are 26366 schools that end with grade 12 (3480 more schools than are included in the “HIGH” category”.

school_dist_join%>%
  count(END_GRADE) 
## count: now 19 rows and 2 columns, ungrouped
## # A tibble: 19 × 2
##    END_GRADE     n
##    <chr>     <int>
##  1 01          541
##  2 02         1604
##  3 03         1441
##  4 04         3946
##  5 05        27618
##  6 06        10786
##  7 07          498
##  8 08        21569
##  9 09          740
## 10 10          260
## 11 11          314
## 12 12        26366
## 13 13           99
## 14 AE           28
## 15 KG          529
## 16 M          2169
## 17 N           818
## 18 PK         1479
## 19 UG          150

Filtering method

This inconsistency in the dataset bares the question of how we are going to accurately filter the dataset to only include high schools. One option is to include schools that end in grade 12 OR are classified as high schools. This would include some of schools that appear to be high schools but are categorized as “other”, “secondary” or “middle”. Lets try it out.

high_dist <- school_dist_join %>%
  filter(END_GRADE == "12" | level == "HIGH")
## filter: removed 74,498 rows (74%), 26,457 rows remaining
high_dist%>%
  count(level) 
## count: now 3 rows and 2 columns, ungrouped
## # A tibble: 3 × 2
##   level      n
##   <chr>  <int>
## 1 HIGH   22886
## 2 MIDDLE     4
## 3 OTHER   3567

This removed 74,498 schools, leaving only 26,457 schools in the data set that are likely high schools. This included most of the schools in the “other” category, as well as four schools categorized as “middle”.

Create dataset with high school leaids

We want a list of all leaids (districts) with high schools as a method of filtering the rest of our datasets to only include districts with high schools. It would make sense to do this given that there should only be graduation rates for districts with high schools.

high_dist <- high_dist %>%
  select(leaid)%>%
  distinct()
## select: dropped 21 variables (NAME, level, ST_GRADE, END_GRADE, state, …)
## distinct: removed 13,481 rows (51%), 12,976 rows remaining
See script for creating the high school district leaid set
LS0tCnRpdGxlOiAiRGlzdHJpY3RzIHdpdGggSGlnaCBTY2hvb2xzIgphdXRob3I6ICJKb24gR2VpZ2VyLCBOb2VsIEdvb2R3aW4sIEFiaWdhaWwgSm9wcGEiCmRhdGU6ICJgciBTeXMuRGF0ZSgpYCIKb3V0cHV0OiBvcGVuaW50cm86OmxhYl9yZXBvcnQKLS0tCgpgYGB7ciBsb2FkLXBhY2thZ2VzLCBtZXNzYWdlPUZBTFNFLCB3YXJuaW5nPUZBTFNFfQpsaWJyYXJ5KHRpZHl2ZXJzZSkKbGlicmFyeSh0aWR5bG9nKQpsaWJyYXJ5KG9wZW5pbnRybykKbGlicmFyeShnZ3RoZW1lcykKdGhlbWVfc2V0KHRoZW1lX2NsZWFuKCkpCmBgYAoKCiMjIyBJbXBvcnQgZGF0YQoKYGBge3IsIG1lc3NhZ2U9RkFMU0V9ClB1YmxpY19TY2hvb2xzIDwtIHJlYWRfY3N2KCIuLi9yYXcvUHVibGljX1NjaG9vbHMuY3N2IikKaGggPC0gcmVhZF9jc3YoIi4uL2RhdGEvaGguY3N2IikKCmBgYAoKQmVjYXVzZSB0aGVyZSBhcmUgc2Nob29scyBpbmNsdWRlZCBpbiB0aGlzIGRhdGFzZXQgdGhhdCBhcmUgbm90IGluIHRoZSBVUywgd2UgYXJlIGdvaW5nIHRvIGZpbHRlciBvdXQgYWxsIHNjaG9vbHMgdGhhdCBhcmUgbm90IGluIHRoZSBVUy4gQWRkaXRpb25hbGx5LCBsZXRzIGdldCByaWQgb2YgYW55IHZhcmlhYmxlcyB0aGF0IGFyZSBpcnJlbGV2YW50IHRvIG91ciBmdXR1cmUgYW5hbHlzZXMuIERpc3RyaWN0IElEIGlzIHRoZSBsZWFpZCBzbyB3ZSBhcmUgcmVuYW1pbmcgaXQgZm9yIGNvbnNpc3RlbmN5IHdpdGggb3RoZXIgZGF0YXNldHMuIApgYGB7cn0KClVTX1BTIDwtIFB1YmxpY19TY2hvb2xzICU+JQogIGZpbHRlcihDT1VOVFJZID09ICJVU0EiKSAlPiUKICBzZWxlY3QoLShYOk9CSkVDVElEKSwKICAgICAgICAgLShBRERSRVNTOldFQlNJVEUpLCAKICAgICAgICAgLUZUX1RFQUNIRVIsIAogICAgICAgICAtU0hFTFRFUl9JRCwgCiAgICAgICAgIC1OQ0VTSUQsIAogICAgICAgICAtRU5ST0xMTUVOVCklPiUKICByZW5hbWUobGVhaWQgPURJU1RSSUNUSUQsIAogICAgICAgICBsZXZlbCA9IExFVkVMXykKCmBgYAoKIyMjIEpvaW4gaG91c2Vob2xkIGRhdGEgd2l0aCB0aGUgcHVibGljIHNjaG9vbCBkYXRhLiAKCkxldHMgam9pbiB1c2luZyB0aGUgTEVBSUQgKGRpc3RyaWN0IElEKSBzbyB3ZSBjYW4gaGF2ZSBhIGJldHRlciBpZGVhIG9mIHdhaHQgaXMgZ29pbmcgb24gaW4gcmVsYXRpb24gdG8gaGggY29uZGl0aW9ucy4gIApgYGB7cn0Kc2Nob29sX2Rpc3Rfam9pbiA8LSBVU19QUyAlPiUKICBsZWZ0X2pvaW4oaGgsIGJ5ID0gYygibGVhaWQiKSkKYGBgCgpUaGVyZSBhcHBlYXJzIHRvIGJlIDExNzY5IHNjaG9vbHMgaW4gdGhlIHB1YmxpYyBzY2hvb2wgZGF0YXNldCB0aGF0IGRvIG5vdCBoYXZlIGNvaW5jaWRpbmcgc2Nob29sIGRpc3RyaWN0cyBkb2N1bWVudGVkIGluIHRoZSBob3VzZWhvbGQgZGF0YXNldC4gCgpgYGB7cn0Kc2Nob29sX2Rpc3Rfam9pbiAlPiUKICBpcy5uYSgpICU+JQogIGNvbFN1bXMoKQoKc2Nob29sX2Rpc3Rfam9pbiU+JQogIGZpbHRlcihpcy5uYShkaXN0KSkKYGBgCgoKIyMjIFF1aWNrIGxvb2sgaW50byBlZHVjYXRpb24gbGV2ZWxzCgpMZXRzIHRha2UgYSBsb29rIGF0IHdoYXQgbGV2ZWxzIG9mIGVkdWNhdGlvbiBleGlzdCBpbiB0aGUgZGF0YSBzZXQgc28gd2UgY2FuIHRyeSB0byBpc29sYXRlIHRoZSBkaXN0cmljdHMgd2l0aCBoaWdoIHNjaG9vbHMuIEJhc2VkIG9uIHRoZSBsZXZlbCB2YXJpYWJsZSwgd2UgY2FuIHNlZSB0aGF0IHRoZXJlIGFyZSAyMjg4NiBzY2hvb2xzIGNhdGVnb3JpemVkIGFzIGhpZ2ggc2Nob29scy4gCgpgYGB7cn0Kc2Nob29sX2Rpc3Rfam9pbiU+JQogIGNvdW50KGxldmVsKSAKYGBgCgpUaGVyZSBhcmUgYSBmZXcgc2Nob29scyB0aGF0IGFyZSBjb25zaWRlcmVkICJvdGhlciIgdGhhdCBpbmNsdWRlIEstMTIgc2Nob29scyB3aXRoIGdyYWR1YXRpb24gcmF0ZXMsIGFzIHdlbGwgYXMgbWFueSBvZiB0aGUgInNlY29uZGFyeSIgc2Nob29scywgc28gbGV0cyBmaWx0ZXIgdGhpcyBkb3duIHRvIG9ubHkgaW5jbHVkZSBkaXN0cmljdHMgd2l0aCBoaWdoIHNjaG9vbHMgaW4gdGhlbS4gCgpVc2luZyB0aGUgZW5kIGdyYWRlIHZhcmlhYmxlIGFzIGEgbWFya2VyIG9mIGhpZ2ggc2Nob29scywgdGhlcmUgYXJlIDI2MzY2IHNjaG9vbHMgdGhhdCBlbmQgd2l0aCBncmFkZSAxMiAoMzQ4MCBtb3JlIHNjaG9vbHMgdGhhbiBhcmUgaW5jbHVkZWQgaW4gdGhlICJISUdIIiBjYXRlZ29yeSIuCgpgYGB7cn0Kc2Nob29sX2Rpc3Rfam9pbiU+JQogIGNvdW50KEVORF9HUkFERSkgCgpgYGAKCiMjIyBGaWx0ZXJpbmcgbWV0aG9kCgpUaGlzIGluY29uc2lzdGVuY3kgaW4gdGhlIGRhdGFzZXQgYmFyZXMgdGhlIHF1ZXN0aW9uIG9mIGhvdyB3ZSBhcmUgZ29pbmcgdG8gYWNjdXJhdGVseSBmaWx0ZXIgdGhlIGRhdGFzZXQgdG8gb25seSBpbmNsdWRlIGhpZ2ggc2Nob29scy4gT25lIG9wdGlvbiBpcyB0byBpbmNsdWRlIHNjaG9vbHMgdGhhdCBlbmQgaW4gZ3JhZGUgMTIgT1IgYXJlIGNsYXNzaWZpZWQgYXMgaGlnaCBzY2hvb2xzLiBUaGlzIHdvdWxkIGluY2x1ZGUgc29tZSBvZiBzY2hvb2xzIHRoYXQgYXBwZWFyIHRvIGJlIGhpZ2ggc2Nob29scyBidXQgYXJlIGNhdGVnb3JpemVkIGFzICJvdGhlciIsICJzZWNvbmRhcnkiIG9yICJtaWRkbGUiLiBMZXRzIHRyeSBpdCBvdXQuIAoKYGBge3J9CmhpZ2hfZGlzdCA8LSBzY2hvb2xfZGlzdF9qb2luICU+JQogIGZpbHRlcihFTkRfR1JBREUgPT0gIjEyIiB8IGxldmVsID09ICJISUdIIikKCmhpZ2hfZGlzdCU+JQogIGNvdW50KGxldmVsKSAKYGBgClRoaXMgcmVtb3ZlZCA3NCw0OTggc2Nob29scywgbGVhdmluZyBvbmx5IDI2LDQ1NyBzY2hvb2xzIGluIHRoZSBkYXRhIHNldCB0aGF0IGFyZSBsaWtlbHkgaGlnaCBzY2hvb2xzLiBUaGlzIGluY2x1ZGVkIG1vc3Qgb2YgdGhlIHNjaG9vbHMgaW4gdGhlICJvdGhlciIgY2F0ZWdvcnksIGFzIHdlbGwgYXMgZm91ciBzY2hvb2xzIGNhdGVnb3JpemVkIGFzICJtaWRkbGUiLgoKIyMjIENyZWF0ZSBkYXRhc2V0IHdpdGggaGlnaCBzY2hvb2wgbGVhaWRzCgpXZSB3YW50IGEgbGlzdCBvZiBhbGwgbGVhaWRzIChkaXN0cmljdHMpIHdpdGggaGlnaCBzY2hvb2xzIGFzIGEgbWV0aG9kIG9mIGZpbHRlcmluZyB0aGUgcmVzdCBvZiBvdXIgZGF0YXNldHMgdG8gb25seSBpbmNsdWRlIGRpc3RyaWN0cyB3aXRoIGhpZ2ggc2Nob29scy4gSXQgd291bGQgbWFrZSBzZW5zZSB0byBkbyB0aGlzIGdpdmVuIHRoYXQgdGhlcmUgc2hvdWxkIG9ubHkgYmUgZ3JhZHVhdGlvbiByYXRlcyBmb3IgZGlzdHJpY3RzIHdpdGggaGlnaCBzY2hvb2xzLiAKCmBgYHtyfQpoaWdoX2Rpc3QgPC0gaGlnaF9kaXN0ICU+JQogIHNlbGVjdChsZWFpZCklPiUKICBkaXN0aW5jdCgpCmBgYAoKIyMjIyMgU2VlIHNjcmlwdCBmb3IgY3JlYXRpbmcgdGhlIGhpZ2ggc2Nob29sIGRpc3RyaWN0IGxlYWlkIHNldAo=