library(tidyverse)
library(openintro)
library(corrplot)
library(ggthemes)
theme_set(theme_clean())
source("../scripts/prune_race_variables.R")
<- read_csv("../data/hh.csv") %>%
grad_race_HH mutate(leaid = as.integer(leaid)) %>%
left_join(read_csv("../data/race.csv") %>%
mutate(leaid = as.integer(leaid)) %>%
prune_and_predom(),
by = c("leaid" = "leaid")) %>%
left_join(read_csv("../data/grad.csv") %>%
mutate(leaid = as.integer(leaid)),
by = c("leaid" = "leaid"))
Lets take a look at the summary statistics of graduation rates. The average graduation rates across the US is about 87.84. Interestingly, there are school districts that have as low as 3.23 percent graduation rates.
summary(grad_race_HH$grad_rate_midpt)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 3.23 81.53 87.52 85.68 91.51 97.93 2550
Taking a closer look with the histogram, we can see that there is an abnormal distribution skewed right with the majority of graduation rates centered between 80 and 90 percent. We are able to see where the outlier school district graduation rates are in the boxplot. Lets take a closer look at these outliers.
hist(grad_race_HH$grad_rate_midpt)
ggplot(grad_race_HH, aes(x = grad_rate_midpt))+
geom_boxplot()
## Warning: Removed 2550 rows containing non-finite values (stat_boxplot).
There are 759 school districts that are considered outliers from the normal distribution. The district with the smallest graduation rate is Shannon Country School District with 3.23% graduation rate and 3856 children. Another thing to note is the school district with the second lowest graduation rate is an elementary school district. If there are only Elementary schools in this district, as it is labeled, how did they document a graduation rate? As I continued looking deeper into the schools with the lowest graduation rates, I found that several of them are elementary and middle school districts that do not have high schools. This begs the questions of how many of the outlier districts are elementary/middle school districts and how are they able to document graduation rates?
<- grad_race_HH %>%
grad_race_HH_outliers filter(grad_rate_midpt < 75) %>%
select(dist, state, children, grad_rate_midpt) %>%
arrange(grad_rate_midpt)
grad_race_HH_outliers
## # A tibble: 815 × 4
## dist state children grad_rate_midpt
## <chr> <chr> <dbl> <dbl>
## 1 Shannon County School District 65-1 Sout… 3856 3.23
## 2 Hermosa Beach City Elementary School District Cali… 2257 15.4
## 3 Mitchell School District 55 Oreg… 31 23.6
## 4 Vilas School District RE-5 Colo… 52 24
## 5 Mackinac Island Public Schools Mich… 77 24
## 6 Vanderbilt Area Schools Mich… 253 24
## 7 College Corner Local School District Ohio 116 24
## 8 Annex School District 29 Oreg… 84 24
## 9 Manchester City School District Tenn… 1849 24
## 10 Kiryas Joel Village Union Free School District New … 9176 24
## # … with 805 more rows
Lets take a look at Shannon County School District. We can see that the household conditions are quite high with about 52% in poverty, 70% with single parents, 50% without computer access ect. We can also see that just over 90 percent of these students are estimated to be native american, so we can assume it is probably located on an Indian reservation. The rest of the student population consists of minority students with no white population.
US News gives more insight into this school district, reporting that 100% of those attending the school are “economically disadvantaged” https://www.usnews.com/education/best-high-schools/south-dakota/districts/oglala-lakota-county-65-1/oglala-lakota-county-virtual-high-school-92-17876
Here is a link to the shool district if you are curious: https://www.olcsd.com/
Given this information, it makes sense that graduation rates may be low (or it may be that they are under reported due to lack of resources?)
<- grad_race_HH %>%
Shannon_County filter(dist == "Shannon County School District 65-1" ) %>%
glimpse()
## Rows: 1
## Columns: 26
## $ state <chr> "South Dakota"
## $ leaid <int> 4665460
## $ dist <chr> "Shannon County School District 65-1"
## $ children <dbl> 3856
## $ pct_pov <dbl> 0.5222682
## $ pct_SP <dbl> 0.6934171
## $ SP_MOE <dbl> 0.075
## $ pct_HHVJ <dbl> 0.2386855
## $ HHVJ_MOE <dbl> 0.05
## $ pct_CC <dbl> 0.3375886
## $ CC_MOE <dbl> 0.055
## $ pct_NCI <dbl> 0.4896849
## $ nci_MOE <dbl> 0.08
## $ pct_CD <dbl> 0.06224066
## $ CD_MOE <dbl> 0.02
## $ pct_CLI <dbl> 0
## $ CLI_MOE <dbl> 0.005
## $ region <chr> "North Central"
## $ predom_race <fct> Native American
## $ pct_hisp_latino <dbl> 5
## $ pct_white <dbl> 0
## $ pct_black <dbl> 0
## $ pct_native <dbl> 90.7
## $ pct_asian <dbl> 0
## $ pct_PI <dbl> 0
## $ grad_rate_midpt <dbl> 3.22963
Graduation rates have the strongest negative correlation with households in poverty and those with single parents in the entire US.
<- grad_race_HH %>%
grad_race_HH_corr select(dist,
children,
grad_rate_midpt, :pct_CLI,
pct_pov
pct_hisp_latino,
pct_white,
pct_black,
pct_native, %>%
pct_asian)na.omit()
2:ncol(grad_race_HH_corr)] %>%
grad_race_HH_corr[,cor() %>%
corrplot(method = "number")
I am curious how this matrix would change if I filtered out the outliers with grad rates specifically. It did not change it much, in fact it made the correlations between grad rates and the household conditions less significant.
<- grad_race_HH%>%
grad_race_HH_norm filter(grad_rate_midpt > 75) %>%
select(dist,
children,
grad_rate_midpt, :pct_CLI,
pct_pov
pct_hisp_latino,
pct_white,
pct_black,
pct_native, %>%
pct_asian) na.omit()
2:ncol(grad_race_HH_norm)] %>%
grad_race_HH_norm[,cor() %>%
corrplot(method = "number")
Lets look into some regional differences. Here is a correlation matrix of the Bible Belt states.
<- grad_race_HH %>%
bible_belt filter(state == "Alabama" |
== "Arkansas" |
state == "Georgia"|
state == "Kentucky"|
state == "Louisiana"|
state == "Mississippi"|
state == "North Carolina"|
state == "Oklahoma"|
state == "South Carolina"|
state == "Tennessee"|
state == "Texas"|
state == "Utah") %>%
state select(dist,
children,
grad_rate_midpt, :pct_CLI,
pct_pov
pct_hisp_latino,
pct_white,
pct_black,
pct_native, %>%
pct_asian) na.omit()
2:ncol(bible_belt)] %>%
bible_belt[,cor() %>%
corrplot(method = "number")
Now lets see about grad rates in states with the “worst” school systems. (https://www.newsnationnow.com/us-news/list-states-with-the-best-and-worst-public-school-systems/)
<- grad_race_HH %>%
worst_states filter(state == "Alabama" |
== "Alaska" |
state == "Arizona"|
state == "Louisiana"|
state == "New Mexico") %>%
state select(dist,
children,
grad_rate_midpt, :pct_CLI,
pct_pov
pct_hisp_latino,
pct_white,
pct_black,
pct_native, %>%
pct_asian) na.omit()
2:ncol(worst_states)] %>%
worst_states[,cor() %>%
corrplot(method = "number")
What about the states with the “best” school systems? This plot is especially fascinating with a significantly higher negative correlation between grad rates and household conditions, as well as for black and Hispanic students.
<- grad_race_HH %>%
best_states filter(state == "Massachusetts" |
== "Connecticut" |
state == "New Jersey"|
state == "Virginia"|
state == "New Hampshire") %>%
state select(dist,
children,
grad_rate_midpt, :pct_CLI,
pct_pov
pct_hisp_latino,
pct_white,
pct_black,
pct_native, %>%
pct_asian) na.omit()
2:ncol(best_states)] %>%
best_states[,cor() %>%
corrplot(method = "number")
ggplot(grad_race_HH,
aes(x = pct_native, y = grad_rate_midpt)) +
geom_point(alpha = 0.1) +
geom_smooth(method = "lm", se = T) +
scale_x_log10()
## Warning: Transformation introduced infinite values in continuous x-axis
## Transformation introduced infinite values in continuous x-axis
## `geom_smooth()` using formula 'y ~ x'
## Warning: Removed 9788 rows containing non-finite values (stat_smooth).
## Warning: Removed 3112 rows containing missing values (geom_point).