library(tidyverse)
library(tidylog)
library(openintro)
library(leaps)
library(ggthemes)
theme_set(theme_clean())
source("../scripts/prune_race_variables.R")
The goal of this analysis is to fit a simple linear regression model to predict average high school district graduation rate based on household conditions and racial composition of the surrounding area.
<- read_csv("../data/grad.csv") grad
## Rows: 12663 Columns: 2
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## dbl (2): leaid, grad_rate_midpt
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
<- read_csv("../data/hh.csv") %>%
hh select(-ends_with("MOE"))
## Rows: 13313 Columns: 18
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): state, dist, region
## dbl (15): leaid, children, pct_pov, pct_SP, SP_MOE, pct_HHVJ, HHVJ_MOE, pct_...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## select: dropped 6 variables (SP_MOE, HHVJ_MOE, CC_MOE, nci_MOE, CD_MOE, …)
<- read_csv("../data/race.csv") %>%
race prune_and_predom()
## Rows: 11910 Columns: 66
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (3): geo_id, dist, year
## dbl (63): leaid, total_pop_est, total_pop_moe, total_hisp_latino, total_hisp...
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
## select: dropped 59 variables (geo_id, dist, year, total_pop_est, total_pop_moe, …)
##
## mutate: converted 'leaid' from double to integer (0 new NA)
##
## select: dropped one variable (leaid)
##
## mutate: new variable 'predom_race' (factor) with 5 unique values and 0% NA
##
## select: dropped 6 variables (pct_hisp_latino, pct_white, pct_black, pct_native, pct_asian, …)
<- grad %>%
data left_join(hh, by = c("leaid" = "leaid")) %>%
left_join(race, by = c("leaid" = "leaid")) %>%
select(-region, -predom_race)
## left_join: added 11 columns (state, dist, children, pct_pov, pct_SP, …)
## > rows only in x 1,900
## > rows only in y ( 2,550)
## > matched rows 10,763
## > ========
## > rows total 12,663
## left_join: added 7 columns (predom_race, pct_hisp_latino, pct_white, pct_black, pct_native, …)
## > rows only in x 2,462
## > rows only in y ( 1,709)
## > matched rows 10,201
## > ========
## > rows total 12,663
## select: dropped 2 variables (region, predom_race)
glimpse(data)
## Rows: 12,663
## Columns: 18
## $ leaid <dbl> 100005, 100006, 100007, 100008, 100011, 100012, 100013…
## $ grad_rate_midpt <dbl> 90.62127, 88.86380, 92.01013, 95.81859, 90.77311, 86.4…
## $ state <chr> "Alabama", "Alabama", "Alabama", "Alabama", "Alabama",…
## $ dist <chr> "Albertville City School District", "Marshall County S…
## $ children <dbl> 4591, 8299, 15397, 9416, 2324, 1644, 4476, 3064, 1425,…
## $ pct_pov <dbl> 0.30053842, 0.26622182, 0.07259553, 0.08298157, 0.1688…
## $ pct_SP <dbl> 0.3529261, 0.2951967, 0.2083155, 0.1783983, 0.2743837,…
## $ pct_HHVJ <dbl> 0.3265352, 0.2348667, 0.2921655, 0.2093061, 0.3043552,…
## $ pct_CC <dbl> 0.047664441, 0.026383169, 0.015484872, 0.016083051, 0.…
## $ pct_NCI <dbl> 0.22147135, 0.14248565, 0.03168848, 0.10363837, 0.2376…
## $ pct_CD <dbl> 0.01807885, 0.04349922, 0.03877379, 0.05777400, 0.0800…
## $ pct_CLI <dbl> 0.1736005247, 0.0386793576, 0.0441644490, 0.0042480882…
## $ pct_hisp_latino <dbl> 44.6, 24.3, 12.9, 7.8, 9.1, 10.8, 1.8, 11.8, 1.7, 1.0,…
## $ pct_white <dbl> 49.4, 70.0, 56.7, 64.1, 65.5, 86.9, 82.6, 43.2, 63.3, …
## $ pct_black <dbl> 3.5, 1.7, 23.1, 15.0, 24.6, 0.0, 6.7, 41.7, 33.0, 77.7…
## $ pct_native <dbl> 0.0, 0.2, 0.0, 0.1, 0.8, 0.0, 0.0, 0.0, 0.0, 0.0, 0.3,…
## $ pct_asian <dbl> 0.8, 0.8, 4.8, 5.5, 0.0, 0.0, 2.6, 2.4, 1.0, 0.2, 0.7,…
## $ pct_PI <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, …
We can notice immediately that our original dataset variables such as
pct_pov
and pct_SP
, are proportions ranging
from 0 to 1. We can also notice that the columns representing our racial
data such as pct_hisp_latino
and pct_white
,
are percentages ranging from 0 to 100. In order to run a regression
model, we will standardize these predictors and make all fo them range
from 0 to 100. This is so that we can interpret a value of “1” as “1%”
rather than 100%.
<- data %>%
data_standardized mutate(
across(.cols = pct_pov:pct_CLI,
.fns = ~ . * 100)
)
## mutate: changed 10,752 values (85%) of 'pct_pov' (0 new NA)
## changed 10,725 values (85%) of 'pct_SP' (0 new NA)
## changed 10,756 values (85%) of 'pct_HHVJ' (0 new NA)
## changed 10,276 values (81%) of 'pct_CC' (0 new NA)
## changed 10,608 values (84%) of 'pct_NCI' (0 new NA)
## changed 10,409 values (82%) of 'pct_CD' (0 new NA)
## changed 5,075 values (40%) of 'pct_CLI' (0 new NA)
Now that the variables are standardized, let’s check out what a simple linear regression model with all variables as predictors would look like.
We will first construct a simple linear regression model using all household conditions as predictors.
<- data_standardized %>%
full_df select(pct_pov:pct_PI, grad_rate_midpt) %>%
na.omit()
## select: dropped 4 variables (leaid, state, dist, children)
<- full_df %>%
hh_df select(pct_pov:pct_CLI, grad_rate_midpt)
## select: dropped 6 variables (pct_hisp_latino, pct_white, pct_black, pct_native, pct_asian, …)
<- full_df %>%
race_df select(pct_hisp_latino:pct_PI, grad_rate_midpt)
## select: dropped 7 variables (pct_pov, pct_SP, pct_HHVJ, pct_CC, pct_NCI, …)
<- lm(grad_rate_midpt ~ ., data = hh_df)
hh_model <- lm(grad_rate_midpt ~ ., data = race_df)
race_model <- lm(grad_rate_midpt ~ ., data = full_df)
full_model
summary(hh_model)
##
## Call:
## lm(formula = grad_rate_midpt ~ ., data = hh_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -73.642 -3.230 1.381 4.470 23.745
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 95.295719 0.387741 245.771 < 2e-16 ***
## pct_pov -0.140192 0.012211 -11.481 < 2e-16 ***
## pct_SP -0.128985 0.007249 -17.793 < 2e-16 ***
## pct_HHVJ -0.062998 0.014420 -4.369 1.26e-05 ***
## pct_CC -0.399606 0.028942 -13.807 < 2e-16 ***
## pct_NCI -0.014030 0.008569 -1.637 0.10159
## pct_CD -0.041546 0.021244 -1.956 0.05053 .
## pct_CLI 0.057739 0.019129 3.018 0.00255 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.126 on 10193 degrees of freedom
## Multiple R-squared: 0.1811, Adjusted R-squared: 0.1805
## F-statistic: 321.9 on 7 and 10193 DF, p-value: < 2.2e-16
summary(race_model)
##
## Call:
## lm(formula = grad_rate_midpt ~ ., data = race_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -71.464 -3.267 1.444 4.788 24.271
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 67.98072 1.62965 41.715 < 2e-16 ***
## pct_hisp_latino 0.15221 0.01698 8.964 < 2e-16 ***
## pct_white 0.20780 0.01695 12.261 < 2e-16 ***
## pct_black 0.08658 0.01778 4.869 1.14e-06 ***
## pct_native -0.04875 0.01957 -2.491 0.01275 *
## pct_asian 0.42143 0.02435 17.310 < 2e-16 ***
## pct_PI -0.60659 0.18706 -3.243 0.00119 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 7.109 on 10194 degrees of freedom
## Multiple R-squared: 0.1847, Adjusted R-squared: 0.1842
## F-statistic: 385 on 6 and 10194 DF, p-value: < 2.2e-16
summary(full_model)
##
## Call:
## lm(formula = grad_rate_midpt ~ ., data = full_df)
##
## Residuals:
## Min 1Q Median 3Q Max
## -72.874 -3.065 1.361 4.395 22.342
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 77.397473 1.694660 45.671 < 2e-16 ***
## pct_pov -0.130124 0.012245 -10.626 < 2e-16 ***
## pct_SP -0.076844 0.007840 -9.801 < 2e-16 ***
## pct_HHVJ -0.071285 0.014064 -5.068 4.08e-07 ***
## pct_CC -0.164292 0.033814 -4.859 1.20e-06 ***
## pct_NCI 0.015406 0.008514 1.809 0.070412 .
## pct_CD -0.069184 0.020942 -3.304 0.000958 ***
## pct_CLI -0.096282 0.022501 -4.279 1.89e-05 ***
## pct_hisp_latino 0.169099 0.017266 9.794 < 2e-16 ***
## pct_white 0.170447 0.016735 10.185 < 2e-16 ***
## pct_black 0.127181 0.017664 7.200 6.45e-13 ***
## pct_native -0.013247 0.020182 -0.656 0.511596
## pct_asian 0.303210 0.024836 12.209 < 2e-16 ***
## pct_PI -0.518837 0.182115 -2.849 0.004395 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.903 on 10187 degrees of freedom
## Multiple R-squared: 0.232, Adjusted R-squared: 0.231
## F-statistic: 236.7 on 13 and 10187 DF, p-value: < 2.2e-16
We can immediately notice that some of these predictors are extremely
significant in terms of being able to predict graduation rate. Factors
such as pct_pov
, pct_SP
,
pct_hisp_latino
, pct_white
, and
pct_asian
are all significant with p-values less than
2e-16.
We can run a quick ANOVA comparison between the model with just household conditions and the model with all predictors, in order to see if the model with race data provides a statistically significant improvement in the model.
anova(hh_model, full_model)
## Analysis of Variance Table
##
## Model 1: grad_rate_midpt ~ pct_pov + pct_SP + pct_HHVJ + pct_CC + pct_NCI +
## pct_CD + pct_CLI
## Model 2: grad_rate_midpt ~ pct_pov + pct_SP + pct_HHVJ + pct_CC + pct_NCI +
## pct_CD + pct_CLI + pct_hisp_latino + pct_white + pct_black +
## pct_native + pct_asian + pct_PI
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 10193 517534
## 2 10187 485357 6 32177 112.56 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
anova(race_model, full_model)
## Analysis of Variance Table
##
## Model 1: grad_rate_midpt ~ pct_hisp_latino + pct_white + pct_black + pct_native +
## pct_asian + pct_PI
## Model 2: grad_rate_midpt ~ pct_pov + pct_SP + pct_HHVJ + pct_CC + pct_NCI +
## pct_CD + pct_CLI + pct_hisp_latino + pct_white + pct_black +
## pct_native + pct_asian + pct_PI
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 10194 515220
## 2 10187 485357 7 29863 89.542 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can see that for both the race-only and the hh-conditions-only models, including the other set of data does provide a significant decrease in the model RSS.
In order to choose the best set of predictors, we will perform best
subset analysis, utilizing the regsubsets()
function from
the leaps
library.
<- regsubsets(grad_rate_midpt ~ .,
fit_full data = full_df,
nvmax = ncol(full_df) - 1)
<- summary(fit_full)
reg_summary reg_summary
## Subset selection object
## Call: regsubsets.formula(grad_rate_midpt ~ ., data = full_df, nvmax = ncol(full_df) -
## 1)
## 13 Variables (and intercept)
## Forced in Forced out
## pct_pov FALSE FALSE
## pct_SP FALSE FALSE
## pct_HHVJ FALSE FALSE
## pct_CC FALSE FALSE
## pct_NCI FALSE FALSE
## pct_CD FALSE FALSE
## pct_CLI FALSE FALSE
## pct_hisp_latino FALSE FALSE
## pct_white FALSE FALSE
## pct_black FALSE FALSE
## pct_native FALSE FALSE
## pct_asian FALSE FALSE
## pct_PI FALSE FALSE
## 1 subsets of each size up to 13
## Selection Algorithm: exhaustive
## pct_pov pct_SP pct_HHVJ pct_CC pct_NCI pct_CD pct_CLI pct_hisp_latino
## 1 ( 1 ) " " "*" " " " " " " " " " " " "
## 2 ( 1 ) "*" " " " " " " " " " " " " " "
## 3 ( 1 ) "*" "*" " " " " " " " " " " " "
## 4 ( 1 ) "*" "*" " " " " " " " " " " " "
## 5 ( 1 ) "*" "*" " " " " " " " " " " " "
## 6 ( 1 ) "*" "*" " " " " " " " " " " "*"
## 7 ( 1 ) "*" "*" " " "*" " " " " " " "*"
## 8 ( 1 ) "*" "*" "*" "*" " " " " " " "*"
## 9 ( 1 ) "*" "*" "*" "*" " " " " "*" "*"
## 10 ( 1 ) "*" "*" "*" "*" " " "*" "*" "*"
## 11 ( 1 ) "*" "*" "*" "*" " " "*" "*" "*"
## 12 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*" "*" "*" "*"
## pct_white pct_black pct_native pct_asian pct_PI
## 1 ( 1 ) " " " " " " " " " "
## 2 ( 1 ) " " " " "*" " " " "
## 3 ( 1 ) " " " " "*" " " " "
## 4 ( 1 ) "*" " " "*" " " " "
## 5 ( 1 ) "*" " " "*" "*" " "
## 6 ( 1 ) "*" "*" " " "*" " "
## 7 ( 1 ) "*" "*" " " "*" " "
## 8 ( 1 ) "*" "*" " " "*" " "
## 9 ( 1 ) "*" "*" " " "*" " "
## 10 ( 1 ) "*" "*" " " "*" " "
## 11 ( 1 ) "*" "*" " " "*" "*"
## 12 ( 1 ) "*" "*" " " "*" "*"
## 13 ( 1 ) "*" "*" "*" "*" "*"
In this output, one row represents one model. We fit thirteen models, each of which is the “best model” for that number of predictors.
We can immediately notice something interesting: The best
one-variable model uses pct_SP
, or percent of single-parent
households, as a predictor. The best two-variable model, however, does
not use pct_SP
, but rather pct_pov
and
pct_native
as predictors. The best three-variable model
utilizes all three of these predictors.
Let’s see which of these models yields the best predictors based on RSS, Adjusted \(R^2\), AIC (Akaike Information Criterion), BIC, and Mallow’s \(C_p\).
par(mfrow = c(2, 2))
plot(reg_summary$rss, xlab = "Number of Variables",
ylab = "RSS", type = "l")
plot(reg_summary$adjr2, xlab = "Number of Variables",
ylab = "Adjusted RSq", type = "l")
points(which.max(reg_summary$adjr2), reg_summary$adjr2[which.max(reg_summary$adjr2)], col = "red", cex = 2, pch = 20)
plot(reg_summary$cp, xlab = "Number of Variables",
ylab = "Cp", type = "l")
points(which.min(reg_summary$cp), reg_summary$cp[which.min(reg_summary$cp)], col = "red", cex = 2, pch = 20)
plot(reg_summary$bic, xlab = "Number of Variables",
ylab = "BIC", type = "l")
points (which.min(reg_summary$bic), reg_summary$bic[which.min(reg_summary$bic)], col = "red", cex = 2, pch = 20)
We can look more into, say, the Bayesian Information Criterion:
plot(fit_full, scale = "bic")
In this case, the top row represents the “best” model, and the bottom
row represents the “worst” model as measured by the BIC. The best model
includes all variables aside from the percentage of households with no
computer/internet access (pct_NCI
) and the percentage of
native americans in the school district (pct_native
).
To do:
pct_native
should be
considered in the model