Functions and Iteration

Megha Joshi

Functions

Example Dataset Background

  • Tennessee’s Student/Teacher Achievement Ratio (STAR) (Mosteller, 1995)

  • Longitudinal project examining effect of class size on student achievement

  • Data on language, math, reading outcomes

  • The data that we are using is from the evalITR package

    • Please review information about the data here

Example Dataset

library(tidyverse)
library(janitor)
library(evalITR)
library(estimatr)
library(broom)

star_dat <- star %>%
    clean_names()

glimpse(star_dat)
Rows: 1,911
Columns: 14
$ g3tlangss  <dbl> 608, 644, 638, 595, 694, 644, 618, 668, 677, 650, 635, 657,…
$ g3treadss  <dbl> 582, 636, 578, 612, 631, 592, 583, 661, 675, 713, 614, 595,…
$ g3tmathss  <dbl> 591, 622, 610, 612, 631, 631, 606, 648, 648, 645, 636, 610,…
$ treatment  <dbl> 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,…
$ gender     <fct> 2, 1, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 1, 2,…
$ race       <fct> 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1,…
$ birthmonth <fct> 11, 11, 12, 11, 9, 3, 6, 6, 1, 6, 11, 12, 7, 5, 5, 11, 12, …
$ birthyear  <fct> 1979, 1979, 1979, 1979, 1980, 1980, 1980, 1980, 1980, 1980,…
$ schlurbn   <fct> 2, 2, 3, 3, 3, 1, 1, 2, 2, 4, 3, 1, 3, 3, 4, 2, 1, 1, 3, 3,…
$ grdrange   <fct> 5, 6, 3, 8, 3, 6, 6, 5, 5, 6, 8, 6, 8, 4, 4, 4, 6, 6, 5, 6,…
$ gkenrmnt   <dbl> 603, 621, 512, 389, 444, 521, 888, 709, 709, 531, 650, 479,…
$ gkfrlnch   <dbl> 90, 10, 32, 51, 30, 90, 98, 1, 1, 48, 64, 70, 64, 32, 28, 2…
$ gkbused    <dbl> 62, 61, 98, 71, 90, 13, 1, 98, 98, 67, 95, 17, 95, 93, 94, …
$ gkwhite    <dbl> 19, 100, 97, 94, 98, 0, 0, 97, 97, 92, 99, 1, 99, 85, 98, 6…

Run a Linear Regression to Assess the Impact of Treatment on Language

lang_mod <- lm_robust(g3tlangss ~ treatment + gender + race + schlurbn + grdrange + gkenrmnt, 
                  data = star_dat,
                  se_type = "HC2")

summary(lang_mod)

Call:
lm_robust(formula = g3tlangss ~ treatment + gender + race + schlurbn + 
    grdrange + gkenrmnt, data = star_dat, se_type = "HC2")

Standard error type:  HC2 

Coefficients:
              Estimate Std. Error  t value  Pr(>|t|)   CI Lower  CI Upper   DF
(Intercept) 633.366942    5.36016 118.1620 0.000e+00 622.854526 643.87936 1898
treatment     4.113126    1.58483   2.5953 9.523e-03   1.004940   7.22131 1898
gender2      10.144257    1.57216   6.4524 1.393e-10   7.060914  13.22760 1898
race2       -13.713573    2.45781  -5.5796 2.758e-08 -18.533876  -8.89327 1898
schlurbn2     8.602082    3.00381   2.8637 4.233e-03   2.710965  14.49320 1898
schlurbn3     6.301093    3.16943   1.9881 4.695e-02   0.085161  12.51702 1898
schlurbn4     6.421318    4.25744   1.5083 1.317e-01  -1.928435  14.77107 1898
grdrange4    -4.028959    4.00679  -1.0055 3.148e-01 -11.887125   3.82921 1898
grdrange5    -6.074969    3.34049  -1.8186 6.913e-02 -12.626392   0.47645 1898
grdrange6     0.334935    2.90863   0.1152 9.083e-01  -5.369513   6.03938 1898
grdrange7   -26.247871    6.98357  -3.7585 1.761e-04 -39.944145 -12.55160 1898
grdrange8    -6.151212    3.37377  -1.8232 6.842e-02 -12.767900   0.46548 1898
gkenrmnt      0.003051    0.00552   0.5526 5.806e-01  -0.007776   0.01388 1898

Multiple R-squared:  0.06845 ,  Adjusted R-squared:  0.06256 
F-statistic: 13.02 on 12 and 1898 DF,  p-value: < 2.2e-16

Tidy Language Results

tidy(lang_mod) %>%
    filter(term == "treatment") %>%
    mutate(outcome = "g3tlangss") %>%
    dplyr::select(outcome, everything())
    outcome      term estimate std.error statistic     p.value conf.low
1 g3tlangss treatment 4.113126  1.584827  2.595315 0.009523254  1.00494
  conf.high   df
1  7.221312 1898

On Math

math_mod <- lm_robust(g3tmathss ~ treatment + gender + race + schlurbn + grdrange + gkenrmnt, 
                  data = star_dat,
                  se_type = "HC2")

summary(math_mod)

Call:
lm_robust(formula = g3tmathss ~ treatment + gender + race + schlurbn + 
    grdrange + gkenrmnt, data = star_dat, se_type = "HC2")

Standard error type:  HC2 

Coefficients:
             Estimate Std. Error   t value  Pr(>|t|)  CI Lower   CI Upper   DF
(Intercept) 627.27238   6.034736 103.94363 0.000e+00 615.43697  6.391e+02 1898
treatment     6.36724   1.732919   3.67429 2.451e-04   2.96862  9.766e+00 1898
gender2       0.03764   1.726153   0.02181 9.826e-01  -3.34772  3.423e+00 1898
race2       -21.85364   3.005153  -7.27205 5.153e-13 -27.74739 -1.596e+01 1898
schlurbn2     8.66786   3.564785   2.43152 1.513e-02   1.67655  1.566e+01 1898
schlurbn3     3.57552   3.717144   0.96190 3.362e-01  -3.71459  1.087e+01 1898
schlurbn4    -0.61152   4.817304  -0.12694 8.990e-01 -10.05929  8.836e+00 1898
grdrange4    10.40834   4.146388   2.51022 1.215e-02   2.27638  1.854e+01 1898
grdrange5     3.24765   3.552424   0.91421 3.607e-01  -3.71941  1.021e+01 1898
grdrange6     7.68867   3.205376   2.39868 1.655e-02   1.40224  1.398e+01 1898
grdrange7   -19.31582   8.736040  -2.21105 2.715e-02 -36.44907 -2.183e+00 1898
grdrange8     6.11783   3.533298   1.73148 8.353e-02  -0.81172  1.305e+01 1898
gkenrmnt     -0.01167   0.005995  -1.94649 5.174e-02  -0.02343  8.830e-05 1898

Multiple R-squared:  0.08643 ,  Adjusted R-squared:  0.08066 
F-statistic: 16.42 on 12 and 1898 DF,  p-value: < 2.2e-16

Tidy Math Results

tidy(lang_mod) %>%
    filter(term == "treatment") %>%
    mutate(outcome = "g3tlangss") %>%
    dplyr::select(outcome, everything())
    outcome      term estimate std.error statistic     p.value conf.low
1 g3tlangss treatment 4.113126  1.584827  2.595315 0.009523254  1.00494
  conf.high   df
1  7.221312 1898

On Reading

read_mod <- lm_robust(g3tlangss ~ treatment + gender + race + schlurbn + grdrange + gkenrmnt, 
                  data = star_dat,
                  se_type = "HC2")

summary(read_mod)

Call:
lm_robust(formula = g3tlangss ~ treatment + gender + race + schlurbn + 
    grdrange + gkenrmnt, data = star_dat, se_type = "HC2")

Standard error type:  HC2 

Coefficients:
              Estimate Std. Error  t value  Pr(>|t|)   CI Lower  CI Upper   DF
(Intercept) 633.366942    5.36016 118.1620 0.000e+00 622.854526 643.87936 1898
treatment     4.113126    1.58483   2.5953 9.523e-03   1.004940   7.22131 1898
gender2      10.144257    1.57216   6.4524 1.393e-10   7.060914  13.22760 1898
race2       -13.713573    2.45781  -5.5796 2.758e-08 -18.533876  -8.89327 1898
schlurbn2     8.602082    3.00381   2.8637 4.233e-03   2.710965  14.49320 1898
schlurbn3     6.301093    3.16943   1.9881 4.695e-02   0.085161  12.51702 1898
schlurbn4     6.421318    4.25744   1.5083 1.317e-01  -1.928435  14.77107 1898
grdrange4    -4.028959    4.00679  -1.0055 3.148e-01 -11.887125   3.82921 1898
grdrange5    -6.074969    3.34049  -1.8186 6.913e-02 -12.626392   0.47645 1898
grdrange6     0.334935    2.90863   0.1152 9.083e-01  -5.369513   6.03938 1898
grdrange7   -26.247871    6.98357  -3.7585 1.761e-04 -39.944145 -12.55160 1898
grdrange8    -6.151212    3.37377  -1.8232 6.842e-02 -12.767900   0.46548 1898
gkenrmnt      0.003051    0.00552   0.5526 5.806e-01  -0.007776   0.01388 1898

Multiple R-squared:  0.06845 ,  Adjusted R-squared:  0.06256 
F-statistic: 13.02 on 12 and 1898 DF,  p-value: < 2.2e-16

Tidy Reading Results

tidy(read_mod) %>%
    filter(term == "treatment") %>%
    mutate(outcome = "g3tlangss") %>%
    dplyr::select(outcome, everything())
    outcome      term estimate std.error statistic     p.value conf.low
1 g3tlangss treatment 4.113126  1.584827  2.595315 0.009523254  1.00494
  conf.high   df
1  7.221312 1898

Repetition

  • Redundant code

  • Repetitive

  • Prone to error when copy and pasting and changing a few things. Can you list out all the errors I made just now?

  • If we want to change the model, let’s say add few other variables, we have to repeat that correction multiple times

    • Which may also open up space for more error

Functions

  • Input

  • Do something with the input

  • Output

  • Run it over again with different input

sum_it <- function(x, # input 1 
                   y  # input 2
                   ){
  
  total <- x + y # do something with the input
  
  return(total) # output
  

}

sum_it(2, 3)
[1] 5
sum_it(1000, 5000)
[1] 6000

Function Example

run_reg <- function(outcome_var) {

    lm_equation <- paste(outcome_var, " ~  treatment + gender + race + schlurbn + grdrange + gkenrmnt")

    mod <- lm_robust(as.formula(lm_equation), 
                     data = star_dat,
                     se_type = "HC2")

    res_clean <- tidy(mod) %>%
        filter(term == "treatment") %>%
        mutate(outcome = outcome_var) %>%
        dplyr::select(outcome, everything())

                
    return(res_clean)
    
}

Run the Function

run_reg(outcome = "g3tlangss")
    outcome      term estimate std.error statistic     p.value conf.low
1 g3tlangss treatment 4.113126  1.584827  2.595315 0.009523254  1.00494
  conf.high   df
1  7.221312 1898
run_reg(outcome = "g3tmathss")
    outcome      term estimate std.error statistic      p.value conf.low
1 g3tmathss treatment 6.367242  1.732919  3.674288 0.0002451289 2.968616
  conf.high   df
1  9.765868 1898
run_reg(outcome = "g3treadss")
    outcome      term estimate std.error statistic      p.value conf.low
1 g3treadss treatment 7.411505  1.644268  4.507479 6.960466e-06 4.186741
  conf.high   df
1  10.63627 1898

Critiques of the run_reg() Function

  • Can you think of potential issues with this function?

  • What are the issues and consequences of those issues?

  • What can you do to improve the function?

Repetition

  • For the sake of this presentation, I am only showing three outcomes

  • Imagine if there were more than 10 outcomes that you needed to analyze

  • Think of potential issues in this situation - both when doing brute force analysis and even when using a function like run_reg()

Iteration

Repetitive Tasks Examples

  • Running analysis for multiple outcomes like above

  • Reading in 50 csv files in a folder with different csv files containing data for different school year

  • Conduct power analysis to estimate sample size for a randomized trial based on multiple assumed values of the effect size

For Loops

  • Classic way to run repetitive analyses or tasks

  • For each item in a vector of items, run the code

numbers <- 1:10 # A vector of items to loop through

for(n in numbers){ # for each number in the vector numbers
  
  x <- n * 2 # multiply the number by 2
  
  print(x) # print the result
  
  
}
[1] 2
[1] 4
[1] 6
[1] 8
[1] 10
[1] 12
[1] 14
[1] 16
[1] 18
[1] 20

STAR Analysis Example: Loop through Outcomes

outcome <- c("g3tlangss", "g3tmathss", "g3treadss") # outcomes to loop through

res <- NULL # initialize the results 

for(y in outcome){ # for each outcome in the outcome vector 
  
  output <- run_reg(outcome = y) # run run_reg for that outcome
  res <- bind_rows(res, output) # append the output to res
  
  
}

res # look at res after the looping
    outcome      term estimate std.error statistic      p.value conf.low
1 g3tlangss treatment 4.113126  1.584827  2.595315 9.523254e-03 1.004940
2 g3tmathss treatment 6.367242  1.732919  3.674288 2.451289e-04 2.968616
3 g3treadss treatment 7.411505  1.644268  4.507479 6.960466e-06 4.186741
  conf.high   df
1  7.221312 1898
2  9.765868 1898
3 10.636268 1898

For Loops are Slow…

  • For loops are still clunky and involve fairly extensive coding
  • Need to think through how to append and store the outcome or print the outcome
  • Also a bit slow..(er than other stuff we are going to talk about) in R

purrr

  • Functional programming in R

  • Like for loop - apply a function to each element in a vector but less clunky

  • Major family of functions: map()

    • Loop over a vector and output some value

    • Different variations of map() will return different types of output

    • Please read through the different variations here

map()

numbers <- 1:5 # vector with items to loop over

# function that you want run
times_two <- function(x){
  x * 2
}

# loop times_two through each number, notice the output
map(numbers, .f = times_two)
[[1]]
[1] 2

[[2]]
[1] 4

[[3]]
[1] 6

[[4]]
[1] 8

[[5]]
[1] 10

Different Output Format

map_int(numbers, .f = times_two)
[1]  2  4  6  8 10

Back to our run_reg()

map_dfr(outcome, run_reg)
    outcome      term estimate std.error statistic      p.value conf.low
1 g3tlangss treatment 4.113126  1.584827  2.595315 9.523254e-03 1.004940
2 g3tmathss treatment 6.367242  1.732919  3.674288 2.451289e-04 2.968616
3 g3treadss treatment 7.411505  1.644268  4.507479 6.960466e-06 4.186741
  conf.high   df
1  7.221312 1898
2  9.765868 1898
3 10.636268 1898

Functions with Multiple arguments

  • The map() family of functions is limited in that we can only loop through one thing

  • We may run into a scenario where we want through loop through multiple things

  • Can you think of some such scenarios?

  • In such case, we can use the pmap() family of functions. Please read through https://purrr.tidyverse.org/reference/pmap.html

Thank you!!