Functions and Iteration

Megha Joshi

Functions

Example Dataset Background

Tennessee’s Student/Teacher Achievement Ratio (STAR) (Mosteller, 1995)
Longitudinal project examining effect of class size on student achievement
Data on language, math, reading outcomes
The data that we are using is from the evalITR package
- Please review information about the data here

Example Dataset

library(tidyverse)
library(janitor)
library(evalITR)
library(estimatr)
library(broom)

star_dat <- star %>%
    clean_names()

glimpse(star_dat)

Rows: 1,911
Columns: 14
$ g3tlangss  <dbl> 608, 644, 638, 595, 694, 644, 618, 668, 677, 650, 635, 657,…
$ g3treadss  <dbl> 582, 636, 578, 612, 631, 592, 583, 661, 675, 713, 614, 595,…
$ g3tmathss  <dbl> 591, 622, 610, 612, 631, 631, 606, 648, 648, 645, 636, 610,…
$ treatment  <dbl> 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0,…
$ gender     <fct> 2, 1, 2, 1, 2, 1, 1, 2, 2, 2, 2, 2, 2, 1, 1, 1, 2, 2, 1, 2,…
$ race       <fct> 2, 1, 1, 1, 1, 2, 2, 1, 1, 1, 1, 2, 1, 1, 1, 1, 2, 2, 1, 1,…
$ birthmonth <fct> 11, 11, 12, 11, 9, 3, 6, 6, 1, 6, 11, 12, 7, 5, 5, 11, 12, …
$ birthyear  <fct> 1979, 1979, 1979, 1979, 1980, 1980, 1980, 1980, 1980, 1980,…
$ schlurbn   <fct> 2, 2, 3, 3, 3, 1, 1, 2, 2, 4, 3, 1, 3, 3, 4, 2, 1, 1, 3, 3,…
$ grdrange   <fct> 5, 6, 3, 8, 3, 6, 6, 5, 5, 6, 8, 6, 8, 4, 4, 4, 6, 6, 5, 6,…
$ gkenrmnt   <dbl> 603, 621, 512, 389, 444, 521, 888, 709, 709, 531, 650, 479,…
$ gkfrlnch   <dbl> 90, 10, 32, 51, 30, 90, 98, 1, 1, 48, 64, 70, 64, 32, 28, 2…
$ gkbused    <dbl> 62, 61, 98, 71, 90, 13, 1, 98, 98, 67, 95, 17, 95, 93, 94, …
$ gkwhite    <dbl> 19, 100, 97, 94, 98, 0, 0, 97, 97, 92, 99, 1, 99, 85, 98, 6…

Run a Linear Regression to Assess the Impact of Treatment on Language

lang_mod <- lm_robust(g3tlangss ~ treatment + gender + race + schlurbn + grdrange + gkenrmnt, 
                  data = star_dat,
                  se_type = "HC2")

summary(lang_mod)


Call:
lm_robust(formula = g3tlangss ~ treatment + gender + race + schlurbn + 
    grdrange + gkenrmnt, data = star_dat, se_type = "HC2")

Standard error type:  HC2 

Coefficients:
              Estimate Std. Error  t value  Pr(>|t|)   CI Lower  CI Upper   DF
(Intercept) 633.366942    5.36016 118.1620 0.000e+00 622.854526 643.87936 1898
treatment     4.113126    1.58483   2.5953 9.523e-03   1.004940   7.22131 1898
gender2      10.144257    1.57216   6.4524 1.393e-10   7.060914  13.22760 1898
race2       -13.713573    2.45781  -5.5796 2.758e-08 -18.533876  -8.89327 1898
schlurbn2     8.602082    3.00381   2.8637 4.233e-03   2.710965  14.49320 1898
schlurbn3     6.301093    3.16943   1.9881 4.695e-02   0.085161  12.51702 1898
schlurbn4     6.421318    4.25744   1.5083 1.317e-01  -1.928435  14.77107 1898
grdrange4    -4.028959    4.00679  -1.0055 3.148e-01 -11.887125   3.82921 1898
grdrange5    -6.074969    3.34049  -1.8186 6.913e-02 -12.626392   0.47645 1898
grdrange6     0.334935    2.90863   0.1152 9.083e-01  -5.369513   6.03938 1898
grdrange7   -26.247871    6.98357  -3.7585 1.761e-04 -39.944145 -12.55160 1898
grdrange8    -6.151212    3.37377  -1.8232 6.842e-02 -12.767900   0.46548 1898
gkenrmnt      0.003051    0.00552   0.5526 5.806e-01  -0.007776   0.01388 1898

Multiple R-squared:  0.06845 ,  Adjusted R-squared:  0.06256 
F-statistic: 13.02 on 12 and 1898 DF,  p-value: < 2.2e-16

Tidy Language Results

tidy(lang_mod) %>%
    filter(term == "treatment") %>%
    mutate(outcome = "g3tlangss") %>%
    dplyr::select(outcome, everything())

    outcome      term estimate std.error statistic     p.value conf.low
1 g3tlangss treatment 4.113126  1.584827  2.595315 0.009523254  1.00494
  conf.high   df
1  7.221312 1898

On Math

math_mod <- lm_robust(g3tmathss ~ treatment + gender + race + schlurbn + grdrange + gkenrmnt, 
                  data = star_dat,
                  se_type = "HC2")

summary(math_mod)


Call:
lm_robust(formula = g3tmathss ~ treatment + gender + race + schlurbn + 
    grdrange + gkenrmnt, data = star_dat, se_type = "HC2")

Standard error type:  HC2 

Coefficients:
             Estimate Std. Error   t value  Pr(>|t|)  CI Lower   CI Upper   DF
(Intercept) 627.27238   6.034736 103.94363 0.000e+00 615.43697  6.391e+02 1898
treatment     6.36724   1.732919   3.67429 2.451e-04   2.96862  9.766e+00 1898
gender2       0.03764   1.726153   0.02181 9.826e-01  -3.34772  3.423e+00 1898
race2       -21.85364   3.005153  -7.27205 5.153e-13 -27.74739 -1.596e+01 1898
schlurbn2     8.66786   3.564785   2.43152 1.513e-02   1.67655  1.566e+01 1898
schlurbn3     3.57552   3.717144   0.96190 3.362e-01  -3.71459  1.087e+01 1898
schlurbn4    -0.61152   4.817304  -0.12694 8.990e-01 -10.05929  8.836e+00 1898
grdrange4    10.40834   4.146388   2.51022 1.215e-02   2.27638  1.854e+01 1898
grdrange5     3.24765   3.552424   0.91421 3.607e-01  -3.71941  1.021e+01 1898
grdrange6     7.68867   3.205376   2.39868 1.655e-02   1.40224  1.398e+01 1898
grdrange7   -19.31582   8.736040  -2.21105 2.715e-02 -36.44907 -2.183e+00 1898
grdrange8     6.11783   3.533298   1.73148 8.353e-02  -0.81172  1.305e+01 1898
gkenrmnt     -0.01167   0.005995  -1.94649 5.174e-02  -0.02343  8.830e-05 1898

Multiple R-squared:  0.08643 ,  Adjusted R-squared:  0.08066 
F-statistic: 16.42 on 12 and 1898 DF,  p-value: < 2.2e-16

Tidy Math Results

tidy(lang_mod) %>%
    filter(term == "treatment") %>%
    mutate(outcome = "g3tlangss") %>%
    dplyr::select(outcome, everything())

    outcome      term estimate std.error statistic     p.value conf.low
1 g3tlangss treatment 4.113126  1.584827  2.595315 0.009523254  1.00494
  conf.high   df
1  7.221312 1898

On Reading

read_mod <- lm_robust(g3tlangss ~ treatment + gender + race + schlurbn + grdrange + gkenrmnt, 
                  data = star_dat,
                  se_type = "HC2")

summary(read_mod)


Call:
lm_robust(formula = g3tlangss ~ treatment + gender + race + schlurbn + 
    grdrange + gkenrmnt, data = star_dat, se_type = "HC2")

Standard error type:  HC2 

Coefficients:
              Estimate Std. Error  t value  Pr(>|t|)   CI Lower  CI Upper   DF
(Intercept) 633.366942    5.36016 118.1620 0.000e+00 622.854526 643.87936 1898
treatment     4.113126    1.58483   2.5953 9.523e-03   1.004940   7.22131 1898
gender2      10.144257    1.57216   6.4524 1.393e-10   7.060914  13.22760 1898
race2       -13.713573    2.45781  -5.5796 2.758e-08 -18.533876  -8.89327 1898
schlurbn2     8.602082    3.00381   2.8637 4.233e-03   2.710965  14.49320 1898
schlurbn3     6.301093    3.16943   1.9881 4.695e-02   0.085161  12.51702 1898
schlurbn4     6.421318    4.25744   1.5083 1.317e-01  -1.928435  14.77107 1898
grdrange4    -4.028959    4.00679  -1.0055 3.148e-01 -11.887125   3.82921 1898
grdrange5    -6.074969    3.34049  -1.8186 6.913e-02 -12.626392   0.47645 1898
grdrange6     0.334935    2.90863   0.1152 9.083e-01  -5.369513   6.03938 1898
grdrange7   -26.247871    6.98357  -3.7585 1.761e-04 -39.944145 -12.55160 1898
grdrange8    -6.151212    3.37377  -1.8232 6.842e-02 -12.767900   0.46548 1898
gkenrmnt      0.003051    0.00552   0.5526 5.806e-01  -0.007776   0.01388 1898

Multiple R-squared:  0.06845 ,  Adjusted R-squared:  0.06256 
F-statistic: 13.02 on 12 and 1898 DF,  p-value: < 2.2e-16

Tidy Reading Results

tidy(read_mod) %>%
    filter(term == "treatment") %>%
    mutate(outcome = "g3tlangss") %>%
    dplyr::select(outcome, everything())

    outcome      term estimate std.error statistic     p.value conf.low
1 g3tlangss treatment 4.113126  1.584827  2.595315 0.009523254  1.00494
  conf.high   df
1  7.221312 1898

Repetition

Redundant code
Repetitive
Prone to error when copy and pasting and changing a few things. Can you list out all the errors I made just now?
If we want to change the model, let’s say add few other variables, we have to repeat that correction multiple times
- Which may also open up space for more error

Functions

Input
Do something with the input
Output
Run it over again with different input

sum_it <- function(x, # input 1 
                   y  # input 2
                   ){
  
  total <- x + y # do something with the input
  
  return(total) # output
  

}

sum_it(2, 3)

[1] 5

sum_it(1000, 5000)

[1] 6000

Function Example

run_reg <- function(outcome_var) {

    lm_equation <- paste(outcome_var, " ~  treatment + gender + race + schlurbn + grdrange + gkenrmnt")

    mod <- lm_robust(as.formula(lm_equation), 
                     data = star_dat,
                     se_type = "HC2")

    res_clean <- tidy(mod) %>%
        filter(term == "treatment") %>%
        mutate(outcome = outcome_var) %>%
        dplyr::select(outcome, everything())

                
    return(res_clean)
    
}

Run the Function

run_reg(outcome = "g3tlangss")

    outcome      term estimate std.error statistic     p.value conf.low
1 g3tlangss treatment 4.113126  1.584827  2.595315 0.009523254  1.00494
  conf.high   df
1  7.221312 1898

run_reg(outcome = "g3tmathss")

    outcome      term estimate std.error statistic      p.value conf.low
1 g3tmathss treatment 6.367242  1.732919  3.674288 0.0002451289 2.968616
  conf.high   df
1  9.765868 1898

run_reg(outcome = "g3treadss")

    outcome      term estimate std.error statistic      p.value conf.low
1 g3treadss treatment 7.411505  1.644268  4.507479 6.960466e-06 4.186741
  conf.high   df
1  10.63627 1898

Critiques of the `run_reg()` Function

Can you think of potential issues with this function?
What are the issues and consequences of those issues?
What can you do to improve the function?

Repetition

For the sake of this presentation, I am only showing three outcomes
Imagine if there were more than 10 outcomes that you needed to analyze
Think of potential issues in this situation - both when doing brute force analysis and even when using a function like run_reg()

Iteration

Repetitive Tasks Examples

Running analysis for multiple outcomes like above
Reading in 50 csv files in a folder with different csv files containing data for different school year
Conduct power analysis to estimate sample size for a randomized trial based on multiple assumed values of the effect size

For Loops

Classic way to run repetitive analyses or tasks
For each item in a vector of items, run the code

numbers <- 1:10 # A vector of items to loop through

for(n in numbers){ # for each number in the vector numbers
  
  x <- n * 2 # multiply the number by 2
  
  print(x) # print the result
  
  
}

[1] 2
[1] 4
[1] 6
[1] 8
[1] 10
[1] 12
[1] 14
[1] 16
[1] 18
[1] 20

STAR Analysis Example: Loop through Outcomes

outcome <- c("g3tlangss", "g3tmathss", "g3treadss") # outcomes to loop through

res <- NULL # initialize the results 

for(y in outcome){ # for each outcome in the outcome vector 
  
  output <- run_reg(outcome = y) # run run_reg for that outcome
  res <- bind_rows(res, output) # append the output to res
  
  
}

res # look at res after the looping

    outcome      term estimate std.error statistic      p.value conf.low
1 g3tlangss treatment 4.113126  1.584827  2.595315 9.523254e-03 1.004940
2 g3tmathss treatment 6.367242  1.732919  3.674288 2.451289e-04 2.968616
3 g3treadss treatment 7.411505  1.644268  4.507479 6.960466e-06 4.186741
  conf.high   df
1  7.221312 1898
2  9.765868 1898
3 10.636268 1898

For Loops are Slow…

For loops are still clunky and involve fairly extensive coding
Need to think through how to append and store the outcome or print the outcome
Also a bit slow..(er than other stuff we are going to talk about) in R

`purrr`

Functional programming in R
Like for loop - apply a function to each element in a vector but less clunky
Major family of functions: map()
- Loop over a vector and output some value
- Different variations of map() will return different types of output
- Please read through the different variations here

`map()`

numbers <- 1:5 # vector with items to loop over

# function that you want run
times_two <- function(x){
  x * 2
}

# loop times_two through each number, notice the output
map(numbers, .f = times_two)

[[1]]
[1] 2

[[2]]
[1] 4

[[3]]
[1] 6

[[4]]
[1] 8

[[5]]
[1] 10

Different Output Format

map_int(numbers, .f = times_two)

[1]  2  4  6  8 10

Back to our `run_reg()`

map_dfr(outcome, run_reg)

    outcome      term estimate std.error statistic      p.value conf.low
1 g3tlangss treatment 4.113126  1.584827  2.595315 9.523254e-03 1.004940
2 g3tmathss treatment 6.367242  1.732919  3.674288 2.451289e-04 2.968616
3 g3treadss treatment 7.411505  1.644268  4.507479 6.960466e-06 4.186741
  conf.high   df
1  7.221312 1898
2  9.765868 1898
3 10.636268 1898

Functions with Multiple arguments

The map() family of functions is limited in that we can only loop through one thing
We may run into a scenario where we want through loop through multiple things
Can you think of some such scenarios?
In such case, we can use the pmap() family of functions. Please read through https://purrr.tidyverse.org/reference/pmap.html

Functions and Iteration

Functions

Example Dataset Background

Example Dataset

Run a Linear Regression to Assess the Impact of Treatment on Language

Tidy Language Results

On Math

Tidy Math Results

On Reading

Tidy Reading Results

Repetition

Functions

Function Example

Run the Function

Critiques of the run_reg() Function

Repetition

Iteration

Repetitive Tasks Examples

For Loops

STAR Analysis Example: Loop through Outcomes

For Loops are Slow…

purrr

map()

Different Output Format

Back to our run_reg()

Functions with Multiple arguments

Thank you!!

Critiques of the `run_reg()` Function

`purrr`

`map()`

Back to our `run_reg()`