Introduction

In this demonstration, we will illustrate the basic functionality of the SuperLearner package.

Installing the SuperLearner package

You can execute the following commands to install the packages needed to complete this demo.

# if needed, install all the necessary pacakges to execute this demo
# install.packages(c("SuperLearner","gam","caret","randomForest","arm","RCurl","MASS","tmle","ggplot2","gbm"))

# load the packages
require(SuperLearner)
require(gam)
require(caret)
require(randomForest)
require(RCurl)
require(MASS)
require(tmle)
require(ggplot2)
require(gbm)

Loading data from GitHub

I have made three data sets available on GitHub that will be used in this demo. These can be read directly from GitHub using the following commands:

# prediction data set
chspred <- read.csv(text = getURL("https://raw.githubusercontent.com/benkeser/sllecture/master/chspred.csv"), header = TRUE)

# statin data
statins <- read.csv(text = getURL("https://raw.githubusercontent.com/benkeser/sllecture/master/statins.csv"), header = TRUE)

# hiv vaccine data
hiv <- read.csv(text = getURL("https://raw.githubusercontent.com/benkeser/sllecture/master/hiv.csv"), header = TRUE)

Super Learner with a simple library

We begin by illustrating the “default” functionality of the SuperLearner function. Using the chspred data, we are interested in predicting myocardial infarcation (mi) using the available covariate data. Let’s take a quick peek at the data to see what variables we have:

head(chspred, 3)
##       waist    alcoh      hdl beta smoke ace      ldl      bmi aspirin
## 1 110.16419 0.000000 66.49739    0     0   1 114.2162 27.99746       0
## 2  89.97632 0.000000 50.06524    0     0   0 103.7766 20.89314       0
## 3 106.19407 8.417438 40.50595    0     0   0 165.7158 28.45541       1
##   gend      age estrgn      glu       ins   cysgfr dm   fetuina       whr
## 1    0 73.51795      0 159.9314  70.33431 75.00781  1 0.1751561 1.1689797
## 2    0 61.77225      0 153.3888  33.96945 82.74331  1 0.5716507 0.9011436
## 3    1 72.93120      0 121.7145 -17.30173 74.69888  0 0.3516775 1.1797129
##   hsed race   logcystat  logtrig    logcrp      logcre health  logkcal
## 1    1    1 -0.34202495 5.406297 2.0125961 -0.67385483      0 4.392559
## 2    0    0 -0.08465306 4.859234 3.2932809 -0.55509011      1 6.207056
## 3    0    1 -0.44510754 4.508810 0.3013225 -0.01151792      0 6.731968
##      sysbp mi
## 1 177.1345  0
## 2 136.3742  0
## 3 135.1993  0

For the sake of computational expediency, we will initially consider only a simple library of algorithms: a main effects GLM and an unadjusted (i.e., intercept) model. Later, we will look at how these algorithms are constructed for useage with SuperLearner

# because cross-validation is used, we need to set to the seed to ensure reproducibility
set.seed(1234)

# execute the call to SuperLearner
sl1 <- SuperLearner(
  # Y is the outcome variable
  Y = chspred$mi,
  # X is a dataframe of predictor variables, in this case
  # everything in chspred except for mi
  X = chspred[,-ncol(chspred)], 
  # newX will be discussed later, for now leave as NULL (default)
  newX = NULL,
  # family will be discussed in more detail when we see how wrappers
  # are written, for now set to binomial() for 0/1 outcome
  family = binomial(), 
  # SL.library (for now) is specified as a vector of names of functions
  # that implement the desired algorithms. SL.glm and SL.mean
  # are included in the Super Learner package
  SL.library = c("SL.glm","SL.mean"),
  # method specifies how the ensembling is done, for now we will use
  # the \sum_{k=1}^K \alpha_k f_{k,n} method by using the deafult
  # option for method (method.NNLS)
  method = "method.NNLS",
  # id specifies a unique subject identifier so that whole subjects 
  # are sampled in CV, not just rows of data. chspred only has one row 
  # per subject, so OK to leave as NULL (default)
  id = NULL, 
  # verbose controls the printing of messages of SuperLearner's progress.
  # We'll leave as FALSE (default) for now
  verbose = FALSE, 
  # control contains options related to logistic ensemble (trimLogit) 
  # and whether to save the fit library to look at individual 
  # algorithms later. We will leave as default
  control = list(saveFitLibrary = TRUE, trimLogit = 0.001),
  # cvControl specifies parameters related to cross validation. Of note
  # the default is for V=10-fold cross validation. See ?SuperLearner
  # for more details
  cvControl = list(V = 10L, stratifyCV = FALSE, shuffle = TRUE, 
                   validRows = NULL)
)

sl1
## 
## Call:  
## SuperLearner(Y = chspred$mi, X = chspred[, -ncol(chspred)], newX = NULL,  
##     family = binomial(), SL.library = c("SL.glm", "SL.mean"), method = "method.NNLS",  
##     id = NULL, verbose = FALSE, control = list(saveFitLibrary = TRUE,  
##         trimLogit = 0.001), cvControl = list(V = 10L, stratifyCV = FALSE,  
##         shuffle = TRUE, validRows = NULL)) 
## 
## 
##                   Risk      Coef
## SL.glm_All  0.02658219 0.8871921
## SL.mean_All 0.03041508 0.1128079

From the output we see that SL.glm_All had the lowest cross-validated risk and is thus the Discrete Super Learner. We will discuss why the name of each algorithm has been augmented with the suffix _All when we illustrate variable screening functions later in the document.

Predictions from the discrete and continuous Super Learner on the observed data can now be obtained as follows:

# default call to predict
slPred <- predict(sl1)
# slPred is a list with two components
#   pred = continuous SL predictions
#   library.predict = predictions from each algorithm

# store the continuous SL predictions
cslPred <- slPred$pred

# get the discrete SL predictions
dslPred <- slPred$library.predict[,which(sl1$cvRisk==min(sl1$cvRisk))]

We can also obtain predictions on a new observation:

# generate a new observation set to the mean of each variable
newObs <- data.frame(t(colMeans(chspred[,-ncol(chspred)])))

# all predictions on newObs
slPredNew <- predict(sl1,newdata=newObs)

# continuous SL prediction on newObs
cslPredNew <- slPredNew$pred

# discrete SL prediction on newObs
dslPredNew <- slPredNew$library.predict[,which(sl1$cvRisk==min(sl1$cvRisk))]

If one wishes to access the fitted object for any of the component algorithms (applied to all the data), this can be accessed through the fitLibrary component of the SuperLearner object. For example, to access the glm object from the SL.glm algorithm, we can use:

# obtain gamma GLM with log-link fit
glmObject <- sl1$fitLibrary$SL.glm$object

# summarize the fit
summary(glmObject)
## 
## Call:
## glm(formula = Y ~ ., family = family, data = X, weights = obsWeights)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4162  -0.2271  -0.1342  -0.0766   3.5835  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  5.4405844  1.9012226   2.862  0.00421 ** 
## waist       -0.0064552  0.0071065  -0.908  0.36370    
## alcoh        0.1127744  0.0110880  10.171  < 2e-16 ***
## hdl          0.0044309  0.0052140   0.850  0.39543    
## beta         0.3328622  0.2747572   1.211  0.22571    
## smoke        0.1358397  0.2704664   0.502  0.61550    
## ace          0.3195770  0.2900528   1.102  0.27055    
## ldl         -0.0155763  0.0019280  -8.079 6.54e-16 ***
## bmi         -0.1718827  0.0188935  -9.097  < 2e-16 ***
## aspirin      0.1921275  0.1940136   0.990  0.32204    
## gend         0.1720266  0.2266058   0.759  0.44777    
## age         -0.0498989  0.0165272  -3.019  0.00253 ** 
## estrgn      -0.2844683  0.3469420  -0.820  0.41226    
## glu         -0.0016521  0.0024212  -0.682  0.49503    
## ins          0.0009482  0.0022495   0.422  0.67338    
## cysgfr      -0.0071902  0.0045757  -1.571  0.11609    
## dm           0.4097270  0.2472677   1.657  0.09752 .  
## fetuina     -0.3428327  0.7446875  -0.460  0.64525    
## whr          0.0397150  0.7534039   0.053  0.95796    
## hsed         0.0805406  0.1955452   0.412  0.68043    
## race         0.0936239  0.2876978   0.325  0.74486    
## logcystat   -0.0288129  0.3140621  -0.092  0.92690    
## logtrig     -0.0586191  0.1576144  -0.372  0.70996    
## logcrp      -0.0336396  0.0715266  -0.470  0.63813    
## logcre      -0.2222800  0.2643888  -0.841  0.40050    
## health       0.2299788  0.2707350   0.849  0.39562    
## logkcal      0.0325390  0.0410616   0.792  0.42810    
## sysbp        0.0047877  0.0033635   1.423  0.15461    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1280.44  on 4589  degrees of freedom
## Residual deviance:  944.26  on 4562  degrees of freedom
## AIC: 1000.3
## 
## Number of Fisher Scoring iterations: 7

Writing algorithms for Super Learner

We now discuss how to supply new algorithms to the SuperLearner function. First, it is useful to check the algorithms that are included in the SuperLearner by default:

listWrappers()
## All prediction algorithm wrappers in SuperLearner:
##  [1] "SL.bayesglm"         "SL.caret"            "SL.caret.rpart"     
##  [4] "SL.cforest"          "SL.earth"            "SL.gam"             
##  [7] "SL.gbm"              "SL.glm"              "SL.glm.interaction" 
## [10] "SL.glmnet"           "SL.ipredbagg"        "SL.knn"             
## [13] "SL.leekasso"         "SL.loess"            "SL.logreg"          
## [16] "SL.mean"             "SL.nnet"             "SL.nnls"            
## [19] "SL.polymars"         "SL.randomForest"     "SL.ridge"           
## [22] "SL.rpart"            "SL.rpartPrune"       "SL.step"            
## [25] "SL.step.forward"     "SL.step.interaction" "SL.stepAIC"         
## [28] "SL.svm"              "SL.template"
## 
## All screening algorithm wrappers in SuperLearner:
## [1] "All"
## [1] "screen.corP"           "screen.corRank"        "screen.glmnet"        
## [4] "screen.randomForest"   "screen.SIS"            "screen.template"      
## [7] "screen.ttest"          "write.screen.template"

Note that both “prediction” and “screening”" algorithms are shown. We focus first on prediction algorithms; screening algorithms are discussed in the next section. Let’s look at the guts of the SL.glm algorithm:

SL.glm
## function (Y, X, newX, family, obsWeights, ...) 
## {
##     fit.glm <- glm(Y ~ ., data = X, family = family, weights = obsWeights)
##     pred <- predict(fit.glm, newdata = newX, type = "response")
##     fit <- list(object = fit.glm)
##     class(fit) <- "SL.glm"
##     out <- list(pred = pred, fit = fit)
##     return(out)
## }
## <environment: namespace:SuperLearner>

Note that SL.glm is a function that takes as input: Y, X, newX, family, obsWeights, and other arguments via .... Note that the family option allows one to use a single prediction function when the outcome is both binary and continuous. In this case, SL.glm with family=gaussian() will call glm with family=gaussian() (linear regression); with family=binomial() it calls glm with family=binomial() (logistic regression). The output of the function is a list with components pred, a vector of predictions computed on the newX object (not X! source of many errors in my life…), and fit, which contains anything that is (1) required for predicting new values later; or (2) desired for later access via the fitLibrary component of the SuperLearner object. Because this fit object may be used for prediction later, it is important to specify its class so that an S3 predict method can be used on the object later. Note that such a method is already included for SL.glm:

predict.SL.glm
## function (object, newdata, ...) 
## {
##     pred <- predict(object = object$object, newdata = newdata, 
##         type = "response")
##     pred
## }
## <environment: namespace:SuperLearner>

This input/output structure is all that is needed to define a new prediction algorithm for SuperLearner.

As an illustration, we could write a new algorithm specifying a new glm algorithm that uses the Poisson error distribution and log link (default of family = poisson()):

require(splines)

SL.poisglm <- function(Y, X, newX, family, obsWeights, ...){
  # Poisson regression can be used regardless of whether Y is 
  # continuous or binary, so we will not specify anything related to 
  # family
  
  # fit glm with family=poisson()
  fit.glm <- glm(Y ~ .,data=X, family=poisson())
  
  # get predictions on newX object
  pred <- predict(fit.glm, newdata=newX, type="response")
  
  # save the fit object
  fit <- list(object=fit.glm)
  
  # because this is simply a different form of glm, 
  # we can use predict.SL.glm to get predictions back, 
  # i.e. no need to write a new predict function
  class(fit) <- "SL.glm"
  
  # out must be list with named objects pred (predictions on newX)
  # and fit (anything needed to get predictions later)
  out <- list(pred=pred, fit=fit)
  return(out)
}

We have now defined a new algorithm for use in the SuperLearner.

These new algorithms can now be added to the library we used previously:

set.seed(1234)

sl2 <- SuperLearner(
  Y = chspred$mi,
  X = chspred[,-ncol(chspred)],
  SL.library = c("SL.glm","SL.mean","SL.poisglm")
  )

sl2
## 
## Call:  
## SuperLearner(Y = chspred$mi, X = chspred[, -ncol(chspred)], SL.library = c("SL.glm",  
##     "SL.mean", "SL.poisglm")) 
## 
## 
##                      Risk      Coef
## SL.glm_All     0.02819407 0.6554559
## SL.mean_All    0.03041508 0.1026581
## SL.poisglm_All 0.03585962 0.2418860

We can double check to make sure that predict.SL.glm works for our the new algorithms we defined by attempting to predict on a new observation:

slPredNew2 <- predict(sl2,newdata=newObs)
slPredNew2
## $pred
##            [,1]
## [1,] 0.02667622
## 
## $library.predict
##      SL.glm_All SL.mean_All SL.poisglm_All
## [1,] 0.03137255  0.03137255     0.01195707

Screening algorithms for the Super Learner

We now discuss how screening algorithms can be utilized to create Super Learner libraries. As the name suggests, these are algorithms that define a screening step prior to the execution of the prediction algorithm. The SuperLearner function will apply this screening step in each of the V folds. The combination of screening algorithm and prediction algorithm defines a new algorithm. We can look at how screening algorithms are constructed for use with the SuperLearner package:

write.screen.template()
## screen.template <- function(Y, X, family, obsWeights, id, ...) {
##   # load required packages
##   # require('pkg')
##   if (family$family == 'gaussian') {
##     
##   }
##   if (family$family == 'binomial') {
##   
##   }
##   # whichVariable is a logical vector,
##   # TRUE indicates variable will be used
##   whichVariable <- rep(TRUE, ncol(X))
##   return(whichVariable)
## }

Screening algorithms take the same input as prediction algorithms, but output a logical vector with TRUE indicating that a column of X should be used in the prediction step. To illustrate why these functions are useful, in our running example, consider the possibility of an interaction between treatment and SOFA score. If we are unsure of the existence of this interaction, we may wish to include algorithms that both do and do not account for this interaction. To construct a new library that includes algorithms both with and without interactions, we can make use of screening algorithms.

Let’s write a screening algorithm that only includes demographic variables:

demographics <- function(X,...){
  returnCols <- rep(FALSE, ncol(X))
  returnCols[names(X) %in% c("age","gend","race","hsed")] <- TRUE
  return(returnCols)
}

Now we can fit the SuperLearner using the two GLMs both with all variables and only demographic variables. The call to SuperLearner is nearly identical; however, we now specify SL.library as a list, where each component is a vector of the form c(predictionAlgorithm,screeningAlgorithm). To include all the covariates, we specify the All screening algorithm that is included in the SuperLearner package.

set.seed(1234) 

# Fit the Super Learner
sl3 <- SuperLearner(
  Y = chspred$mi,
  X = chspred[,-ncol(chspred)],
  SL.library=list(c("SL.glm","All"),c("SL.glm","denographics"),
                  c("SL.mean","All"), # not adjusted, so doesn't matter
                  c("SL.poisglm","All"),c("SL.poisglm","denographics"))
  )

sl3
## 
## Call:  
## SuperLearner(Y = chspred$mi, X = chspred[, -ncol(chspred)], SL.library = list(c("SL.glm",  
##     "All"), c("SL.glm", "denographics"), c("SL.mean", "All"), c("SL.poisglm",  
##     "All"), c("SL.poisglm", "denographics"))) 
## 
## 
##                               Risk      Coef
## SL.glm_All              0.02819407 0.0000000
## SL.glm_denographics     0.02819407 0.6554559
## SL.mean_All             0.03041508 0.1026581
## SL.poisglm_All          0.03585962 0.2418860
## SL.poisglm_denographics 0.03585962 0.0000000

Note that the output for sl3 lists five algorithms: the three original algorithms each with the interaction (_All) and without (_noInt). Note that this explains why the output for sl1 contained the “_All" addendum – by default SuperLearner uses all the All screening function to pass through all variables in X to the prediction algorithms.

This flexibility in combining screening and prediction algorithms to generate new algorithms allows one to easily implement a library containing a large number of candidate algorithms. Check out listWrappers() to see other screening functions that are useful for more high dimensional settings.

Using different loss/ensemble functions

So far, we have been focusing on using mean-squared error loss, by virtue of using the default method=method.NNLS. Because our outcome is binary, we may instead prefer the negative log-likelihood loss function instead. We can easily change our original call to SuperLearner to this loss function:

set.seed(1234)

sl4 <- SuperLearner(
  Y = chspred$mi,
  X = chspred[,-ncol(chspred)], 
  SL.library=c("SL.glm","SL.mean"),
  method = "method.NNloglik"
  )

sl4
## 
## Call:  
## SuperLearner(Y = chspred$mi, X = chspred[, -ncol(chspred)], SL.library = c("SL.glm",  
##     "SL.mean"), method = "method.NNloglik") 
## 
## 
##                  Risk Coef
## SL.glm_All        NaN    1
## SL.mean_All 0.1399198    0

We may wish instead to maximize AUC (equivalent to minimizing rank loss); for this, we can specify method=method.AUC:

set.seed(1234)

sl5 <- SuperLearner(
  Y = chspred$mi,
  X = chspred[,-ncol(chspred)], 
  SL.library=c("SL.glm","SL.mean"),
  method = "method.AUC",
  family=binomial()
  )
## Loading required package: cvAUC
## Loading required package: ROCR
## Loading required package: gplots
## 
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
## 
##     lowess
## Loading required package: data.table
## 
## cvAUC version: 1.1.0
## Notice to cvAUC users: Major speed improvements in version 1.1.0
## 
sl5
## 
## Call:  
## SuperLearner(Y = chspred$mi, X = chspred[, -ncol(chspred)], family = binomial(),  
##     SL.library = c("SL.glm", "SL.mean"), method = "method.AUC") 
## 
## 
##                  Risk      Coef
## SL.glm_All  0.1583508  643.9684
## SL.mean_All 0.5996542 -798.7203

Or we can even write our own method. The package contains a template for doing so. It requires a function that returns a list with three components: (1) require lists the packages needed to execute the functions; (2) computeCoef is a function that takes a specific input and returns a cross validated risk estimate and vector of weight coefficients corresponding to the \(K\) algorithms in the library; (3) computePred a function that computes the ensemble prediction.

method.template
## function () 
## {
##     out <- list(require = NULL, computeCoef = function(Z, Y, 
##         libraryNames, obsWeights, control, verbose, ...) {
##         cvRisk <- numeric()
##         coef <- numeric()
##         out <- list(cvRisk = cvRisk, coef = coef)
##         return(out)
##     }, computePred = function(predY, coef, control, ...) {
##         out <- crossprod(t(predY), coef)
##         return(out)
##     })
##     invisible(out)
## }
## <environment: namespace:SuperLearner>

Statin Analysis – Super Learner + Efficient Estimation

We now illustrate an analysis of the statin data that combines Super Learning-derived estimates with one-step estimation and TMLE. As you learned in class, the canonical gradient of the parameter \(E_0\bigl\{E_0(Y | A=1, W) - E_0(Y | A=0, W)\bigr\}\) at \((Q,g)\) in a nonparametric model is given by \[ D^*(Q,g)(o) = \frac{2a - 1}{g^a_0(w)} (y - \bar{Q}^a_0(w)) + \bar{Q}^1_0(w) - \bar{Q}^0_0(w) - \psi_0 \ . \] The one-step estimator is defined as \[ \psi_n^+ := \Psi(Q_n) + P_n D^*(Q_n,g_n) \ . \] We will now use the statin data to illustrate how to construct this estimator using Super Learner-based estimates of the nuisance parameters.

Let’s first take a peek at the statin data to familiarize ourselves with what variables are there

head(statins, 3)
##   statin educ gend      bmi      hdl smoke      age race     chol      ldl
## 1      0    0    1 24.09128 67.48161     0 75.40926    1 231.2540  94.5397
## 2      0    0    0 24.04775 59.35625     0 70.24054    1 306.1297 196.1082
## 3      0    0    1 33.48020 66.70018     1 73.49698    0 195.6578 102.1743
##   hyper copd    kcal death
## 1     0    0    0.00     1
## 2     1    0 1276.99     0
## 3     1    0    0.00     0

First we need to estimate both the outcome regression and the propensity regression using Super Learner. We start with the propensity:

# estimate g_0^1
sl.g1 <- SuperLearner(
  # our outcome is now the statin variable
  Y = statins$statin, 
  # our predictors are all variables except for death and statins
  X = statins[,-c(1,ncol(statins))],
  # outcome is binary, so let's use family = binomial()
  family = binomial(),
  # and nnloglik metho
  method=method.NNloglik,
  # simple library for computational efficiency
  SL.library = c("SL.glm","SL.mean")
)

# get predicted probability that statin = 1
g1n <- sl.g1$SL.pred

# the predicted probability that statin = 0 is 1-g1n
g0n <- 1 - g1n

# get predicted probability that statin = observed value
gan <- ifelse(statins$statin==0, g0n[statins$statin==0], g1n[statins$statin==1])

We can now estimate the outcome regression and return predictions setting statin equal to zero and one:

set.seed(1234)

# estimate \bar{Q}_0
sl.Q <- SuperLearner(
  # our outcome is death
  Y = statins$death, 
  # our predictors are all variables other than death
  X = statins[,-ncol(statins)],
  # outcome is binary, so let's use family = binomial()
  family = binomial(),
  # and nnloglik metho
  method=method.NNloglik,
  # simple library for computational efficiency
  SL.library = c("SL.glm","SL.mean")
)

# set up a data frame where everyone has statin=1
statins1 <- statins[,-ncol(statins)]
statins1$statin <- 1

# get \bar{Q}_n^1
Q1n <- predict(sl.Q, newdata=statins1)$pred

# set up a data frame where everyone has statin=0
statins0 <- statins[,-ncol(statins)]
statins0$statin <- 0

# get \bar{Q}_n^0
Q0n <- predict(sl.Q, newdata=statins0)$pred

# get \bar{Q}_n^a
Qan <- ifelse(statins$statin==0, Q0n[statins$statin==0], Q1n[statins$statin==1])

We now have all the ingredients to construct the one-step estimator:

# naive plug-in estimator
psi.naive <- mean(Q1n - Q0n)
psi.naive
## [1] -0.0329666
# bias correction (P_n applied to canonical gradient)
PnDQngn <- mean((2*statins$statin - 1)/gan * (statins$death - Qan) + (Q1n - Q0n) - psi.naive)

# one step estimator
psi.os <- psi.naive + PnDQngn
psi.os
## [1] -0.01373641

We can also estimate \(\psi_0\) using TMLE combined with Super Learning. The tmle package is already integrated with SuperLearner, so we do not have to do as much work ourselves:

set.seed(1234)

tmleFit <- tmle(
  Y = statins$death,
  A = statins$statin,
  W = statins[,-c(1,ncol(statins))],
  Q.SL.library = c("SL.glm","SL.mean"),
  g.SL.library = c("SL.glm","SL.mean")
)

tmleFit
##  Additive Effect
##    Parameter Estimate:  -0.046762
##    Estimated Variance:  0.0014712
##               p-value:  0.22278
##     95% Conf Interval: (-0.12194, 0.028416)

Evaluating the Super Learner

The SuperLearner package comes with an additional function to objectively evaluate the performance of the SuperLearner predictions relative to those from its component methods. This is achieved by adding an additional outer layer of V-fold cross-validation to the procedure. That is the data is split into, for example twenty equally sized pieces and each algorithm is trained on nine-tenths of the data – including the Super Learner, which itself uses 10-fold cross-validation – and evaluated on the remaining piece. Each piece of data serves as the evaluation set once and the cross-validated risk of the Super Learner and each component algorithm is computed.

We can use the CV.SuperLearer function to evaluate our over-simplified library:

set.seed(1234)

# fit cross-validated super learner
cvsl1 <- CV.SuperLearner(
  Y = statins$death, 
  X = statins[,-ncol(statins)],
  # V specifies the number of outer CV layers used to evalute
  # the Super Learner (which by default uses 10-fold CV)
  V = 20,
  family = binomial(),
  method="method.NNLS",
  SL.library = c("SL.glm","SL.mean")
)

The object itself is not all that informative:

cvsl1
## 
## Call:  
## CV.SuperLearner(Y = statins$death, X = statins[, -ncol(statins)], V = 20,  
##     family = binomial(), SL.library = c("SL.glm", "SL.mean"), method = "method.NNLS") 
## 
## 
## 
## Cross-validated predictions from the SuperLearner:  SL.predict 
## 
## Cross-validated predictions from the discrete super learner (cross-validation selector):  discreteSL.predict 
## 
## Which library algorithm was the discrete super learner:  whichDiscreteSL 
## 
## Cross-validated prediction for all algorithms in the library:  library.predict

However, there is a nice plotting function to display the results:

# plot cross-validated risk
plot(cvsl1)

The plot shows the ordered cross-validated risk estimates and 95% confidence intervals about these estimates for each of the candidate algorithms, in addition to the discrete and continuous Super Learner.

Using Super Learner to tune a single method

The Super Learner software can easily be used to implement a single method with many different tuning parameter values. As an example, consider using Random Forests to estimate the outcome regression in the statin example. The method requires two tuning parameters: the number of trees to build ntree, the size of the trees nodesize, and the number of randomly sampled covariates for each tree mtry.

We can easily define a new algorithm that uses a single set of values for these parameters:

SL.randomForest_m5_nt1000_ns3 <- function(...,mtry=5,ntree=1000,nodesize=3){
  SL.randomForest(...,mtry=mtry,ntree=ntree,nodesize=nodesize)
}

We can also use a loop to define functions over a grid of tuning parameter values:

tuneGrid <-  expand.grid(mtry = c(3,5), ntree=c(500,1000), nodesize=c(1,3))

for(i in seq(nrow(tuneGrid))) { 
  eval(parse(text = paste0("SL.randomForest_m",tuneGrid[i,1],"_nt",tuneGrid[i,2],"_ns",tuneGrid[i,3], 
                      "<- function(..., mtry = ", tuneGrid[i, 1], ", ntree = ", tuneGrid[i, 2], 
                      ", nodesize = ", tuneGrid[i,3],") { SL.randomForest(..., mtry = mtry, ntree = ntree, nodesize=nodesize)}")))
  }

We have now created eight new prediction algorithms with each combination of tuning parameters specified in tuneGrid. For example, we can look at the algorithm that uses mtry=3, ntree=500, and nodesize=1:

SL.randomForest_m3_nt500_ns1
## function (..., mtry = 3, ntree = 500, nodesize = 1) 
## {
##     SL.randomForest(..., mtry = mtry, ntree = ntree, nodesize = nodesize)
## }

We can collect all of these algorithms by searching through R objects with a similar name:

# get vector of all objects in R
allObjects <- ls()
# search names of objects for 'SL.randomForest_m'
myRfObjects <- grep("SL.randomForest_m",allObjects)
# get only objects with 'SL.randomForest_m' in their name
allRf <- allObjects[myRfObjects]
allRf
## [1] "SL.randomForest_m3_nt1000_ns1" "SL.randomForest_m3_nt1000_ns3"
## [3] "SL.randomForest_m3_nt500_ns1"  "SL.randomForest_m3_nt500_ns3" 
## [5] "SL.randomForest_m5_nt1000_ns1" "SL.randomForest_m5_nt1000_ns3"
## [7] "SL.randomForest_m5_nt500_ns1"  "SL.randomForest_m5_nt500_ns3"

We can now use Super Learner to evaluate the performance of the Wang (2009) method using various tuning parameter values:

rf.sl <- SuperLearner(
  Y = statins$death, 
  X = statins[,-ncol(statins)],
  family = binomial(),
  method="method.NNLS",
  SL.library = allRf
  )

rf.sl
## 
## Call:  
## SuperLearner(Y = statins$death, X = statins[, -ncol(statins)], family = binomial(),  
##     SL.library = allRf, method = "method.NNLS") 
## 
## 
##                                        Risk       Coef
## SL.randomForest_m3_nt1000_ns1_All 0.1600278 0.51486107
## SL.randomForest_m3_nt1000_ns3_All 0.1600289 0.41608786
## SL.randomForest_m3_nt500_ns1_All  0.1612663 0.00000000
## SL.randomForest_m3_nt500_ns3_All  0.1602803 0.06905107
## SL.randomForest_m5_nt1000_ns1_All 0.1615840 0.00000000
## SL.randomForest_m5_nt1000_ns3_All 0.1619054 0.00000000
## SL.randomForest_m5_nt500_ns1_All  0.1621477 0.00000000
## SL.randomForest_m5_nt500_ns3_All  0.1618559 0.00000000

Combining Super Learner with the caret package

The caret package provides a uniform approach to tuning parameter selection for a number of algorithms (for a full list, see http://topepo.github.io/caret/modelList.html). This provides a way to include algorithms in the Super Learner that implicitly use cross-validation. For example, one candidate algorithm in the Super Learner might use a gradient boosted machine with fixed values for tuning parameters, while another candidate algorithm uses a gradient boosted machine with tuning parameters determined adaptively (e.g., through twenty-fold cross-validation).

The SuperLearner package provides an algorithm SL.caret to use caret to train an algorithm. I have written a slight modification to this function that fixes an issue when one trains a gradient boosted machine with verbose=FALSE (even with this option specified, SL.caret still outputs a TON of useless information).

SL.caret1 <- function (Y, X, newX, family, obsWeights, method = "rf", tuneLength = 3, 
                       trControl = trainControl(method = "cv", number = 20, verboseIter = FALSE), 
                       metric,...) 
{
  if (length(unique(Y))>2){
    if(is.matrix(Y)) Y <- as.numeric(Y)
    metric <- "RMSE"
    if(method=="gbm"){
      suppressWarnings(
        # pass verbose==FALSE directly to train (verboseIter doesn't 
        # suppress output)
      fit.train <- caret::train(x = X, y = Y, weights = obsWeights, 
                              metric = metric, method = method, 
                              tuneLength = tuneLength, 
                              trControl = trControl,verbose=FALSE)
      )
    }else{
      suppressWarnings(
      fit.train <- caret::train(x = X, y = Y, weights = obsWeights, 
                                metric = metric, method = method, 
                                tuneLength = tuneLength, 
                                trControl = trControl)
      )
    }
    pred <- predict(fit.train, newdata = newX, type = "raw")
  }
  if (length(unique(Y))<=2) {
    metric <- "Accuracy"
    Y.f <- as.factor(Y)
    levels(Y.f) <- c("A0", "A1")
    if(method=="gbm"){
      suppressWarnings(
        # pass verbose==FALSE directly to train (verboseIter doesn't 
        # suppress output)
      fit.train <- caret::train(x = X, y = Y.f, weights = obsWeights,
                              metric = metric, method = method, 
                              tuneLength = tuneLength, 
                              trControl = trControl, verbose = FALSE)
      )
    }else{
      suppressWarnings(
      fit.train <- caret::train(x = X, y = Y, weights = obsWeights, 
                                metric = metric, method = method, 
                                tuneLength = tuneLength, 
                                trControl = trControl)
      )
    }
    pred <- predict(fit.train, newdata = newX, type = "prob")[,2]
  }
  fit <- list(object = fit.train)
  out <- list(pred = pred, fit = fit)
  class(out$fit) <- c("SL.caret")
  return(out)
}

As a brief demonstration, we illustrate the implementation of a Super Learner that uses three gradient boosted machine algorithms: SL.gbm uses fixed tuning parameters (gbm.trees=10000 and interaction.depth=2); SL.gbm.caret1 uses twenty-fold cross validation (the default) to select these tuning parameters from a grid of three possible values; SL.gbm.caret2 uses ten-fold cross validation to select tuning parameters also from a grid of eight possible values. First we must define SL.gbm.caret1 and SL.gbm.caret2:

SL.gbm.caret1 <- function(...,method="gbm",tuneLength=3){
  SL.caret1(...,method=method,tuneLength=tuneLength)
}

SL.gbm.caret2 <- function(...,method="gbm",tuneLength=3, trControl=trainControl(method="cv",number=10,verboseIter=FALSE)){
  SL.caret1(...,method=method,tuneLength=tuneLength,trControl=trControl)
}

We can now implement the Super Learner. Note that the run time on this Super Learner will be considerably longer than previous runs due to the additional layers of cross validation.

set.seed(123)

# fit super learner using three GBMs -- only use the first 150 rows 
# for the sake of expediency
gbm.sl <- SuperLearner(
  Y = statins$death[1:150], 
  X = statins[1:150,-ncol(statins)],
  family = binomial(),
  SL.library = c("SL.gbm","SL.gbm.caret1","SL.gbm.caret2")
)

gbm.sl
## 
## Call:  
## SuperLearner(Y = statins$death[1:150], X = statins[1:150, -ncol(statins)],  
##     family = binomial(), SL.library = c("SL.gbm", "SL.gbm.caret1", "SL.gbm.caret2")) 
## 
## 
## 
##                        Risk Coef
## SL.gbm_All        0.1592559    1
## SL.gbm.caret1_All 0.1751551    0
## SL.gbm.caret2_All 0.1874641    0