In this demonstration, we will illustrate the basic functionality of the SuperLearner
package.
You can execute the following commands to install the packages needed to complete this demo.
# if needed, install all the necessary pacakges to execute this demo
# install.packages(c("SuperLearner","gam","caret","randomForest","arm","RCurl","MASS","tmle","ggplot2","gbm"))
# load the packages
require(SuperLearner)
require(gam)
require(caret)
require(randomForest)
require(RCurl)
require(MASS)
require(tmle)
require(ggplot2)
require(gbm)
I have made three data sets available on GitHub that will be used in this demo. These can be read directly from GitHub using the following commands:
# prediction data set
chspred <- read.csv(text = getURL("https://raw.githubusercontent.com/benkeser/sllecture/master/chspred.csv"), header = TRUE)
# statin data
statins <- read.csv(text = getURL("https://raw.githubusercontent.com/benkeser/sllecture/master/statins.csv"), header = TRUE)
# hiv vaccine data
hiv <- read.csv(text = getURL("https://raw.githubusercontent.com/benkeser/sllecture/master/hiv.csv"), header = TRUE)
We begin by illustrating the “default” functionality of the SuperLearner
function. Using the chspred
data, we are interested in predicting myocardial infarcation (mi
) using the available covariate data. Let’s take a quick peek at the data to see what variables we have:
head(chspred, 3)
## waist alcoh hdl beta smoke ace ldl bmi aspirin
## 1 110.16419 0.000000 66.49739 0 0 1 114.2162 27.99746 0
## 2 89.97632 0.000000 50.06524 0 0 0 103.7766 20.89314 0
## 3 106.19407 8.417438 40.50595 0 0 0 165.7158 28.45541 1
## gend age estrgn glu ins cysgfr dm fetuina whr
## 1 0 73.51795 0 159.9314 70.33431 75.00781 1 0.1751561 1.1689797
## 2 0 61.77225 0 153.3888 33.96945 82.74331 1 0.5716507 0.9011436
## 3 1 72.93120 0 121.7145 -17.30173 74.69888 0 0.3516775 1.1797129
## hsed race logcystat logtrig logcrp logcre health logkcal
## 1 1 1 -0.34202495 5.406297 2.0125961 -0.67385483 0 4.392559
## 2 0 0 -0.08465306 4.859234 3.2932809 -0.55509011 1 6.207056
## 3 0 1 -0.44510754 4.508810 0.3013225 -0.01151792 0 6.731968
## sysbp mi
## 1 177.1345 0
## 2 136.3742 0
## 3 135.1993 0
For the sake of computational expediency, we will initially consider only a simple library of algorithms: a main effects GLM and an unadjusted (i.e., intercept) model. Later, we will look at how these algorithms are constructed for useage with SuperLearner
# because cross-validation is used, we need to set to the seed to ensure reproducibility
set.seed(1234)
# execute the call to SuperLearner
sl1 <- SuperLearner(
# Y is the outcome variable
Y = chspred$mi,
# X is a dataframe of predictor variables, in this case
# everything in chspred except for mi
X = chspred[,-ncol(chspred)],
# newX will be discussed later, for now leave as NULL (default)
newX = NULL,
# family will be discussed in more detail when we see how wrappers
# are written, for now set to binomial() for 0/1 outcome
family = binomial(),
# SL.library (for now) is specified as a vector of names of functions
# that implement the desired algorithms. SL.glm and SL.mean
# are included in the Super Learner package
SL.library = c("SL.glm","SL.mean"),
# method specifies how the ensembling is done, for now we will use
# the \sum_{k=1}^K \alpha_k f_{k,n} method by using the deafult
# option for method (method.NNLS)
method = "method.NNLS",
# id specifies a unique subject identifier so that whole subjects
# are sampled in CV, not just rows of data. chspred only has one row
# per subject, so OK to leave as NULL (default)
id = NULL,
# verbose controls the printing of messages of SuperLearner's progress.
# We'll leave as FALSE (default) for now
verbose = FALSE,
# control contains options related to logistic ensemble (trimLogit)
# and whether to save the fit library to look at individual
# algorithms later. We will leave as default
control = list(saveFitLibrary = TRUE, trimLogit = 0.001),
# cvControl specifies parameters related to cross validation. Of note
# the default is for V=10-fold cross validation. See ?SuperLearner
# for more details
cvControl = list(V = 10L, stratifyCV = FALSE, shuffle = TRUE,
validRows = NULL)
)
sl1
##
## Call:
## SuperLearner(Y = chspred$mi, X = chspred[, -ncol(chspred)], newX = NULL,
## family = binomial(), SL.library = c("SL.glm", "SL.mean"), method = "method.NNLS",
## id = NULL, verbose = FALSE, control = list(saveFitLibrary = TRUE,
## trimLogit = 0.001), cvControl = list(V = 10L, stratifyCV = FALSE,
## shuffle = TRUE, validRows = NULL))
##
##
## Risk Coef
## SL.glm_All 0.02658219 0.8871921
## SL.mean_All 0.03041508 0.1128079
From the output we see that SL.glm_All had the lowest cross-validated risk and is thus the Discrete Super Learner. We will discuss why the name of each algorithm has been augmented with the suffix _All
when we illustrate variable screening functions later in the document.
Predictions from the discrete and continuous Super Learner on the observed data can now be obtained as follows:
# default call to predict
slPred <- predict(sl1)
# slPred is a list with two components
# pred = continuous SL predictions
# library.predict = predictions from each algorithm
# store the continuous SL predictions
cslPred <- slPred$pred
# get the discrete SL predictions
dslPred <- slPred$library.predict[,which(sl1$cvRisk==min(sl1$cvRisk))]
We can also obtain predictions on a new observation:
# generate a new observation set to the mean of each variable
newObs <- data.frame(t(colMeans(chspred[,-ncol(chspred)])))
# all predictions on newObs
slPredNew <- predict(sl1,newdata=newObs)
# continuous SL prediction on newObs
cslPredNew <- slPredNew$pred
# discrete SL prediction on newObs
dslPredNew <- slPredNew$library.predict[,which(sl1$cvRisk==min(sl1$cvRisk))]
If one wishes to access the fitted object for any of the component algorithms (applied to all the data), this can be accessed through the fitLibrary
component of the SuperLearner
object. For example, to access the glm
object from the SL.glm
algorithm, we can use:
# obtain gamma GLM with log-link fit
glmObject <- sl1$fitLibrary$SL.glm$object
# summarize the fit
summary(glmObject)
##
## Call:
## glm(formula = Y ~ ., family = family, data = X, weights = obsWeights)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.4162 -0.2271 -0.1342 -0.0766 3.5835
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 5.4405844 1.9012226 2.862 0.00421 **
## waist -0.0064552 0.0071065 -0.908 0.36370
## alcoh 0.1127744 0.0110880 10.171 < 2e-16 ***
## hdl 0.0044309 0.0052140 0.850 0.39543
## beta 0.3328622 0.2747572 1.211 0.22571
## smoke 0.1358397 0.2704664 0.502 0.61550
## ace 0.3195770 0.2900528 1.102 0.27055
## ldl -0.0155763 0.0019280 -8.079 6.54e-16 ***
## bmi -0.1718827 0.0188935 -9.097 < 2e-16 ***
## aspirin 0.1921275 0.1940136 0.990 0.32204
## gend 0.1720266 0.2266058 0.759 0.44777
## age -0.0498989 0.0165272 -3.019 0.00253 **
## estrgn -0.2844683 0.3469420 -0.820 0.41226
## glu -0.0016521 0.0024212 -0.682 0.49503
## ins 0.0009482 0.0022495 0.422 0.67338
## cysgfr -0.0071902 0.0045757 -1.571 0.11609
## dm 0.4097270 0.2472677 1.657 0.09752 .
## fetuina -0.3428327 0.7446875 -0.460 0.64525
## whr 0.0397150 0.7534039 0.053 0.95796
## hsed 0.0805406 0.1955452 0.412 0.68043
## race 0.0936239 0.2876978 0.325 0.74486
## logcystat -0.0288129 0.3140621 -0.092 0.92690
## logtrig -0.0586191 0.1576144 -0.372 0.70996
## logcrp -0.0336396 0.0715266 -0.470 0.63813
## logcre -0.2222800 0.2643888 -0.841 0.40050
## health 0.2299788 0.2707350 0.849 0.39562
## logkcal 0.0325390 0.0410616 0.792 0.42810
## sysbp 0.0047877 0.0033635 1.423 0.15461
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1280.44 on 4589 degrees of freedom
## Residual deviance: 944.26 on 4562 degrees of freedom
## AIC: 1000.3
##
## Number of Fisher Scoring iterations: 7
We now discuss how to supply new algorithms to the SuperLearner
function. First, it is useful to check the algorithms that are included in the SuperLearner
by default:
listWrappers()
## All prediction algorithm wrappers in SuperLearner:
## [1] "SL.bayesglm" "SL.caret" "SL.caret.rpart"
## [4] "SL.cforest" "SL.earth" "SL.gam"
## [7] "SL.gbm" "SL.glm" "SL.glm.interaction"
## [10] "SL.glmnet" "SL.ipredbagg" "SL.knn"
## [13] "SL.leekasso" "SL.loess" "SL.logreg"
## [16] "SL.mean" "SL.nnet" "SL.nnls"
## [19] "SL.polymars" "SL.randomForest" "SL.ridge"
## [22] "SL.rpart" "SL.rpartPrune" "SL.step"
## [25] "SL.step.forward" "SL.step.interaction" "SL.stepAIC"
## [28] "SL.svm" "SL.template"
##
## All screening algorithm wrappers in SuperLearner:
## [1] "All"
## [1] "screen.corP" "screen.corRank" "screen.glmnet"
## [4] "screen.randomForest" "screen.SIS" "screen.template"
## [7] "screen.ttest" "write.screen.template"
Note that both “prediction” and “screening”" algorithms are shown. We focus first on prediction algorithms; screening algorithms are discussed in the next section. Let’s look at the guts of the SL.glm
algorithm:
SL.glm
## function (Y, X, newX, family, obsWeights, ...)
## {
## fit.glm <- glm(Y ~ ., data = X, family = family, weights = obsWeights)
## pred <- predict(fit.glm, newdata = newX, type = "response")
## fit <- list(object = fit.glm)
## class(fit) <- "SL.glm"
## out <- list(pred = pred, fit = fit)
## return(out)
## }
## <environment: namespace:SuperLearner>
Note that SL.glm
is a function that takes as input: Y
, X
, newX
, family
, obsWeights
, and other arguments via ...
. Note that the family
option allows one to use a single prediction function when the outcome is both binary and continuous. In this case, SL.glm
with family=gaussian()
will call glm
with family=gaussian()
(linear regression); with family=binomial()
it calls glm
with family=binomial()
(logistic regression). The output of the function is a list with components pred
, a vector of predictions computed on the newX
object (not X
! source of many errors in my life…), and fit
, which contains anything that is (1) required for predicting new values later; or (2) desired for later access via the fitLibrary
component of the SuperLearner
object. Because this fit
object may be used for prediction later, it is important to specify its class so that an S3 predict method can be used on the object later. Note that such a method is already included for SL.glm
:
predict.SL.glm
## function (object, newdata, ...)
## {
## pred <- predict(object = object$object, newdata = newdata,
## type = "response")
## pred
## }
## <environment: namespace:SuperLearner>
This input/output structure is all that is needed to define a new prediction algorithm for SuperLearner
.
As an illustration, we could write a new algorithm specifying a new glm
algorithm that uses the Poisson error distribution and log link (default of family = poisson()
):
require(splines)
SL.poisglm <- function(Y, X, newX, family, obsWeights, ...){
# Poisson regression can be used regardless of whether Y is
# continuous or binary, so we will not specify anything related to
# family
# fit glm with family=poisson()
fit.glm <- glm(Y ~ .,data=X, family=poisson())
# get predictions on newX object
pred <- predict(fit.glm, newdata=newX, type="response")
# save the fit object
fit <- list(object=fit.glm)
# because this is simply a different form of glm,
# we can use predict.SL.glm to get predictions back,
# i.e. no need to write a new predict function
class(fit) <- "SL.glm"
# out must be list with named objects pred (predictions on newX)
# and fit (anything needed to get predictions later)
out <- list(pred=pred, fit=fit)
return(out)
}
We have now defined a new algorithm for use in the SuperLearner.
These new algorithms can now be added to the library we used previously:
set.seed(1234)
sl2 <- SuperLearner(
Y = chspred$mi,
X = chspred[,-ncol(chspred)],
SL.library = c("SL.glm","SL.mean","SL.poisglm")
)
sl2
##
## Call:
## SuperLearner(Y = chspred$mi, X = chspred[, -ncol(chspred)], SL.library = c("SL.glm",
## "SL.mean", "SL.poisglm"))
##
##
## Risk Coef
## SL.glm_All 0.02819407 0.6554559
## SL.mean_All 0.03041508 0.1026581
## SL.poisglm_All 0.03585962 0.2418860
We can double check to make sure that predict.SL.glm
works for our the new algorithms we defined by attempting to predict on a new observation:
slPredNew2 <- predict(sl2,newdata=newObs)
slPredNew2
## $pred
## [,1]
## [1,] 0.02667622
##
## $library.predict
## SL.glm_All SL.mean_All SL.poisglm_All
## [1,] 0.03137255 0.03137255 0.01195707
We now discuss how screening algorithms can be utilized to create Super Learner libraries. As the name suggests, these are algorithms that define a screening step prior to the execution of the prediction algorithm. The SuperLearner
function will apply this screening step in each of the V folds. The combination of screening algorithm and prediction algorithm defines a new algorithm. We can look at how screening algorithms are constructed for use with the SuperLearner
package:
write.screen.template()
## screen.template <- function(Y, X, family, obsWeights, id, ...) {
## # load required packages
## # require('pkg')
## if (family$family == 'gaussian') {
##
## }
## if (family$family == 'binomial') {
##
## }
## # whichVariable is a logical vector,
## # TRUE indicates variable will be used
## whichVariable <- rep(TRUE, ncol(X))
## return(whichVariable)
## }
Screening algorithms take the same input as prediction algorithms, but output a logical vector with TRUE
indicating that a column of X
should be used in the prediction step. To illustrate why these functions are useful, in our running example, consider the possibility of an interaction between treatment and SOFA score. If we are unsure of the existence of this interaction, we may wish to include algorithms that both do and do not account for this interaction. To construct a new library that includes algorithms both with and without interactions, we can make use of screening algorithms.
Let’s write a screening algorithm that only includes demographic variables:
demographics <- function(X,...){
returnCols <- rep(FALSE, ncol(X))
returnCols[names(X) %in% c("age","gend","race","hsed")] <- TRUE
return(returnCols)
}
Now we can fit the SuperLearner using the two GLMs both with all variables and only demographic variables. The call to SuperLearner
is nearly identical; however, we now specify SL.library
as a list, where each component is a vector of the form c(predictionAlgorithm,screeningAlgorithm)
. To include all the covariates, we specify the All
screening algorithm that is included in the SuperLearner
package.
set.seed(1234)
# Fit the Super Learner
sl3 <- SuperLearner(
Y = chspred$mi,
X = chspred[,-ncol(chspred)],
SL.library=list(c("SL.glm","All"),c("SL.glm","denographics"),
c("SL.mean","All"), # not adjusted, so doesn't matter
c("SL.poisglm","All"),c("SL.poisglm","denographics"))
)
sl3
##
## Call:
## SuperLearner(Y = chspred$mi, X = chspred[, -ncol(chspred)], SL.library = list(c("SL.glm",
## "All"), c("SL.glm", "denographics"), c("SL.mean", "All"), c("SL.poisglm",
## "All"), c("SL.poisglm", "denographics")))
##
##
## Risk Coef
## SL.glm_All 0.02819407 0.0000000
## SL.glm_denographics 0.02819407 0.6554559
## SL.mean_All 0.03041508 0.1026581
## SL.poisglm_All 0.03585962 0.2418860
## SL.poisglm_denographics 0.03585962 0.0000000
Note that the output for sl3
lists five algorithms: the three original algorithms each with the interaction (_All
) and without (_noInt
). Note that this explains why the output for sl1
contained the “_All" addendum – by default SuperLearner
uses all the All
screening function to pass through all variables in X
to the prediction algorithms.
This flexibility in combining screening and prediction algorithms to generate new algorithms allows one to easily implement a library containing a large number of candidate algorithms. Check out listWrappers()
to see other screening functions that are useful for more high dimensional settings.
So far, we have been focusing on using mean-squared error loss, by virtue of using the default method=method.NNLS
. Because our outcome is binary, we may instead prefer the negative log-likelihood loss function instead. We can easily change our original call to SuperLearner
to this loss function:
set.seed(1234)
sl4 <- SuperLearner(
Y = chspred$mi,
X = chspred[,-ncol(chspred)],
SL.library=c("SL.glm","SL.mean"),
method = "method.NNloglik"
)
sl4
##
## Call:
## SuperLearner(Y = chspred$mi, X = chspred[, -ncol(chspred)], SL.library = c("SL.glm",
## "SL.mean"), method = "method.NNloglik")
##
##
## Risk Coef
## SL.glm_All NaN 1
## SL.mean_All 0.1399198 0
We may wish instead to maximize AUC (equivalent to minimizing rank loss); for this, we can specify method=method.AUC
:
set.seed(1234)
sl5 <- SuperLearner(
Y = chspred$mi,
X = chspred[,-ncol(chspred)],
SL.library=c("SL.glm","SL.mean"),
method = "method.AUC",
family=binomial()
)
## Loading required package: cvAUC
## Loading required package: ROCR
## Loading required package: gplots
##
## Attaching package: 'gplots'
## The following object is masked from 'package:stats':
##
## lowess
## Loading required package: data.table
##
## cvAUC version: 1.1.0
## Notice to cvAUC users: Major speed improvements in version 1.1.0
##
sl5
##
## Call:
## SuperLearner(Y = chspred$mi, X = chspred[, -ncol(chspred)], family = binomial(),
## SL.library = c("SL.glm", "SL.mean"), method = "method.AUC")
##
##
## Risk Coef
## SL.glm_All 0.1583508 643.9684
## SL.mean_All 0.5996542 -798.7203
Or we can even write our own method. The package contains a template for doing so. It requires a function that returns a list with three components: (1) require
lists the packages needed to execute the functions; (2) computeCoef
is a function that takes a specific input and returns a cross validated risk estimate and vector of weight coefficients corresponding to the \(K\) algorithms in the library; (3) computePred
a function that computes the ensemble prediction.
method.template
## function ()
## {
## out <- list(require = NULL, computeCoef = function(Z, Y,
## libraryNames, obsWeights, control, verbose, ...) {
## cvRisk <- numeric()
## coef <- numeric()
## out <- list(cvRisk = cvRisk, coef = coef)
## return(out)
## }, computePred = function(predY, coef, control, ...) {
## out <- crossprod(t(predY), coef)
## return(out)
## })
## invisible(out)
## }
## <environment: namespace:SuperLearner>
We now illustrate an analysis of the statin data that combines Super Learning-derived estimates with one-step estimation and TMLE. As you learned in class, the canonical gradient of the parameter \(E_0\bigl\{E_0(Y | A=1, W) - E_0(Y | A=0, W)\bigr\}\) at \((Q,g)\) in a nonparametric model is given by \[ D^*(Q,g)(o) = \frac{2a - 1}{g^a_0(w)} (y - \bar{Q}^a_0(w)) + \bar{Q}^1_0(w) - \bar{Q}^0_0(w) - \psi_0 \ . \] The one-step estimator is defined as \[ \psi_n^+ := \Psi(Q_n) + P_n D^*(Q_n,g_n) \ . \] We will now use the statin data to illustrate how to construct this estimator using Super Learner-based estimates of the nuisance parameters.
Let’s first take a peek at the statin
data to familiarize ourselves with what variables are there
head(statins, 3)
## statin educ gend bmi hdl smoke age race chol ldl
## 1 0 0 1 24.09128 67.48161 0 75.40926 1 231.2540 94.5397
## 2 0 0 0 24.04775 59.35625 0 70.24054 1 306.1297 196.1082
## 3 0 0 1 33.48020 66.70018 1 73.49698 0 195.6578 102.1743
## hyper copd kcal death
## 1 0 0 0.00 1
## 2 1 0 1276.99 0
## 3 1 0 0.00 0
First we need to estimate both the outcome regression and the propensity regression using Super Learner. We start with the propensity:
# estimate g_0^1
sl.g1 <- SuperLearner(
# our outcome is now the statin variable
Y = statins$statin,
# our predictors are all variables except for death and statins
X = statins[,-c(1,ncol(statins))],
# outcome is binary, so let's use family = binomial()
family = binomial(),
# and nnloglik metho
method=method.NNloglik,
# simple library for computational efficiency
SL.library = c("SL.glm","SL.mean")
)
# get predicted probability that statin = 1
g1n <- sl.g1$SL.pred
# the predicted probability that statin = 0 is 1-g1n
g0n <- 1 - g1n
# get predicted probability that statin = observed value
gan <- ifelse(statins$statin==0, g0n[statins$statin==0], g1n[statins$statin==1])
We can now estimate the outcome regression and return predictions setting statin
equal to zero and one:
set.seed(1234)
# estimate \bar{Q}_0
sl.Q <- SuperLearner(
# our outcome is death
Y = statins$death,
# our predictors are all variables other than death
X = statins[,-ncol(statins)],
# outcome is binary, so let's use family = binomial()
family = binomial(),
# and nnloglik metho
method=method.NNloglik,
# simple library for computational efficiency
SL.library = c("SL.glm","SL.mean")
)
# set up a data frame where everyone has statin=1
statins1 <- statins[,-ncol(statins)]
statins1$statin <- 1
# get \bar{Q}_n^1
Q1n <- predict(sl.Q, newdata=statins1)$pred
# set up a data frame where everyone has statin=0
statins0 <- statins[,-ncol(statins)]
statins0$statin <- 0
# get \bar{Q}_n^0
Q0n <- predict(sl.Q, newdata=statins0)$pred
# get \bar{Q}_n^a
Qan <- ifelse(statins$statin==0, Q0n[statins$statin==0], Q1n[statins$statin==1])
We now have all the ingredients to construct the one-step estimator:
# naive plug-in estimator
psi.naive <- mean(Q1n - Q0n)
psi.naive
## [1] -0.0329666
# bias correction (P_n applied to canonical gradient)
PnDQngn <- mean((2*statins$statin - 1)/gan * (statins$death - Qan) + (Q1n - Q0n) - psi.naive)
# one step estimator
psi.os <- psi.naive + PnDQngn
psi.os
## [1] -0.01373641
We can also estimate \(\psi_0\) using TMLE combined with Super Learning. The tmle
package is already integrated with SuperLearner
, so we do not have to do as much work ourselves:
set.seed(1234)
tmleFit <- tmle(
Y = statins$death,
A = statins$statin,
W = statins[,-c(1,ncol(statins))],
Q.SL.library = c("SL.glm","SL.mean"),
g.SL.library = c("SL.glm","SL.mean")
)
tmleFit
## Additive Effect
## Parameter Estimate: -0.046762
## Estimated Variance: 0.0014712
## p-value: 0.22278
## 95% Conf Interval: (-0.12194, 0.028416)
The SuperLearner
package comes with an additional function to objectively evaluate the performance of the SuperLearner predictions relative to those from its component methods. This is achieved by adding an additional outer layer of V-fold cross-validation to the procedure. That is the data is split into, for example twenty equally sized pieces and each algorithm is trained on nine-tenths of the data – including the Super Learner, which itself uses 10-fold cross-validation – and evaluated on the remaining piece. Each piece of data serves as the evaluation set once and the cross-validated risk of the Super Learner and each component algorithm is computed.
We can use the CV.SuperLearer
function to evaluate our over-simplified library:
set.seed(1234)
# fit cross-validated super learner
cvsl1 <- CV.SuperLearner(
Y = statins$death,
X = statins[,-ncol(statins)],
# V specifies the number of outer CV layers used to evalute
# the Super Learner (which by default uses 10-fold CV)
V = 20,
family = binomial(),
method="method.NNLS",
SL.library = c("SL.glm","SL.mean")
)
The object itself is not all that informative:
cvsl1
##
## Call:
## CV.SuperLearner(Y = statins$death, X = statins[, -ncol(statins)], V = 20,
## family = binomial(), SL.library = c("SL.glm", "SL.mean"), method = "method.NNLS")
##
##
##
## Cross-validated predictions from the SuperLearner: SL.predict
##
## Cross-validated predictions from the discrete super learner (cross-validation selector): discreteSL.predict
##
## Which library algorithm was the discrete super learner: whichDiscreteSL
##
## Cross-validated prediction for all algorithms in the library: library.predict
However, there is a nice plotting function to display the results:
# plot cross-validated risk
plot(cvsl1)
The plot shows the ordered cross-validated risk estimates and 95% confidence intervals about these estimates for each of the candidate algorithms, in addition to the discrete and continuous Super Learner.
The Super Learner software can easily be used to implement a single method with many different tuning parameter values. As an example, consider using Random Forests to estimate the outcome regression in the statin example. The method requires two tuning parameters: the number of trees to build ntree
, the size of the trees nodesize
, and the number of randomly sampled covariates for each tree mtry
.
We can easily define a new algorithm that uses a single set of values for these parameters:
SL.randomForest_m5_nt1000_ns3 <- function(...,mtry=5,ntree=1000,nodesize=3){
SL.randomForest(...,mtry=mtry,ntree=ntree,nodesize=nodesize)
}
We can also use a loop to define functions over a grid of tuning parameter values:
tuneGrid <- expand.grid(mtry = c(3,5), ntree=c(500,1000), nodesize=c(1,3))
for(i in seq(nrow(tuneGrid))) {
eval(parse(text = paste0("SL.randomForest_m",tuneGrid[i,1],"_nt",tuneGrid[i,2],"_ns",tuneGrid[i,3],
"<- function(..., mtry = ", tuneGrid[i, 1], ", ntree = ", tuneGrid[i, 2],
", nodesize = ", tuneGrid[i,3],") { SL.randomForest(..., mtry = mtry, ntree = ntree, nodesize=nodesize)}")))
}
We have now created eight new prediction algorithms with each combination of tuning parameters specified in tuneGrid
. For example, we can look at the algorithm that uses mtry=3
, ntree=500
, and nodesize=1
:
SL.randomForest_m3_nt500_ns1
## function (..., mtry = 3, ntree = 500, nodesize = 1)
## {
## SL.randomForest(..., mtry = mtry, ntree = ntree, nodesize = nodesize)
## }
We can collect all of these algorithms by searching through R
objects with a similar name:
# get vector of all objects in R
allObjects <- ls()
# search names of objects for 'SL.randomForest_m'
myRfObjects <- grep("SL.randomForest_m",allObjects)
# get only objects with 'SL.randomForest_m' in their name
allRf <- allObjects[myRfObjects]
allRf
## [1] "SL.randomForest_m3_nt1000_ns1" "SL.randomForest_m3_nt1000_ns3"
## [3] "SL.randomForest_m3_nt500_ns1" "SL.randomForest_m3_nt500_ns3"
## [5] "SL.randomForest_m5_nt1000_ns1" "SL.randomForest_m5_nt1000_ns3"
## [7] "SL.randomForest_m5_nt500_ns1" "SL.randomForest_m5_nt500_ns3"
We can now use Super Learner to evaluate the performance of the Wang (2009) method using various tuning parameter values:
rf.sl <- SuperLearner(
Y = statins$death,
X = statins[,-ncol(statins)],
family = binomial(),
method="method.NNLS",
SL.library = allRf
)
rf.sl
##
## Call:
## SuperLearner(Y = statins$death, X = statins[, -ncol(statins)], family = binomial(),
## SL.library = allRf, method = "method.NNLS")
##
##
## Risk Coef
## SL.randomForest_m3_nt1000_ns1_All 0.1600278 0.51486107
## SL.randomForest_m3_nt1000_ns3_All 0.1600289 0.41608786
## SL.randomForest_m3_nt500_ns1_All 0.1612663 0.00000000
## SL.randomForest_m3_nt500_ns3_All 0.1602803 0.06905107
## SL.randomForest_m5_nt1000_ns1_All 0.1615840 0.00000000
## SL.randomForest_m5_nt1000_ns3_All 0.1619054 0.00000000
## SL.randomForest_m5_nt500_ns1_All 0.1621477 0.00000000
## SL.randomForest_m5_nt500_ns3_All 0.1618559 0.00000000
caret
packageThe caret
package provides a uniform approach to tuning parameter selection for a number of algorithms (for a full list, see http://topepo.github.io/caret/modelList.html). This provides a way to include algorithms in the Super Learner that implicitly use cross-validation. For example, one candidate algorithm in the Super Learner might use a gradient boosted machine with fixed values for tuning parameters, while another candidate algorithm uses a gradient boosted machine with tuning parameters determined adaptively (e.g., through twenty-fold cross-validation).
The SuperLearner
package provides an algorithm SL.caret
to use caret to train an algorithm. I have written a slight modification to this function that fixes an issue when one trains a gradient boosted machine with verbose=FALSE
(even with this option specified, SL.caret
still outputs a TON of useless information).
SL.caret1 <- function (Y, X, newX, family, obsWeights, method = "rf", tuneLength = 3,
trControl = trainControl(method = "cv", number = 20, verboseIter = FALSE),
metric,...)
{
if (length(unique(Y))>2){
if(is.matrix(Y)) Y <- as.numeric(Y)
metric <- "RMSE"
if(method=="gbm"){
suppressWarnings(
# pass verbose==FALSE directly to train (verboseIter doesn't
# suppress output)
fit.train <- caret::train(x = X, y = Y, weights = obsWeights,
metric = metric, method = method,
tuneLength = tuneLength,
trControl = trControl,verbose=FALSE)
)
}else{
suppressWarnings(
fit.train <- caret::train(x = X, y = Y, weights = obsWeights,
metric = metric, method = method,
tuneLength = tuneLength,
trControl = trControl)
)
}
pred <- predict(fit.train, newdata = newX, type = "raw")
}
if (length(unique(Y))<=2) {
metric <- "Accuracy"
Y.f <- as.factor(Y)
levels(Y.f) <- c("A0", "A1")
if(method=="gbm"){
suppressWarnings(
# pass verbose==FALSE directly to train (verboseIter doesn't
# suppress output)
fit.train <- caret::train(x = X, y = Y.f, weights = obsWeights,
metric = metric, method = method,
tuneLength = tuneLength,
trControl = trControl, verbose = FALSE)
)
}else{
suppressWarnings(
fit.train <- caret::train(x = X, y = Y, weights = obsWeights,
metric = metric, method = method,
tuneLength = tuneLength,
trControl = trControl)
)
}
pred <- predict(fit.train, newdata = newX, type = "prob")[,2]
}
fit <- list(object = fit.train)
out <- list(pred = pred, fit = fit)
class(out$fit) <- c("SL.caret")
return(out)
}
As a brief demonstration, we illustrate the implementation of a Super Learner that uses three gradient boosted machine algorithms: SL.gbm
uses fixed tuning parameters (gbm.trees=10000
and interaction.depth=2
); SL.gbm.caret1
uses twenty-fold cross validation (the default) to select these tuning parameters from a grid of three possible values; SL.gbm.caret2
uses ten-fold cross validation to select tuning parameters also from a grid of eight possible values. First we must define SL.gbm.caret1
and SL.gbm.caret2
:
SL.gbm.caret1 <- function(...,method="gbm",tuneLength=3){
SL.caret1(...,method=method,tuneLength=tuneLength)
}
SL.gbm.caret2 <- function(...,method="gbm",tuneLength=3, trControl=trainControl(method="cv",number=10,verboseIter=FALSE)){
SL.caret1(...,method=method,tuneLength=tuneLength,trControl=trControl)
}
We can now implement the Super Learner. Note that the run time on this Super Learner will be considerably longer than previous runs due to the additional layers of cross validation.
set.seed(123)
# fit super learner using three GBMs -- only use the first 150 rows
# for the sake of expediency
gbm.sl <- SuperLearner(
Y = statins$death[1:150],
X = statins[1:150,-ncol(statins)],
family = binomial(),
SL.library = c("SL.gbm","SL.gbm.caret1","SL.gbm.caret2")
)
gbm.sl
##
## Call:
## SuperLearner(Y = statins$death[1:150], X = statins[1:150, -ncol(statins)],
## family = binomial(), SL.library = c("SL.gbm", "SL.gbm.caret1", "SL.gbm.caret2"))
##
##
##
## Risk Coef
## SL.gbm_All 0.1592559 1
## SL.gbm.caret1_All 0.1751551 0
## SL.gbm.caret2_All 0.1874641 0