5 Methods
5.1 Outcomes
5.1.1 Single bnAb
If a single bnAb or combination of bnAbs that are measured directly in the CATNAP data is requested (i.e., the nab
option is a single string of a single bnAb or measured combination of bnAbs from the CATNAP database), then the possible outcomes are:
ic50
= \(\mbox{log}_{10}(\mbox{IC}_{50})\), where IC\(_{50}\) is the half-maximal inhibitory concentration;ic80
= \(\mbox{log}_{10}(\mbox{IC}_{80})\), where IC\(_{80}\) is the 80% maximal inhibitory concentration;iip
= \((-1)\mbox{log}_{10}(1 - \mbox{IIP})\), where IIP (Shen et al. 2008, @wagh2016optimal) is the instantaneous inhibitory potential, computed as \[ \frac{10^m}{\mbox{IC$_{50}$}^m + 10^m} \ , \] where \(m = \mbox{log}_{10}(4) / (\mbox{log}_{10}(\mbox{IC}_{80}) - \mbox{log}_{10}(\mbox{IC}_{50}))\); andsens
= sensitivity: the binary indicator that IC\(_{x}\) \(<\)sens_thresh
, the user-specified sensitivity threshold. The value of \(x\) is determined bybinary_outcomes
, which defaults to"ic50"
(i.e., \(x = 50\)) but may be set to"ic80"
(i.e., \(x = 80\)).
5.1.2 Multiple bnAbs
If multiple bnAbs are requested (i.e., the nab
option is a semicolon separated string of more than one bnAb from CATNAP), then the possible outcomes
that can be requested are:
ic50
= \(\mbox{log}_{10}(\mbox{estimated IC}_{50})\), where estimated IC\(_{50}\) is computed using the requestedcombination_method
(see below);ic80
= \(\mbox{log}_{10}(\mbox{estimated IC}_{80})\), where estimated IC\(_{80}\) is computed using the requestedcombination_method
(see below);iip
= \((-1)\mbox{log}_{10}(1 - \mbox{IIP})\), where IIP is computed as \[ \frac{10^m}{\mbox{estimated IC$_{50}$}^m + 10^m} \ , \] where \(m = \mbox{log}_{10}(4) / (\mbox{log}_{10}(\mbox{estimated IC}_{80}) - \mbox{log}_{10}(\mbox{estimated IC}_{50}))\); andestsens
= estimated sensitivity: the binary indicator that estimated IC\(_{x}\) (defined above) is less thansens_thresh
(where \(x\) is determined by the value ofbinary_outcomes
); andmultsens
= multiple sensitivity: the binary indicator that measured IC\(_{x}\) is less than the sensitivity threshold (sens_thresh
) for a number of bnAbs defined bymultsens_nab
(where \(x\) is determined by the value ofbinary_outcomes
).
Possible combination_method
s used for computing estimated IC\(_{50}\) and IC\(_{80}\) are:
additive
, the additive model of Wagh et al. (2016). For \(J\) bnAbs, \[ \mbox{estimated IC}_{50} = \left( \sum_{j=1}^J \mbox{IC}_{50,j}^{-1} \right)^{-1} \ , \] where IC\(_{50,j}\) denotes the measured IC\(_{50}\) for antibody \(j\);Bliss-Hill
, the Bliss-Hill model of Wagh et al. (2016). For \(J\) bnAbs, computed using Brent’s algorithm (Brent 1971) as the concentration value \(c\) that minimizes \(\lvert f_J(c) - k \rvert\), where \(k\) denotes the desired neutralization fraction (50% for IC\(_{50}\) or 80% for IC\(_{80}\)), \[ f_J(c) = 1 - \prod_{j = 1}^J \{1 - f_j(c, c / J)\}; \] \[ f_j(c, c_j) = (c^m) / (\mbox{IC}_{50,j}^m + c_j^m), \] \(m = \mbox{log}_{10}(4) / (\mbox{log}_{10}(\mbox{IC}_{80,j}) - \mbox{log}_{10}(\mbox{IC}_{50,j}))\), and IC\(_{50,j}\) and IC\(_{80,j}\) denote the measured IC\(_{50}\) and IC\(_{80}\) for bnAb \(j\), respectively.
5.2 Learners
There are three possible learners
available in slapnap
: random forests (Breiman 2001), as implemented in the R
package ranger
(Wright and Ziegler 2017) and accessed in slapnap
by including 'rf'
in learners
; elastic net (Zou and Hastie 2005) as implemented in glmnet
(Friedman, Hastie, and Tibshirani 2010) and accessed in slapnap
by including 'lasso'
in learners
; and boosted trees (Friedman 2001; Chen and Guestrin 2016) as implemented in either xgboost
(Chen et al. 2019) (accessed in slapnap
by including 'xgboost'
in learners
) or H2O.ai
(H2O.ai 2016) (accessed in slapnap
by including 'h2oboost'
in learners
).
For each learner
, there is a default
choice of tuning parameters that is implemented if cvtune="FALSE"
. If instead cvtune="TRUE"
, then there are several choices of tuning parameters that are evaluated using nfold
cross validation; see Table 5.1 for a full list of these learners
.
learner |
Tuning parameters |
---|---|
rf_default |
mtry equal to square root of number of predictors |
rf_1 |
mtry equal to one-half times square root of number of predictors |
rf_2 |
mtry equal to two times square root of number of predictors |
xgboost_default |
maximum tree depth equal to 4 |
xgboost_1 |
maximum tree depth equal to 2 |
xgboost_2 |
maximum tree depth equal to 6 |
xgboost_3 |
maximum tree depth equal to 8 |
h2oboost_default |
max_depth in (2, 4, 5, 6), learn_rate in (.05, .1, .2), and col_sample_rate in (.1, .2, .3); optimal combination chosen via 5-fold CV |
lasso_default |
\(\lambda\) selected by 5-fold CV and \(\alpha\) equal to 0 |
lasso_1 |
\(\lambda\) selected by 5-fold CV and \(\alpha\) equal to 0.25 |
lasso_2 |
\(\lambda\) selected by 5-fold CV and \(\alpha\) equal to 0.5 |
lasso_3 |
\(\lambda\) selected by 5-fold CV and \(\alpha\) equal to 0.75 |
Tuning parameters not mentioned in the table are set as follows:
rf
:num.trees = 500
,min.node.size = 5
for continuous outcomes and= 1
for binary outcomes;xgboost
:nrounds = 1000
,eta = 0.1
,min_child_weight = 10
,objective = binary:logistic
for binary outcomes andobjective = reg:squarederror
for continuous outcomes.h2oboost
:ntrees = 1000
; for binary outcomes,distribution = "bernoulli"
,balance_classes = TRUE
,fold_assignment = "Stratified"
,stopping_metric = "AUC"
, while for continuous outcomes,distribution = "gaussian"
,balance_classes = FALSE
,fold_assignment = "AUTO"
,stopping_metric = "MSE"
; andmax_after_balance_class_size = 5
,stopping_rounds = 3
,stopping_tolerance = 0.001
,max_runtime_secs = 60
.
5.3 Super learner
If multiple learners
are specified, then a super learner ensemble (van der Laan, Polley, and Hubbard 2007) is constructed using nfold
cross validation, as implemented in the R
package SuperLearner
(Polley et al. 2019). Specifically, the data are randomly partitioned into nfold
chunks of approximately equal size. For binary outcomes, this partitioning is done in such a way as to ensure an approximately even number of sensitive/resistant pseudoviruses in each chunk. A so-called super learner library of candidate algorithms is constructed by including different learners
:
- the algorithm
mean
, which reports back the sample mean as prediction for all observations is always included; - if
cvtune="FALSE"
then thedefault
version of eachlearner
(Section 5.2) is included; - if
cvtune="TRUE"
then each choice of tuning parameters for the selectedlearners
in Table 5.1 is included.
The cross-validated risk of each algorithm in the library is computed. For binary outcomes, mean negative log-likelihood loss is used; for continuous outcomes, mean squared-error loss is used. The single algorithm with the smallest cross-validated risk is reported as the cv selector
(also known as the discrete super learner). The super learner ensemble is constructed by selecting convex weights (i.e., each algorithm is assigned a non-negative weight and the weights sum to one) that minimize cross-validated risk.
When cvperf="TRUE"
and a super learner is constructed, an additional layer of cross validation is used to evaluate the predictive performance of the super learner and of the cv selector
.
5.4 Variable importance
If importance_grp
or importance_ind
is specified, variable importance estimates are computed based on the learners
. Both intrinsic and prediction importance can be obtained; we discuss each in the following two sections.
5.4.1 Intrinsic importance
Intrinsic importance may be obtained by specifying importance_grp
, importance_ind
, or both. We provide two types of intrinsic importance: marginal and conditional, accessed by passing "marg"
and "cond"
, respectively, to one of the importance options. Both types of intrinsic importance are based on the population prediction potential of features (Williamson et al. 2020), as implemented in the R
package vimp
(B. D. Williamson, Simon, and Carone 2020). We measure prediction potential using nonparametric \(R^2\) for continuous outcomes [i.e., (estimated) IC\(_{50}\), (estimated) IC\(_{80}\), or IIP] and using the nonparametric area under the receiver operating characteristic curve (AUC) for binary outcomes [i.e., (estimated) sensitivity or multiple sensitivity]. Both marginal and conditional importance compare the population prediction potential including the feature(s) of interest to the population prediction potential excluding the feature(s) of interest; this provides a measure of the intrinsic importance of the feature(s). The two types of intrinsic importance differ only in the other adjustment variables that we consider: conditional importance compares the prediction potential of all features to the prediction potential of all features excluding the feature(s) of interest, and thus importance must be interpreted conditionally; whereas marginal importance compares the prediction potential of the feature(s) of interest plus geographic confounders to the prediction potential of the geographic confounders alone.
Both marginal and conditional intrinsic importance can be computed for groups of features or individual features. The available feature groups are detailed in Section 7. Execution time may increase when intrinsic importance is requested, depending upon the other options passed to slapnap
: a separate learner
(or super learner ensemble) must be trained for each feature group (or individual feature) of interest. Marginal importance tends to be computed more quickly than conditional importance, but both types of importance provide useful information about the population of interest and the underlying biology.
If intrinsic importance is requested, then point estimates, confidence intervals, and p-values (for a test of the null hypothesis that the intrinsic importance is equal to zero) will be computed and displayed for each feature or group of features of interest. All results are based on an internal sample-splitting procedure, whereby the algorithm including the feature(s) of interest is evaluated on independent data from the algorithm excluding the feature(s) of interest. This ensures that the procedure has a type I error rate of 0.05 (Williamson et al. 2020).
In the following command, we request marginal intrinsic importance for the feature groups defined in Section 7. We do not specify a super learner ensemble to reduce computation time; however, in most problems we recommend an ensemble to protect against model misspecification.
The raw R
objects (saved as .rds
files) containing the point estimates, confidence intervals, and p-values for intrinsic importance can be saved by passing "vimp"
to return
.
5.4.2 Predictive importance
learner
-level predictive importance may be obtained by including "pred"
in the importance_ind
option. If a single learner
is fit, then the predictive importance is the R
default for that type of learner:
rf
: theimpurity
importance fromranger
(Wright and Ziegler 2017) is returned. The impurity importance for a given feature is computed by taking a normalized sum of the decrease in impurity (i.e., Gini index for binary outcomes; mean squared-error for continuous outcomes) over all nodes in the forest at which a split on that feature has been conducted.xgboost
: thegain
importance fromxgboost
(Chen et al. 2019) is returned. Interpretation is essentially the same as forrf
’simpurity
importance.h2oboost
: thegain
importance (H2O.ai 2016) is returned. Interpretation is the same as forxgboost
’sgain
importance.lasso
: the absolute value of the estimated regression coefficient at the cross-validation-selected \(\lambda\) is returned.
Note that these importance measures each have important limitations: the rf
, h2oboost
, and xgboost
measures will tend to favor features with many levels, while the lasso
variable importance will tend to favor features with few levels. Nevertheless, these commonly reported measures can provide some insight into how a given learner is making predictions.
If multiple learners
are used, and thus a super learner is constructed, then the importance measures for the learner
with the highest weight in the super learner are reported.
If a single learner
is used, but cvtune="TRUE"
then importance measures for the cv selector
are reported.
In the following command, we request predictive importance for a simple scenario. Predictive importance is displayed for the top 15 features.
References
Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.
Brent, Richard P. 1971. “An Algorithm with Guaranteed Convergence for Finding a Zero of a Function.” The Computer Journal 14 (4): 422–25.
Chen, Tianqi, and Carlos Guestrin. 2016. “Xgboost: A Scalable Tree Boosting System.” In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 785–94. https://doi.org/10.1145/2939672.2939785.
Chen, Tianqi, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen, et al. 2019. xgboost: Extreme Gradient Boosting. https://CRAN.R-project.org/package=xgboost.
Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics, 1189–1232. https://doi.org/10.1214/aos/1013203451.
Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (1): 1–22. https://doi.org/10.18637/jss.v033.i01.
H2O.ai. 2016. R Interface for H2o. https://github.com/h2oai/h2o-3.
Polley, Eric, Erin LeDell, Chris Kennedy, and Mark van der Laan. 2019. SuperLearner: Super Learner Prediction. https://CRAN.R-project.org/package=SuperLearner.
Shen, Lin, Susan Peterson, Ahmad R Sedaghat, Moira A McMahon, Marc Callender, Haili Zhang, Yan Zhou, et al. 2008. “Dose-Response Curve Slope Sets Class-Specific Limits on Inhibitory Potential of anti-HIV Drugs.” Nature Medicine 14 (7): 762–66. https://doi.org/10.1038/nm1777.
van der Laan, Mark J, Eric C Polley, and Alan E Hubbard. 2007. “Super Learner.” Statistical Applications in Genetics and Molecular Biology 6 (1). https://doi.org/10.2202/1544-6115.1309.
Wagh, Kshitij, Tanmoy Bhattacharya, Carolyn Williamson, Alex Robles, Madeleine Bayne, Jetta Garrity, Michael Rist, et al. 2016. “Optimal Combinations of Broadly Neutralizing Antibodies for Prevention and Treatment of HIV-1 Clade C Infection.” PLoS Pathogens 12 (3). https://doi.org/10.1371/journal.ppat.1005520.
Williamson, Brian D, Peter B Gilbert, Noah R Simon, and Marco Carone. 2020. “A Unified Approach for Inference on Algorithm-Agnostic Variable Importance.” arXiv Preprint. https://arxiv.org/abs/2004.03683.
Williamson, Brian D., Noah Simon, and Marco Carone. 2020. “vimp: Perform Inference on Algorithm-Agnostic Variable Importance.” https://CRAN.R-project.org/package=vimp.
Wright, Marvin N., and Andreas Ziegler. 2017. “ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R.” Journal of Statistical Software 77 (1): 1–17. https://doi.org/10.18637/jss.v077.i01.
Zou, Hui, and Trevor Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2): 301–20. https://doi.org/10.1111/j.1467-9868.2005.00503.x.