5 Methods

5.1 Outcomes

5.1.1 Single bnAb

If a single bnAb or combination of bnAbs that are measured directly in the CATNAP data is requested (i.e., the nab option is a single string of a single bnAb or measured combination of bnAbs from the CATNAP database), then the possible outcomes are:

  • ic50 = \(\mbox{log}_{10}(\mbox{IC}_{50})\), where IC\(_{50}\) is the half-maximal inhibitory concentration;
  • ic80 = \(\mbox{log}_{10}(\mbox{IC}_{80})\), where IC\(_{80}\) is the 80% maximal inhibitory concentration;
  • iip = \((-1)\mbox{log}_{10}(1 - \mbox{IIP})\), where IIP (Shen et al. 2008, @wagh2016optimal) is the instantaneous inhibitory potential, computed as \[ \frac{10^m}{\mbox{IC$_{50}$}^m + 10^m} \ , \] where \(m = \mbox{log}_{10}(4) / (\mbox{log}_{10}(\mbox{IC}_{80}) - \mbox{log}_{10}(\mbox{IC}_{50}))\); and
  • sens = sensitivity: the binary indicator that IC\(_{x}\) \(<\) sens_thresh, the user-specified sensitivity threshold. The value of \(x\) is determined by binary_outcomes, which defaults to "ic50" (i.e., \(x = 50\)) but may be set to "ic80" (i.e., \(x = 80\)).

5.1.2 Multiple bnAbs

If multiple bnAbs are requested (i.e., the nab option is a semicolon separated string of more than one bnAb from CATNAP), then the possible outcomes that can be requested are:

  • ic50 = \(\mbox{log}_{10}(\mbox{estimated IC}_{50})\), where estimated IC\(_{50}\) is computed using the requested combination_method (see below);
  • ic80 = \(\mbox{log}_{10}(\mbox{estimated IC}_{80})\), where estimated IC\(_{80}\) is computed using the requested combination_method (see below);
  • iip = \((-1)\mbox{log}_{10}(1 - \mbox{IIP})\), where IIP is computed as \[ \frac{10^m}{\mbox{estimated IC$_{50}$}^m + 10^m} \ , \] where \(m = \mbox{log}_{10}(4) / (\mbox{log}_{10}(\mbox{estimated IC}_{80}) - \mbox{log}_{10}(\mbox{estimated IC}_{50}))\); and
  • estsens = estimated sensitivity: the binary indicator that estimated IC\(_{x}\) (defined above) is less than sens_thresh (where \(x\) is determined by the value of binary_outcomes); and
  • multsens = multiple sensitivity: the binary indicator that measured IC\(_{x}\) is less than the sensitivity threshold (sens_thresh) for a number of bnAbs defined by multsens_nab (where \(x\) is determined by the value of binary_outcomes).

Possible combination_methods used for computing estimated IC\(_{50}\) and IC\(_{80}\) are:

  • additive, the additive model of Wagh et al. (2016). For \(J\) bnAbs, \[ \mbox{estimated IC}_{50} = \left( \sum_{j=1}^J \mbox{IC}_{50,j}^{-1} \right)^{-1} \ , \] where IC\(_{50,j}\) denotes the measured IC\(_{50}\) for antibody \(j\);
  • Bliss-Hill, the Bliss-Hill model of Wagh et al. (2016). For \(J\) bnAbs, computed using Brent’s algorithm (Brent 1971) as the concentration value \(c\) that minimizes \(\lvert f_J(c) - k \rvert\), where \(k\) denotes the desired neutralization fraction (50% for IC\(_{50}\) or 80% for IC\(_{80}\)), \[ f_J(c) = 1 - \prod_{j = 1}^J \{1 - f_j(c, c / J)\}; \] \[ f_j(c, c_j) = (c^m) / (\mbox{IC}_{50,j}^m + c_j^m), \] \(m = \mbox{log}_{10}(4) / (\mbox{log}_{10}(\mbox{IC}_{80,j}) - \mbox{log}_{10}(\mbox{IC}_{50,j}))\), and IC\(_{50,j}\) and IC\(_{80,j}\) denote the measured IC\(_{50}\) and IC\(_{80}\) for bnAb \(j\), respectively.

5.2 Learners

There are three possible learners available in slapnap: random forests (Breiman 2001), as implemented in the R package ranger (Wright and Ziegler 2017) and accessed in slapnap by including 'rf' in learners; elastic net (Zou and Hastie 2005) as implemented in glmnet (Friedman, Hastie, and Tibshirani 2010) and accessed in slapnap by including 'lasso' in learners; and boosted trees (Friedman 2001; Chen and Guestrin 2016) as implemented in either xgboost (Chen et al. 2019) (accessed in slapnap by including 'xgboost' in learners) or H2O.ai (H2O.ai 2016) (accessed in slapnap by including 'h2oboost' in learners).

For each learner, there is a default choice of tuning parameters that is implemented if cvtune="FALSE". If instead cvtune="TRUE", then there are several choices of tuning parameters that are evaluated using nfold cross validation; see Table 5.1 for a full list of these learners.

Table 5.1: Labels for learners in report and description of their respective tuning parameters
learner Tuning parameters
rf_default mtry equal to square root of number of predictors
rf_1 mtry equal to one-half times square root of number of predictors
rf_2 mtry equal to two times square root of number of predictors
xgboost_default maximum tree depth equal to 4
xgboost_1 maximum tree depth equal to 2
xgboost_2 maximum tree depth equal to 6
xgboost_3 maximum tree depth equal to 8
h2oboost_default max_depth in (2, 4, 5, 6), learn_rate in (.05, .1, .2), and col_sample_rate in (.1, .2, .3); optimal combination chosen via 5-fold CV
lasso_default \(\lambda\) selected by 5-fold CV and \(\alpha\) equal to 0
lasso_1 \(\lambda\) selected by 5-fold CV and \(\alpha\) equal to 0.25
lasso_2 \(\lambda\) selected by 5-fold CV and \(\alpha\) equal to 0.5
lasso_3 \(\lambda\) selected by 5-fold CV and \(\alpha\) equal to 0.75

Tuning parameters not mentioned in the table are set as follows:

  • rf: num.trees = 500, min.node.size = 5 for continuous outcomes and = 1 for binary outcomes;
  • xgboost: nrounds = 1000, eta = 0.1, min_child_weight = 10, objective = binary:logistic for binary outcomes and objective = reg:squarederror for continuous outcomes.
  • h2oboost: ntrees = 1000; for binary outcomes, distribution = "bernoulli", balance_classes = TRUE, fold_assignment = "Stratified", stopping_metric = "AUC", while for continuous outcomes, distribution = "gaussian", balance_classes = FALSE, fold_assignment = "AUTO", stopping_metric = "MSE"; and max_after_balance_class_size = 5, stopping_rounds = 3, stopping_tolerance = 0.001, max_runtime_secs = 60.

5.3 Super learner

If multiple learners are specified, then a super learner ensemble (van der Laan, Polley, and Hubbard 2007) is constructed using nfold cross validation, as implemented in the R package SuperLearner (Polley et al. 2019). Specifically, the data are randomly partitioned into nfold chunks of approximately equal size. For binary outcomes, this partitioning is done in such a way as to ensure an approximately even number of sensitive/resistant pseudoviruses in each chunk. A so-called super learner library of candidate algorithms is constructed by including different learners:

  • the algorithm mean, which reports back the sample mean as prediction for all observations is always included;
  • if cvtune="FALSE" then the default version of each learner (Section 5.2) is included;
  • if cvtune="TRUE" then each choice of tuning parameters for the selected learners in Table 5.1 is included.

The cross-validated risk of each algorithm in the library is computed. For binary outcomes, mean negative log-likelihood loss is used; for continuous outcomes, mean squared-error loss is used. The single algorithm with the smallest cross-validated risk is reported as the cv selector (also known as the discrete super learner). The super learner ensemble is constructed by selecting convex weights (i.e., each algorithm is assigned a non-negative weight and the weights sum to one) that minimize cross-validated risk.

When cvperf="TRUE" and a super learner is constructed, an additional layer of cross validation is used to evaluate the predictive performance of the super learner and of the cv selector.

5.4 Variable importance

If importance_grp or importance_ind is specified, variable importance estimates are computed based on the learners. Both intrinsic and prediction importance can be obtained; we discuss each in the following two sections.

5.4.1 Intrinsic importance

Intrinsic importance may be obtained by specifying importance_grp, importance_ind, or both. We provide two types of intrinsic importance: marginal and conditional, accessed by passing "marg" and "cond", respectively, to one of the importance options. Both types of intrinsic importance are based on the population prediction potential of features (Williamson et al. 2020), as implemented in the R package vimp (B. D. Williamson, Simon, and Carone 2020). We measure prediction potential using nonparametric \(R^2\) for continuous outcomes [i.e., (estimated) IC\(_{50}\), (estimated) IC\(_{80}\), or IIP] and using the nonparametric area under the receiver operating characteristic curve (AUC) for binary outcomes [i.e., (estimated) sensitivity or multiple sensitivity]. Both marginal and conditional importance compare the population prediction potential including the feature(s) of interest to the population prediction potential excluding the feature(s) of interest; this provides a measure of the intrinsic importance of the feature(s). The two types of intrinsic importance differ only in the other adjustment variables that we consider: conditional importance compares the prediction potential of all features to the prediction potential of all features excluding the feature(s) of interest, and thus importance must be interpreted conditionally; whereas marginal importance compares the prediction potential of the feature(s) of interest plus geographic confounders to the prediction potential of the geographic confounders alone.

Both marginal and conditional intrinsic importance can be computed for groups of features or individual features. The available feature groups are detailed in Section 7. Execution time may increase when intrinsic importance is requested, depending upon the other options passed to slapnap: a separate learner (or super learner ensemble) must be trained for each feature group (or individual feature) of interest. Marginal importance tends to be computed more quickly than conditional importance, but both types of importance provide useful information about the population of interest and the underlying biology.

If intrinsic importance is requested, then point estimates, confidence intervals, and p-values (for a test of the null hypothesis that the intrinsic importance is equal to zero) will be computed and displayed for each feature or group of features of interest. All results are based on an internal sample-splitting procedure, whereby the algorithm including the feature(s) of interest is evaluated on independent data from the algorithm excluding the feature(s) of interest. This ensures that the procedure has a type I error rate of 0.05 (Williamson et al. 2020).

In the following command, we request marginal intrinsic importance for the feature groups defined in Section 7. We do not specify a super learner ensemble to reduce computation time; however, in most problems we recommend an ensemble to protect against model misspecification.

docker run -v /path/to/local/dir:/home/output \
           -e importance_grp="marg" \
           slapnap/slapnap

The raw R objects (saved as .rds files) containing the point estimates, confidence intervals, and p-values for intrinsic importance can be saved by passing "vimp" to return.

5.4.2 Predictive importance

learner-level predictive importance may be obtained by including "pred" in the importance_ind option. If a single learner is fit, then the predictive importance is the R default for that type of learner:

  • rf: the impurity importance from ranger (Wright and Ziegler 2017) is returned. The impurity importance for a given feature is computed by taking a normalized sum of the decrease in impurity (i.e., Gini index for binary outcomes; mean squared-error for continuous outcomes) over all nodes in the forest at which a split on that feature has been conducted.
  • xgboost: the gain importance from xgboost (Chen et al. 2019) is returned. Interpretation is essentially the same as for rf’s impurity importance.
  • h2oboost: the gain importance (H2O.ai 2016) is returned. Interpretation is the same as for xgboost’s gain importance.
  • lasso: the absolute value of the estimated regression coefficient at the cross-validation-selected \(\lambda\) is returned.

Note that these importance measures each have important limitations: the rf , h2oboost, and xgboost measures will tend to favor features with many levels, while the lasso variable importance will tend to favor features with few levels. Nevertheless, these commonly reported measures can provide some insight into how a given learner is making predictions.

If multiple learners are used, and thus a super learner is constructed, then the importance measures for the learner with the highest weight in the super learner are reported.

If a single learner is used, but cvtune="TRUE" then importance measures for the cv selector are reported.

In the following command, we request predictive importance for a simple scenario. Predictive importance is displayed for the top 15 features.

docker run -v /path/to/local/dir:/home/output \
           -e importance_ind="pred" \
           slapnap/slapnap

References

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1): 5–32. https://doi.org/10.1023/A:1010933404324.

Brent, Richard P. 1971. “An Algorithm with Guaranteed Convergence for Finding a Zero of a Function.” The Computer Journal 14 (4): 422–25.

Chen, Tianqi, and Carlos Guestrin. 2016. “Xgboost: A Scalable Tree Boosting System.” In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 785–94. https://doi.org/10.1145/2939672.2939785.

Chen, Tianqi, Tong He, Michael Benesty, Vadim Khotilovich, Yuan Tang, Hyunsu Cho, Kailong Chen, et al. 2019. xgboost: Extreme Gradient Boosting. https://CRAN.R-project.org/package=xgboost.

Friedman, Jerome H. 2001. “Greedy Function Approximation: A Gradient Boosting Machine.” Annals of Statistics, 1189–1232. https://doi.org/10.1214/aos/1013203451.

Friedman, Jerome, Trevor Hastie, and Robert Tibshirani. 2010. “Regularization Paths for Generalized Linear Models via Coordinate Descent.” Journal of Statistical Software 33 (1): 1–22. https://doi.org/10.18637/jss.v033.i01.

H2O.ai. 2016. R Interface for H2o. https://github.com/h2oai/h2o-3.

Polley, Eric, Erin LeDell, Chris Kennedy, and Mark van der Laan. 2019. SuperLearner: Super Learner Prediction. https://CRAN.R-project.org/package=SuperLearner.

Shen, Lin, Susan Peterson, Ahmad R Sedaghat, Moira A McMahon, Marc Callender, Haili Zhang, Yan Zhou, et al. 2008. “Dose-Response Curve Slope Sets Class-Specific Limits on Inhibitory Potential of anti-HIV Drugs.” Nature Medicine 14 (7): 762–66. https://doi.org/10.1038/nm1777.

van der Laan, Mark J, Eric C Polley, and Alan E Hubbard. 2007. “Super Learner.” Statistical Applications in Genetics and Molecular Biology 6 (1). https://doi.org/10.2202/1544-6115.1309.

Wagh, Kshitij, Tanmoy Bhattacharya, Carolyn Williamson, Alex Robles, Madeleine Bayne, Jetta Garrity, Michael Rist, et al. 2016. “Optimal Combinations of Broadly Neutralizing Antibodies for Prevention and Treatment of HIV-1 Clade C Infection.” PLoS Pathogens 12 (3). https://doi.org/10.1371/journal.ppat.1005520.

Williamson, Brian D, Peter B Gilbert, Noah R Simon, and Marco Carone. 2020. “A Unified Approach for Inference on Algorithm-Agnostic Variable Importance.” arXiv Preprint. https://arxiv.org/abs/2004.03683.

Williamson, Brian D., Noah Simon, and Marco Carone. 2020. “vimp: Perform Inference on Algorithm-Agnostic Variable Importance.” https://CRAN.R-project.org/package=vimp.

Wright, Marvin N., and Andreas Ziegler. 2017. “ranger: A Fast Implementation of Random Forests for High Dimensional Data in C++ and R.” Journal of Statistical Software 77 (1): 1–17. https://doi.org/10.18637/jss.v077.i01.

Zou, Hui, and Trevor Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2): 301–20. https://doi.org/10.1111/j.1467-9868.2005.00503.x.