1 Executive summary

The broadly neutralizing antibody (bNAb) studied in this analysis is VRC01. The analysis considered 2 measures of neutralization sensitivity: IC$_{80}$ and sensitivity. Sensitivity is defined by the binary indicator that IC$_{80}$ < 1. Based on this specification of bNAb and outcomes:

828 sequences were extracted from the CATNAP database (Yoon et al. 2015);
827 sequences had complete geographic and genetic sequence information;
572 of these sequences had measured IC$_{80}$;
out of the sequences with complete data, 223 were sensitive to the bNAb, while 349 were resistant.

Prediction of each outcome was performed using a super learner ensemble (van der Laan, Polley, and Hubbard 2007) of several random forests (Breiman 2001) with varied tuning parameters, several gradient boosted trees (Chen and Guestrin 2016) with varied tuning parameters and several elastic net regressions (Zou and Hastie 2005) with varied tuning parameters and intercept-only regression. Each algorithm (excepting xgboost) was additionally implemented in combination with variable pre-screening procedures to ensure that all binary features had at least 0, 4, 8 minority variants. This constituted a total of 6500/6513, 3841/6513, 3074/6513 features, respectively.

The specific algorithms used in the learning process are described in Table 1.1.

Table 1.1: Algorithms used in the super learner library. Each algorithm (excepting xgboost) was additionally implemented in combination with variable pre-screening procedures to ensure that all binary features had at least 0, 4, 8 minority variants.
Label	Description
rf_tune1	random forest with `mtry` equal to one-half times square root of number of predictors
rf_default	random forest with `mtry` equal to square root of number of predictors
rf_tune2	random forest with `mtry` equal to two times square root of number of predictors
xgboost_default	boosted regression trees with maximum depth of 4
xgboost_tune3	boosted regression trees with maximum depth of 8
xgboost_tune4	boosted regression trees with maximum depth of 12
lasso_default	elastic net with $\lambda$ selected by CV and $\alpha$ equal to 0
lasso_tune1	elastic net with $\lambda$ selected by 5-fold CV and $\alpha$ equal to 0.25
lasso_tune2	elastic net with $\lambda$ selected by 5-fold CV and $\alpha$ equal to 0.5
lasso_tune3	elastic net with $\lambda$ selected by 5-fold CV and $\alpha$ equal to 0.75
mean	intercept only regression

The predictive ability of the learner was assessed using cross-validation. The estimated cross-validated $R^2$ of the learner for predicting IC$_{80}$ is shown in Table 1.2. The estimated cross-validated area under the receiver operating characteristic curve (AUC) of the learner for predicting sensitivity is shown in Table 1.3.

Table 1.2: Estimates of 5-fold cross-validated $R^2$ for predictions of and IC$_{80}$ (n = 572).
	CV-R$^2$	Lower 95% CI	Upper 95% CI
IC$_{80}$	0.342	0.262	0.414

Table 1.3: Estimates of 5-fold cross-validated AUC for predictions of sensitivity (n = 572).
	CV-AUC	Lower 95% CI	Upper 95% CI
Sensitivity	0.777	0.676	0.852

We define the marginal biological importance of a subgroup of features as the difference in population predictiveness between the best possible prediction function based on the features under consideration plus geographic confounders versus only geographic confounders (Williamson et al. 2020). In Table 1.4, we display the groups of variables and their ranked marginal biological variable importance for predicting each outcome. The groups are displayed in order of decreasing average rank across outcomes. For variable group definitions, please refer to Table 4.1.

Table 1.4: Ranked marginal variable importance of groups relative to the group of geographic confounders for predicting each outcome. Importance is measured via $R^2$ for IC$_{80}$ and AUC for sensitivity. Stars next to ranks denote groups with p-value less than 0.05 from a hypothesis test with null hypothesis of zero importance. (n = 572; for estimating the prediction functions based on the feature group of interest, n = 286; for estimating the prediction functions based on the group of geographic confounders, n = 286)
Variable group	IC$_{80}$	Sensitivity
gp120 CD4 binding sites	1*	2
gp120 V2	2	1
gp120 V3	4	3
Region-specific counts of PNG sites	5	5
gp41 MPER	3	7
Cysteine counts	7	4
Viral geometry	6	6

2 Results for IC$_{80}$

2.1 Descriptive statistics

A summary of the distribution of IC$_{80}$ for the selected bNAb is shown in Figure 2.1.

$Histogram of IC$_{80}$ for the selected bNAb. Top row = original scale; bottom row = log$_{10}$ scale (n = 572 observations).$

Figure 2.1: Histogram of IC$_{80}$ for the selected bNAb. Top row = original scale; bottom row = log$_{10}$ scale (n = 572 observations).

2.2 Super learner results

The weights assigned to each algorithm for Super Learner predicting IC$_{80}$ are shown in Table 2.1.

Table 2.1: Table of super learner weights for IC$_{80}$ (n = 572 observations).
Learner	Weight
rf_tune1_screen0	0.00
rf_default_screen0	0.00
rf_tune2_screen0	0.00
lasso_default_screen0	0.21
lasso_tune1_screen0	0.00
lasso_tune2_screen0	0.00
lasso_tune3_screen0	0.00
rf_tune1_screen4	0.00
rf_default_screen4	0.00
rf_tune2_screen4	0.01
lasso_default_screen4	0.00
lasso_tune1_screen4	0.00
lasso_tune2_screen4	0.09
lasso_tune3_screen4	0.00
rf_tune1_screen8	0.00
rf_default_screen8	0.00
rf_tune2_screen8	0.00
lasso_default_screen8	0.00
lasso_tune1_screen8	0.00
lasso_tune2_screen8	0.00
lasso_tune3_screen8	0.00
xgboost_default	0.38
xgboost_tune3	0.24
xgboost_tune4	0.06
mean	0.00

2.3 Predictive performance

The cross-validated $R^2$ of the super learner and constituent algorithms (descriptions of algorithms shown in Table 1.1 in predicting IC$_{80}$ are shown in Figure 2.2.

$Cross-validated $R^2$ for IC$_{80}$ (n = 572 observations)$

Figure 2.2: Cross-validated $R^2$ for IC$_{80}$ (n = 572 observations)

Figure 2.3 shows cross-validated predictions of IC$_{80}$ plotted against observed values of IC$_{80}$, colored by cross-validation fold.

$Cross-validated super learner predicted log$_{10}$(IC$_{80})$ plotted against observed value (n = 572 observations) . Colors correspond to cross-validation folds.$

Figure 2.3: Cross-validated super learner predicted log$_{10}$(IC$_{80})$ plotted against observed value (n = 572 observations) . Colors correspond to cross-validation folds.

2.4 Variable importance

2.4.1 Biological importance

We show the biological variable importance of groups of features (defined in Table 4.1) in predicting IC$_{80}$ in Figure 2.4. Importance is defined using the difference in $R^2$ values. The plot shows the marginal biological importance of the group relative to the null model with geographic confounders only.

$Group biological variable importance for predicting IC$_{80}$. 95\% confidence intervals and stars denoting p-values less than 0.05 are displayed in blue. (n = 572; for estimating the prediction function based on geographic confounders only, n = 286; for estimating the prediction function based on the feature group of interest plus geographic confounders, n = 286)$

Figure 2.4: Group biological variable importance for predicting IC$_{80}$. 95% confidence intervals and stars denoting p-values less than 0.05 are displayed in blue. (n = 572; for estimating the prediction function based on geographic confounders only, n = 286; for estimating the prediction function based on the feature group of interest plus geographic confounders, n = 286)

We show the biological variable importance of individual features in predicting IC$_{80}$ in Figure 2.5. Importance is defined using the difference in $R^2$ values. The plot shows the marginal biological importance of the feature relative to the null model with geographic confounders only.

$Individual biological variable importance for predicting IC$_{80}$. 95\% confidence intervals are displayed in blue. (n = 572; for estimating the prediction function based on geographic confounders only, n = 286; for estimating the prediction function based on the feature of interest plus geographic confounders, n = 286)$

Figure 2.5: Individual biological variable importance for predicting IC$_{80}$. 95% confidence intervals are displayed in blue. (n = 572; for estimating the prediction function based on geographic confounders only, n = 286; for estimating the prediction function based on the feature of interest plus geographic confounders, n = 286)

2.4.2 Predictive importance

Table 2.2 shows the top 20 features in terms of their predictive importance. Specifically, the algorithm with the largest weight in the super learner ensemble was selected and associated variable importance metrics for this algorithm are shown. In this case, the highest weight was assigned to a xgboost algorithm, and thus the variable importance measures presented correspond to xgboost gain importance measures were computed and are shown by their rank. Gain measures the improvement in accuracy brought by a given feature to the tree branches on which it appears. The essential idea is that before adding a split on a given feature to the branch, there may be some observations that are poorly predicted, while after adding an additional split on this feature, and each resultant branch is more accurate. Gain measures this change in accuracy.

Table 2.2: The top 20 important features for predicting IC$_{80}$ as measured by their algorithm-specific importance.
Feature	Importance
hxb2.456.R.1mer	hxb2.456.R.1mer
hxb2.459.G.1mer	hxb2.459.G.1mer
hxb2.234.sequon_actual.1mer	hxb2.234.sequon_actual.1mer
num.sequons.gp120	num.sequons.gp120
hxb2.364.H.1mer	hxb2.364.H.1mer
hxb2.471.G.1mer	hxb2.471.G.1mer
hxb2.268.E.1mer	hxb2.268.E.1mer
hxb2.279.D.1mer	hxb2.279.D.1mer
hxb2.65.A.1mer	hxb2.65.A.1mer
num.sequons.env	num.sequons.env
hxb2.853.A.1mer	hxb2.853.A.1mer
length.gp120	length.gp120
hxb2.106.T.1mer	hxb2.106.T.1mer
hxb2.403.T.1mer	hxb2.403.T.1mer
hxb2.154.M.1mer	hxb2.154.M.1mer
hxb2.463.D.1mer	hxb2.463.D.1mer
hxb2.363.P.1mer	hxb2.363.P.1mer
hxb2.202.A.1mer	hxb2.202.A.1mer
hxb2.223.F.1mer	hxb2.223.F.1mer
subtype.is.D	subtype.is.D

3 Results for sensitivity

3.1 Descriptive statistics

Out of the sequences with complete data, 223 were estimated to be sensitive to the bNAb, while 349 were estimated to be resistant, where sensitivity was defined as the indicator that IC$_{80}$ was less than 1.

3.2 Super learner results

The weights assigned to each algorithm for Super Learner predicting sensitivity are shown in Table 3.1.

Table 3.1: Table of super learner weights for sensitivity (n = 572 observations).
Learner	Weight
rf_tune1_screen0	0.00
rf_default_screen0	0.00
rf_tune2_screen0	0.00
lasso_default_screen0	0.00
lasso_tune1_screen0	0.00
lasso_tune2_screen0	0.00
lasso_tune3_screen0	0.00
rf_tune1_screen4	0.00
rf_default_screen4	0.19
rf_tune2_screen4	0.00
lasso_default_screen4	0.00
lasso_tune1_screen4	0.00
lasso_tune2_screen4	0.00
lasso_tune3_screen4	0.00
rf_tune1_screen8	0.00
rf_default_screen8	0.00
rf_tune2_screen8	0.25
lasso_default_screen8	0.34
lasso_tune1_screen8	0.00
lasso_tune2_screen8	0.00
lasso_tune3_screen8	0.00
xgboost_default	0.21
xgboost_tune3	0.00
xgboost_tune4	0.00
mean	0.00

3.3 Predictive performance

The cross-validated area under the ROC curve of super learner predictions of sensitivity relative to candidate algorithms is shown in Figure 3.1. Figure 3.2 shows cross-validated ROC curves for this endpoint.

The cross-validated area under the ROC curve of the learner with tuning parameters and optimal pre-screening selected via cross-validation and learners with each individual value of tuning parameters are shown in Figure 3.2.

Figure 3.1: Cross-validated AUC for predicting sensitivity (n = 572 observations).

Figure 3.2 shows the cross-validated ROC curve for predicting sensitivity.

Cross-validated ROC curve for the super learner, discrete super learner, and single best performing algorithm for predicting sensitivity (n = 572 observations).

Figure 3.2: Cross-validated ROC curve for the super learner, discrete super learner, and single best performing algorithm for predicting sensitivity (n = 572 observations).

Cross-validated predicted probabilities of sensitivity made by super learner, discrete super learner, and single best performing algorithm colored by cross-validation fold (n = 572 observations).

Figure 3.3: Cross-validated predicted probabilities of sensitivity made by super learner, discrete super learner, and single best performing algorithm colored by cross-validation fold (n = 572 observations).

3.4 Variable importance

3.4.1 Biological importance

We show the biological variable importance of groups of features (defined in Table 4.1) in predicting sensitivity in Figure 3.4. Importance is defined using the difference in AUCs. The plot shows the marginal biological importance of the group relative to the null model with geographic confounders only.

$Group biological variable importance for predicting sensitivity. 95\% confidence intervals are displayed in blue. (n = 572; for estimating the prediction function based on geographic confounders only, n = 286; for estimating the prediction function based on the feature group of interest plus geographic confounders, n = 286)$

Figure 3.4: Group biological variable importance for predicting sensitivity. 95% confidence intervals are displayed in blue. (n = 572; for estimating the prediction function based on geographic confounders only, n = 286; for estimating the prediction function based on the feature group of interest plus geographic confounders, n = 286)

We show the biological variable importance of individual features in predicting sensitivity in Figure 3.5. Importance is defined using the difference in AUCs. The plot shows the marginal biological importance of the feature relative to the null model with geographic confounders only.

$Individual biological variable importance for predicting sensitivity. 95\% confidence intervals and stars denoting p-values less than 0.05 are displayed in blue. (n = 572; for estimating the prediction function based on geographic confounders only, n = 286; for estimating the prediction function based on the feature of interest plus geographic confounders, n = 286)$

Figure 3.5: Individual biological variable importance for predicting sensitivity. 95% confidence intervals and stars denoting p-values less than 0.05 are displayed in blue. (n = 572; for estimating the prediction function based on geographic confounders only, n = 286; for estimating the prediction function based on the feature of interest plus geographic confounders, n = 286)

3.4.2 Predictive importance

Table 3.2 shows the top 20 features in terms of their predictive importance. Specifically, the algorithm with the largest weight in the super learner ensemble was selected and associated variable importance metrics for this algorithm are shown. In this case, the highest weight was assigned to a lasso algorithm, and thus the variable importance measures presented correspond to the magnitude of the coefficient for the model with $\lambda$ chosen via cross-validation. Overall, there were 87 features that had non-zero coefficient in the final lasso fit.

Table 3.2: The top 20 important features for predicting sensitivity as measured by their algorithm-specific importance.
Feature	Importance
hxb2.459.G.1mer	1.092
hxb2.147.M.1mer	0.882
hxb2.463.D.1mer	0.569
hxb2.252.R.1mer	0.460
hxb2.460.sequon_actual.1mer	-0.448
hxb2.456.R.1mer	0.398
hxb2.149.N.1mer	-0.396
hxb2.172.T.1mer	0.396
hxb2.304.R.1mer	0.376
hxb2.719.T.1mer	-0.356
hxb2.805.R.1mer	0.338
hxb2.164.A.1mer	0.337
hxb2.236.T.1mer	-0.332
hxb2.403.T.1mer	-0.327
hxb2.463.R.1mer	-0.322
hxb2.234.T.1mer	0.291
hxb2.268.E.1mer	-0.278
hxb2.234.sequon_actual.1mer	-0.277
hxb2.471.G.1mer	0.273
hxb2.106.T.1mer	0.258

4 Variable group definitions

Table 4.1 provides the individual HXB2 coordinates and variable names of the variables that make up each of the variable groups considered for biological importance.

Table 4.1: Individual variables within each variable group. Numeric codes followed by a single letter denote the presence of an amino acid (AA) residue at a given site (relative to HXB2). Other suffixes are: ‘sequon_actual’, referring to the site having a leading AA for the canonical N-linked glycosylation motif (N[!P]{S/T]; in other words, this AA will be an N, and the following two AAs will conform to the motif); ‘gap’, referring to an observed gap at this site after alignment to maintain site-specific relevance; and ‘frameshift’, referring to a gap at this site that resulted in a frameshift. The prefix ‘num’ denotes the number (e.g., ‘num.cysteines’ refers to the number of cysteines), while the prefix ‘length’ denotes the length of the specified region (excluding gaps and frameshifts).
	Variables
gp120_cd4bs	61.F, 61.H, 61.I, 61.L, 61.Q, 61.T, 61.V, 61.Y, 62.A, 62.D, 62.E, 62.G, 62.H, 62.I, 62.K, 62.M, 62.N, 62.R, 62.S, 62.T, 62.V, 62.Y, 66.H, 66.R, 66.X, 120.I, 120.T, 120.V, 124.F, 124.H, 124.I, 124.P, 124.Y, 125.F, 125.I, 125.L, 125.M, 125.X, 127.I, 127.V, 182.A, 182.E, 182.H, 182.I, 182.K, 182.L, 182.M, 182.N, 182.Q, 182.S, 182.T, 182.V, 182.X, 197.D, 197.I, 197.K, 197.N, 197.R, 197.S, 197.T, 198.A, 198.I, 198.S, 198.T, 198.V, 204.A, 204.E, 204.S, 204.T, 204.V, 206.P, 206.S, 206.T, 209.N, 209.S, 209.T, 274.A, 274.C, 274.F, 274.G, 274.S, 274.T, 274.V, 274.gap, 276.D, 276.E, 276.H, 276.K, 276.N, 276.S, 276.X, 276.gap, 279.A, 279.C, 279.D, 279.E, 279.I, 279.K, 279.N, 279.Q, 279.R, 279.S, 280.A, 280.D, 280.N, 280.S, 280.T, 280.X, 281.A, 281.E, 281.G, 281.H, 281.I, 281.L, 281.R, 281.S, 281.T, 281.V, 282.E, 282.G, 282.H, 282.K, 282.N, 282.P, 282.Q, 282.R, 282.S, 282.Y, 283.A, 283.I, 283.N, 283.P, 283.S, 283.T, 283.V, 283.X, 304.E, 304.G, 304.I, 304.K, 304.L, 304.R, 304.S, 304.V, 304.W, 318.F, 318.H, 318.N, 318.Q, 318.R, 318.S, 318.V, 318.W, 318.Y, 326.A, 326.I, 326.M, 326.P, 326.S, 326.T, 362.A, 362.C, 362.D, 362.E, 362.F, 362.G, 362.K, 362.M, 362.N, 362.Q, 362.R, 362.S, 362.T, 362.V, 362.X, 362.gap, 363.A, 363.E, 363.G, 363.H, 363.I, 363.K, 363.L, 363.M, 363.N, 363.P, 363.Q, 363.R, 363.S, 363.T, 363.V, 363.X, 365.A, 365.G, 365.I, 365.L, 365.N, 365.P, 365.R, 365.S, 365.T, 365.V, 366.E, 366.G, 367.G, 367.S, 367.X, 369.A, 369.E, 369.I, 369.L, 369.P, 369.Q, 369.S, 369.T, 369.V, 370.E, 370.X, 374.F, 374.H, 374.L, 374.X, 374.Y, 386.D, 386.K, 386.N, 386.S, 386.T, 386.X, 386.Y, 392.D, 392.E, 392.F, 392.H, 392.I, 392.K, 392.L, 392.N, 392.P, 392.Q, 392.S, 392.T, 392.X, 392.Y, 392.gap, 425.K, 425.N, 425.R, 425.X, 426.A, 426.I, 426.K, 426.L, 426.M, 426.R, 426.S, 426.T, 426.V, 427.L, 427.W, 427.gap, 428.H, 428.I, 428.K, 428.M, 428.Q, 428.T, 428.V, 428.X, 429.A, 429.D, 429.E, 429.G, 429.K, 429.Q, 429.R, 429.S, 429.T, 430.A, 430.G, 430.I, 430.Q, 430.S, 430.T, 430.V, 430.X, 431.A, 431.E, 431.G, 431.R, 431.V, 432.I, 432.K, 432.L, 432.Q, 432.R, 432.S, 432.X, 455.A, 455.D, 455.E, 455.I, 455.L, 455.Q, 455.S, 455.T, 455.V, 456.H, 456.L, 456.M, 456.N, 456.R, 456.S, 456.V, 456.W, 456.Y, 457.A, 457.D, 457.N, 457.S, 457.X, 458.A, 458.D, 458.E, 458.G, 458.K, 458.N, 458.Q, 458.S, 458.T, 458.Y, 459.A, 459.D, 459.E, 459.G, 459.I, 459.N, 459.P, 459.S, 459.T, 459.V, 459.X, 459.gap, 460.A, 460.C, 460.D, 460.E, 460.G, 460.H, 460.I, 460.K, 460.L, 460.M, 460.N, 460.P, 460.Q, 460.R, 460.S, 460.T, 460.V, 460.W, 460.X, 460.gap, 461.A, 461.D, 461.E, 461.F, 461.G, 461.H, 461.I, 461.K, 461.L, 461.M, 461.N, 461.P, 461.Q, 461.R, 461.S, 461.T, 461.V, 461.X, 461.Y, 461.gap, 462.A, 462.D, 462.E, 462.G, 462.H, 462.I, 462.K, 462.L, 462.M, 462.N, 462.P, 462.Q, 462.R, 462.S, 462.T, 462.V, 462.X, 462.Y, 462.gap, 463.A, 463.C, 463.D, 463.E, 463.G, 463.H, 463.I, 463.K, 463.L, 463.M, 463.N, 463.P, 463.Q, 463.R, 463.S, 463.T, 463.V, 463.X, 463.Y, 463.gap, 469.K, 469.R, 469.S, 469.Y, 469.gap, 471.A, 471.E, 471.G, 471.I, 471.L, 471.Q, 471.S, 471.T, 471.V, 474.D, 474.E, 474.N, 474.Y, 475.I, 475.M, 475.T, 475.V, 476.G, 476.K, 476.M, 476.Q, 476.R, 476.T, 476.V, 477.D, 477.G, 477.N, 197.sequon_actual, 276.sequon_actual, 363.sequon_actual, 386.sequon_actual, 392.sequon_actual, 460.sequon_actual, 461.sequon_actual, 462.sequon_actual, 463.sequon_actual
gp120_v2	121.E, 121.K, 121.M, 121.Q, 121.R, 121.X, 123.A, 123.T, 123.X, 124.F, 124.H, 124.I, 124.P, 124.Y, 127.I, 127.V, 157.C, 157.X, 158.D, 158.E, 158.S, 158.T, 159.D, 159.F, 159.L, 159.X, 159.Y, 160.D, 160.E, 160.H, 160.I, 160.K, 160.N, 160.R, 160.S, 160.T, 160.V, 160.X, 160.Y, 160.gap, 161.A, 161.I, 161.L, 161.M, 161.S, 161.T, 161.V, 161.X, 161.gap, 162.A, 162.H, 162.I, 162.N, 162.P, 162.Q, 162.S, 162.T, 162.X, 162.gap, 163.A, 163.G, 163.I, 163.K, 163.P, 163.R, 163.S, 163.T, 163.X, 163.gap, 164.A, 164.D, 164.E, 164.F, 164.G, 164.H, 164.I, 164.K, 164.L, 164.M, 164.N, 164.P, 164.Q, 164.R, 164.S, 164.T, 164.V, 164.X, 164.gap, 165.G, 165.I, 165.L, 165.M, 165.P, 165.Q, 165.R, 165.S, 165.T, 165.V, 165.W, 165.X, 166.A, 166.D, 166.G, 166.H, 166.I, 166.K, 166.M, 166.N, 166.Q, 166.R, 166.S, 166.T, 166.V, 166.W, 166.X, 167.D, 167.E, 167.G, 167.K, 167.N, 167.P, 167.Q, 167.R, 167.T, 167.X, 168.D, 168.E, 168.G, 168.I, 168.K, 168.L, 168.R, 168.S, 168.V, 168.X, 168.gap, 169.A, 169.E, 169.G, 169.H, 169.I, 169.K, 169.L, 169.M, 169.N, 169.P, 169.Q, 169.R, 169.S, 169.T, 169.V, 169.W, 169.X, 169.Y, 169.gap, 170.C, 170.E, 170.H, 170.K, 170.L, 170.N, 170.Q, 170.R, 170.S, 170.T, 170.X, 170.gap, 171.A, 171.D, 171.E, 171.G, 171.H, 171.K, 171.L, 171.M, 171.N, 171.P, 171.Q, 171.R, 171.S, 171.T, 171.V, 171.X, 171.gap, 172.A, 172.D, 172.E, 172.G, 172.I, 172.K, 172.M, 172.N, 172.Q, 172.R, 172.T, 172.V, 172.X, 172.Y, 173.A, 173.D, 173.E, 173.F, 173.G, 173.H, 173.K, 173.N, 173.Q, 173.R, 173.S, 173.T, 173.X, 173.Y, 174.A, 174.D, 174.G, 174.N, 174.S, 174.T, 174.V, 174.X, 174.Y, 175.A, 175.E, 175.F, 175.H, 175.I, 175.L, 175.M, 175.N, 175.Q, 175.S, 175.T, 175.V, 175.X, 175.Y, 176.F, 176.L, 176.S, 176.X, 177.A, 177.D, 177.F, 177.H, 177.N, 177.Q, 177.X, 177.Y, 178.A, 178.D, 178.E, 178.G, 178.I, 178.K, 178.L, 178.N, 178.R, 178.S, 178.T, 178.V, 178.X, 178.Y, 179.A, 179.E, 179.F, 179.I, 179.K, 179.L, 179.M, 179.P, 179.Q, 179.R, 179.S, 179.T, 179.V, 179.X, 179.Y, 180.D, 180.L, 180.S, 180.X, 181.D, 181.I, 181.K, 181.L, 181.M, 181.T, 181.V, 181.X, 182.A, 182.E, 182.H, 182.I, 182.K, 182.L, 182.M, 182.N, 182.Q, 182.S, 182.T, 182.V, 182.X, 183.A, 183.D, 183.E, 183.H, 183.K, 183.L, 183.N, 183.P, 183.Q, 183.R, 183.S, 183.V, 183.X, 184.A, 184.F, 184.I, 184.L, 184.M, 184.N, 184.S, 184.T, 184.V, 184.X, 184.gap, 185.A, 185.D, 185.E, 185.F, 185.G, 185.H, 185.I, 185.K, 185.L, 185.N, 185.P, 185.Q, 185.R, 185.S, 185.T, 185.V, 185.X, 185.Y, 185.gap, 186.A, 186.D, 186.E, 186.G, 186.H, 186.I, 186.K, 186.L, 186.N, 186.P, 186.Q, 186.R, 186.S, 186.T, 186.V, 186.X, 186.gap, 187.A, 187.C, 187.D, 187.E, 187.G, 187.H, 187.I, 187.K, 187.N, 187.P, 187.Q, 187.R, 187.S, 187.T, 187.X, 187.Y, 187.gap, 188.A, 188.D, 188.E, 188.F, 188.G, 188.H, 188.I, 188.K, 188.N, 188.P, 188.Q, 188.R, 188.S, 188.T, 188.V, 188.W, 188.X, 188.Y, 188.gap, 189.A, 189.D, 189.E, 189.G, 189.H, 189.I, 189.K, 189.L, 189.M, 189.N, 189.P, 189.Q, 189.R, 189.S, 189.T, 189.X, 189.Y, 189.gap, 190.A, 190.D, 190.E, 190.F, 190.G, 190.H, 190.I, 190.K, 190.L, 190.M, 190.N, 190.P, 190.Q, 190.R, 190.S, 190.T, 190.V, 190.X, 190.Y, 191.F, 191.H, 191.S, 191.W, 191.Y, 192.A, 192.G, 192.I, 192.K, 192.M, 192.R, 192.S, 192.T, 192.V, 193.F, 193.I, 193.L, 193.M, 193.P, 194.I, 194.K, 194.L, 194.M, 194.R, 194.T, 194.V, 195.D, 195.H, 195.K, 195.N, 195.Q, 195.S, 195.T, 195.Y, 197.D, 197.I, 197.K, 197.N, 197.R, 197.S, 197.T, 202.A, 202.K, 202.P, 202.R, 202.S, 202.T, 203.K, 203.Q, 203.R, 312.A, 312.G, 312.V, 315.A, 315.G, 315.H, 315.K, 315.M, 315.Q, 315.R, 315.S, 315.T, 315.V, 160.sequon_actual, 171.sequon_actual, 173.sequon_actual, 174.sequon_actual, 185.sequon_actual, 186.sequon_actual, 187.sequon_actual, 188.sequon_actual, 189.sequon_actual, 195.sequon_actual, 197.sequon_actual
gp120_v3	296.C, 296.R, 297.A, 297.E, 297.I, 297.K, 297.L, 297.M, 297.N, 297.Q, 297.R, 297.S, 297.T, 297.V, 297.X, 298.G, 298.R, 299.E, 299.F, 299.H, 299.L, 299.N, 299.P, 299.T, 299.V, 300.A, 300.C, 300.D, 300.F, 300.G, 300.H, 300.N, 300.Q, 300.S, 300.T, 300.W, 300.X, 300.Y, 301.D, 301.E, 301.H, 301.I, 301.K, 301.N, 301.Q, 301.R, 301.T, 301.V, 301.X, 301.Y, 301.gap, 302.A, 302.G, 302.H, 302.K, 302.L, 302.N, 302.Q, 302.S, 302.Y, 303.E, 303.I, 303.K, 303.M, 303.Q, 303.R, 303.S, 303.T, 303.V, 304.E, 304.G, 304.I, 304.K, 304.L, 304.R, 304.S, 304.V, 304.W, 305.D, 305.E, 305.G, 305.H, 305.I, 305.K, 305.N, 305.Q, 305.R, 305.T, 305.X, 305.Y, 306.A, 306.D, 306.E, 306.G, 306.K, 306.Q, 306.R, 306.S, 306.X, 306.gap, 307.A, 307.E, 307.F, 307.H, 307.I, 307.L, 307.M, 307.T, 307.V, 307.X, 307.Y, 308.A, 308.G, 308.H, 308.K, 308.N, 308.P, 308.Q, 308.R, 308.S, 308.T, 308.W, 308.X, 309.F, 309.I, 309.L, 309.M, 309.R, 309.T, 309.V, 309.X, 310.G, 310.Q, 310.gap, 311.I, 311.R, 311.gap, 312.A, 312.G, 312.V, 313.A, 313.G, 313.L, 313.P, 313.Q, 313.S, 313.T, 313.V, 313.W, 314.A, 314.G, 314.M, 314.P, 314.X, 315.A, 315.G, 315.H, 315.K, 315.M, 315.Q, 315.R, 315.S, 315.T, 315.V, 316.A, 316.E, 316.G, 316.I, 316.L, 316.M, 316.R, 316.S, 316.T, 316.V, 316.W, 316.X, 316.gap, 317.F, 317.I, 317.L, 317.M, 317.R, 317.S, 317.V, 317.W, 317.X, 317.Y, 318.F, 318.H, 318.N, 318.Q, 318.R, 318.S, 318.V, 318.W, 318.Y, 319.A, 319.G, 319.I, 319.K, 319.L, 319.M, 319.N, 319.Q, 319.R, 319.S, 319.T, 319.V, 319.Y, 319.gap, 320.A, 320.E, 320.G, 320.H, 320.I, 320.K, 320.M, 320.N, 320.P, 320.Q, 320.R, 320.S, 320.T, 320.W, 320.X, 320.Y, 320.gap, 321.A, 321.D, 321.E, 321.F, 321.G, 321.H, 321.I, 321.K, 321.L, 321.N, 321.R, 321.S, 321.T, 321.V, 321.Y, 321.gap, 322.E, 322.G, 322.I, 322.K, 322.L, 322.N, 322.Q, 322.T, 322.V, 322.Y, 322.gap, 323.D, 323.G, 323.I, 323.K, 323.M, 323.N, 323.Q, 323.R, 323.S, 323.T, 323.V, 323.gap, 324.E, 324.G, 324.L, 324.N, 324.P, 324.R, 324.S, 324.T, 325.D, 325.E, 325.G, 325.I, 325.K, 325.N, 325.Q, 325.R, 325.S, 325.T, 325.Y, 326.A, 326.I, 326.M, 326.P, 326.S, 326.T, 327.G, 327.K, 327.R, 328.A, 328.D, 328.E, 328.G, 328.H, 328.I, 328.K, 328.L, 328.M, 328.N, 328.P, 328.Q, 328.R, 328.S, 328.V, 329.A, 329.V, 329.X, 330.F, 330.H, 330.N, 330.Q, 330.R, 330.S, 330.Y, 331.C, 331.X, 332.D, 332.E, 332.H, 332.I, 332.K, 332.L, 332.N, 332.Q, 332.R, 332.S, 332.T, 332.V, 333.I, 333.L, 333.V, 333.Y, 334.A, 334.D, 334.E, 334.F, 334.G, 334.I, 334.K, 334.N, 334.R, 334.S, 334.T, 334.Y, 334.gap, 300.sequon_actual, 301.sequon_actual, 302.sequon_actual, 322.sequon_actual, 323.sequon_actual, 324.sequon_actual, 330.sequon_actual, 332.sequon_actual, 334.sequon_actual
gp41_mper	609.A, 609.F, 609.H, 609.K, 609.L, 609.P, 609.Q, 609.R, 609.S, 609.X, 609.Y, 657.E, 657.K, 657.V, 658.E, 658.H, 658.K, 658.L, 658.N, 658.Q, 658.R, 658.X, 659.A, 659.D, 659.E, 659.K, 659.N, 659.R, 659.S, 659.X, 661.F, 661.L, 661.S, 661.X, 662.A, 662.E, 662.G, 662.K, 662.Q, 662.S, 662.T, 663.F, 663.L, 663.M, 663.W, 664.D, 664.E, 664.G, 664.N, 664.S, 665.E, 665.H, 665.K, 665.N, 665.Q, 665.R, 665.S, 665.T, 665.X, 667.A, 667.D, 667.E, 667.G, 667.K, 667.N, 667.Q, 667.S, 667.T, 668.D, 668.F, 668.G, 668.H, 668.N, 668.Q, 668.S, 668.T, 668.X, 669.I, 669.L, 669.X, 671.D, 671.G, 671.K, 671.N, 671.S, 671.T, 672.L, 672.W, 673.F, 673.L, 673.S, 674.A, 674.D, 674.E, 674.G, 674.K, 674.N, 674.S, 674.T, 674.X, 674.Y, 675.I, 675.L, 675.M, 675.V, 676.A, 676.S, 676.T, 676.V, 677.E, 677.H, 677.K, 677.N, 677.Q, 677.R, 677.S, 677.T, 677.X, 679.I, 679.L, 680.G, 680.R, 680.S, 680.W, 681.D, 681.H, 681.S, 681.Y, 682.I, 682.T, 682.V, 683.K, 683.Q, 683.R, 683.X, 684.I, 684.L, 684.M, 684.T, 684.V, 684.X, 674.sequon_actual
glyco	num.sequons.env, num.sequons.gp120, num.sequons.v2, num.sequons.v3, num.sequons.v5
cysteines	num.cysteine.env, num.cysteine.gp120, num.cysteine.v2, num.cysteine.v3, num.cysteine.v5
geometry	length.env, length.gp120, length.v2, length.v3, length.v5

References

Breiman, Leo. 2001. “Random Forests.” Machine Learning 45 (1). Springer: 5–32. doi:10.1023/A:1010933404324.

Chen, Tianqi, and Carlos Guestrin. 2016. “Xgboost: A Scalable Tree Boosting System.” In Proceedings of the 22nd Acm Sigkdd International Conference on Knowledge Discovery and Data Mining, 785–94. doi:10.1145/2939672.2939785.

van der Laan, Mark J, Eric C Polley, and Alan E Hubbard. 2007. “Super Learner.” Statistical Applications in Genetics and Molecular Biology 6 (1). De Gruyter. doi:10.2202/1544-6115.1309.

Williamson, Brian D, Peter B Gilbert, Noah R Simon, and Marco Carone. 2020. “A Unified Approach for Inference on Algorithm-Agnostic Variable Importance.” arXiv Preprint. https://arxiv.org/abs/2004.03683.

Yoon, Hyejin, Jennifer Macke, Anthony P West Jr, Brian Foley, Pamela J Bjorkman, Bette Korber, and Karina Yusim. 2015. “CATNAP: A Tool to Compile, Analyze and Tally Neutralizing Antibody Panels.” Nucleic Acids Research 43 (W1). Oxford University Press: W213–W219. doi:10.1093/nar/gkv404.

Zou, Hui, and Trevor Hastie. 2005. “Regularization and Variable Selection via the Elastic Net.” Journal of the Royal Statistical Society: Series B (Statistical Methodology) 67 (2). Wiley Online Library: 301–20. doi:10.1111/j.1467-9868.2005.00503.x.

SLAPNAP Report: VRC01

06 November, 2020