Full article: Predicting the clinical subcutaneous absorption rate constant of monoclonal antibodies using only the primary sequence: a machine learning approach

Formulae display: $MathJax Logo$ ?Mathematical formulae have been encoded as MathML and are displayed in this HTML version using MathJax in order to improve their display. Uncheck the box to turn MathJax off. This feature requires Javascript. Click on a formula to zoom.

ABSTRACT

Subcutaneous injections are an increasingly prevalent route of administration for delivering biological therapies including monoclonal antibodies (mAbs). Compared with intravenous delivery, subcutaneous injections reduce administration costs, shorten the administration time, and are strongly preferred from a patient experience point of view. An understanding of the absorption process of a mAb from the injection site to the systemic circulation is critical to the process of subcutaneous mAb formulation development. In this study, we built a model to predict the absorption rate constant (k_a), which denotes how fast a mAb is absorbed from the site of administration. Once trained, our model (enabled by the XGBoost algorithm in machine learning) can predict the k_a of a mAb following a subcutaneous injection using in silico molecular properties alone (generated from the primary sequence). Our model does not need clinically observed plasma concentration-time data; this is a novel capability not previously achieved in predictive pharmacokinetic models. The model also showed improved performance when benchmarked against a recently reported mechanistic model that relied on clinical data to predict subcutaneous absorption of mAbs. We further interpreted the model to understand which molecular properties affect the absorption rate and showed that our findings are consistent with previous studies evaluating subcutaneous absorption through direct experimentation. Taken altogether, this study reports the development, validation, benchmarking, and interpretation of a model that can predict the clinical k_a of a mAb using its primary sequence as the only input.

KEYWORDS:

Introduction

Following subcutaneous (SC) administration of a monoclonal antibody (mAb) to the hypodermis layer, the mAb traverses several paths within the body before reaching the systemic circulation. First, the mAb enters the net negatively charged interstitium primarily made of glycosaminoglycans, collagen fibers, and interstitial fluid.^Citation1 It then travels through the interstitium to reach the peripheral lymphatic capillaries which, without a complete basement layer as a barrier, act as the initial site of uptake into the lymphatic system to carry the mAb to the thoracic duct and ultimately to the systemic circulation.^{Citation1,Citation2} The absorption rate from the injection site to the systemic circulation is conventionally represented by the absorption rate constant (k_a) in pharmacokinetics (PK) studies assuming the absorption process as a first-order reaction.^Citation3 As such, the k_a is a parameter whose value typically remains unknown until after a clinical trial.

To construct a model that can calculate the k_a of a mAb, the model must use molecule-specific information as input to differentiate the k_a of one mAb from another. Most conventional PK models (both compartmental and physiologically based) achieve this by using as input plasma concentration-time profiles from clinical trials.^{Citation4–6} The construction of these models involves determining optimized model parameters by fitting against a training concentration-time profile of a given mAb, followed by the application of these optimized parameters to predict an unseen concentration-time profile of the same mAb.^{Citation7–9} Though conventional PK models can make accurate predictions,^{Citation7–9} the need to use clinical data inevitably limits their application, particularly during early stages of research and development when clinical data are unavailable, or limited at best. This gap prompted us to develop a model that only uses molecule information available in early stages of research and development and that does not rely on any clinical information to predict PK behavior.

To achieve the aforementioned goal, we sought to construct a model using the molecular properties of a mAb, which can be simulated based on the primary sequence alone^Citation10 and which past studies have identified to have the potential to influence the absorption rate.^{Citation11–15} For example, surface charge,^{Citation12–15} hydrophobicity,^{Citation11,Citation14} thermal stability,^{Citation11,Citation14} and size and shape^{Citation12,Citation14,Citation15} have all been found to play a role in transporting a molecule from the injection site to the systemic circulation. We reasoned that such physical connectivity between molecular properties and absorption could possibly replace the need for clinical data and be leveraged to develop a new model that relies solely on the primary sequence and the derived in silico inputs to predict the k_a of a mAb following SC administration.

On the basis of this thesis, we first generated and trained a “data-centric” model by correlating the molecular properties of various mAbs with their respective known k_a values generated from clinical trials. Once successfully trained, the resultant model, which no longer required any clinical data as input to predict the k_a of a given mAb and instead relied only on its molecular properties that can be obtained in silico, was validated using mAbs not used to generate the model. Finally, we interrogated the model further, including benchmarking with a previously reported PK model, as well as dissecting how the model uses molecular properties to make predictions and rationalizing the inclusion of these molecular properties in light of prior experimental literature that examined the impact of molecular properties on k_a.

The importance and relevance of this study are two-fold. First, the study explores, for the first time, a model that does not require any clinical or preclinical data for the prediction of in vivo clinical k_a of a mAb. To predict k_a, our model only needs the sequence information of a mAb as input whereas other existing models require extensive clinically derived data which are unavailable in early phase drug discovery and development, wherein the availability of a k_a prediction would positively impact the selection of mAbs with desirable absorption profiles. Notably, the model also showed improved performance when benchmarked against a recently reported mechanistic model that relied on clinical data to predict subcutaneous absorption of mAbs.^Citation16 Second, this study presents a quantitative approach to decipher molecular properties affecting the SC absorption rate. In particular, we evaluated, compared, and ranked a wide range of molecular properties with respect to their effect on k_a, and found our results to be consistent with experimentally derived results from previous studies.^{Citation11–13} Altogether, this study offers the first tool to predict and understand the rate of absorption of subcutaneously administered mAbs with nothing but the primary sequence of the molecule.

Methods

Compilation of the SC absorption rate constants

A survey of available publications and internal clinical reports was conducted to identify mAbs with reported clinical k_a values where the mAb was administered SC as a monotherapy. Formulations containing hyaluronidase were excluded due to its impact on SC absorption.^Citation17 We also excluded any data generated from preclinical animal studies, given the physiological and anatomical differences across species and lack of clear inter-species translatability for absorption. We note that the general approach reported in this study can be applied to k_a values derived from formulations containing hyaluronidase or from studies from a given preclinical animal species.

33 mAbs fit the selection criteria. Data sources included the Elsevier PharmaPendium^Citation18 database, which compiles information primarily from the Food and Drug Administration and European Medicines Agency approval packages. The reported k_a values were individually reviewed and, if needed, corrected to match the database-reported value with the source document that initially reported the k_a value. The error associated with those k_a values, which was reported as %RSE, SE, or 95% CI depending on the source document, was also compiled and can be found in the Rate Constants tab in Supplementary Information (%RSE = percent relative standard error, SE = standard error, and CI = confidence interval). The Lilly mAbs belong to the immunoglobulin G4 (IgG4) subclass; their identities and sequences were masked, but the k_a values and molecular properties associated are reported in the Rate Constants tab in Supplementary Information.

The dataset of k_a values assembled as detailed above contained values calculated mostly from population PK analysis with compartmental models built using systemically measured concentration-time data. Pertinent information on the subcutaneously administered mAbs was also collected, such as injection parameters and aggregated patient characteristics, but this information was available for a limited number of mAbs only, precluding the inclusion of these parameters in our model. More than one k_a were reported for a number of mAbs. In these cases, we calculated the mean k_a and the same is reported in . The raw, unaveraged data are also reported in the Rate Constants tab in Supplementary Information.

Figure 1. SC absorption rate constant overview (1a), across IgG subclasses (1b), and across clearance types (1c). For clarity, the two box plots show the 25%, 50%, and 75% quartiles with jittered points.

Among the 33 compiled mAbs, golimumab was excluded from the analysis as it exhibits unusually fast absorption; its k_a is approximately 4 times the average k_a value and 2 times the second highest k_a value reported for other mAbs. Emicizumab was also excluded as, being a bispecific with two different heavy chain sequences, its molecular properties could not be properly calculated using the Molecular Operating Environment (MOE) software detailed below. The k_a data from the remaining 31 mAbs formed the output variable of our prediction models.

Simulation of the molecular properties

We used the MOE 2019 software platform to generate in silico molecular properties for the 31 mAbs of interest.^Citation10 The software used the light chain and heavy chain sequences of the mAbs as inputs and simulated molecular properties using the antigen-binding fragment (Fab) region simulated at pH of 7.4, which represents a physiologically relevant value in the SC space.^Citation19 Other simulation parameters include setting the solution at 0.001 M NaCl, dielectric constant at 78, viscosity at 0.89 cP, and temperature at 300 K (27°C). We recognized that 37°C would reflect the actual in vivo condition more accurately, but we noted that the values of the molecular properties used in our study were not sensitive to this temperature change: they either remained unchanged or were comparable (<5% change in magnitude), when simulated at 27°C versus 37°C. We note that future simulations should be run at 37°C for biological relevance for human data and in case new parameters that are affected by temperature are explored in future studies.

MOE generated over a hundred molecular properties for a given mAb, but some of the properties were redundant or highly correlated. For instance, the property set contained five properties quantifying the five largest negative patch areas on a mAb; The same information, to some extent, could be captured by one property, the total negative patch area, which was also reported. These redundant properties as well as the inter-correlated ones describing the same molecular properties with a Spearman coefficient larger than 0.6 were filtered out. As a result, the number of properties included in the model was reduced, enabling a balance between dimensionality reduction and information preservation. Removing nonessential properties consistently improved the predicted k_a values to closely match the actual values across model setups. This observation is consistent with the observation that reducing the number of inputs in a high-dimensional model such as the one reported in this study would lead to better model performance.^Citation20 In this case, we determined the final 27 properties, henceforth simply called the properties, which could be classified into four main categories – charge, hydrophobicity, size, and bulk properties ().

Table 1. Properties used in the predictive models.

Download CSV Display Table

The sequence of each mAb was modeled into a structure 10 times and then the properties were calculated for each of the structures. The MOE simulation was repeated 10 times due to the variability in the output value generated by MOE for some of the properties. In particular, for structure-based properties derived from protein folding, MOE yielded a different value in each run despite the same input sequence. The variability in output values was due to the variability in simulating protein folding, an inherent feature of MOE. For the same input sequences some structure-based properties showed a percentage change as large as 50% from run to run, whereas sequence-based properties had a single value across runs. For a given mAb, we ran the simulation 10 times because, when we performed fewer than 10 runs, often a value outside of the sampled range would manifest when we performed additional runs to reach a total of 10 runs; when we performed more than 10 runs, the values obtained for runs past the tenth run always fell within the range sampled in the first 10 runs. Also, rather than taking the average of each property across the 10 runs per mAb, a simplification which on its own would fail to convey the degree of variability of each property, all 10 runs from each mAb were used as inputs in the predictive models. Therefore, the 310 runs of properties, consisting of 10 runs per mAb and 31 mAbs in total, formed the input variables of our regression models.

The following alternative methods for preparing property data as model inputs were also attempted: using the average or individual values of every property generated by MOE (over a hundred properties) over the 10 runs; and using the average values of the selected properties (27 properties) over the 10 runs as the input variables. However, each of these approaches resulted in model performance poorer than the one we reported in the study, i.e., using the individual values of the selected properties as the input variables resulted in the lowest root-mean-square error (RMSE) of 0.0023 h⁻¹ (). We noted that a study in the literature that attempted to predict the viscosity and clearance of a mAb based on its MOE-generated properties reported an alternative method that used the averaged values across a much larger number of runs (i.e., 100 runs).^Citation21 This approach may be of interest for future exercises involving MOE property calculations.

Table 2. RMSEs across all the predictive models.

Download CSV Display Table

Data preprocessing

Both the input and output variables were normalized to a mean of 0 and a standard deviation of 1 prior to constructing the models. The output variables were normalized to improve readability and, if desired, could be easily converted back to their raw value, while the input variables were normalized to improve accuracy of the traditional regression models. Although normalization would not affect the accuracy of XGBoost as a decision tree model (described in the next section), the same normalization procedure was utilized for XGBoost for consistency across model setups.

Generation and interpretation of the predictive models

We evaluated four different algorithms with the goal of identifying the one that can make the most accurate prediction. The first three, the principal component regression (PCR), partial least square (PLS) regression, and least absolute shrinkage and selection operator (LASSO) regression, are advanced variations of standard linear regression. The difference across the three algorithms lies in the specific methods used to calculate the coefficient for each input variable. The fourth, the XGBoost algorithm, uses a novel decision tree algorithm consisting of a pre-defined number of nodes (called “trees”) whose individual predictions summed to render a final, collective prediction.^Citation22 Figuratively, the XGBoost algorithm resembles a council where the members vote in order to decide on an issue together. Each member of the council (tree in this case) sequentially votes in a way that it considers to be correct (to minimize the error in the prediction) based on a given piece of information (a given molecular property). Multiple members may select the same information, and some information may be left out completely if considered unimportant. The final prediction is then the sum of the individual votes.^Citation22

To compare performance across the models enabled by these algorithms, we used minimization of the RMSE between the model-predicted and the known-clinical values as the benchmark. The RMSE was calculated by performing a leave-group-out cross validation. The group in each leave-group-out fold corresponded to a mAb with 10 runs of properties, ensuring that the model is always tested against an independent molecule that has no direct influence on itself. Each time one mAb was left out, the accuracy of the predictive model, trained with the remaining 30 mAbs, to determine the k_a of the excluded mAb was evaluated via RMSE. The cross validation was repeated 31 times so that each of the 31 mAbs was evaluated once; then we took the average RMSE. In the leave-group-out cross validation, we optimized the hyperparameters for each model, which are structural parameters defining a model such as the number of principle components for PCR and the maximum depth of trees for XGBoost (Table S.2 in Supplementary Information).

Next, we sought to interpret the resultant model with respect to the physical significance of the identified molecular properties. Interpreting traditional regression models simply equates to evaluating the sign and magnitude of the coefficient in front of a parameter. For example, a large negative coefficient preceding a parameter may indicate a strong negative correlation between the parameter and the predicted value. In contrast, deciphering a machine learning model such as XGBoost is less straightforward due to the absence of a coefficient associated with the parameter(s).^Citation22 As an XGBoost model cannot be interpreted directly, we linearized it using the Shapley Additive Explanations (SHAP) analysis.^Citation23 Essentially, the SHAP analysis decomposes a complicated black box model into an interpretable linear model by showing how influences from individual properties sum to yield a resultant k_a value as shown in Eq (1):

(1)

ka, predicted = \overline{k_{a}} + \sum_{Property} SHAP

(1)

where k_{a, predicted} is the predicted k_a value of a given mAb; $\overline{k_{a}}$ is the mean of all the compiled k_a values (the initial guess), and $\sum_{Property} SHAP$ is the sum of individual influences (also known as Shapley values) exerted by each molecular property. The linear form of EquationEq (1)(1) $ka, predicted = \overline{k_{a}} + \sum_{Property} SHAP$ (1) makes it possible to interpret how varying the value of each molecular property affected the k_a prediction and, critically, to assign a physical interpretation to our observations. For instance, we observed that a high total area of the negatively charged patches around the complementarity-determining regions (CDRs) was associated with faster absorption, which prompted us to hypothesize that local charges play a key role in SC absorption, a computational finding that turned out consistent with the conclusions of previous experimental studies.^{Citation12,Citation13} By using the SHAP analysis, our model allowed us to identify predictive parameters in an unbiased fashion and highlighted correlations that could guide future hypothesis-driven mechanistic studies.

The R software (version 4.1.2) was used as the modeling platform to process, analyze, and display the results from each of four models described above.^Citation24 The Caret software package (version 6.0.92)^Citation25 with XGBoost (version 1.6.0.1)^Citation22 was used for leave-group-out cross validation and hyperparameter tuning. The SHAPforXGboost package (version 0.1.1) was used to run and visualize the SHAP analysis.^Citation23 Except the SHAP plot, all other figures in this study were plotted by the ggplot2 package (version 3.3.5).^Citation26

Results

SC absorption rate constants follow a skewed distribution

The compiled k_a values are represented graphically in showing a skewed distribution with a median of 0.011 h⁻¹, a mean of 0.013 h⁻¹, and a standard deviation of 0.0043 h⁻¹. Further, the distribution is not log normal, has a long tail in the fast absorption direction, and tapers off quickly in the other direction. All the k_a values are within the range of equal to or less than 0.02 h⁻¹.

We sought to examine IgG subclass and drug clearance as potential co-variates that may lead to and potentially explain the differences in k_a as discussed below. Therapeutic mAbs are classified into IgG subclasses based on their structure^Citation27; as such, it is possible that structural variations among subclasses may lead to differences in k_a. We determined the subclasses of the mAbs using publicly available information. Among the 31 mAbs surveyed, 15 are IgG1; 5 are IgG2; and 11 are IgG4. The distribution of the k_a values among the three subclasses roughly reflects the actual distribution of approved mAb therapeutics on the market, with IgG1 being the most common (approximately 74% of approved antibodies) followed by IgG2 and IgG4 (approximately 13% each).^Citation28 As shown in , we did not observe any statistically significant differences in the k_a values across the three subclasses (p = 0.26 for IgG2 vs. IgG1, p = 0.22 for IgG2 vs. IgG4, p = 0.54 for IgG1 vs. IgG4). For assessing statistical significance, we used the Wilcoxon rank sum test, which does not assume normal distribution and is therefore an appropriate tool for analyzing skewed datasets.

In addition to the IgG subclass, we also considered drug clearance as another co-variate that could potentially explain variation in k_a. Among the 31 mAbs surveyed, 18 display linear clearance; 13 display non-linear clearance. Again, the Wilcoxon test did not show any statistically significant differences between the two groups (p = 0.11).

No single molecular property correlates strongly with k_a

We initially examined whether any individual molecular property could potentially explain the differences in k_a across mAbs. To do so, we used the Spearman correlation analysis, which measures how well the relationship between k_a and a property can be described using a monotonic function. None of the Spearman correlation coefficients exceeded 0.45. The top five properties, in the order of decreasing correlation, are Pro.affinity (with a Spearman’s coefficient of 0.36), Pro.mass (0.32), Pro.hyd.moment (0.31), Pro.patch.edr.neg.n (0.23), and Pro.patch.cdr.hyd (0.22). We also plotted and visually inspected the top five properties with the largest correlation coefficients against k_a, but failed to identify any meaningful result (R² < 0.1 in all cases). This lack of strong correlation again suggests that the SC absorption is a complicated process affected by many factors and in turn prompted us to devise and evaluate more sophisticated models as described below.

The XGBoost model achieved the lowest RMSE among peer models

The main challenge for a priori prediction of k_a is to construct a model that can provide accurate prediction for mAbs outside of the training dataset used during model validation. As detailed in the Methods section, we evaluated predictive models enabled by various algorithms and the results of the relative accuracies are shown in (cross-validation RMSEs of the respective models, wherein a small RMSE value close to 0 represents strong predictive power with respect to unseen validation data). The table shows that the XGBoost model achieves the highest predictive accuracy, with an RMSE of 0.0023 h⁻¹, an order of magnitude below the mean k_a of 0.013 h⁻¹. The fact that the XGBoost model outperformed the other regression models is consistent with prior literature establishing its utility in modeling based on tabular data.^Citation29 The accuracy of the XGBoost model can be partially attributed to an inherent regularization term in its algorithm setup, which enables the model to identify input variables that carry the most weight while avoiding overfitting.^Citation22

Next, we attempted to improve the accuracy of the XGBoost model. Several studies have demonstrated that dimensionality reduction via principal component analysis (PCA) before applying XGBoost can lead to better accuracy.^{Citation30,Citation31} Accordingly, we followed a similar methodology and attempted two approaches: PCA on either the entire property set or by the four main property groups (hydrophobicity, charge, size, and bulk properties). The top principal components explaining over 60% of the variance were chosen and fed into the XGBoost model. However, neither pre-processing step improved accuracy. Dimensionality reduction may have failed in this case because the molecular descriptors could not be effectively reduced to a small number of principal components. In particular, we found that, we needed too many principal components, i.e., around 10 of the original 27 properties, to account for even 60% of the variance, which is far short of the ideal value of >80% variance captured with a more modest number of components. Thus, when we fed the selected principal components to the machine learning algorithm, the remaining 40% of the variance was lost and consequently hampered the machine learning model. Hence, the XGBoost model by itself, with no PCA pre-processing, represented the most accurate model.

The performance of the XGBoost model when applied to the prediction of k_a values from data not previously used to construct the model is visualized in . The underlying data in numeric format are included in the Validation Results tab in Supplementary Information. Zheng et al. previously reported a multiscale PK model for predicting the absorption and clearance of subcutaneously injected biotherapeutics by constructing a realistic subcutaneous compartment in addition to other abstracted body compartments (thus called the mechanistic, “multiscale” PK model).^Citation16 Zheng’s model successfully predicted a priori several PK parameters, including k_a, of albumin with its physical properties alone as the model inputs. For mAbs whose physical properties are unknown, the authors only managed to predict the k_a values by using clinical data as inputs to the model. A head-to-head comparison of the prediction accuracy for the 12 overlapping mAbs predicted in both studies showed that the XGBoost model had superior performance with respect to predicting k_a (additional details are included in the Model Comparison tab in Supplementary Information). For those 12 mAbs, the XGBoost model yielded an RMSE of 0.0021 h⁻¹, and the multiscale model yielded an RMSE of 0.0034 h⁻¹. Therefore, the XGBoost model demonstrated the feasibility of predicting k_a for a given mAb based only on the sequence information of the said mAb. As such, this should serve as a valuable tool early in drug development where no, or at best limited, clinical data exist.

Figure 2. Comparison of the predicted and actual k_a values pooled from each cross validation run. The two dotted lines are one standard deviation away from the dashed identity line, and the enclosed area is highlighted in green. Each mAb has 10 runs of corresponding predictions shown as faded, red dots. Their means are shown as blue dots.

Physical basis for correlation between k_a and molecular properties identified in the XGBoost model

, the SHAP plot, illustrates how the molecular properties of a mAb affect its k_a prediction. A brief explanation on how to read the SHAP plot is provided in the legend, and a detailed tutorial has been published.^Citation32 Discussed below is the physical basis for correlation between k_a and molecular properties identified in the XGBoost model.

Figure 3. SHAP plot of the XGBoost model. It is a visual representation of how different molecular property values (also customarily called feature values in a SHAP plot) affect k_a prediction. Each row of the plot corresponds to a molecular property and is dotted with 310 points (corresponding to 310 runs obtained from 31 mAbs each run 10 times). The range of values for a given molecular property is color coded from purple (relatively high value for a given molecular property) to yellow (relatively low value for a given molecular property). The position of a colored point relative to the centerline shows the impact of a value. If a point lies to the right of the centerline (and thus with a positive SHAP value), it makes the model predict a higher k_a relative to an initial guess (starting at the k_a average across the 31 mAbs), and vice versa. For example, if yellow points cluster far away to the right of the center line, the interpretation will be that lower values of a given molecular property is likely related to a higher k_a. The magnitude of the SHAP values averaged across all data points for each molecular property is listed by the y-axis. Properties with an averaged SHAP value below or equal to 0.05 (unitless as k_a is normalized during data preprocessing) are considered unimportant, which, if converted back, is equivalent to the inability to change more than 0.0002 h⁻¹ from an initial guess. Considering that the k_a average across the 31 mAbs is two orders of magnitude larger and the smallest k_a is still one order of magnitude larger than 0.0002 h⁻¹, we consider it reasonable to exclude any molecular properties with below or equal to a SHAP value of 0.05 as unimportant.

First, charge properties played a role in the k_a prediction. Negative patches near the CDRs (“Pro.patch.cdr.neg”) were linked to a high k_a. This phenomenon has been reported in previous studies,^{Citation12,Citation13} and a number of explanations exist: localized negative charges may facilitate the transport of a mAb from the injection site through the negatively charged interstitium due to charge repulsion^Citation1 or reduce the likelihood for a mAb to be exposed to targets with negative charge that can degrade it.^Citation33 Interestingly, we did not see any meaningful correlation between local positive charges (“Pro.patch.cdr.pos”) and k_a, though a high dipole moment (“Pro.dipole.moment”), which is indicative of high charge concentration, correlated with faster absorption. In contrast to properties relating to local charges noted above, the overall charge distribution (“Pro.patch.neg” and “Pro.patch.pos”) showed no correlation with the rate of absorption. The isoelectric point (pI) (“Pro.pI.3D”) also showed limited impact on the k_a prediction except for some isolated cases. We note that two mAbs could have the same pI or overall charge as a bulk property, but vary greatly in their local charge properties, thus manifesting different electrostatic interactions with components of the interstitium. Previous studies, though not directly focusing on the absorption rate, found that pIs need to differ by more than one unit to produce a meaningful change in other PK parameters such as the clearance and tissue distribution of a mAb.^{Citation34,Citation35} Assuming that the same holds for k_a, it is perhaps not unexpected that we did not identify a meaningful correlation between the calculated pI (calculated using the Fab region as input) and k_a because of the narrow range of pI values for the mAbs surveyed in this study (pI 25^th quartile 7.7, pI 75^th quartile 8.4).

Second, hydrophobicity properties also affected the k_a prediction. Uneven distribution of hydrophobic residues, as determined by the hydrophobic moment (“Pro.hyd.moment”), was strongly linked to faster absorption. The hydrophobic moment was a weighted moment calculation where amino acid residues were assigned quantities associated with their intrinsic hydrophobicity.^Citation36 Further, large hydrophobic patches near the CDRs (“Pro.patch.cdr.hyd” and “Pro.patch.cdr.hyd.n”) were slightly correlated with faster absorption. A previous study has reported that a low global hydrophobicity measurement would lead to fast SC absorption.^Citation11 Taken together, our results suggest that including measures of fine detail of local hydrophobicity and hydrophobic moment can complement the use of global hydrophobicity for improving the accuracy of SC absorption rate predictions.

Third, molecule size was seen to affect k_a. In particular, a smaller radius of gyration (“Pro.r.gyr”) was linked to faster absorption. This correlation could be potentially attributed to reduced steric hindrance associated with a relatively smaller molecule size.^{Citation12,Citation14,Citation15}

Finally, we observed that the interaction energy between the variable heavy and variable light chains of a mAb (“Pro.affinity”), the mass of the antigen-binding region of a mAb (“Pro.mass”), and the extinction coefficient at 280 nm (“Pro.coeff.280”) played a role in the prediction. Though the trends of these molecular properties lacked a clear trend of monotonic ordering of colored points in the SHAP plot, rendering them challenging to interpret definitively, it is nonetheless possible to link these properties to appropriate physical parameters. The interaction energy term (“Pro.affinity”) indicates the strength of interaction within the variable domain of a mAb.^Citation10 This property is consistent with published literature reports wherein inherent thermal stability is associated with the propensity of a mAb to unfold, aggregate, and consequently result in slower or incomplete absorption.^{Citation11,Citation37} The mass term (“Pro.mass”) indicates the size of the antigen-binding region. As previously reported, the overall size of a molecule affects its steric hindrance in the absorption process^{Citation12,Citation14,Citation15}; we hypothesize that the size of the antigen-binding region may have a small but cascading effect. The extinction coefficient (“Pro.coeff.280”) is a function of the number of tryptophan, tyrosine, and cystine residues in a mAb, in the order of descending importance and with cystine having a negligible impact.^Citation38 Hence, it could be interpreted that the extinction coefficient is indicative of the number and identity of hydrophobic amino acid residues in a given mAb. We suspected that the challenge in interpreting the three properties above stemmed from having a small dataset, making model construction and interpretation difficult as would be the case for any machine learning model.^Citation39 Nonetheless, our model is more accurate than the multiscale model,^Citation16 and, by inference, any model not capable of making a k_a prediction when the primary sequence is the only available model input in early discovery and development, which prompted us to keep these properties in our model.

Discussion

Here, we report the development, validation, benchmarking, and interpretation of a model that can predict the clinical k_a of a mAb using its primary sequence as the only input. Unlike other PK models to predict mAb SC absorption,^Citation40 our model can make k_a predictions during early discovery and development when human and animal in vivo data are typically not available. The superior performance of our model against an existing state-of-the-art mechanistic model clearly demonstrates that machine learning is a viable option for predicting k_a.^Citation16 In fact, as echoed in another study that also applied machine learning to studying SC PK of mAbs,^Citation41 we also believe that machine learning approaches are appropriate for studying SC absorption, which by its inherent nature is challenging to deconstruct with traditional modeling methods and remain elusive even after decades of research.

The limitation of the data-centric model reported here is its inability to directly provide a physical basis for the observed correlations between molecular properties and k_a, which creates three distinct but inter-related challenges. First, discerning false model overfitting from true physical trends, particularly when the available dataset is small, is not straight-forward.^Citation39 We expect that the model’s use of the three molecular properties that were challenging to interpret because of a lack of clear, monotonic ordering of colored points in the SHAP plot, and the loss in accuracy if they are removed, will be addressed when more k_a data become available. Second, a manual assessment of the relevant molecular properties is necessary to assign physical meaning to the observed correlations. Although we provide our interpretations, which are consistent with prior literature, these would need to be verified through carefully designed experimental studies. Third, by the very attritive nature of drug development, only a subset of mAbs that are deemed to meet specific pre-defined criteria are tested clinically and therefore have clinical data available. Hence, the resultant skewed dataset inevitably biases the model. What is interesting is that, even when accounting for all the challenges, the model still makes the most accurate a priori prediction of clinical k_a with the least information needed as model input. One last limitation of the study is that there is inconsistency in how the source documents calculated k_a, ranging from using a variant of the compartmental model family to curve stripping by fitting the absorption half-life. Such inconsistency may explain why varying values exist even for the same parameter, as noted by a recent review.^Citation42

The model reported here can be improved in two ways. First, additional factors known to affect SC absorption can be added to the model to further improve its accuracy. In particular, our current model does not account for many factors known to affect SC absorption, such as formulation properties, injection and device parameters, disease state, and subject characteristics.^{Citation13,Citation17,Citation43–48} While molecular properties are likely convoluted with some of these parameters, the latter are not directly reflected in the former. Despite our best efforts, we could not collect this type of information even for a subset of the mAbs due to the lack of consistent and complete data reporting practices (for details, see Rate Constants tab in Supplementary Information). Gathering these absent parameters, if possible, will improve the accuracy of the model. Second, as more SC mAbs enter the clinical pipeline, additional k_a data will become available to be used in the model. Accessing this ever-growing dataset in the future can be another approach to improve the model.

Prior to our work and to the best of our knowledge, there are no preexisting models to predict the k_a of a mAb post clinical subcutaneous administration without relying on clinical or preclinical drug plasma concentration-time data. The proposed model, once trained, only requires in silico molecular properties as input. When compared to not only traditional regression models but also to a recently published mechanistic PK model, it achieved the lowest RMSE. By using the novel XGBoost algorithm, this study demonstrated the applicability of a machine learning approach to the study of SC absorption. Despite its superior accuracy, our model explains the observed variability in clinical k_a values across mAbs only in part; we believe that, as more k_a and relevant data become available, the accuracy of the prediction tool will increase further.

This study also demonstrates the potential to derive mechanistic knowledge from a data-centric model. The correlations between the properties and the predicted k_a in our model are consistent with the observations from previous experimental studies. The results were evaluated correlations and interpreted in the context of physicochemical principles, as well as the underlying biology of the absorption process. Though some properties remain challenging to interpret, the knowledge gained from this study offers a new perspective on the design of further hypothesis-driven mechanistic studies of SC absorption, contributing to the continued development of SC-administered mAbs.

Supplemental material

Supplemental Material

Download MS Word (310.6 KB)

Supplementary Information

Download MS Excel (104.8 KB)

Acknowledgments

The authors are grateful to Patricia L. Brown-Augsburger for compiling some of the absorption rate constants from the literature; Yuhao Lin, Peter Thomas, and Qing Chai for their help with generating the molecular properties; and Nagarajan R. Thyagarajapuram for his continuous support.

Disclosure statement

No potential conflict of interest was reported by the author(s).

Data availability statement

The datasets for the absorption rate constants and the molecular properties are available in the Supplementary Information.

Supplementary material

Supplemental data for this article can be accessed online at https://doi.org/10.1080/19420862.2024.2352887.

Additional information

Funding

This study was entirely funded by Eli Lilly & Company.

References

Swartz M. The physiology of the lymphatic system. Adv Drug Deliv Rev. 2001;50(1–2):3–11. doi:10.1016/S0169-409X(01)00150-8.
PubMed Web of Science ®Google Scholar
Supersaxo A, Hein WR, Steffen H. Effect of molecular weight on the lymphatic absorption of water-soluble compounds following subcutaneous administration. Pharm Res. 1990;7(2):167–69. doi:10.1023/A:1015880819328.
PubMed Web of Science ®Google Scholar
Gabrielsson J, Weiner D, Weiner D. Pharmacokinetic & pharmacodynamic data analysis: concepts and applications. In: Gabrielsson J, editors. Expanded.; PK/PD data analysis: concepts and applications; Apotekarsocieteten, Swedish Acedemy of Pharmaceutical Sciences. Stockholm: Swedish Pharmaceutical Press; 2006. p. 43.
Google Scholar
Shah DK, Betts AM. Towards a platform PBPK model to characterize the plasma and tissue disposition of monoclonal antibodies in preclinical species and human. J Pharmacokinet Pharmacodyn. 2012;39(1):67–86. doi:10.1007/s10928-011-9232-2.
PubMed Web of Science ®Google Scholar
Cao Y, Jusko WJ. Applications of minimal physiologically-based pharmacokinetic models. J Pharmacokinet Pharmacodyn. 2012;39(6):711–23. doi:10.1007/s10928-012-9280-2.
PubMed Web of Science ®Google Scholar
Jones H, Rowland-Yeo K. Basic concepts in physiologically based pharmacokinetic modeling in drug discovery and development. CPT Pharmacomet Syst Pharmacol. 2013;2(8):63. doi:10.1038/psp.2013.41.
PubMedGoogle Scholar
Li Z, Yu X, Li Y, Verma A, Chang HP, Shah DK. A two-pore physiologically based pharmacokinetic model to predict subcutaneously administered different-size antibody/antibody fragments. Aaps J. 2021;23(3):62. doi:10.1208/s12248-021-00588-8.
PubMed Web of Science ®Google Scholar
Hu S, D’Argenio DZ. Predicting monoclonal antibody pharmacokinetics following subcutaneous administration via whole-body physiologically-based modeling. J Pharmacokinet Pharmacodyn. 2020;47(5):385–409. doi:10.1007/s10928-020-09691-3.
PubMed Web of Science ®Google Scholar
Gill KL, Gardner I, Li L, Jamei M. A bottom-up whole-body physiologically based pharmacokinetic model to mechanistically predict tissue distribution and the rate of subcutaneous absorption of therapeutic proteins. Aaps J. 2016;18(1):156–70. doi:10.1208/s12248-015-9819-4.
PubMed Web of Science ®Google Scholar
Chemical Computing Group ULC. Molecular operating environment (MOE); 1010 Sherbooke St. West, Suite #910 Montreal, QC, Canada, H3A 2R7; 2022.
Google Scholar
Datta-Mannan A, Estwick S, Zhou C, Choi H, Douglass NE, Witcher DR, Lu J, Beidler C, Millican R. Influence of physiochemical properties on the subcutaneous absorption and bioavailability of monoclonal antibodies. Mabs-austin. 2020;12(1):1770028. doi:10.1080/19420862.2020.1770028.
PubMed Web of Science ®Google Scholar
Reddy ST, Berk DA, Jain RK, Swartz MA. A sensitive in vivo model for quantifying interstitial convective transport of injected macromolecules and nanoparticles. J Appl Physiol. 2006;101(4):1162–69. doi:10.1152/japplphysiol.00389.2006.
PubMed Web of Science ®Google Scholar
Richter WF, Bhansali SG, Morris ME. Mechanistic determinants of biotherapeutics absorption following SC administration. Aaps J. 2012;14(3):559–70. doi:10.1208/s12248-012-9367-0.
PubMed Web of Science ®Google Scholar
Datta-Mannan A. Mechanisms influencing the pharmacokinetics and disposition of monoclonal antibodies and peptides. Drug Metab Dispos. 2019;47(10):1100–10. doi:10.1124/dmd.119.086488.
PubMed Web of Science ®Google Scholar
Turner MR, Balu-Iyer SV. Challenges and opportunities for the subcutaneous delivery of therapeutic proteins. J Pharm Sci. 2018;107(5):1247–60. doi:10.1016/j.xphs.2018.01.007.
PubMed Web of Science ®Google Scholar
Zheng F, Hou P, Corpstein CD, Park K, Li T. Multiscale pharmacokinetic modeling of systemic exposure of subcutaneously injected biotherapeutics. J Controlled Release. 2021;337:407–16. doi:10.1016/j.jconrel.2021.07.043.
PubMed Web of Science ®Google Scholar
Wasserman RL. Overview of recombinant human hyaluronidase-facilitated subcutaneous infusion of IgG in primary immunodeficiencies. Immunotherapy. 2014;6(5):553–67. doi:10.2217/imt.14.34.
PubMed Web of Science ®Google Scholar
Yang L, Allan J. Pharmapendium. Amsterdam, Netherlands: Elsevier.
Google Scholar
Kinnunen HM, Mrsny RJ. Improving the outcomes of biopharmaceutical delivery via the subcutaneous route by understanding the chemical, physical and physiological properties of the subcutaneous injection site. J Controlled Release. 2014;182:22–32. doi:10.1016/j.jconrel.2014.03.011.
PubMed Web of Science ®Google Scholar
Velliangiri S, Alagumuthukrishnan S, Thankumar Joseph SI. A review of dimensionality reduction techniques for efficient computation. Procedia Comput Sci. 2019;165:104–11. doi:10.1016/j.procs.2020.01.079.
Google Scholar
Thorsteinson N, Gunn JR, Kelly K, Long W, Labute P. Structure-based charge calculations for predicting isoelectric point, viscosity, clearance, and profiling antibody therapeutics. Mabs-austin. 2021;13(1):1981805. doi:10.1080/19420862.2021.1981805.
PubMed Web of Science ®Google Scholar
Chen T, Guestrin C. Xgboost: a scalable tree boosting system. In Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining; 2016; p. 785–94. 10.1145/2939672.2939785.
Google Scholar
Liu Y, Just A. Package Version 0.1.0.; 2020. Shapforxgboost: SHAP plots for “XGBoost”. R.
Google Scholar
R Core Team. R: a language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing; 2021.
Google Scholar
Kuhn M. Building predictive models in R using the caret package. J Stat Softw. 2008;28:5. doi:10.18637/jss.v028.i05.
Web of Science ®Google Scholar
Wickham H. Ggplot2. Cham: Use R!; Springer International Publishing; 2016. doi:10.1007/978-3-319-24277-4.
Google Scholar
Vidarsson G, Dekkers G, Rispens T. IgG subclasses and Allotypes: from structure to effector functions. Front Immunol. 2014;5. doi:10.3389/fimmu.2014.00520.
PubMed Web of Science ®Google Scholar
Tang Y, Cain P, Anguiano V, Shih JJ, Chai Q, Feng Y. Impact of IgG subclass on molecular properties of monoclonal antibodies. Mabs-austin. 2021;13(1):1993768. doi:10.1080/19420862.2021.1993768.
PubMed Web of Science ®Google Scholar
Shwartz-Ziv R, Armon A. Tabular data: deep learning is not all you need. arXiv November 23, 2021. (accessed 2022 07 14) http://arxiv.org/abs/2106.03253.
Google Scholar
Bhattacharya S, S SRK, Maddikunta PKR, Kaluri R, Singh S, Gadekallu TR, Alazab M, Tariq U. A novel PCA-Firefly based XGBoost classification model for intrusion detection in networks using GPU. Electronics. 2020;9(2):219. doi:10.3390/electronics9020219.
Web of Science ®Google Scholar
Nobre J, Neves RF. Combining Principal component analysis, discrete wavelet transform and XGBoost to trade in the financial markets. Expert Syst Appl. 2019;125:181–94. doi:10.1016/j.eswa.2019.01.083.
Web of Science ®Google Scholar
Molnar C. Interpretable machine learning: a guide for making black box models explainable. 2nd. Munich, Germany: Christoph Molnar; 2022.
Google Scholar
Tibbitts J, Canter D, Graff R, Smith A, Khawli LA. Key factors influencing ADME properties of therapeutic proteins: a need for ADME characterization in drug discovery and development. mAbs-austin. 2016;8(2):229–45. doi:10.1080/19420862.2015.1115937.
PubMed Web of Science ®Google Scholar
Khawli LA, Goswami S, Hutchinson R, Kwong ZW, Yang J, Wang X, Yao Z, Sreedhara A, Cano T, Tesar DB. et al. Charge variants in IgG1: isolation, characterization, in vitro binding properties and pharmacokinetics in rats. mAbs-austin. 2010;2(6):613–24. doi:10.4161/mabs.2.6.13333.
PubMed Web of Science ®Google Scholar
Boswell CA, Tesar DB, Mukhyala K, Theil F-P, Fielder PJ, Khawli LA. Effects of charge on antibody tissue distribution and pharmacokinetics. Bioconjug Chem. 2010;21(12):2153–63. doi:10.1021/bc100261d.
PubMed Web of Science ®Google Scholar
Kyte J, Doolittle RF. A simple method for displaying the hydropathic character of a protein. J Mol Biol. 1982;157(1):105–32. doi:10.1016/0022-2836(82)90515-0.
PubMed Web of Science ®Google Scholar
Koulakis JP, Rouch J, Huynh N, Wu HH, Dunn JCY, Putterman S. Tumescent injections in subcutaneous pig tissue disperse fluids volumetrically and maintain elevated local concentrations of additives for several hours, suggesting a treatment for drug resistant wounds. Pharm Res. 2020;37(3):51. doi:10.1007/s11095-020-2769-2.
PubMed Web of Science ®Google Scholar
Pace CN, Vajdos F, Fee L, Grimsley G, Gray T. How to measure and predict the molar absorption coefficient of a protein. Protein Sci. 1995;4(11):2411–23. doi:10.1002/pro.5560041120.
PubMed Web of Science ®Google Scholar
Ying X. An overview of overfitting and its solutions. J Phys Conf Ser. 2019;1168:022022. doi:10.1088/1742-6596/1168/2/022022.
Google Scholar
Kagan L. Pharmacokinetic modeling of the subcutaneous absorption of therapeutic proteins. Drug Metab Dispos. 2014;42(11):1890–905. doi:10.1124/dmd.114.059121.
PubMedGoogle Scholar
Lou H, Hageman MJ. Machine learning attempts for predicting human subcutaneous bioavailability of monoclonal antibodies. Pharm Res. 2021;38(3):451–60. doi:10.1007/s11095-021-03022-y.
PubMed Web of Science ®Google Scholar
Martin KP, Grimaldi C, Grempler R, Hansel S, Kumar S. Trends in industrialization of biotherapeutics: a survey of product characteristics of 89 antibody-based biotherapeutics. Mabs-austin. 2023;15(1):2191301. doi:10.1080/19420862.2023.2191301.
PubMed Web of Science ®Google Scholar
Porter C. Lymphatic transport of proteins after s.C. Injection: implications of animal model selection. Adv Drug Deliv Rev. 2001;50(1–2):157–71. doi:10.1016/S0169-409X(01)00153-3.
PubMed Web of Science ®Google Scholar
Thomas VA, Balthasar JP. Understanding inter-individual variability in monoclonal antibody disposition. Antibodies. 2019;8(4):56. doi:10.3390/antib8040056.
Web of Science ®Google Scholar
Gill KL, Machavaram KK, Rose RH, Chetty M. Potential sources of inter-subject variability in monoclonal antibody pharmacokinetics. Clin Pharmacokinet. 2016;55(7):789–805. doi:10.1007/s40262-015-0361-4.
PubMed Web of Science ®Google Scholar
Dirks NL, Meibohm B. Population pharmacokinetics of therapeutic monoclonal antibodies: Clin. Pharmacokinet. 2010;49(10):633–59. doi:10.2165/11535960-000000000-00000.
Google Scholar
Zou P, Wang F, Wang J, Lu Y, Tran D, Seo SK. Impact of injection sites on clinical pharmacokinetics of subcutaneously administered peptides and proteins. J Controlled Release. 2021;336:310–21. doi:10.1016/j.jconrel.2021.06.038.
PubMed Web of Science ®Google Scholar
Bown HK, Bonn C, Yohe S, Yadav DB, Patapoff TW, Daugherty A, Mrsny RJ. In vitro model for predicting bioavailability of subcutaneously injected monoclonal antibodies. J Controlled Release. 2018;273:13–20. doi:10.1016/j.jconrel.2018.01.015.
PubMed Web of Science ®Google Scholar

Predicting the clinical subcutaneous absorption rate constant of monoclonal antibodies using only the primary sequence: a machine learning approach

ABSTRACT

Introduction

Methods