Macroscopic pKa Prediction

Most molecules of practical interest have several places that can gain or lose a proton, making simple per-site macroscopic pKa models incapable of describing their properties accurately. Rowan's macro-pKa prediction workflow is built to reproduce the complexity of full macroscopic ensembles, while using machine learning to accelerate these complex calculations and make them fast enough for routine work. (For a more in-depth description of the difference between macroscopic and microscopic pKa prediction, see our full explanation.)

How It Works

Microstate Enumeration

Microstates are enumerated using a beam-search strategy within a given charge window, by default [2-2, +2+2]. The algorithm proceeds as follows:

  1. Initial state. The input molecule is sanitized with RDKit.
  2. Template-driven generation. New microstates are generated using a substructure-matching protocol. Atom-wise transformations were applied based on SMARTS patterns (adapted from the Uni-pKa data processing pipeline), and unreasonable structures (e.g. pentavalent carbon) were filtered out using a list of hard-coded substructure patterns (also from the Uni-pKa data processing pipeline).
  3. Scoring and pruning. Newly generated candidates are scored using AIMNet2 as an energy estimator, and the NN microstates with the lowest scores are retained (where NN represents the beam width).
  4. Convergence and termination. Previously visited microstates are tracked by canonical SMILES, which prevents revisiting identical microstates. The search terminates when (i) no new states are generated, (ii) all beams become stationary, or (iii) a maximum of ten iterations is reached.

Free-Energy Prediction

We used the Starling model, a lightweight retrained Uni-pKa model, to predict dimensionless free energies for each microstate. Conformers were generated via ETKDG followed by MMFF94 optimization as implemented in RDKit. Predictions for each conformer were aggregated using a log-sum-exponential (LSE) procedure to yield a Boltzmann-averaged microstate energy:

Emicro=logiexp(Ei)E_\text{micro} = -\log \sum_i \exp(-E_i)

where EiE_i is the predicted energy of conformer ii.

Energies were further adjusted by adding a per-charge offset to account for the free energy of solvation of a proton. This offset, approximately 14.96-14.96 in model units, was inferred from the original Uni-pKa training data and computed as 6.50×ln(10)-6.50 \times \ln(10), where 6.50-6.50 is the shift applied to the Uni-pKa model to align with experimental pKa data.

Macroscopic pKa Calculation

Macroscopic pKa values were computed by grouping microstates by formal charge and comparing free energies between adjacent charge states. For a charge transition cc+1c \rightarrow c+1, the macroscopic pKa was computed using:

pKa=1ln(10)[logic+1exp(Ei)logjcexp(Ej)]\text{p}K_\text{a} = \frac{1}{\ln(10)} \left[ \log \sum_{i \in c+1} \exp(-E_i) - \log \sum_{j \in c} \exp(-E_j) \right]

Microstate Populations and Derived Properties

By default, microstate populations were computed across pH 0–14 in 0.1 pH steps. The relative population wi(pH)w_i(\text{pH}) of microstate ii was calculated as:

wi(pH)=exp(Eiqiln(10)pH)Z(pH)w_i(\text{pH}) = \frac{\exp(-E_i - q_i \ln(10) \cdot \text{pH})}{Z(\text{pH})}

where qiq_i is the charge of microstate ii and Z(pH)Z(\text{pH}) is the pH-dependent partition function.

LogD Prediction

LogD(pH) was calculated using weighted averaging of logP values in linear space:

logD(pH)=log10(iwi(pH)10logPi)\log D(\text{pH}) = \log_{10} \left( \sum_i w_i(\text{pH}) \cdot 10^{\log P_i} \right)

LogP values were computed using the Crippen logP function in RDKit for neutral species and set to 2-2 for ionic species.

Isoelectric Point Calculations

The average net charge across pH was computed, and the pI was defined as the pH where the net charge crossed zero, using bisection search with a convergence tolerance of 10310^{-3}.

Accuracy

The Starling pKa-prediction workflow gives accuracy comparable to other state-of-the-art pKa-prediction tools, including the original Uni-pKa report (on which Starling is based) and commercial tools like ChemAxon and Epik. For a more in-depth discussion, see our publication.

MethodNovartis BaseNovartis AcidSAMPL6SAMPL7SAMPL8
Uni-pKa0.6531.0610.7160.7350.878
MolGpka1.0641.2870.7730.9801.150
ChemAxon Marvin1.1451.1441.2480.7081.511
Epik Classic1.1751.5310.9621.648---
Epik 7 (ensemble)------0.61------
QupKake------0.440.851.04
Starling0.7901.0831.1180.7341.142

pH-Dependent Aqueous Solubility

pH-dependent aqueous solubility can be predicted by enabling the "Predict pH-Dependent Aqueous Solubility?" toggle.

pH-dependent prediction begins by using our Kingfisher aqueous-solubility-prediction model to predict the aqueous solubility of the molecule at a neutral pH. We then compute the fraction of microstates where charge is 0 for each pH of interest during the microstate enumeration step of pKa prediction. We use this fraction to scale our neutral-pH solubility prediction down to intrinsic solubility, which is independent of pH. We use the fraction at all other desired pHs to scale the intrinsic solubility up to aqueous solubility for the desired pH.

Non-ideal behavior like aggregation is not modeled through this framework.