Macroscopic pK_a Prediction

Most molecules of practical interest have several places that can gain or lose a proton, making simple per-site macroscopic pK_a models incapable of describing their properties accurately. Rowan's macro-pK_a prediction workflow is built to reproduce the complexity of full macroscopic ensembles, while using machine learning to accelerate these complex calculations and make them fast enough for routine work. (For a more in-depth description of the difference between macroscopic and microscopic pK_a prediction, see our full explanation.)

How It Works

Microstate Enumeration

Microstates are enumerated using a beam-search strategy within a given charge window, by default [ $-2$ , $+2$ ]. The algorithm proceeds as follows:

Initial state. The input molecule is sanitized with RDKit.
Template-driven generation. New microstates are generated using a substructure-matching protocol. Atom-wise transformations were applied based on SMARTS patterns (adapted from the Uni-pKa data processing pipeline), and unreasonable structures (e.g. pentavalent carbon) were filtered out using a list of hard-coded substructure patterns (also from the Uni-pKa data processing pipeline).
Scoring and pruning. Newly generated candidates are scored using AIMNet2 as an energy estimator, and the $N$ microstates with the lowest scores are retained (where $N$ represents the beam width).
Convergence and termination. Previously visited microstates are tracked by canonical SMILES, which prevents revisiting identical microstates. The search terminates when (i) no new states are generated, (ii) all beams become stationary, or (iii) a maximum of ten iterations is reached.

Free-Energy Prediction

We used the Starling model, a lightweight retrained Uni-pKa model, to predict dimensionless free energies for each microstate. Conformers were generated via ETKDG followed by MMFF94 optimization as implemented in RDKit. Predictions for each conformer were aggregated using a log-sum-exponential (LSE) procedure to yield a Boltzmann-averaged microstate energy:

$E_\text{micro} = -\log \sum_i \exp(-E_i)$

where $E_i$ is the predicted energy of conformer $i$ .

Energies were further adjusted by adding a per-charge offset to account for the free energy of solvation of a proton. This offset, approximately $-14.96$ in model units, was inferred from the original Uni-pKa training data and computed as $-6.50 \times \ln(10)$ , where $-6.50$ is the shift applied to the Uni-pKa model to align with experimental pK_a data.

Macroscopic pK_a Calculation

Macroscopic pK_a values were computed by grouping microstates by formal charge and comparing free energies between adjacent charge states. For a charge transition $c \rightarrow c+1$ , the macroscopic pK_a was computed using:

$\text{p}K_\text{a} = \frac{1}{\ln(10)} \left[ \log \sum_{i \in c+1} \exp(-E_i) - \log \sum_{j \in c} \exp(-E_j) \right]$

Microstate Populations and Derived Properties

By default, microstate populations were computed across pH 0–14 in 0.1 pH steps. The relative population $w_i(\text{pH})$ of microstate $i$ was calculated as:

$w_i(\text{pH}) = \frac{\exp(-E_i - q_i \ln(10) \cdot \text{pH})}{Z(\text{pH})}$

where $q_i$ is the charge of microstate $i$ and $Z(\text{pH})$ is the pH-dependent partition function.

LogD Prediction

LogD(pH) was calculated using weighted averaging of logP values in linear space:

$\log D(\text{pH}) = \log_{10} \left( \sum_i w_i(\text{pH}) \cdot 10^{\log P_i} \right)$

LogP values were computed using the Crippen logP function in RDKit for neutral species and set to $-2$ for ionic species.

Isoelectric Point Calculations

The average net charge across pH was computed, and the pI was defined as the pH where the net charge crossed zero, using bisection search with a convergence tolerance of $10^{-3}$ .

Accuracy

The Starling pK_a-prediction workflow gives accuracy comparable to other state-of-the-art pK_a-prediction tools, including the original Uni-pKa report (on which Starling is based) and commercial tools like ChemAxon and Epik. For a more in-depth discussion, see our publication.

Method	Novartis Base	Novartis Acid	SAMPL6	SAMPL7	SAMPL8
Uni-pKa	0.653	1.061	0.716	0.735	0.878
MolGpka	1.064	1.287	0.773	0.980	1.150
ChemAxon Marvin	1.145	1.144	1.248	0.708	1.511
Epik Classic	1.175	1.531	0.962	1.648	---
Epik 7 (ensemble)	---	---	0.61	---	---
QupKake	---	---	0.44	0.85	1.04
Starling	0.790	1.083	1.118	0.734	1.142

pH-Dependent Aqueous Solubility

pH-dependent aqueous solubility can be predicted by enabling the "Predict pH-Dependent Aqueous Solubility?" toggle.

pH-dependent prediction begins by using our Kingfisher aqueous-solubility-prediction model to predict the aqueous solubility of the molecule at a neutral pH. We then compute the fraction of microstates where charge is 0 for each pH of interest during the microstate enumeration step of pK_a prediction. We use this fraction to scale our neutral-pH solubility prediction down to intrinsic solubility, which is independent of pH. We use the fraction at all other desired pHs to scale the intrinsic solubility up to aqueous solubility for the desired pH.

Non-ideal behavior like aggregation is not modeled through this framework.

Macroscopic pKa Prediction