Most molecules of practical interest have several places that can gain or lose a proton, making simple per-site macroscopic pKa models incapable of describing their properties accurately. Rowan's macro-pKa prediction workflow is built to reproduce the complexity of full macroscopic ensembles, while using machine learning to accelerate these complex calculations and make them fast enough for routine work. (For a more in-depth description of the difference between macroscopic and microscopic pKa prediction, see our full explanation.)
Microstates are enumerated using a beam-search strategy within a given charge window, by default [, ]. The algorithm proceeds as follows:
We used the Starling model, a lightweight retrained Uni-pKa model, to predict dimensionless free energies for each microstate. Conformers were generated via ETKDG followed by MMFF94 optimization as implemented in RDKit. Predictions for each conformer were aggregated using a log-sum-exponential (LSE) procedure to yield a Boltzmann-averaged microstate energy:
where is the predicted energy of conformer .
Energies were further adjusted by adding a per-charge offset to account for the free energy of solvation of a proton. This offset, approximately in model units, was inferred from the original Uni-pKa training data and computed as , where is the shift applied to the Uni-pKa model to align with experimental pKa data.
Macroscopic pKa values were computed by grouping microstates by formal charge and comparing free energies between adjacent charge states. For a charge transition , the macroscopic pKa was computed using:
By default, microstate populations were computed across pH 0–14 in 0.1 pH steps. The relative population of microstate was calculated as:
where is the charge of microstate and is the pH-dependent partition function.
LogD(pH) was calculated using weighted averaging of logP values in linear space:
LogP values were computed using the Crippen logP function in RDKit for neutral species and set to for ionic species.
The average net charge across pH was computed, and the pI was defined as the pH where the net charge crossed zero, using bisection search with a convergence tolerance of .
The Starling pKa-prediction workflow gives accuracy comparable to other state-of-the-art pKa-prediction tools, including the original Uni-pKa report (on which Starling is based) and commercial tools like ChemAxon and Epik. For a more in-depth discussion, see our publication.
Method | Novartis Base | Novartis Acid | SAMPL6 | SAMPL7 | SAMPL8 |
---|---|---|---|---|---|
Uni-pKa | 0.653 | 1.061 | 0.716 | 0.735 | 0.878 |
MolGpka | 1.064 | 1.287 | 0.773 | 0.980 | 1.150 |
ChemAxon Marvin | 1.145 | 1.144 | 1.248 | 0.708 | 1.511 |
Epik Classic | 1.175 | 1.531 | 0.962 | 1.648 | --- |
Epik 7 (ensemble) | --- | --- | 0.61 | --- | --- |
QupKake | --- | --- | 0.44 | 0.85 | 1.04 |
Starling | 0.790 | 1.083 | 1.118 | 0.734 | 1.142 |
pH-dependent aqueous solubility can be predicted by enabling the "Predict pH-Dependent Aqueous Solubility?" toggle.
pH-dependent prediction begins by using our Kingfisher aqueous-solubility-prediction model to predict the aqueous solubility of the molecule at a neutral pH. We then compute the fraction of microstates where charge is 0 for each pH of interest during the microstate enumeration step of pKa prediction. We use this fraction to scale our neutral-pH solubility prediction down to intrinsic solubility, which is independent of pH. We use the fraction at all other desired pHs to scale the intrinsic solubility up to aqueous solubility for the desired pH.
Non-ideal behavior like aggregation is not modeled through this framework.