Solubility

fastsolv

Rowan uses the fastsolv model developed by Lucas Attia, Jackson Burns, and co-workers at MIT and described in the preprint "Organic Solubility Prediction at the Limit of Aleatoric Uncertainty." (See also the authors' own fastsolv web interface.)

The fastsolv model was trained on BigSolDB, which contains 54,273 experimental solubility measurements spanning 839 solutes and 138 solvents across temperatures ranging from −30°C to 130°C.

Solutes and solvents are both input using the SMILES format, and the model is used to generate a prediction for each solute-solvent-temperature combination. Rowan has a number of solvents predefined on the front-end, each of which is associated with a SMILES string, and supports custom SMILES entry.

This workflow returns both predicted solubilities and standard deviations for each solvent-temperature combination. When viewed on Rowan's web GUI, these results are plotted on an interactive graph.

Aqueous Solubility

Rowan offers two methods for aqueous solubility prediction: Kingfisher and a reparameterized ESOL.

Both models were trained on an 80% Butina split of the Falcón-Cano et al. "reliable" dataset, which is a de-duplicated combination of the the AqSolDB and Cui et. al. datasets. We used 1024-bit Morgan fingerprints with radius 2 for Butina splitting, and our train split contains 10,043 experimental aqueous solubility measurements. All measurements were taken at room temperature in neutral-pH water.

To reflect the training data, we restrict the inputs to these aqueous-solubility-prediction methods to the following:

  • Solvent: water (SMILES: O)
  • Temperature: 25ºC (room temperature)

When using the GUI, you can click the "Set Default Settings" button to automatically select Kingfisher- and ESOL-compatible inputs.

Kingfisher

Rowan uses the Kingfisher model for aqueous solubility prediction. Kingfisher is a finetuned version of the CheMeleon model developed by Jackson Burns and co-workers at MIT.

ESOL

Rowan uses a reparameterized ESOL model for low-cost aqueous solubility predictions. Our ESOL model was fit using the dataset developed for training Kingfisher.

ESOL is a multiple-linear regression model that fits a linear combination of:

  • molecular weight,
  • water–octanol partition coefficient,
  • the number of rotatable bonds,
  • the proportion of heavy atoms in the molecule that are aromatic,
  • and a bias term.

ESOL was developed by John S. Delaney at Syngenta. We refit the ESOL model using an RDKit-based implementation by Pat Walters.