Pipeline Pilot Spotlight: Reaction Informatics (and a Conference reminder)

BIOVIA Conference 2022 - Still Time To View The Talks!

Before talking about chemical reactions I'd like to remind everyone that you can still log on and view the talks (and posters!) from the BIOVIA Conference 2022. But currently this will only be possible until the end of the calendar year. So give yourself an early Christmas present and spend some time seeing all the excellent work being done with Pipeline Pilot.

Here is the link to the conference we page:

              https://bioviaconference-events.3ds.com/e/biovia-conference-2022

What follows is some work which I presented at the conference as a poster.

Can We Predict Why Reactions Fail?

In the world of cheminformatics, the chemical reaction has until recently been the poor relation of the molecule, but this relationship is shifting quickly now. There is more reaction data, of better quality, and much of it is openly available. As a result, Data Scientists can try more ambitious things.

A good example is the work by cheminformaticians and data scientists at AstraZeneca and the University of Notre Dame

               AstraZeneca/Notre Dame Paper

They report good (high-accuracy) models of reaction yield for two HTS datasets, but a poor model for a dataset coming from the AstraZeneca ELN, in spite of the fact that this contained a reasonably large set (750) of reactions of only one type (Buchwald-Hartwig). These data are openly-available, via the Open Reaction Database (ORD).

With BIOVIA Pipeline Pilot it is relatively simple to join in the chase for good models. I've been using it to:

  • Download datasets from the ORD.
  • Convert them to standard Pipeline Pilot reaction records.
  • Calculate descriptors for all reactants, reagents, solvents, catalysts, products, etc. (ECFP_6), using the Molecular Fingerprints component.
  • Use these descriptors to calculate the "modelability" of the subset of AZ ELN data which report zero yield (154 reactions), using the Modelability Index (MODI) component.

The real benefits of Pipeline Pilot come from the last two points, as they leverage the automation resulting from a long history of cheminformatics and data science tool development.

But the first two points are also easily automated. Indeed, here is a prototype ORD Reader component, which downloads the (binary) dataset from the ORD and converts it, using the Python Jupyter Notebook component now in Pipeline Pilot (together with the ord-schema Python API):

And here are the resulting records:

From here the data science is easy, if not pleasing. We also found that the AstraZeneca ELN data would not give a good model. In our case, though, we didn't directly model the yield (the 154 cases we focus on were all reported as zero yield), but used the procedure description to (manually) classify the reason for the reaction failure, using both the ontology in the AZ/Notre Dame paper, but also that of the old Synopsys Failed Reactions Database.

In both cases the MODI scores were very low (less than 0.3). This is to be compared with a rule-of-thumb of a MODI value of 0.65 for a dataset which could produce a good model.

No-one said that doing data science and machine learning on chemical reactions was going to be easy, but at least it is easy to wrangle the data and apply the algorithms. There are many to thank (AZ/Notre Dame, for sharing the data, the ORD team, the Python team, and last but not least the Pipeline Pilot team).