A while back, I saw a presentation by Ansgar Schuffenauer on compound clustering methods and how they often lead to non-intuitive assignments through the eyes of Medicinal Chemists. He (quite rightly) argued that this is often because the descriptors used in clustering approaches did not factor how chemists often group compounds. For example, not all features are equal: chemists often look at compounds as cores with attachments and group via this accordingly. Hence, Ansgar proposed a method of assigning compounds into classes, based on their cores. One major advantage of this method is that it scales linearly and you do not need to rebuild the whole clustering every time you add new entries. Later on, I saw Nicolas Triballeau propose a similar method as part of Galapagos' Turtle solution for Chemists.
These two presentations got me thinking about developing something equivalent in Pipeline Pilot. The attached protocol makes use of the "Generate Fragments" component in Pipeline Pilot to take an input set of molecules and then extracts out the "cores" from them. These cores are then used to assign all the compounds into core classes. The protocol is pretty quick. Indeed, you don't need to run the first pipe every time, you could run it once and retain the list of cores and then re-use them on pipe 2 to rapidly process very large lists of compounds.
The key part of the protocol is deciding what constitutes the core, or scaffold, in a molecule. This concept of scafffold varies considerably:
- Is it the feature that I can create my IP around?
- Is it the feature that I think I can do the most varied chemistry around?
- Is it the most decorated, most substituted heterocyclic ring system?
I've opted for option three, the biggest ring system, with the most hetero atoms, with the most substituted R-groups. This is managed in pipe 1, in component "Rank core assemblies by feature preference". You can change this as you wish to create different rankings of cores. The core assignment in Pipe 2 is then decided on the first core it finds in the list that is a substructure match in the test compound. Again, you could modify this to find either "all cores", core with most ring substituents, etc.
You are welcome to try it out and judge how well it works for yourself. Remember that perception of core can vary on a compound, by compound basis and no two users are going to agree all of the time!
A.
Acknowledgements
This protocol exemplifies concepts that have been presented originally by both Ansgar Schuffenauer (Novartis) and later, independently by Nicolas Triballeau (Galapagos NV).
Details from protocol Help:
This protocol performs two tasks:
- Read in a set of molecules and identify putative cores (Pipe 1)
- Re-read the set of molecules and relocate into "clusters", based on identified core (Pipe 2)
The core assemblies are identified in a two step operation
- Generate fragments based on Murcko Assemblies
- For each Murcko assembly, generate fragments, based on ring assemblies
- If no rings are detected, then it creates a set of chain assemblies
Cores are categorised and then ranked in order of putative novelty. The clustering step involves molecules being relocated into a core group, based on the first match identified.
Factors that influence the categorisation and sort order of a core include:
- First, give preference to cores with chain assemblies
- Next, retain cores with numbers of N, O, S > zero
- Keep those cores with either aromatic, or double bonds
- I.e,, non-aromatic rings, containing P-orbitals are also kept
- Finally, give preference to those systems with more than one R-group present
Cores are then sorted in each category according to six properties:
- their number of rings
- N,O,S count
- number of aromatic rings
- Number of double and aromatic bonds
- number of chains and number of bridge-head atoms.
Each is sorted in descending order.