Iterative screening in ChEMBL

HP 2015-05-08

Suppose you don't have the resources to run a full HTS of the compounds in your corporate screening collection, for instance there is limited supply of one of the reagents needed in the assay or you don't want to wait 6 months for an available slot in the HTS group.

One solution is to screen a subset of your file, build a model from the results, use this model to select a few more compounds to screen, etc. What is the optimal method for this iterative process? Which modeling method works best? Use one modeling method or the consensus of multiple? Apply many small iterations or a few larger ones? Aim for maximal retrieval of actives each iteration or include in the selection some compounds that improve the model (active learning strategy)? How many of the actives will you (not) find? How to ensure enough diversity in the actives that are found?

The attached protocol can help to investigate the above questions. In the example ChEMBL plays the role of both your corporate screening collection and your assay. A random subset is chosen and an 'assay' is performed by looking up the activity for the desired target (thrombin) in ChEMBL (unknown equates to inactive). A model is built and the rest of ChEMBL is searched for more actives, etc. The first 2 parts of the protocol create a flat file dump of ChEMBL and annotate the thrombin actives actives, respectively. You can skip these 2 parts and use the attached file ChEMBL_thrombin.tsv.gz in part 3. In pipeline 9 you can change the settings which are now to start with a random subset of 50 compounds followed by 10 iterations of 500 compounds picked by Tanimoto Nearest Neighbour search. Bayesian learning is also implemented. You can add your own methods in the subprotocol in pipeline 13.

Below is the enrichment plot for the 10 iterations of 500 by Bayesian and Nearest Neighbours. Just over half of the actives are found by screening 5500 compounds. Bayesian is slightly better in retrieving actives than nearest neigbours (but no analysis is done how many series are found by each method). In each iteration the nearest neighbour search starts of close to a perfect search but then runs flat (suggesting that smaller iterations would work better for this method).

Experiment with adding your own method(s) and beat the number of actives found in the enrichment curves below. Please report back here!

Willem