Iterative Optimal Design for Free Wilson Analysis

DH 2011-08-02

Free-Wilson analysis (FWA) is a simple approach to QSAR that models activity as a function of substituents at specific sites of a core structure. It traditionally assumes additivity of effects, but is easily generalized to account for pairwise interactions as well (at the cost of more experiments). There are a couple of example protocols in Pipeline Pilot that illustrate the approach. These examples assume that you already have sufficient activity data to do the computations.

But suppose you either (1) have a core structure and set of substituents in mind, but don't yet have any activity data, or (2) have some data, but not enough to do the FWA calculations. How do you decide which compounds to synthesize or measure in order to complete the FWA? The problem with just choosing a random set or the most easily synthesized set is that you may end up with a singular system such that some or all of the substituent coefficients cannot be computed.

This question (inspired by a support request) led me to create the attached protocol, which uses an enhanced version of the prototype Design Optimal Experiment component in Pipeline Pilot 8.5. (The component in PP 8.5 handles only continuous variables. To address FWA, I needed to generalize it to handle categorical variables as well.)

The protocol is just a proof of concept template. In a production environment, you'd need to extend it and break it up to cover different parts of the workflow. Here's how it works:

Using SMILES strings in a delimited text format, we start by specifying the core structure, substitution sites, and possible substituents, with a code for each substituent. (Note that we need to include H as a "substituent," to which we assign the code "AAAH" in order to ensure that it is always first in alphabetical order. This ensures that R takes the unsubstituted site as the baseline case when calculating substituent effects.)

Next, we define variables R1, R2, and R3 in the Design Optimal Experiment component. For each of these, we list the substituents we wish to allow at that site. We set the Model Type to "Linear," corresponding to the additivity assumption. (We could set it to "Linear with Interactions" if we wished to include pairwise interactions as well.) Since we wish to do the minimal number of experiments, we set Number of Runs to 1. This tells the component to generate the fewest possible experiments that support the model, while including a few additional compounds to allow us to test lack-of-fit -- i.e., to allow us to test the additivity assumption.

The next step would be to either synthesize or acquire the compounds specified by the experimental design and then assay them. We'd then use the activity values to create a model, resulting in the Free-Wilson coefficients. For the purpose of this example, I just included random activity values to allow you to see how this would be done.

Once we have a model and the FWA coefficients, we have multiple different options depending on the model quality and on any insight we gain as chemists from the relative values of the different coefficients. One simple thing we can do, as shown in the example, is to use the model to predict the most active compounds. For an additive model with no interactions, the predicted most active compound is of course the one containing the substituent with the greatest coefficient at each site. But if we include interactions in the model, the most active compound may not be so obvious.

If you're interested in Free-Wilson analysis, please take a look at the protocol. I'd like to include something like this as an example protocol in a future Pipeline Pilot release, so let me know what I could do to improve it.

Thanks,
Dana

P.S. To run the attached protocol, you need the R Statistics component collection, and the R package named AlgDesign must be installed. It should run in PP 8.0 and above.