Applicability Domain Support in PP 8 0

DH 2015-01-30

In my latest blog posting, I discuss the new model applicability domain (MAD) support in the Pipeline Pilot 8.0 release. Here I give more details on the MAD-related research that I presented at last week's North American User Group Meeting.

Regression Models

I didn't study every possible MAD or distance measure, but I did consider many of them. Of those I did look at, the one that usually gave the best correlation with the model error in regression models was the distance from the test sample to its 3rd-nearest neighbor in the training data. (The distance to the 5th nearest neighbor performed almost as well.) My guess is that the distance to the closest neighbor doesn't do as well because that neighbor might be an outlier (singleton). And perhaps the reason that the distance to the overall centroid of the data doesn't correlate as well with the error is that this measure doesn't capture the details of the local distribution of points in the training data. However, in some cases when the descriptors are continous, the Mahalanobis distance (MD) from a test sample to the centroid of the training data does correlate well with model peformance. As calculation of the distance from a test sample to the nearest neighbors is an order(N) operation, the MD may be a better option if the training set is large. With PP 8.0, you can compute either measure.

As for correlating the distance to the error, if we simply get the correlation coefficient between the prediction error (or its magnitude) for each test set sample and a MAD measure, we find a very weak correlation. The reason is that the error is random. On average, we expect the error to be greater for samples far from the training data than for those near the training data, but in any specific case, the error could be high, medium, or low. Error bars represent the spread of an error distribution, and to measure that spread, we need to do some averaging. This means we need to divide our distance into bins and compute average test set errors over each bin.

To reduce quite a bit of research (including much trial-and-error) to a single sentence: dividing the data by distance quartiles appears to be a good way to do this averaging. E.g., for distances from 0 up to the first quartile (the first 1/4 of the data), we compute the root-mean-square error (RMSE), median absolute error, or any other error measure. We repeat for each of the remaining fourths of the test set. We then use these averaged values as our error bars when making predictions, assigning the appropriate error value depending on the bin into which a sample falls. Using quartiles (or any regularly-spaced percentiles) ensures that each bin contains the same number of samples, as opposed to the case of fixed bin widths, in which some bins typically have many samples and others very few. Thus, for the fixed width case, the quality of the statistics varies widely from bin to bin.

Sheridan et al. successfully used fixed-width but overlapping bins to get better statistics, and I experimented with overlapping distance "windows" (based on equal number of samples rather than equal distances). But for either of these approaches to be general to all types of data, the bin or window size needs to be an adjustable parameter, which is something I wished to avoid. So in spite of some initially promising results, I ultimately abandoned this approach. (Note: If you have any interest at all in MAD, I strongly recommend that you read the Sheridan paper.)

Classification Models

Given the promising results for regression models with quartile-based errors using the 3rd-nearest neighbor distance, I applied the same binning scheme to Bayesian classification models, in most cases using ECFP_4 as the descriptor and Tanimoto as the distance measure. For these models, I calculated the per bin ROC AUC score rather than the RMSE. As was seen for regression models, model prediction quality systematically degraded with increasing distance.

One hypothesis I investigated, with mixed results, is whether the model performance as a true classifier (rather than a ranker) could be improved by using a score cutoff value that varies with the distance quartile. For the highly imbalanced MAO and NCI AIDS data sets included with Pipeline Pilot, I saw no overall improvement. But for the more-balanced Ames mutagenicity data, the geometric mean of sensitivity and specificity did improve with bin-specific cutoffs. I need to do some more research to see whether the difference in results comes from balanced-vs-imbalanced data or from some other difference in the data sets.

Caveat

The conclusion that model prediction quality monotonically degrades with increasing 3rd-nearest neighbor distance is based on averaging over hundreds of randomly sampled training/test set divisions. But in a typical model-building activity, there is only one training set and one test set. Obtaining accurate error measures to use in future predictions requires that this test set be large enough to get good statistics. Otherwise, the "error bars of the error bars" can be too large for them to be useful. It may be possible to use some other approach to get better statistics, such as multiple repeats of k-fold cross-validation, but I haven't yet pursued this.

How to use MAD support in PP 8.0

I focused above on just one MAD measure: the distance to the closest training set samples. But every learner in Pipeline Pilot 8.0 supports additional MAD-related properties as well. To make the greatest number of these available when making predictions, select the following two values for the Learn Options parameter of the learner component when building a model:

Save Training Properties Perform OPS Analysis

This makes available to you the following output properties when making predictions with the resulting model (supposing the model is named "mymodel"):

mymodel_Applicability: Contains warnings for any property or OPS components whose values are outside the range seen in the training data. For molecular data, also contains warnings for any structural feature in a compound that was not seen in any of the training compounds.
mymodel_Applicability#MD: Contains the Mahalanobis distance to the centroid of the training data (as long as the covariance matrix of the training data can be inverted). The related mymodel_Applicability#MDpvalue property gives one minus the probability contour on which the sample lies, based on the assumption that the training data properties are normally distributed. The larger the MD and the smaller the MDpvalue, the more likely the sample is outside the MAD. (The normality assumption fails completely for binary fingerprint data, so you should just ignore the MDpvalue if using a fingerprint as one of your descriptors.)
mymodel_ClosestDistance: Array of distances to the k closest samples in the training data (up to 10)
mymodel_ClosestSample: Array of names of the k closest training samples
mymodel_ClosestValue: Array of actual Y values for the k closest training samples
mymodel_ClosestPredictedValue: Predicted Y values for the k closest training samples
mymodel_: Any additional training data properties for the k closest samples that you chose to save when building the model. (You do this by specifying a list of properties for the Additional Properties parameter of the learner.) For example, you may wish to save a SMILES string for each compound.

This post is already plenty long, so I will stop here. But I will be happy to respond to questions or elaborate on any of the above.