I just made a posting on the Accelrys Blog describing a few calculations I did on an Ames mutagenicity data set recently published and made available by Katja Hansen et al. Here, I provide some more details on how the calculations were done. The results may help you decide among different methods when you need to build a classification model.
The learner components I applied to the data were: Learn Good Molecules (Bayesian), Learn Cross-validated RP Tree Model (RP Tree), Learn RP Forest Model (RP Forest), Learn Molecular Property (kNN), and Learn R Support Vector Machine Model (R SVM). In all cases, I used the ECFP_4 fingerprint as the sole independent variable, largely because that's what Hansen et al. used for their Bayesian model. (Normally, I'd use a larger-radius fingerprint such as ECFP_6.)
For the RP Forest, I used 2000 trees, with the Number of Descriptors set to the fraction 0.25. These means that for each tree node, a random one-fourth of all the fingerprint features seen in the data are considered as potential splitting criteria. In a standard random RP forest model, the default is to use the square-root of the number of descriptors as the number to consider, which for this particular data set with ECFP_4 corresponds to the fraction 0.025. However, because each "descriptor" here is a fingeprint feature bit whose only possible values are 0 or 1, the information per descriptor is lower than with a continuous property. Hence, it seemed appropriate to increase the fraction considered, and empirically, this gave a better model.
For Learn Molecular Property, I used the "k-Nearest-Neighbor" model option, which uses Tanimoto similarity to predict a test compound's activity based on its proximity to compounds in the training data. Note that the purpose of Learn Molecular Property is to build regression models, yet we're dealing with a classification problem. But we can turn any regression model into a classifier simply by applying a cutoff. In this particular case, we set the property Activity to 1 for active compounds or 0 for inactive ones when building the model. The kNN prediction then represents the model's assessment of the probability that the test compound is active, and we can use this prediction to get a ROC score as for any other score-based classifier.
With any of the learners in the R Statistics collection, it can be tricky to use sparse fingerprints (FP) such as ECFP_4. The reason is that R does not natively handle such fingerprints. To get the FP into a form that R can handle, we need to convert it such that there is one property for each FP feature in the data set. For data records where the feature is present, the property value is 1; for others, the value is 0. The problem is that for large data sets, the number of resulting properties can be so large that either R is overwhelmed or the time required to build the model in R becomes prohibitively long. In order to reduce the number of properties passed to R, we must either "fold" the fingerprint to a fixed size (such as 256 bits, corresponding to 256 binary properties in R) or perform feature selection on the FP. To do the latter, we use the Fingerprint to Properties component. This uses a Bayesian analysis to pre-process the data and keep only the N most important features (where N=200 or 400 in the runs I did).
The broader point to keep in mind here is that by comparison to native PP models, R models start at a disadvantage when using fingerprints as descriptors. So the relatively poor performance of the R SVM in the table below does not necessarily reflect a weakness in the SVM algorithm.
One other complication with the R SVM learner is that there are a couple of parameters -- Gamma and Cost -- that need to be tuned in order to get the best model. To help with this, the SVM component in PP has built-in cross-validation to automatically choose the best combination of Gamma and Cost from a list of values that you provide. Or if you're in a hurry, as I was, and already have an independent way to test the model quality, the SVM learner can use the all-data model rather than cross-validation to do the parameter selection. (For large data sets, the R svm() function is typically much slower than other learners, and I didn't want to wait the 2+ hours that cross-validation would have taken.)
Here are the results. These are ROC scores averaged over the five training/test splits provided by Hansen et al. The standard error in each case is 0.01 or less.
Method ROC Score
Bayesian 0.82
RP Tree 0.78
RP Forest 0.82
R SVM 0.72
kNN 0.84
These results largely speak for themselves, but I do want to point out one thing. Observe how good the kNN results are. If you read the paper, you'll see that Hansen et al. got significantly worse results with their kNN model. The main difference appears to be in the descriptors. They used DragonX descriptors with (apparently) a Euclidean distance, while I used ECFP_4 with a Tanimoto distance.
There's much more that could be done with this data set, but I need to get back to my day job, so I'll leave these as "exercises for the reader": Can we improve on these results with different descriptors -- either a different fingerprint or a combination of fingerprint and numeric properties? Can we improve the SVM results by including more FP features (at the price of more compute time) or by using folding instead of feature pre-selection? How do the other classification learners measure up, such as Learn R Logistic Regression Model, Learn R Linear Discriminant Analysis Model, Learn R Neural Net Model, and Learn R Mixture Discriminant Analysis Model? Given that we can use a regression model as a classifier, and given how well the kNN model did, what sort of results does a consensus model from Learn Molecular GFA Model give?
Finally, a question for readers: This posting is rather different from the typical one in which someone posts a problem and others respond to help solve it. This one is more like a mini-application note with a few hints and tips folded in. Do you find this type of posting useful or not?
