Simpson's Paradox

JM 2015-02-04

BIOVIA Pipeline Pilot

Has anyone investigated the various Pipeline Pilot / R stats learners with respect to their ability to deal with

data sets for which Simpson's Paradox plays a significant role?

For example, has anyone tried a fake data set comprised of two separate sets for which trends go

in opposite directions compared to the entire data set. This can be recast as a "global" model

vs "local" model problem (see attached image, Figure 1, from Levick, S.R. and Rogers, K.H.,

Landscape Ecol., 26 (2011) 515). Will a decision tree model properly pull these two sets (and

models) apart? What about other machine learning algorithms?

I would greatly appreciate hearing from those of you who have wrestled with this issue. What

analysis techniques did you use, etc.

Thank you.

Regards,

Jim Metz