Has anyone investigated the various Pipeline Pilot / R stats learners with respect to their ability to deal with
data sets for which Simpson's Paradox plays a significant role?
For example, has anyone tried a fake data set comprised of two separate sets for which trends go
in opposite directions compared to the entire data set. This can be recast as a "global" model
vs "local" model problem (see attached image, Figure 1, from Levick, S.R. and Rogers, K.H.,
Landscape Ecol., 26 (2011) 515). Will a decision tree model properly pull these two sets (and
models) apart? What about other machine learning algorithms?
I would greatly appreciate hearing from those of you who have wrestled with this issue. What
analysis techniques did you use, etc.
Thank you.
Regards,
Jim Metz