I just put up a blog posting showing Pareto optimization applied to the problem of assigning subjects to groups in a laboratory trial. Here, I provide more details on what I did, as well as some thoughts on the applicability of the approach and how it might be generalized.
In optimization problems, one of the first questions we need to answer is: What is the entity being optimized? In this case, the entity is a partitioning of 50 subjects into 5 groups of 10. Such a partitioning can be represented simply as a 50-member GroupID array. The "population" for the purpose of a Pareto optimization is a set of alternative partitionings.
In a nutshell, here's how the attached "Pareto Partitioning" protocol does the optimization: It starts by generating an initial set of completely random partitionings (corresponding to multiple applications of randomization). It then drives the system to partitionings that give the best tradeoff between reduced variance-of-mean and reduced variance-of-variance. At each iteration, the partitionings are mutated by randomly selecting 2 subjects from 2 groups and swapping them. Then, only the most optimal partitionings are retained, and the process is repeated.
For comparison purposes, pipeline 2 of the protocol assigns subjects to bins based on body mass and uses the minimization approach to achieve balance among the bins across groups.
Generalizing to More Properties
The Pareto Partitioning protocol considers just one property -- body mass -- in assigning subjects to groups. How might this be generalized to more properties? In principle, one could do a Pareto optmization with the variance-of-mean (VOM) and variance-of-variance (VOV) of each variable as a property to be optimized. In practice, this is not feasible, because the Pareto approach is impractical for more than about 4 optimization properties.
I think that principal component analysis (PCA) may yield a workable approach. This requires first computing the principal component values for the entire set of subjects, and then performing an optimization to minimize the following two sums:
SPC Sgroups(
SPC Sgroups(s2ij - s2i,overall)2
where PCi is the ith principal component value; <>j denotes an average over group j; <>overall denotes an average over all subjects; s2ij is the computed variance of PCi over group j; and s2i,overall is the variance of PCi over all subjects; For centered principal components,
The attached "Pareto Partitioning Multi" protocol uses this PCA-based method, which appears to give better balance than simple constrained randomization. (I didn't implement minimization in this case.) Binary properties such as Gender and Smoker are represented as 0 or 1 and included in the PCA (which mean-centers and scales the data first).
Suitability for Clinical Trials
By the nature of the approach, Pareto optimization needs to consider the experimental subjects as a set rather than individually. By contrast, it is my understanding that the typical clinical trial enrolls subjects sequentially, assigning each subject to a group as soon as the subject is enrolled. This is why I wrote in the blog that the Pareto approach may be more relevant to laboratory than to clinical studies.
(Note: The attached protocols require Pipeline Pilot 8.0 with the Advanced Data Modeling collection. "Pareto Partitioning Multi" requires the R Statistics Collection as well.)