Hearding Cats

KS 2015-05-08

Greetings all. I wanted to share with you my plans for developing a richer web-application using Pipeline Pilot, that will allow users to not only view the results of several end-points, but also to actively add and remove samples and end-points. I call this effort 'Hearding Cats' as it allows collection of samples to be hearded through a protocol, taking into account that not all samples behave well.

Part I - The problem

In the parlance of Pipline Pilot, a 'sample' is equivalent to a 'record', and an 'end-point' is equivalent to one or more calculated properties. Normally an ideal pipeline takes a set of samples, runs them through a series of calculations, and publishes the result in a report.

Sometimes calculations must be done using software which is not 100% bulletproof. (is any software 100% bulletproof?). By default, when a record causes a failure in a protocol, the protocol stops and reports the error. All progress is lost, and after the problem is addressed, the entire input-set is re-submitted. It is possible to trap errors, and eliminate problem records, so that some results can be collected. The remainder are collected, delt with individually, and re-submitted. If they work, then the results are integrated with the previous samples. This is shown in the leaky-pipe diagram below.

You'll also notice in the diagram above another source of problems. When new samples appear, we would like to integrate them into the previously calculated results. Normally this results in a lot of book-keeping effort. Why should humans do book-keeping when computers are so good at it?

Part II - The proposal

I am planning on using Pipeline Pilot to build a web application to manage each of the samples individually. The progress will be tracked on a grid, where 1 dimension is the list of samples to be run, and the other dimension is the series of calculations that are requested as shown below:

Notice that both axes should be editable without affecting the others. For example, new samples can be added at the end, without loosing the previously calculated values. Likewise, additional outputs can be chosen at a later time, and will be appended to the previously calculated values.

Obviously the framework to do this type of web application is not batch-oriented, but rather database driven. The result of each sample-property pair will be stored in a database, as will the status of the calculation (Queued, Running, OK, or Error). The database is then queried for an up-to-date view of the status of the whole project.

I'm happy to hear your thoughts on this, and if it would be of use to others.

Happy Holidays!

-Kip