Error when running larger jobs on grid

LM 2015-02-04

Hi,

We are running Pipeline Pilot on a linux cluster using PBS/Torque as the grid engine.

I have a component in a protocol which I've set up to run in parallel on the grid in batches of 2000 records. When the batch size is 2000 records and each process takes around 70 seconds, everything completes OK. However, when I increase the batch size to 3000 or more records, after about 80 seconds I get an error message: "The job's server process halted unexpectedly". I have checked the queue, and the processes have not exited with error - they complete successfully. I have checked the memory size, stack size, etc. and all these settings are set to unlimited for scitegicuser (the user that runs the protocol).

I can also run the protocol with 3000 or more records per batch size directly on the head node (but not using the grid), and the protocol completes successfully.

I think there is some sort of communication timeout between PP and PBS - I don't get much information from the PP logs other than the job halted unexpectedly. PBS job information sometimes shows "Job terminated at request of scitegicuser@etc".

Does anyone have any ideas what may be going on?

Thanks for your help,

Liz