MPI errors on cluster | Dassault Systèmes®

TH 2015-02-04

Hi,

Ever since Materials Studio moved to using Intel MPI (instead of HP MPI / Platform MPI), no Materials Studio jobs can run on more than one compute node on any of our compute clusters. The technical support couldn't do much because they cannot reproduce the issue. With version 6.x, so far our users can cope because they can still run their jobs on one single node with 12-16 cores. On a newer cluster that has 16 cores per node, however, there is a 20% chance that Discover / Forcite jobs would fail with these errors:

[mpiexec@n057] control_cb (./pm/pmiserv/pmiserv_cb.c:674): assert (!closed) failed

[mpiexec@n057] HYDT_dmxu_poll_wait_for_event (./tools/demux/demux_poll.c:77): callback returned error status

[mpiexec@n057] HYD_pmci_wait_for_completion (./pm/pmiserv/pmiserv_pmci.c:388): error waiting for event

[mpiexec@n057] main (./ui/mpich/mpiexec.c:745): process manager error waiting for completion

This happens even though the jobs are single node jobs and thus is unlikely to be an interconnect related issue. Anybody knows why we are getting this problem? This issue has become very severe since the upgrade to version 7. Version 6 doesn't have this problem (although it also doesn't work with multi-node jobs.)

Brief system details (two separate clusters);

- Intel Xeon Westphere and Ivy Bridge CPUs (12-core and 16-core per nodes)

- Infiniband interconnect

- MS 6.x and 7.x with latest service pack installed

All other MPI applications have no problem running on our clusters.

Due to security policy Accelrys support staff can't access our systems. WebEx is probably the only option, although the time zone difference makes this challenging.

Any advice is appreciated.