Myrinet test configuration

I have been running MS 5 for a while on a cluster of eight core (2 x four core) xeon machines (actually MacPros) connected by fast ethernet. I have had no real trouble with these machines to date, but I would like to decrease the latency of the internconnects so I am contemplating buying a myrinet switch and myrinet cards for each machine. I have a myrinet card from a myrinet vendor now for testing. The card has a direct link between two machines in the cluster for which I am testing things.

A description of the overall setup follows:
Fast Ethernet setup:
The MS50 software was installed with the --cluster option on each unix box along with the included hpmpi rpm. All machines are running Suse 11.1 (Linux node1 2.6.27.45-0.1-default #1 SMP 2010-02-22 16:49:47 +0100 x86_64 x86_64 x86_64 GNU/Linux)
The cluster has a common nfs disk for submitting jobs and each of the job directories on each node have a symbolic link to a jobs folder pointing to the shared nfs directory. I have specified in the Accelryls/MaterialsStudio50/share/data/machines.LINUX file lines to the effect:

node1 150.29.xx.xx
node2 150.29.xx.xy

where the above nodes are the fast ethernet connect.

This all works quite well, but I noted that with QMD jobs in Castep, things got bogged down presumably by the ethernet latency/speed.

To try and get around this I am thinking of installing a myrinet switch and myrinet cards in each box. I have two Myri10GE cards from a myrinet vendor now installed in two nodes and these two nodes are connected directly by the appropriate cable. I have installed the driver and the card is recognized as an alternate ethernet device (in my case eth0). I have defined the cards in the machines.LINUX file and /etc/hosts

/etc/hosts
150.29.xx.xx node1 #(dns registered name) -> eth3
150.29.xx.xy node2 #(dns registered name) -> eth3
192.168.1.1 myri-node1 # -> eth0 Myrinet10G
192.168.1.2 myri-node2 # -> eth0 Myrinet10G



modprobe myri10ge (installs drivers)

ifconfig eth0 gives
eth0 Link encap:Ethernet HWaddr 00:60Big GrinD:46:BA:C8
inet addr:192.168.1.1 Bcast:192.168.1.255 Mask:255.255.255.0
inet6 addr: fe80::260:ddff:fe46:bac8/64 Scope:Link
UP BROADCAST RUNNING MULTICAST MTU:1500 Metric:1
RX packets:3542012 errors:0 dropped:0 overruns:0 frame:0
TX packets:23452590 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:4095353265 (3905.6 Mb) TX bytes:34987404219 (33366.5 Mb)
Interrupt:243


My routing looks like:

Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
192.168.100.0 0.0.0.0 255.255.255.0 U 0 0 0 pan0
192.168.1.0 0.0.0.0 255.255.255.0 U 0 0 0 eth0
150.29.54.0 0.0.0.0 255.255.254.0 U 0 0 0 eth3
169.254.0.0 0.0.0.0 255.255.0.0 U 0 0 0 eth3
127.0.0.0 0.0.0.0 255.0.0.0 U 0 0 0 lo
0.0.0.0 150.29.54.1 0.0.0.0 UG 0 0 0 eth3


The device eth3 is the gigabit ethernet connection to the WLAN through a gigabit switching hub. The 169.254.0.0 address is just the zero config interface on Suse.

pinging works fine:
node2:~ # traceroute 192.168.1.1
traceroute to 192.168.1.1 (192.168.1.1), 30 hops max, 40 byte packets using UDP
1 myri-node1 (192.168.1.1) 3.861 ms 0.079 ms 0.085 ms
node2:~ # traceroute 192.168.1.2
traceroute to 192.168.1.2 (192.168.1.2), 30 hops max, 40 byte packets using UDP
1 myri-node2 (192.168.1.2) 0.000 ms 0.000 ms 0.000 ms


When I try and start a parallel processing job, things seem to get stuck with no traffic over the eth0 link and a large SYS load, e.g. top yields

top - 11:03:55 up 19 min, 3 users, load average: 6.79, 3.43, 1.39
Tasks: 179 total, 9 running, 170 sleeping, 0 stopped, 0 zombie
Cpu(s): 0.0%us, 73.0%sy, 14.2%ni, 12.8%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st
Mem: 20557444k total, 642984k used, 19914460k free, 11464k buffers
Swap: 2109340k total, 0k used, 2109340k free, 169720k cached

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
5893 matstudc 29 9 156m 53m 19m R 100 0.3 3:06.73 castepexe_mpi.e
5898 matstudc 29 9 151m 47m 13m R 100 0.2 3:08.93 castepexe_mpi.e
5894 matstudc 29 9 153m 49m 15m R 100 0.2 3:08.77 castepexe_mpi.e
5896 matstudc 29 9 151m 49m 15m R 100 0.2 3:08.90 castepexe_mpi.e
5897 matstudc 29 9 151m 49m 15m R 100 0.2 3:08.95 castepexe_mpi.e
5899 matstudc 29 9 152m 47m 13m R 100 0.2 3:08.82 castepexe_mpi.e
5900 matstudc 29 9 151m 47m 13m R 100 0.2 3:09.01 castepexe_mpi.e

Note the 73% sy time.

The Castep job is set for 16 processes and starts out fine. It runs until the initial SCF is complete and writes a checkpoint file. After this point the sy times goes up and nothing further happens.

Any ideas on what is going on?

paul-fons@aist.go.jp