Archive for June 2010

MITgcm ECCO-GODAE 1 degree configuration on a Power5 GigEth cluster

June 29, 2010

Yes – such an imbalanced cluster (powerful nodes for their time, slow interconnect) does exist – I pressume when they bought it they expected to use it as a high throughput resource or for very latency tolerant, low bandwidth applications and therefore the interconnect was necessary only for network file system access (which can still be good reason enough for something better than plain GigEth unless disk access is used very infrequently). The compute nodes are 13 OpenPower 710 (2 socket) and 2 720 (4 socket) nodes with Power5 processors at 1.65GHz running RHEL 5.2.

Parallel scaling of the standard 1 degree (360×160 – -80 to +80 degrees in latitude) ECCO-GODAE configuration is not particularly great on this system – what are shown are the average (arithmetic mean) timings on processor 0 per timestep. Each timestep corresponds to a model hour so 1 second per timestep corresponds to 9.86 model years per compute day.

MPICH 1.2.7p1

OpenMPI 1.2.7

user time

system time

wall clock time

user time

system time

wall clock time

4×2=8

root I/O

0.70 (100%)

0.24 (100%)

1.01 (100%)

0.70 (100%)

0.19 (100%)

0.96 (100%)

tile I/O

0.70 (100%)

0.23 (100%)

1.00 (100%)

0.70 (100%)

0.19 (100%)

0.95 (100%)

4×4=16

root I/O

0.42 (83%)

0.36 (33%)

0.83 (61%)

0.39 (90%)

0.36 (26%)

0.83 (58%)

tile I/O

0.40 (88%)

0.36 (32%)

0.81 (62%)

0.40 (88%)

0.35 (27%)

0.82 (58%)

6×4=24

root I/O

0.30 (78%)

0.38 (21%)

0.72 (47%)

0.28 (83%)

0.40 (16%)

0.75 (43%)

tile I/O

0.29 (80%)

0.39 (20%)

0.77 (43%)

0.28 (83%)

0.40 (16%)

0.74 (43%)

6×5=30

root I/O

0.29 (64%)

0.55 (12%)

0.88 (31%)

0.24 (78%)

0.52 (10%)

0.81 (32%)

tile I/O

0.27 (69%)

0.51 (12%)

0.84 (32%)

0.24 (78%)

0.53 (10%)

0.82 (31%)

The scaling is to the 8-processor parallel run – one can safely pressume that it would be worse if done (as per definition of weak scaling) against the best serial run available). User time scales weakly pretty decently none-the-less. It is system time that really messes up on this system, increasing (!) instead of decreasing with processor count and as a result the only parallel scaling that really matters, that with respect to wallclock time is very bad. I would not waste cpu cycles to run on more than 16 cpus on this system and even that is up for debate (0.82 sec/timestep vs. 0.95 for 8 cpus). When I get a chance I’ll try this benchmark configuration at the PPC970MP cluster at IU (Big Red) which uses Myrinet 2G as an interconnect and has both Lustre and GPFS filesystems to show/see the difference this can make (OK, the CPU is different though from the same family).

All executables were built with (relatively recent) versions of the IBM XL Fortran (11.1) and C/C++ (9.0) compilers for RHEL at -O3 with flags for tuning for the native processor and with inlining and hot transformations. -O5 seems to produce static binaries that fail in mysterious ways.

Both MPI implementations used are antiquated – OpenMPI is currently at version 1.4.2 and MPICH is no longer under development, having been superseded by MPICH2. Still I very much doubt that newer implementations would make a big difference in performance – other than an implementation like Parastation V4 or V5, SCALI MPI Connect (now Platform MPI) or an MX-aware MPI implementation over Open-MX that can offer lower latencies over Gigabit Ethernet. An easier solution would be to take advantage of low prices (e.g. ~$125 for an SDR HBA and ~$2k for a 24 port SDR switch) on 1st generation Infiniband hardware and upgrade the network on this system.