Nektar++ 2.0 is out

Posted October 25, 2010 by adventuresinhpc
Categories: Uncategorized

and it can be found at http://www.nektar.info. I’m partial to this work as it is the evolution of the Nektar code I worked with during my Ph.D. thesis work and the main developers are colleagues (and friends). Nektar++ is a generalized library of spectral element (otherwise known as h/p element) methods that can be applied not just to flow problems (as was the case of Nektar which has been used for incompressible and compressible Navier-Stokes (N-S) as well as MHD, for examples see the work at IC and at Brown) but to other application domains as well. As expected given the expertise of the developers the initial emphasis is still on basic PDEs (e.g. advection-diffusion equation) and N-S. There is already one external adoption of the code with application to coastal modeling in conjunction with the Cactus framework (which makes it triply interesting to me). Nektar++ is not as mature for production work as Nektar (more than 15 years of production use) is which explains why at my graduate alma mater (Brown) Nektar still does the heavy lifting for flow research work but the new codebase promises greater flexibility for the future:

  1. It offers a unified framework for Continuous, Discontinuous and Hybrid Discontinuous Galerkin methods – the latter is of interest to colleagues in the Mechanical Engineering Department at MIT
  2. It is written as an object-oriented toolkit, similar in some aspects to OpenFOAM with all the advantages for code development and integration such an approach entails
  3. The developers have put a lot of effort into efficiently computing operators

Well, after this shameless plug :-) over the next few posts I will detail my experiences/gripes with building Nektar++ on different platforms.

MITgcm ECCO-GODAE 1 degree configuration on a Power5 GigEth cluster

Posted June 29, 2010 by adventuresinhpc
Categories: Benchmarking

Tags: , ,

Yes – such an imbalanced cluster (powerful nodes for their time, slow interconnect) does exist – I pressume when they bought it they expected to use it as a high throughput resource or for very latency tolerant, low bandwidth applications and therefore the interconnect was necessary only for network file system access (which can still be good reason enough for something better than plain GigEth unless disk access is used very infrequently). The compute nodes are 13 OpenPower 710 (2 socket) and 2 720 (4 socket) nodes with Power5 processors at 1.65GHz running RHEL 5.2.

Parallel scaling of the standard 1 degree (360×160 – -80 to +80 degrees in latitude) ECCO-GODAE configuration is not particularly great on this system – what are shown are the average (arithmetic mean) timings on processor 0 per timestep. Each timestep corresponds to a model hour so 1 second per timestep corresponds to 9.86 model years per compute day.

MPICH 1.2.7p1

OpenMPI 1.2.7

user time

system time

wall clock time

user time

system time

wall clock time

4×2=8

root I/O

0.70 (100%)

0.24 (100%)

1.01 (100%)

0.70 (100%)

0.19 (100%)

0.96 (100%)

tile I/O

0.70 (100%)

0.23 (100%)

1.00 (100%)

0.70 (100%)

0.19 (100%)

0.95 (100%)

4×4=16

root I/O

0.42 (83%)

0.36 (33%)

0.83 (61%)

0.39 (90%)

0.36 (26%)

0.83 (58%)

tile I/O

0.40 (88%)

0.36 (32%)

0.81 (62%)

0.40 (88%)

0.35 (27%)

0.82 (58%)

6×4=24

root I/O

0.30 (78%)

0.38 (21%)

0.72 (47%)

0.28 (83%)

0.40 (16%)

0.75 (43%)

tile I/O

0.29 (80%)

0.39 (20%)

0.77 (43%)

0.28 (83%)

0.40 (16%)

0.74 (43%)

6×5=30

root I/O

0.29 (64%)

0.55 (12%)

0.88 (31%)

0.24 (78%)

0.52 (10%)

0.81 (32%)

tile I/O

0.27 (69%)

0.51 (12%)

0.84 (32%)

0.24 (78%)

0.53 (10%)

0.82 (31%)

The scaling is to the 8-processor parallel run – one can safely pressume that it would be worse if done (as per definition of weak scaling) against the best serial run available). User time scales weakly pretty decently none-the-less. It is system time that really messes up on this system, increasing (!) instead of decreasing with processor count and as a result the only parallel scaling that really matters, that with respect to wallclock time is very bad. I would not waste cpu cycles to run on more than 16 cpus on this system and even that is up for debate (0.82 sec/timestep vs. 0.95 for 8 cpus). When I get a chance I’ll try this benchmark configuration at the PPC970MP cluster at IU (Big Red) which uses Myrinet 2G as an interconnect and has both Lustre and GPFS filesystems to show/see the difference this can make (OK, the CPU is different though from the same family).

All executables were built with (relatively recent) versions of the IBM XL Fortran (11.1) and C/C++ (9.0) compilers for RHEL at -O3 with flags for tuning for the native processor and with inlining and hot transformations. -O5 seems to produce static binaries that fail in mysterious ways.

Both MPI implementations used are antiquated – OpenMPI is currently at version 1.4.2 and MPICH is no longer under development, having been superseded by MPICH2. Still I very much doubt that newer implementations would make a big difference in performance – other than an implementation like Parastation V4 or V5, SCALI MPI Connect (now Platform MPI) or an MX-aware MPI implementation over Open-MX that can offer lower latencies over Gigabit Ethernet. An easier solution would be to take advantage of low prices (e.g. ~$125 for an SDR HBA and ~$2k for a 24 port SDR switch) on 1st generation Infiniband hardware and upgrade the network on this system.

How to get Sun ClusterTools 8.2.1 to work on MIT’s AFS Athena cluster

Posted March 9, 2010 by adventuresinhpc
Categories: Howtos

Tags: , , , ,

At the end of the 1990’s MIT’s Athena cluster used to be mostly Sparc/Solaris with a small component of MIPS/IRIX boxes – in the 2000s this shifted to x86/x86_64 with Linux and an ever decreasing proportion of Sparc/Solaris boxes – the last (quad processor Sparc IIIi based) of them – which used to offer the old "dialup" service – are supposed to go away in the summer of 2010 – which is a great pity for us as we run daily MITgcm validation tests on them.

In any case, I had installed Sun ClusterTools 8.2.1 (OpenMPI 1.3.4 based and since version 8.2.1c – based on OpenMPI 1.4.2 – called Oracle Message Passing Toolkit) on my Athena account for both x86/x86_64/Linux and Sparc/Solaris. Given that the installation did not happen in the default location I have had to use the OPAL PREFIX environment variable to point to the new installation root directory. Runs within the same node are fine under both operating systems. Running across multiple nodes however is a bit more complicated. In all cases one should use Kerberos and the kerberized rsh with the right flags to allow connecting to other nodes for startup (of course having gotten the tickets first). AFS creates problems when it comes to seeing files on AFS In the absence of a queuing system one uses a manual hostfile and to allow for mixed MPI-OpenMP applications the value of the environment variable OMP_NUM_THREADS also needs to be propagated with the -x flag.

Under Linux this works fine (even for the mixed case) with both the GNU compiler (as a quick test I used the mixed MPI-OpenMP benchmark HOMB):

no-knife:benchmark/comms/HOMB-1.0% mpirun -report-bindings -wd $PWD -mca plm_rsh_agent "krb5-rsh -x -F" -hostfile /mit/13.715/ompihosts-linux -np 6 -pernode -display-map -x OPAL_PREFIX -x OMP_NUM_THREADS /tmp/homb.ex.gnu -NRC 3072 -NITER 10 -pc -s ======================== JOB MAP ======================== Data for node: Name: contents-vnder-pressvre Num procs: 1 Process OMPI jobid: [35507,1] Process rank: 0 Data for node: Name: no-knife.mit.edu Num procs: 1 Process OMPI jobid: [35507,1] Process rank: 1 Data for node: Name: mass-toolpike Num procs: 1 Process OMPI jobid: [35507,1] Process rank: 2 Data for node: Name: home-on-the-dome Num procs: 1 Process OMPI jobid: [35507,1] Process rank: 3 Data for node: Name: scrubbing-bubbles Num procs: 1 Process OMPI jobid: [35507,1] Process rank: 4 Data for node: Name: all-night-tool Num procs: 1 Process OMPI jobid: [35507,1] Process rank: 5 ============================================================= This rsh session is encrypting input/output data transmissions. This rsh session is encrypting input/output data transmissions. This rsh session is encrypting input/output data transmissions. This rsh session is encrypting input/output data transmissions. This rsh session is encrypting input/output data transmissions. Number of Rows: 3072, Number of Columns: 3072, Number of Iterations: 10 Number of Tasks: 6, Number of Threads per Task: 2 Reduction at end of each iteration. Summary Standard Ouput with Header #==========================================================================================================# # Tasks Threads NR NC NITER meanTime maxTime minTime NstdvTime # #==========================================================================================================# 6 2 3072 3072 10 0.376302 0.636142 0.151431 0.350631

and using the Sun Compilers for Linux:

no-knife:benchmark/comms/HOMB-1.0% mpirun -report-bindings -wd $PWD -mca plm_rsh_agent "krb5-rsh -x -F" -hostfile /mit/13.715/ompihosts-linux -np 6 -pernode -display-map -x OPAL_PREFIX -x OMP_NUM_THREADS /tmp/homb.ex.sun.linux.m32 -NRC 3072 -NITER 10 -pc -s ======================== JOB MAP ======================== Data for node: Name: contents-vnder-pressvre Num procs: 1 Process OMPI jobid: [34033,1] Process rank: 0 Data for node: Name: no-knife.mit.edu Num procs: 1 Process OMPI jobid: [34033,1] Process rank: 1 Data for node: Name: mass-toolpike Num procs: 1 Process OMPI jobid: [34033,1] Process rank: 2 Data for node: Name: home-on-the-dome Num procs: 1 Process OMPI jobid: [34033,1] Process rank: 3 Data for node: Name: scrubbing-bubbles Num procs: 1 Process OMPI jobid: [34033,1] Process rank: 4 Data for node: Name: all-night-tool Num procs: 1 Process OMPI jobid: [34033,1] Process rank: 5 ============================================================= This rsh session is encrypting input/output data transmissions. This rsh session is encrypting input/output data transmissions. This rsh session is encrypting input/output data transmissions. This rsh session is encrypting input/output data transmissions. This rsh session is encrypting input/output data transmissions. Number of Rows: 3072, Number of Columns: 3072, Number of Iterations: 10 Number of Tasks: 6, Number of Threads per Task: 2 Reduction at end of each iteration. Summary Standard Ouput with Header #==========================================================================================================# # Tasks Threads NR NC NITER meanTime maxTime minTime NstdvTime # #==========================================================================================================# 6 2 3072 3072 10 0.040981 0.066348 0.015613 0.448715

Unfortunately under Solaris things get ugly: One needs to also set LD_LIBRARY_PATH for both ClusterTools and the Sun Compilers

setenv LD_LIBRARY_PATH /mit/sunsoft_v12u1/sunstudio12.1/libdynamic/v9:/mit/sunsoft_v12u1/sunstudio12.1/libdynamic:$OPAL_PREFIX/lib/64:$OPAL_PREFIX/lib/32:/mit/gcc-4.0/lib:/mit/13.715/ActiveTcl-8.4/lib:/mit/13.715/ActiveTcl-8.4/lib/tclx8.4

and then specify both -prefix and propagate LD_LIBRARY_PATH with the -x flag and still in both 32-bit mode:

mpirun -prefix $OPAL_PREFIX -report-bindings -wd $PWD -mca plm_rsh_agent "rsh -x -F" -hostfile /mit/13.715/ompihosts-sunos -np 2 -pernode -display-map -x OPAL_PREFIX -x OMP_NUM_THREADS -x LD_LIBRARY_PATH /tmp/homb.ex.sun.sparc.m32 -NRC 2048 -NITER 10 -pc -s [department-of-alchemy.mit.edu:14767] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ess_hnp_module.c at line 161 -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_plm_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -------------------------------------------------------------------------- [department-of-alchemy.mit.edu:14767] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 132 -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_set_name failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -------------------------------------------------------------------------- [department-of-alchemy.mit.edu:14767] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file orterun.c at line 541

and 64-bit mode.

mpirun -prefix $OPAL_PREFIX -report-bindings -wd $PWD -mca plm_rsh_agent "rsh -x -F" -np 2 -display-map -x OPAL_PREFIX -x OMP_NUM_THREADS -x LD_LIBRARY_PATH /tmp/homb.ex.sun.sparc.m64 -NRC 2048 -NITER 10 -pc -s [department-of-alchemy.mit.edu:14671] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file ess_hnp_module.c at line 161 -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_plm_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -------------------------------------------------------------------------- [department-of-alchemy.mit.edu:14671] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file runtime/orte_init.c at line 132 -------------------------------------------------------------------------- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): orte_ess_set_name failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -------------------------------------------------------------------------- [department-of-alchemy.mit.edu:14671] [[INVALID],INVALID] ORTE_ERROR_LOG: Not found in file orterun.c at line 541

The error messages are rather cryptic but after some effort one discovers that they are caused by AFS – specifically the partner node has trouble accessing files over AFS (including the ones providing the MPI runtime) because the tokens have not been generated/propagated properly – a problem not seen under Athena-Linux. The solution is to manually login to all partner nodes (and run aklog if this is not done automatically for you). Then one can successfully use ClusterTools in 32-bit mode on Sparc/Solaris/Athena on multiple nodes.

mpirun -prefix $OPAL_PREFIX -report-bindings -wd $PWD -mca plm_rsh_agent "rsh -x -F" -hostfile /mit/13.715/ompihosts-sunos -np 2 -display-map -pernode -x OPAL_PREFIX -x OMP_NUM_THREADS -x LD_LIBRARY_PATH /tmp/homb.ex.sun.sparc.m32 -NRC 2048 -NITER 10 -pc -s ======================== JOB MAP ======================== Data for node: Name: department-of-alchemy.mit.edu Num procs: 1 Process OMPI jobid: [2763,1] Process rank: 0 Data for node: Name: biohazard-cafe Num procs: 1 Process OMPI jobid: [2763,1] Process rank: 1 =============================================================This rsh session is encrypting input/output data transmissions.Number of Rows: 2048, Number of Columns: 2048, Number of Iterations: 10 Number of Tasks: 2, Number of Threads per Task: 2 Reduction at end of each iteration.Summary Standard Ouput with Header #==========================================================================================================## Tasks Threads NR NC NITER meanTime maxTime minTime NstdvTime ##==========================================================================================================# 2 2 2048 2048 10 0.054501 0.058747 0.052440 0.036289

Trying to do the same in 64-bit mode runs into trouble because of problematic resolution of dynamic libraries – I use SunStudio 12u1 and the remote node does not want to be forced to properly load them up.


Follow

Get every new post delivered to your Inbox.