Running jobs on hippolyta

In scientific computing, we need tools to coordinate scientists (tens to hundreds of them) with cluster computer resources, which can can range from tens to millions of processors. We also need tools to wrangle a menagerie of (often conflicting) software libraries to support computational science applications.

Hippolyta uses slurm to address the first challenge, and Environment Modules to address the second.

Environment Modules

Do you need a specific version of a library (FFTW-2.1.5, for example) that conflict with the version that another project requires (FFTW-3.3.4)?1 This is one of the problems that Environment Modules addresses.

Environment Modules works by manipulating the environment variables that tell your shell where to look for programs and libraries, like PATH and LD_LIBRARY_PATH.

Use the module command to list available packages, or to load them.

mullins@hippolyta > module avail

------------------------- /usr/share/Modules/modulefiles --------------------------
dot         module-git  module-info modules     null        use.own

-------------------------- /mnt/data/sw/etc/modulefiles ---------------------------
gcc/4.9.3      lammps         openmpi        python/2.7     spparks
hoomd          mpich          paraview/4.4.0 python/3.4

When multiple versions of a software package are installed, module will load the most recent version if a version is not specified:

mullins@hippolyta > module load python
mullins@hippolyta > python
Python 3.4.3 (default, May 28 2015, 17:24:24)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-11)] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>>

If you wanted python2 instead:

mullins@hippolyta > module unload python
mullins@hippolyta > module load python/2.7
mullins@hippolyta > python
Python 2.7.10 (default, May 29 2015, 10:08:46)
[GCC 4.4.7 20120313 (Red Hat 4.4.7-11)] on linux2
Type "help", "copyright", "credits" or "license" for more information.
>>>

Modules can depend on or conflict with other modules – the lammps module is a great example. Trying to module load lammps without first loading its dependencies will result in an error. To make this simpler, one can load the lammps-all module, which loads the lammps module along with its dependencies.

brian@hippolyta > module load lammps-all
gcc version 4.9.3 loaded
mpich2 version 3.1.4 loaded
lammps loaded
brian@hippolyta > which lammps
/mnt/data/sw/lammps/bin/lammps

For commonly used modules, it can be convenient to load them automatically at login time by placing the relevant module load command in your shell startup file (~/.bashrc for bash, the default shell on Hippolyta).

slurm

Hippolyta uses slurm, a free and open source resource manager and scheduler developed at Lawrence Livermore. Slurm is increasingly being used at U.S. national labs and supercomputing facilities, and on many of the TOP500 systems. If you’re used to a PBS system, you may find this comparison of slurm commands to PBS commands helpful, along with this more extensive reference sheet.

Hippolyta currently has three compute node partitions (or job queues):

  • test - a small partition for short debugging jobs (6 hour limit)
  • batch - a large partition for production calculations (7 day limit)
  • holm515 - a dedicated partition for Professor Holm’s 27-515 Intro to Computational MSE course

Slurm allows users to request resources from these processor partitions to run interactive commands with srun, or to run batch scripts with sbatch. These commands must be run from a directory that the whole cluster has access to. On Hippolyta, this means that jobs must be run from inside your ~/data directory, which is mounted on a shared Network File System (NFS) volume.

Use the sinfo command to check the status of the compute partitions, including time limits and availability.

brian@hippolyta > sinfo
PARTITION AVAIL  TIMELIMIT  NODES  STATE NODELIST
batch*       up 7-00:00:00      3  down* cacophony[14,16,21]
batch*       up 7-00:00:00      1    mix cacophony23
batch*       up 7-00:00:00     22   idle cacophony[00-13,15,17-20,22,24-25]
holm515      up    1:00:00      2   idle cacophony[26-27]
test         up    6:00:00      1  down* cacophony30
test         up    6:00:00      3   idle cacophony[28-29,31]

Use the squeue command to check on running and queued jobs:

brian@hippolyta > squeue
             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)
              1837     batch     iso1   philip PD       0:00      1 (PartitionTimeLimit)
2558_[17-1000%128]     batch AGGmodel    brian PD       0:00      4 (Resources)
 2575_[1-1000%128]     batch AGGmodel    brian PD       0:00      4 (Dependency)
 2576_[1-1000%128]     batch AGGmodel    brian PD       0:00      4 (Dependency)
 2577_[1-1000%128]     batch AGGmodel    brian PD       0:00      4 (Dependency)
 2578_[1-1000%128]     batch AGGmodel    brian PD       0:00      4 (Dependency)
              2579      test snote-br    brian PD       0:00      1 (Dependency)
              2580      test snote-AG    brian PD       0:00      1 (Dependency)
           2558_13     batch AGGmodel    brian  R       5:33      4 cacophony[00-03]
           2558_14     batch AGGmodel    brian  R       5:33      4 cacophony[00-03]
           2558_15     batch AGGmodel    brian  R       5:33      4 cacophony[00-03]
           2558_16     batch AGGmodel    brian  R       5:33      4 cacophony[00-03]
            2558_1     batch AGGmodel    brian  R       5:34      4 cacophony[00-03]
            2558_2     batch AGGmodel    brian  R       5:34      4 cacophony[00-03]
            2558_3     batch AGGmodel    brian  R       5:34      4 cacophony[00-03]
            2558_4     batch AGGmodel    brian  R       5:34      4 cacophony[00-03]
            2558_5     batch AGGmodel    brian  R       5:34      4 cacophony[00-03]
            2558_6     batch AGGmodel    brian  R       5:34      4 cacophony[00-03]
            2558_7     batch AGGmodel    brian  R       5:34      4 cacophony[00-03]
            2558_8     batch AGGmodel    brian  R       5:34      4 cacophony[00-03]
            2558_9     batch AGGmodel    brian  R       5:34      4 cacophony[00-03]
           2558_10     batch AGGmodel    brian  R       5:34      4 cacophony[00-03]
           2558_11     batch AGGmodel    brian  R       5:34      4 cacophony[00-03]
           2558_12     batch AGGmodel    brian  R       5:34      4 cacophony[00-03]


  1. FFTW is a fast Fourier transform library, and we need both version 2.1 and version 3.X on Hippolyta.