Running jobs on hippolyta
In scientific computing, we need tools to coordinate scientists (tens to hundreds of them) with cluster computer resources, which can can range from tens to millions of processors. We also need tools to wrangle a menagerie of (often conflicting) software libraries to support computational science applications.
Hippolyta uses slurm to address the first challenge, and Environment Modules to address the second.
Environment Modules
Do you need a specific version of a library (FFTW-2.1.5, for example) that conflict with the version that another project requires (FFTW-3.3.4)?1 This is one of the problems that Environment Modules addresses.
Environment Modules works by manipulating the environment variables that tell your shell where to look for programs and libraries, like PATH and LD_LIBRARY_PATH.
Use the module command to list available packages, or to load them.
mullins@hippolyta > module avail ------------------------- /usr/share/Modules/modulefiles -------------------------- dot module-git module-info modules null use.own -------------------------- /mnt/data/sw/etc/modulefiles --------------------------- gcc/4.9.3 lammps openmpi python/2.7 spparks hoomd mpich paraview/4.4.0 python/3.4
When multiple versions of a software package are installed, module will load the most recent version if a version is not specified:
mullins@hippolyta > module load python mullins@hippolyta > python Python 3.4.3 (default, May 28 2015, 17:24:24) [GCC 4.4.7 20120313 (Red Hat 4.4.7-11)] on linux Type "help", "copyright", "credits" or "license" for more information. >>>
If you wanted python2 instead:
mullins@hippolyta > module unload python mullins@hippolyta > module load python/2.7 mullins@hippolyta > python Python 2.7.10 (default, May 29 2015, 10:08:46) [GCC 4.4.7 20120313 (Red Hat 4.4.7-11)] on linux2 Type "help", "copyright", "credits" or "license" for more information. >>>
Modules can depend on or conflict with other modules – the lammps module is a great example.
Trying to module load lammps without first loading its dependencies will result in an error.
To make this simpler, one can load the lammps-all module, which loads the lammps module along with its dependencies.
brian@hippolyta > module load lammps-all gcc version 4.9.3 loaded mpich2 version 3.1.4 loaded lammps loaded brian@hippolyta > which lammps /mnt/data/sw/lammps/bin/lammps
For commonly used modules, it can be convenient to load them automatically at login time by placing the relevant module load command in your shell startup file (~/.bashrc for bash, the default shell on Hippolyta).
slurm
Hippolyta uses slurm, a free and open source resource manager and scheduler developed at Lawrence Livermore. Slurm is increasingly being used at U.S. national labs and supercomputing facilities, and on many of the TOP500 systems. If you’re used to a PBS system, you may find this comparison of slurm commands to PBS commands helpful, along with this more extensive reference sheet.
Hippolyta currently has three compute node partitions (or job queues):
test- a small partition for short debugging jobs (6 hour limit)batch- a large partition for production calculations (7 day limit)holm515- a dedicated partition for Professor Holm’s 27-515 Intro to Computational MSE course
Slurm allows users to request resources from these processor partitions to run interactive commands with srun, or to run batch scripts with sbatch.
These commands must be run from a directory that the whole cluster has access to.
On Hippolyta, this means that jobs must be run from inside your ~/data directory, which is mounted on a shared Network File System (NFS) volume.
Use the sinfo command to check the status of the compute partitions, including time limits and availability.
brian@hippolyta > sinfo PARTITION AVAIL TIMELIMIT NODES STATE NODELIST batch* up 7-00:00:00 3 down* cacophony[14,16,21] batch* up 7-00:00:00 1 mix cacophony23 batch* up 7-00:00:00 22 idle cacophony[00-13,15,17-20,22,24-25] holm515 up 1:00:00 2 idle cacophony[26-27] test up 6:00:00 1 down* cacophony30 test up 6:00:00 3 idle cacophony[28-29,31]
Use the squeue command to check on running and queued jobs:
brian@hippolyta > squeue
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
1837 batch iso1 philip PD 0:00 1 (PartitionTimeLimit)
2558_[17-1000%128] batch AGGmodel brian PD 0:00 4 (Resources)
2575_[1-1000%128] batch AGGmodel brian PD 0:00 4 (Dependency)
2576_[1-1000%128] batch AGGmodel brian PD 0:00 4 (Dependency)
2577_[1-1000%128] batch AGGmodel brian PD 0:00 4 (Dependency)
2578_[1-1000%128] batch AGGmodel brian PD 0:00 4 (Dependency)
2579 test snote-br brian PD 0:00 1 (Dependency)
2580 test snote-AG brian PD 0:00 1 (Dependency)
2558_13 batch AGGmodel brian R 5:33 4 cacophony[00-03]
2558_14 batch AGGmodel brian R 5:33 4 cacophony[00-03]
2558_15 batch AGGmodel brian R 5:33 4 cacophony[00-03]
2558_16 batch AGGmodel brian R 5:33 4 cacophony[00-03]
2558_1 batch AGGmodel brian R 5:34 4 cacophony[00-03]
2558_2 batch AGGmodel brian R 5:34 4 cacophony[00-03]
2558_3 batch AGGmodel brian R 5:34 4 cacophony[00-03]
2558_4 batch AGGmodel brian R 5:34 4 cacophony[00-03]
2558_5 batch AGGmodel brian R 5:34 4 cacophony[00-03]
2558_6 batch AGGmodel brian R 5:34 4 cacophony[00-03]
2558_7 batch AGGmodel brian R 5:34 4 cacophony[00-03]
2558_8 batch AGGmodel brian R 5:34 4 cacophony[00-03]
2558_9 batch AGGmodel brian R 5:34 4 cacophony[00-03]
2558_10 batch AGGmodel brian R 5:34 4 cacophony[00-03]
2558_11 batch AGGmodel brian R 5:34 4 cacophony[00-03]
2558_12 batch AGGmodel brian R 5:34 4 cacophony[00-03]