This cluster is for small and medium jobs (32 x86_64-cores, 256GB memory). It has excellent memory speed (8xDDR4) but 10GbE between nodes only. Since 2020 it is also for preparation of GPU-jobs. Access is possible after login using ssh (secure shell) to hpc18.urz.uni-magdeburg.de (intra-uni-network only). Operating system is Linux CentOS. This cluster is not suited for personal data.
#!/bin/bash
# this is first draft (bad cpu pinning is a problem)
#
#SBATCH -J jobname1
#SBATCH -N 2 # Zahl der nodes, 110 GB per, node range: 1..12
# add option cpus-per-task and cpu-bind on 2024-11-11
#SBATCH --ntasks-per-node 1 # 1 for multi-thread-codes (using 32 cores)
#SBATCH --cpus-per-task 32 # 1 tasks/node * 32 threads/task = 32 threads/node
##SBATCH --ntasks-per-node 2 # 2 for hybrid code, 2 tasks * 8 cores/task
##SBATCH --cpus-per-task 16 # 2 tasks/node * 16 threads/task = 32 threads/node
##...
##SBATCH --ntasks-per-node 32 # 32 for pure MPI-code or 32 single-core-apps
##SBATCH --cpus-per-task 1 # 32 tasks/node * 1 threads/task = 32 threads/node
##SBATCH --cpu-bind=threads # binding: 111.. 222... 333... per node
#SBATCH --time 0:59:00 # set 59min walltime
#SBATCH --mem 110000 # please use max. 110GB for better priorisation
#
exec 2>&1 # send errors into stdout stream
echo "DEBUG: SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
echo "DEBUG: SLURM_NNODES=$SLURM_NNODES"
echo "DEBUG: SLURM_TASKS_PER_NODE=$SLURM_TASKS_PER_NODE"
#env | grep -e MPI -e SLURM
echo "DEBUG: host=$(hostname) pwd=$(pwd) ulimit=$(ulimit -v) \$1=$1 \$2=$2"
scontrol show Job $SLURM_JOBID # show slurm-command and more for DBG
/usr/local/bin/quota_all # show quotas (Feb22), but node-nfs sees no quota!
echo "ulimit -v = $(ulimit -v)" # this may be low-mem relevant debugging
module load mpi/openmpi
module list
HOSTFILE=slurm-$SLURM_JOBID.hosts
scontrol show hostnames $SLURM_JOB_NODELIST > $HOSTFILE # one entry per host
awk '{print $1,"slots=32"}' $HOSTFILE > $HOSTFILE.2
echo "DEBUG: taskset= $(taskset -p $$)"
NPERNODE=$SLURM_NTASKS_PER_NODE
if [ -z "$NPERNODE" ];then NPERNODE=32; fi # default 32 mpi-tasks/node
echo "DEBUG: NPERNODE= $NPERNODE"
export OMP_NUM_THREADS=$[32/NPERNODE]
export OMP_WAIT_POLICY="PASSIVE" # reduces OMP energy consumption
export OMPI_MCA_mpi_yield_when_idle=1 # untested, low energy OMPI ???
export OMPI_MCA_hwloc_base_binding_policy=none # pin-problem work arround???
# default Core-binding is bad, two tasks bound on 2 hyperthreads of same core
# but helps for srun only, not for direct mpirun (no mpi standard!?)
# obsolete, if hyperthreading is disabled by BIOS
TASKSET="taskset 0xffffffff" # 2019-01 ok for 1*32t, 32*1t
# try to fix 2 most simple cases here:
if [ $NPERNODE == 32 ];then
export SLURM_CPU_BIND=v,map_cpu:1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
fi
if [ $NPERNODE == 1 ];then
TASKSET="taskset 0xffffffff" # 32bit bask
fi
#mpirun -np 1 --report-bindings --oversubscribe -v --npernode $NPERNODE bash -c "taskset -p $$;ps aux"
# hybrid binary mpi+multithread
mpirun --report-bindings --oversubscribe -v --npernode $NPERNODE $TASKSET mpi-binary -t$OMP_NUM_THREADS
run with: sbatch jobfile
2018-02 installation 12 nodes
2018-04 bad CPU pinning of slurm(?),openmpi(?), workarround script
2018-05 set MTU=9000 (dflt 1500) to improve 10GbE network speed (200%)
2019-02 +5 nodes
2019-05-14 set overcommit_memory=2 to avoid linux crashes on memory pressure
2019-06-12 power loss due to short during work on the room electric
2020-03-03 upgrade of 1GbE to 10GbE for the ssh-access
2020-10-14 system reconfiguration due to instable 10GbE in progress
2021-06-15 slurm partition reconfiguration (about one week testing)
2021-07-07 network and config problems after system update
2021-08-09 remote shutdown tests for 2021-08-16
2021-08-16 planned maintenance, no clima, 12h down time
2022-02-23 quota_all added, showing user quotas
2022-05-16 downtime due to slurm security update
2023-02-13 node[08,16] down (node16 CE-ECC CPU1-G2, node08 no SEL)
2023-06-21 fix default routing issues on node01 + node03
2023-06-21 firewall outgoing worldwide traffic, ask the admin on need (security)
2023-06-21 dbus.service and rpcbind deactivated at nodes (security, stability)
2023-11-08 cluster down for maintenance, network upgrade, node01 memory 1TB
estimated downtime 24 to 48 hours
2023-11-09 further 24h downtime to fix performance and software issues
2023-11-20 investigate instability of node01;
instability of node02 and node08 was found as memory problem and fixed
node13-17 have corrupt SEL entries "08/31/2018 02:20:10 Unknown #0xff"
between other entries from after 2020-01
2023-11-22 instability of node08 partly reappeared, like node01,
(do not power on after off/reset until DIMM places changed, no error logs,
sometimes hanging at DXE--CPU Initialization,
also with removed 10G+100Gb-cards), happens with one DIMM too,
sometimes every second reset fails, not DIMM dependend,
btw. BMC works without any DIMM installed, must have its own local memory
2023-12-12 node01 1TB memory available (sbatch -N1 --mem 960000 ...)
2024-04-23 SMT reenabled (+30% speed, but do not use for MPI)
2024-09-03 19:42 node01 linux kernel panic/crashed
2024-10-09 15:44 node01 linux crashed, reboot + memory-testing
2024-10-17 node01: found single-bit-errors on a 64GB-DDR4-DIMM, DIMM removed,
still crashes (more frequently, c-state-related?),
testing minfreq=1800MHz instead of lowest 1200MHz to have higher CPU-core
voltage (still stable after 13 days, CPU aged?)
2024-10-29 node01: replacement 64GB-DDR4-DIMM
2024-11-06 new Slurm.SelectType=select/cons_res, allow shared resources