hpc18

This cluster is for small and medium jobs (32 x86_64-cores, 256GB memory). It has excellent memory speed (8xDDR4) but 10GbE between nodes only. Since 2020 it is also for preparation of GPU-jobs. Access is possible after login using ssh (secure shell) to hpc18.urz.uni-magdeburg.de (intra-uni-network only). Operating system is Linux CentOS. This cluster is not suited for personal data.

Hardware:

2 AMD EPYC 7351, 32cores 2.4GHz 8FLOP/clk (16FLOP/2clk, 64bit), 64 threads, 256GB DDR4-2666MHz ECC (16x16GB), 10GbE
614 GFLOP/s (64bit) RPeak per node (32bit: 1230 SP-GF/s), ca. 410W per node, idle 180W, 1.5GF/W

sample slurm-jobscript:

#!/bin/bash
#   this is first draft (bad cpu pinning is a problem)
#
#SBATCH -J jobname1
#SBATCH -N 2 # Zahl der nodes, 110 GB per, node range: 1..12
# add option cpus-per-task and cpu-bind on 2024-11-11
#SBATCH --ntasks-per-node 1   # 1 for multi-thread-codes (using 32 cores)
#SBATCH --cpus-per-task 32    # 1 tasks/node * 32 threads/task = 32 threads/node
##SBATCH --ntasks-per-node 2    # 2 for hybrid code, 2 tasks * 8 cores/task
##SBATCH --cpus-per-task 16     # 2 tasks/node * 16 threads/task = 32 threads/node
##...
##SBATCH --ntasks-per-node 32  # 32 for pure MPI-code or 32 single-core-apps
##SBATCH --cpus-per-task 1     # 32 tasks/node * 1 threads/task = 32 threads/node
##SBATCH --cpu-bind=threads    # binding: 111.. 222... 333... per node
#SBATCH --time 0:59:00 # set 59min walltime
#SBATCH --mem 110000   # please use max. 110GB for better priorisation
#
exec 2>&1      # send errors into stdout stream
echo "DEBUG: SLURM_JOB_NODELIST=$SLURM_JOB_NODELIST"
echo "DEBUG: SLURM_NNODES=$SLURM_NNODES"
echo "DEBUG: SLURM_TASKS_PER_NODE=$SLURM_TASKS_PER_NODE"
#env | grep -e MPI -e SLURM
echo "DEBUG: host=$(hostname) pwd=$(pwd) ulimit=$(ulimit -v) \$1=$1 \$2=$2"
scontrol show Job $SLURM_JOBID  # show slurm-command and more for DBG
/usr/local/bin/quota_all # show quotas (Feb22), but node-nfs sees no quota!
echo "ulimit -v = $(ulimit -v)" # this may be low-mem relevant debugging

module load mpi/openmpi
module list
HOSTFILE=slurm-$SLURM_JOBID.hosts
scontrol show hostnames $SLURM_JOB_NODELIST > $HOSTFILE # one entry per host
awk '{print $1,"slots=32"}' $HOSTFILE > $HOSTFILE.2

echo "DEBUG: taskset= $(taskset -p $$)"
NPERNODE=$SLURM_NTASKS_PER_NODE
if [ -z "$NPERNODE" ];then NPERNODE=32; fi # default 32 mpi-tasks/node
echo "DEBUG: NPERNODE= $NPERNODE"

export OMP_NUM_THREADS=$[32/NPERNODE]
export OMP_WAIT_POLICY="PASSIVE"    # reduces OMP energy consumption

export OMPI_MCA_mpi_yield_when_idle=1  # untested, low energy OMPI ???
export OMPI_MCA_hwloc_base_binding_policy=none # pin-problem work arround???

# default Core-binding is bad, two tasks bound on 2 hyperthreads of same core
# but helps for srun only, not for direct mpirun (no mpi standard!?)
#  obsolete, if hyperthreading is disabled by BIOS
TASKSET="taskset 0xffffffff"  # 2019-01 ok for 1*32t, 32*1t
# try to fix 2 most simple cases here:
if [ $NPERNODE == 32 ];then
export SLURM_CPU_BIND=v,map_cpu:1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31
fi
if [ $NPERNODE == 1  ];then
 TASKSET="taskset 0xffffffff" # 32bit bask
fi

#mpirun -np 1 --report-bindings --oversubscribe -v --npernode $NPERNODE bash -c "taskset -p $$;ps aux"

# hybrid binary mpi+multithread
mpirun --report-bindings --oversubscribe -v --npernode $NPERNODE $TASKSET mpi-binary -t$OMP_NUM_THREADS

run with: sbatch jobfile

History/ChangeLog/News

2018-02    installation 12 nodes
2018-04    bad CPU pinning of slurm(?),openmpi(?), workarround script
2018-05    set MTU=9000 (dflt 1500) to improve 10GbE network speed (200%)
2019-02    +5 nodes
2019-05-14 set overcommit_memory=2 to avoid linux crashes on memory pressure
2019-06-12 power loss due to short during work on the room electric
2020-03-03 upgrade of 1GbE to 10GbE for the ssh-access
2020-10-14 system reconfiguration due to instable 10GbE in progress
2021-06-15 slurm partition reconfiguration (about one week testing)
2021-07-07 network and config problems after system update 
2021-08-09 remote shutdown tests for 2021-08-16
2021-08-16 planned maintenance, no clima, 12h down time
2022-02-23 quota_all  added, showing user quotas
2022-05-16 downtime due to slurm security update
2023-02-13 node[08,16] down (node16 CE-ECC CPU1-G2, node08 no SEL)
2023-06-21 fix default routing issues on node01 + node03
2023-06-21 firewall outgoing worldwide traffic, ask the admin on need (security)
2023-06-21 dbus.service and rpcbind deactivated at nodes (security, stability)
2023-11-08 cluster down for maintenance, network upgrade, node01 memory 1TB
           estimated downtime 24 to 48 hours 
2023-11-09 further 24h downtime to fix performance and software issues
2023-11-20 investigate instability of node01;
   instability of node02 and node08 was found as memory problem and fixed
   node13-17 have corrupt SEL entries "08/31/2018 02:20:10 Unknown #0xff"
             between other entries from after 2020-01
2023-11-22 instability of node08 partly reappeared, like node01,
 (do not power on after off/reset until DIMM places changed, no error logs,
  sometimes hanging at DXE--CPU Initialization,
  also with removed 10G+100Gb-cards), happens with one DIMM too,
  sometimes every second reset fails, not DIMM dependend, 
  btw. BMC works without any DIMM installed, must have its own local memory
2023-12-12 node01 1TB memory available (sbatch -N1 --mem 960000 ...)
2024-04-23 SMT reenabled (+30% speed, but do not use for MPI)
2024-09-03 19:42 node01 linux kernel panic/crashed
2024-10-09 15:44 node01 linux crashed, reboot + memory-testing
2024-10-17 node01: found single-bit-errors on a 64GB-DDR4-DIMM, DIMM removed,
 still crashes (more frequently, c-state-related?), 
 testing minfreq=1800MHz instead of lowest 1200MHz to have higher CPU-core
 voltage (still stable after 13 days, CPU aged?)
2024-10-29 node01: replacement 64GB-DDR4-DIMM
2024-11-06 new Slurm.SelectType=select/cons_res, allow shared resources
2026-02-23 soft-limit vmem 126GB (login node) removed, 
  was exported to 256GB compute nodes on slurm jobs
2026-02-26 - power outage - USV failure -
2026-03-10 - power off until ~11h - 230VAC maintenance

weitere Infos zu zentralen Compute-Servern im CMS oder im fall-back OvGU-HPC overview