Usage of GPGPU cluster

Resources

Faculty Open HPC Laboratory

  • GPGPU HPC Cluster
  • Cloud computing
  • Data storage
  • Processing platform

GPGPU Cluster

  • 2 x GPGPU nodes
  • Model: Dell PowerEdge XE8545
  • CPUs: 2x AMD EPYC 7413 24-Core Processor
  • Memory: 1 TB
  • OS: Ubuntu
  • 4x NVidia HGX A100 80 GB GPGPU
  • NVLink Redstone
  • SLURM cluster management
  • Job submission

Cloud computing

Openstack cloud

  • 3 x OpenStack Cloud Controllers
    • Model: Dell PowerEdge R650xs
    • CPUs: 1x 20 core Intel(R) Xeon(R) Silver 4316 CPU @ 2.30GHz
    • Memory: 128 GB
    • OS: Ubuntu
    • 4x NVidia HGX A100 80 GB GPGPU
    • NVLink Redstone
  • 10 x OpenStack Compute node
    • Model: Dell PowerEdge 750
    • CPUs: 2x 32 core Intel(R) Xeon(R) Gold 6338 CPU @ 2.00GHz
    • Memory: 1.5 TB
    • OS: Ubuntu
  • Total
    • Computing power: 640 cores
    • Memort: 15 TB

Data storage

Data storage nodes

  • 4 x Ceph Storage nodes
    • Model: Dell PowerEdge R750xs
    • CPUs: 2x 20 core Intel(R) Xeon(R) Silver 4316 CPU @ 2.30GHz
    • Memory: 512 GB
    • OS: Ubuntu
    • Storage:
      • 2x 480 GB SSD (OS)
      • 2x 960 GB mixe use SSD
      • 4x 400GB write intensive SSD
  • 4 x JBOD External Expander
    • SAS Disk Enclosures with a capacity of 84 disks
    • SAS, SATA and SSD intermix allowed

Interconnection

  • 2 x 48–port 10 Gbit switches interconnected with two 100 Gbit connections – Cloud/Storage network
  • 2 x 48-port 1 Gbit switches interconnected with two 40 Gbit connections – External network

Lab overview

Software

  • SLURM scheduler
  • Nextcloud
  • Datahub
  • Soon
    • Kubernetes cluster
    • JupyterHub
    • Spark cluster
    • Deployment of on demand tools

Access to resources

Application process

  • Explain your project and required resources and the Lab council will review your application.

Usage of GPGPU resources

Process

Access and login

  1. Generate SSH key or use your existing SSH key
  2. Login to https://osc-lab.finki.ukim.mk/ and in your profile add your SSH key
  3. You will receive information about the username created on the GPGPU cluster.
  4. Connect to EDUVPN service for OpenLab
  5. Use SSH client to login to the address 10.10.65.6 (GPU cluster front node) using ssh key
    • ssh -i <PATH_TO_PRIVATE_KEY> USERNAME@10.10.65.6

Selection of docker image

The easiest way to run a GPGPU code is using dockerized version of the software/libraries that your code is based on. We use NVIDIA NGC catalog as a source of docker images with preinstalled software/libraries/models that we use in our code.
For example, if our code trains a model using a library based on pytorch, we can use the following docker image (replace XX.XX with needed version available on NGC):

docker image pull nvcr.io/nvidia/pytorch:XX.XX-py3

Ex.:
docker image pull nvcr.io/nvidia/pytorch:23.05-py3

Preparation of code and datasets

In your home directory /home/hpc/users/ create a code directory that will contain the code, data and job stript.
The code, data and results directory will be connected to your singularity container

—JOB_FOLDER_NAME

   —Dockerfile (**optional)
   —code
   —data
—results
   — …
   — job.py

Following steps need to be executed from this directory

Creation of code conteriners

Singularity is a containerization platform that enables running containers that contain different pieces of software in a portable and reproducible way. Singularity as a platform was created to enable the execution of complex applications on HPC systems by favouring integration over isolation and thereby making it easier to use GPU units, cluster data systems, etc.
Containers that are created in the singularity platform are files with a .sif extension.

Creation of code conteriners – step one

If you need to install additional libraries or packages (besides what is installed in the image you download from the NVIDIA NGC catalog), you need to create a Dockerfile in your experiment directory.
The contents of the Dockerfile should be as follows:

FROM {NVIDIA_NGC_IMAGE}
RUN {install comands here e.g. pip install pytorch-geometric}

Once we are ready with the Dockerfile, we need to create a new image
docker build -t DOCKER_IMAGE_TAG .
eg.
docker build -t pyg .

Creation of code conteriners – step two

If you do not need to install additional libraries or packages (other than what is installed in the image that you will download from the NVIDIA NGC catalog), you proceed directly to creating the sif image with the following commands:

docker save DOCKER_IMAGE_NAME -o ARCHIVED_DOCKER_IMAGE_NAME.tar
singularity build SIF_IMAGE_NAME.sif docker-archive://ARCHIVED_DOCKER_IMAGE_NAME.tar

DOCKER_IMAGE_NAME can be the name of the image downloaded from NVIDIA NGC or the name of the image created by you by running a Dockerfile build
ARCHIVED_DOCKER_IMAGE_NAME and SIF_IMAGE_NAME are chosen by you and may refer to the library/model you are using.

Creation of job script

In the same directory that refers to your experiment/job add a shell script named job.sh.
The content of the script:

#SBATCH --job-name CudaJob
#SBATCH --output result.out   ## filename of the output; the %j is equivalent to jobID; default is slurm-[jobID].out
#SBATCH --partition=openlab-queue  ## the partitions to run in (comma seperated)
#SBATCH --ntasks=1  ## number of tasks (analyses) to run
#SBATCH --gres=gpu:3 # select node with v100 GPU
##SBATCH --mem-per-gpu=100M # Memory allocated for the job
##SBATCH --time=0-00:10:00  ## time for analysis (day-hour:min:sec)

# parse out number of GPUs and CPU cores assigned to your job
env | grep -i slurm
N_GPUS=`echo $SLURM_JOB_GPUS | tr "," " " | wc -w`
N_CORES=${SLURM_NTASKS}

#export SINGULARITYENV_CUDA_VISIBLE_DEVICES=${CUDA_VISIBLE_DEVICES}
singularity exec \
  -B ./code:/code \
  -B ./data:/data \
  --nv SIF_IMAGE_NAME.sif \
  python3 code/PYTHON_SCRIPT_NAME.py

Creation of job 

Code remarks and guidelines

Job execution

The jobs are executed completely identically to the already existing slurm (instructions available here).

The sbatch command queues the job.
Ex. sbatch job.sh
(Note: you still need to be located in the directory created for the experiment)

Job status

  • squeue (checking the status of all queued jobs)
  • sinfo (checking the status of nodes and partitions in the cluster)
  • scancel (stop the task (or multiple tasks), by defining id)
  • sacct (information about completed and ongoing tasks, as well as the users who started them)
  • sstat (current task data and user information)