Clusters LIP6

convergence is composed of one frontend and ten compute nodes:

Computer	Model	Memory	Processor	Cores	GPUs
front	DELL PowerEdge R650xs	125 GB	2 x Intel Xeon Silver 4310	24 cores / 48 threads @ 2.10 GHz
node01	DELL PowerEdge XE8545	2 TB	2 x AMD EPYC 7543	64 cores / 128 threads @ 2.80 GHz	4 x NVIDIA A100 80Go SXM
node[02-06]	DELL PowerEdge R750xa	2 TB	2 x Intel Xeon Gold 6330	56 cores / 112 threads @ 2.00 GHz	4 x NVIDIA A100 80Go PCIe
node[07-10]	DELL PowerEdge R750xa	1 TB	2 x Intel Xeon Gold 6330	56 cores / 112 threads @ 2.00 GHz	4 x NVIDIA A100 80Go PCIe

On each node, 4 cores (8 threads) and 4 GB of RAM are reserved for the system and slurm.

By default when you reserve a GPU, slurm allocates you 4 cores (8 threads) and 64 GB of RAM.

Each node has 4 x GPU A100 80 GB (for slurm: a100_7g.80gb).

On node[07-10], MIG is used to partition the 4 x A100 80 GB GPUs into 8 x MIG 40 GB GPUs (for slurm: a100_3g.40gb).

In slurm reservations, you have to specify the type of GPU you want.

/home (300 TB) is hosted by front (DELL ME5084 disk array - SAS 12 Gb - 28 x HDD 16 TB) and exported to the compute nodes through NFS.

Each compute node has a local storage space mounted in /scratch (1.6 TB on NVME).

Access to front is done through a 10Gb/s ethernet link.

Compute nodes and front are interconnected by a 200Gb/s Infiniband network (Mellanox QM8700).

To access Convergence, you need to establish a ssh connection to the cluster's frontend (front.convergence.lip6.fr).

LIP6's members automatically get access to Convergence.

Users that do not belong to the LIP6 can request an account at convergence@lip6.fr.

You can access compute resources through the slurm resource manager (see https://slurm.schedmd.com/).

get information about the cluster: sinfo

using the sinfo command, you can:

list partitions (queues):
```
root@front:~# sinfo -O "partition:13,available:8,nodelist:18,defaulttime:13,time:13,nodeai:10"
PARTITION    AVAIL   NODELIST          DEFAULTTIME  TIMELIMIT NODES(A/I)
convergence* up      node[01-10]       1:00:00      15-00:00:00  1/9
```
Explanation for the above output: display
- convergence is the default partition (*)
- it is usable (up)
- it is composed of 10 nodes (node[01-10])
- reservation is limited to one hour by default (1:00:00)
- reservation can not exceed 15 days (15-00:00:00)
- currently, one node is active while the others are inactive (A/I)

list compute nodes:

root@front:~# sinfo -p convergence --Node -O "nodelist:13,features:8,socketcorethread:8,cpusstate:15,memory:8,allocmem:10,gres:60,gresused:60,statelong:20,reason:20"
NODELIST AVAIL_FEATURES S:C:T  CPUS(A/I/O/T) MEMORY  ALLOCMEM GRES                                            GRES_USED                                             STATE REASON              
node01       amd     8:8:2   0/128/0/128    2048000 0         gpu:a100_7g.80gb:4(S:1,3,5,7)                               gpu:a100_7g.80gb:0(IDX:N/A)                                 idle~               none
node02       intel   2:28:2  0/112/0/112    2048000 0         gpu:a100_7g.80gb:4(S:0-1)                                   gpu:a100_7g.80gb:0(IDX:N/A)                                 idle~               none
node03       intel   2:28:2  0/112/0/112    2048000 0         gpu:a100_7g.80gb:4(S:0-1)                                   gpu:a100_7g.80gb:0(IDX:N/A)                                 idle~               none
node04       intel   2:28:2  0/112/0/112    2048000 0         gpu:a100_7g.80gb:4(S:0-1)                                   gpu:a100_7g.80gb:0(IDX:N/A)                                 idle~               none
node05       intel   2:28:2  0/112/0/112    2048000 0         gpu:a100_7g.80gb:4(S:0-1)                                   gpu:a100_7g.80gb:0(IDX:N/A)                                 idle~               none
node06       intel   2:28:2  0/112/0/112    2048000 0         gpu:a100_7g.80gb:4(S:0-1)                                   gpu:a100_7g.80gb:0(IDX:N/A)                                 idle~               none
node07       intel   2:28:2  0/112/0/112    1024000 0         gpu:a100_3g.40gb:8(S:0-1)                                   gpu:a100_3g.40gb:0(IDX:N/A)                                 idle~               none
node08       intel   2:28:2  0/112/0/112    1024000 0         gpu:a100_3g.40gb:8(S:0-1)                                   gpu:a100_3g.40gb:0(IDX:N/A)                                 idle~               none
node09       intel   2:28:2  0/112/0/112    1024000 0         gpu:a100_3g.40gb:8(S:0-1)                                   gpu:a100_3g.40gb:0(IDX:N/A)                                 idle~               none
node10       intel   2:28:2  8/104/0/112    1024000 64000     gpu:a100_3g.40gb:8(S:0-1)                                   gpu:a100_3g.40gb:1(IDX:0)                                   mixed               none

Explanation for the above output: display

node01 has the amd feature
node[02-10] have the intel feature
node01 has 8 processors, 8 cores by processor and 2 threads by core (8:8:2)
nodes[02-10] have 2 processors, 28 cores by processor and 2 threads by core (2:28:2)
currently on node01, 0 threads are in use, 128 threads are available, 0 are in another state (???) for a total of 128 cores (0/128/0/128)
currently on node[02-9], 0 threads are in use, 112 threads are available, 0 are in another state for a total of 112 threads (0/112/0/112)
currently on node10, 8 threads are in use, 104 threads are available, 0 are in another state for a total of 112 threads (8/104/0/112)
node[01-6] have 2048000 MB of RAM
node[07-10] have 1024000 MB of RAM
currently on node[01-9], 0 MB of RAM are allocated
currently on node10, 64000 MB of RAM are allocated
nodes[01-6] have 4 x A100 80 GB GPUs (gpu:a100_7g.80gb:4)
nodes[07-10] have 8 x MIG 40 GB GPUs (gpu:a100_3g.40gb:8)
currently on node[01-9], there is no GPU in use (gpu:a100_7g.80gb:0 or gpu:a100_3g.40gb:0)
currently on node10, there is one 40 GB MIG GPU in use (gpu:a100_3g.40gb:1(IDX:0))
currently node[01-9] are unused (idle) and powered off by slurm's energy saving module (~)
currently node10 is in mixed mode (some resources are allocated but some resources are still available)
currently, neither compute node has errors (none)

get information about jobs:
- display the queue: squeue
  Using the squeue command, you can list running jobs and get their identifiers.
```
root@front:~# squeue
JOBID PARTITION NAME USER ST TIME  NODES NODELIST(REASON)
   68 convergen test leroux R  28:06     1   node10
```
  Explanation for the above output: there is one job, its identifier is 68, its name is test, it was started by user leroux, it has been running (R) for 28 minutes 06 seconds and is using resources on node10. The different states of a job are described in the man page of squeue.
- get more details about a job: sacct
  Using the sacct command, you can get more details about a job.
```
root@front:~# sacct -j 68 --format="JobID,JobName,User,Account,NodeList,AllocTres%80,Start,End,State,Reason" -X
JobID JobName User   Account NodeList                                                 AllocTRES               Start     End   State Reason 
----- ------- ------ ------- -------- --------------------------------------------------------- ------------------- ------- ------- ------ 
68       test leroux lip6      node10 billing=16,cpu=16,gres/gpu:a100_3g.40gb=1,mem=512G,node=1 2023-04-19T15:31:12 Unknown RUNNING   None
```
  Détail de la sortie :
  - the job with id 68 and named test has been launched by user leroux using the lip6 account
  - resources allocated to this job are: 16 threads (cpu=16), 512 GB of RAM (mem=512G), one 40 GB MIG GPU (gres/gpu:a100_3g.40gb=1) on one node (node=1) here node10
  - job started on 2023-04-19 at 15:31:12 and is still running (RUNNING)
allocate resources:
- interactive job: salloc
  You can use the salloc command to get an interactive session. You will get a shell on the frontend from which you will be able to run commands on reserved resources with srun. If you close the shell, the job is terminated.
```
leroux@front:~$ salloc --nodes=2 --gpus-per-node=a100_3g.40gb:1 --time=60
salloc: Granted job allocation 60
salloc: Waiting for resource configuration
salloc: Nodes node[01,10] are ready for job

leroux@front:~$ srun nvidia-smi -L
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-6bfd077d-a528-62e5-5ffd-f5ccf9e5a557)
  MIG 3g.40gb     Device  0: (UUID: MIG-a5a1e127-c156-5892-ae71-8518fcd84332)
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-b90dfade-11bc-8d5b-321b-9f6f6284b497)
  MIG 3g.40gb     Device  0: (UUID: MIG-81ca0f5d-30f9-5a88-9f4a-1cec8fd84f6c)

leroux@front:~$ srun hostname
node10
node01

leroux@front:~$ exit
salloc: Relinquishing job allocation 60
salloc: Job allocation 60 has been revoked.
leroux@front:~$ 
```
  You can use salloc's option --x11 to activate graphical display forwarding by slurm (you also need to use ssh's option -X for the connection to the frontend).
  
  You can use salloc's option --no-shell to allocate resources without having to keep a shell opened on the frontend during the duration of your job. Then you can access allocated resources using srun's option --jobid or directly by ssh.
  
  You can connect by ssh to compute nodes on which salloc allocated resources for your job.
- non interactive job: sbatch
  The sbatch command allows you to submit a script which will be executed in a non interactive way. You can configure the reservation by adding #SBATCH directives at the beginning of the script.
  
  Example of sbatch script:
```
leroux@front:~$ cat batch1.sh
#!/bin/bash

#SBATCH --job-name=exemple
#SBATCH --nodes=1
#SBATCH --constraint=amd
#SBATCH --cpus-per-task=16
#SBATCH --mem=512G
#SBATCH --gpus=a100_7g.80gb:1
#SBATCH --time=5
#SBATCH --mail-type=ALL
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err

nvidia-smi -L
sleep 300

leroux@front:~$ sbatch batch1.sh
Submitted batch job 20

leroux@front:~$ cat exemple-20.out
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-b90dfade-11bc-8d5b-321b-9f6f6284b497)
```
  If your script reserves resources on many compute nodes, the script will run on the first allocated node.
  slurm defines many environment variables ‘SLURM_*’ that you can use in your scripts:
  - SLURM_JOB_NAME : job name
  - SLURM_JOBID : job identifier
  - SLURM_STEPID : step number (an invocation of srun is one step of the job)
  - SLURM_JOB_NODELIST : list of nodes on which your job has allocated resources
  - SLURM_NODEID : node identifier
  You can connect by ssh to compute nodes on which sbatch allocated resources for your job.
- exclusive reservation (a full node) : --exclusive
  
  You can use the --exclusive option of salloc or sbatch to reserve a full node.
  
  In an exclusive reservation, you have access to all the CPUs and GPUs of the compute node.
  
  To get access to all the memory of the compute node, you must specify the --mem=0 option.
- use of contraints : --constraint
  
  You can use the --constraint= option of salloc or sbatch to specify additional caracteristics of the compute nodes you want.
  
  This option lets you select compute nodes by the features they have (see sinfo) ouput to list these features.
  
  For example, you can choose a compute node with Intel processors (--constraint=intel) ou AMD processors (--constraint=amd).

access resources:

using slurm : srun

You can use the srun command to run commands simultaneously on compute nodes on which salloc or sbatch allocated resources for your job.

In a job, each call to srun is a step.

srun inherits from salloc or sbatch's reservation directives.

leroux@front:~$ cat batch3.sh
#!/bin/bash

#SBATCH --job-name=exemple
#SBATCH --nodes=2
#SBATCH --gpus-per-node=a100_3g.40gb:1
#SBATCH --time=1
#SBATCH --mail-type=ALL
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err

srun hostname
srun nvidia-smi -L

leroux@front:~$ sbatch batch3.sh
Submitted batch job 24

leroux@front:~$ cat exemple-24.out
node01
node10
GPU 0: NVIDIA A100 80GB PCIe (UUID: GPU-6bfd077d-a528-62e5-5ffd-f5ccf9e5a557)
  MIG 3g.40gb     Device  0: (UUID: MIG-a5a1e127-c156-5892-ae71-8518fcd84332)
GPU 0: NVIDIA A100-SXM4-80GB (UUID: GPU-b90dfade-11bc-8d5b-321b-9f6f6284b497)
  MIG 3g.40gb     Device  0: (UUID: MIG-81ca0f5d-30f9-5a88-9f4a-1cec8fd84f6c)

A call to srun is blocking. You have to wait for the command to finish on every node before executing the next command. You can use shell's & to execute srun in parallel. In which case, you have to tell srun what resources each step will consume, so that slurm can run them in parallel.

#!/bin/bash

#SBATCH --nodes=2
#SBATCH --gpus-per-node=a100_3g.40gb:3
#SBATCH --time=5

srun -n2 -c8 --gpus-per-node=a100_3g.40gb:1 bash tache1.sh &
srun -n2 -c8 --gpus-per-node=a100_3g.40gb:1 bash tache2.sh &
srun -n2 -c8 --gpus-per-node=a100_3g.40gb:1 bash tache3.sh

wait

directly: ssh
If a job you started on a node (via salloc or sbatch) has resources currently allocated, you can connect to that node directly via ssh. Your session will be restricted to these resources.

Compute nodes are not reachable from internet. To access them, you must first pass through the frontend front:
```
ssh -J front.convergence.lip6.fr node01.convergence.lip6.fr
```

cancel a job : scancel
Using the scancel command, you can cancel a job:
```
leroux@front:~# scancel 177
```

module

The module command can be used to configure your environment to use specific version of software:

list available modules:

leroux@front:~$ module avail
----------------------------------- /etc/environment-modules/modules ---------------------------------------------
cuda/11.0  cuda/11.1  maple/2019.0  maple/2020.0  mathematica/12.1  matlab/R2019b  matlab/R2020a  python/anaconda3

load a module:

leroux@front:~$ module load cuda/11.1 python/anaconda3

list loaded modules:

leroux@front:~$ module list
Currently Loaded Modulefiles:
 1) cuda/11.1   2) python/anaconda3

unload a module:
```
leroux@front:~$ module unload cuda
```
unload all loaded modules:
```
leroux@front:~$ module purge
```

maple, mathematica, matlab

Thanks to Sorbonne Universités's site licences, maple, mathematica and matlab are available on the cluster via the module command.
anaconda

You can use the conda command, available by loading module python/anaconda3, to manage your own python environments.

Your shell needs to be initialized before using conda. You can use conda init to permanently modify your .bashrc so that your shell is automatically initialized for conda in interactive sessions. Scripts executed by slurm are not run in an interactive session, so you need to initialize your shell with eval "$(conda shell.bash hook)" (see example for jupyter).

conda's documentation is available on its official web site.

jupyter

To use jupyter:

create a python environment with anaconda:
```
leroux@front:~$ conda create -n myenv
```

install jupyter notebook in this environment:

leroux@front:~$ conda install -n myenv notebook

example of sbatch script:

#!/bin/bash

#SBATCH --job-name=test_jupyter
#SBATCH --nodes=1
#SBATCH --gpus-per-node=a100_3g.40gb:1
#SBATCH --time=60
#SBATCH --mail-type=ALL
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err

module purge                        # Environment cleanup
module load python/anaconda3        # Loading of anaconda3 module
eval "$(conda shell.bash hook)"     # Shell initialization to use conda
conda activate myenv                # Activation your python environment
jupyter notebook                    # Startup of jupyter

after some time, notebook's url can be found in job's logs

cat test_jupyter-226.err
...
    or http://127.0.0.1:8888/?token=97c4066cee8dcc55cb40b7311bcf1240cb503a6872c88038
...

setup an ssh tunnel between your personnal computer and the notebook's port on the compute node:
```
ssh -J front.convergence.lip6.fr -L 8888:localhost:8888 node01.convergence.lip6.fr
```

tensorflow
Installation of tensorflow:
```
conda create -n tensorflow_env
conda install -n tensorflow_env tensorflow-gpu
```
tensorflow installed by anaconda comes with its own CUDA stack. You should not load another CUDA stack through module to avoid conflicts.
pytorch
Installation of pytorch:
```
conda create -n pytorch_env
conda install -n pytorch_env pytorch torchvision -c pytorch
```
pytorch installed by anaconda comes with its own CUDA stack. You should not load another CUDA stack through module to avoid conflicts.

container

On Convergence, you can run docker like containers thanks to the pyxis slurm plugin. This plugin uses enroot to run containers.

Example of sbatch script:

#!/bin/bash -x

#SBATCH --job-name=test
#SBATCH --nodes=1
#SBATCH --gpus=a100_3g.40gb:1
#SBATCH --container-image nvcr.io\#nvidia/pytorch:23.04-py3
#SBATCH --container-mount-home
#SBATCH --time=60
#SBATCH --mail-type=ALL
#SBATCH --output=%x-%j.out
#SBATCH --error=%x-%j.err

hostname
python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.device_count()); print(torch.cuda.get_device_name())"
echo "" > /tmp/jupyter_notebook_config.py                        # The default jupyter configuration file creates wrong URLs
jupyter notebook --config=/tmp/jupyter_notebook_config.py

Script commands are executed inside of the container defined by the --container-image option.

The user's home directory can be mounted inside of the container by using the --container-mount-home option.

The reserved GPU can be use in the container.

You can get a shell inside of the container:

# get the PID of your process
leroux@node10:~$ ps aux|grep leroux
leroux        3435  0.2  0.0   5784  3268 ?        S    14:25   0:00 /bin/bash -x /var/spool/slurmd/job00128/slurm_script
leroux        5838  2.8  0.0 808204 105396 ?       Sl   14:30   0:01 /usr/bin/python /usr/local/bin/jupyter-notebook
root        5961  0.0  0.0  46596 12436 ?        Ss   14:30   0:00 sshd: leroux [priv]
leroux        6001  0.1  0.0  46596  8960 ?        S    14:30   0:00 sshd: leroux@pts/0
leroux        6002  0.0  0.0  18004  5844 pts/0    Ss   14:30   0:00 -bash
leroux        6065  0.0  0.0  19160  3612 pts/0    R+   14:31   0:00 ps aux
leroux        6066  0.0  0.0   6608  2260 pts/0    S+   14:31   0:00 grep --color=auto leroux

# Start a shell inside of the container
leroux@node10:~$ enroot exec 5838 bash

# A simple test using pytorch in the container
leroux@node10:/workspace$ python -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.device_count()); print(torch.cuda.get_device_name())"
True
1
NVIDIA A100-SXM4-80GB MIG 3g.40gb

# Exit of the container
leroux@node10:/workspace$ exit

Send any requests about Convergence to convergence@lip6.fr.

To get news about Convergence you should subscribe to the convergence-news@listes.lip6.fr mailing list. Non LIP6 users are automatically added to this list when they got an account.

LIP6's clusters

Convergence

Hardware

Storage

Network

Access to the cluster

Use of the cluster

Software

Contact