Differences

This shows you the differences between two versions of the page.

--- infra:howto:runjob [2025/02/27 12:49] – remi
+++ infra:howto:runjob [2026/03/11 10:25] (current) – [Quotas] marc
@@ Line 1: / Line 1: @@
 ====== How to run a job on our computational ressources ======
@@ Line 11: / Line 10: @@
+===== Connection =====
+Currently, you have to SSH directly on one of the compute node, either server [[infra:chacha|Chacha]] or [[infra:disco|Disco]] :
+  $ ssh firstname.lastname@disco.hevs.ch
+  $ ssh firstname.lastname@chacha.hevs.ch
+** NOTE : Please note you have to connect from the school network or the HEVS VPN to be able to connect on those servers **
+TODO : Change doc when the jump host will be ready
 ===== Environment =====
-On either server [[infra:chacha|Chacha]] or [[infra:disco|Disco]], you have a symlink **datasets** in your home directory that is linked to the local storage of the server : its purpose is to give you a proper space to put all the data you will be working on.
+. On either server [[infra:chacha|Chacha]] or [[infra:disco|Disco]], you have a symlink **datasets** in your home directory that is linked to the local storage of the server : its purpose is to give you a proper space to put all the data you will be working on.
-You have also another symlink **shared_dataset** for jobs that needs to be run on several nodes : this filesystem is shared between nodes.
+. You have also another symlink **shared_dataset** for jobs that needs to be run on several nodes : this filesystem is shared between nodes.
+. Your .bashrc / .zshrc contains by default the variable **APPTAINER_TMPDIR** set to **/home/user.name/.apptainer/** : this allows you to build containers without using the system /tmp that is restricted with a low quota, and use your larger dataset quota instead.
-NOTE : By default, you are the only one seeing your data : If you are working as a team on these data, **please ask for a group creation** so we can add members in it and apply suitable permissions.
+. By default, you are the only one seeing your data : If you are working as a team on these data, **please ask for a group creation** so we can add members in it and apply suitable permissions.
@@ Line 28: / Line 37: @@
 To avoid having everyone installing their libraries installed on the system or on their user directly on the physical servers, we need you to keep them cleanly packed in a container : That way you can both install what you want inside this container, and you can do it without needing any root priviledge on the server you are sharing with other researchers.
-See [[infra:howto:apptainer_sample|How to create a simple apptainer container]]
+**For examples, see : [[infra:howto:apptainer_sample|How to create a simple apptainer container]]**
 ===== Run your application via SLURM  =====
-Now that you have your container ready in the Apptainer file format : for example application.sif
+To be able to run a job on the ISC Compute Center, you **MUST** run it under [[https://slurm.schedmd.com/overview.html|SLURM]]. Ressource usage is managed by Slurm on this cluster.
-To run your job via SLURM you need to create a sbatch script :
+**For examples, see : [[infra:howto:slurm_sample|How-to create a simple SLURM job]]**
-(Note: for now we don't have Modules or LMod installed, we might add it later : hence the module commands commented)
-==== Example for a serial job ====
-<code>
-#!/bin/bash
-#SBATCH --job-name=application   # create a short name for your job
-#SBATCH --nodes=1                # node count
-#SBATCH --ntasks=1               # total number of tasks across all nodes
-#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
-#SBATCH --mem=4G                 # total memory per node (4 GB per cpu-core is default)
-#SBATCH --time=00:05:00          # total run time limit (HH:MM:SS)
-#SBATCH --mail-type=begin        # send email when job begins
-#SBATCH --mail-type=end          # send email when job ends
-#SBATCH --mail-user=email@domain.org
-#module purge
-apptainer run application.sif <arg-1> <arg-2> ... <arg-N>
-</code>
-==== Example for a parallel MPI code ====
-<code>
-#!/bin/bash
-#SBATCH --job-name=solar         # create a short name for your job
-#SBATCH --nodes=1                # node count
-#SBATCH --ntasks=4               # total number of tasks across all nodes
-#SBATCH --cpus-per-task=1        # cpu-cores per task (>1 if multi-threaded tasks)
-#SBATCH --mem-per-cpu=4G         # memory per cpu-core (4G per cpu-core is default)
-#SBATCH --time=00:05:00          # total run time limit (HH:MM:SS)
-#SBATCH --mail-type=begin        # send email when job begins
-#SBATCH --mail-type=end          # send email when job ends
-#SBATCH --mail-user=email@domain.org
-#module purge
-#module load openmpi/gcc/4.1.2
-srun apptainer exec solar.sif /opt/ray-kit/bin/solar inputs.dat
-</code>
-==== Example using GPUs ====
-<code>
-#!/bin/bash
-#SBATCH --job-name=tensorflow    # create a short name for your job
-#SBATCH --nodes=1                # node count
-#SBATCH --ntasks=1               # total number of tasks across all nodes
-#SBATCH --cpus-per-task=4        # cpu-cores per task (>1 if multi-threaded tasks)
-#SBATCH --mem-per-cpu=4G         # memory per cpu-core (4G per cpu-core is default)
-#SBATCH --time=00:05:00          # total run time limit (HH:MM:SS)
-#SBATCH --gres=gpu:1             # number of gpus per node
-#SBATCH --mail-type=begin        # send email when job begins
-#SBATCH --mail-type=end          # send email when job ends
-#SBATCH --mail-user=email@domain.org
-#module purge
-apptainer exec --nv ./tensorflow.sif python3 tensor.py
-</code>
-Note the **--nv** flag that allows you to use the GPU from your container without being root.
-==== Example using GPU sharding ====
-Sharding is a generic way of SLURM to use fragments of a GPU for a job, leaving room for other researchers, or running several jobs needing each one a fragment of GPU.
-<code>
-#!/bin/bash
-#SBATCH --partition=Dance        # Partition to run the job on
-#SBATCH --job-name=tensorflow    # create a short name for your job
-#SBATCH --nodes=1                # node count
-#SBATCH --ntasks=1               # total number of tasks across all nodes
-#SBATCH --cpus-per-task=4        # cpu-cores per task (>1 if multi-threaded tasks)
-#SBATCH --mem-per-cpu=4G         # memory per cpu-core (4G per cpu-core is default)
-#SBATCH --time=02:00:00          # total run time limit (HH:MM:SS)
-#SBATCH --gres=shard:24          # number of gpu shards to use
-#SBATCH --mail-type=begin        # send email when job begins
-#SBATCH --mail-type=end          # send email when job ends
-#SBATCH --mail-user=email@domain.org
-#module purge
-apptainer exec --nv ./tensorflow.sif python3 tensor.py
-</code>
-Note the **--nv** flag that allows you to use the GPU from your container without being root. Here, the partition has been requested to be specifically run on the Disco node, which has max 80 shards. On Chacha the limit is 96 shards.
-==== Example using an interactive shell ====
-**NOTE : Debugging your application directly on the ISC compute is to be avoided **
-To debug your application on a test Slurm + Apptainer infrastructure, you can use srun with the --pty argument to run your container :
-<code>
-# For a simple compute container :
-srun --cpus-per-task=12 --time=3:00:00 --mem=24G --pty apptainer shell /home/user.name/example_apptainer.sif
-# Or a container using GPUs :
-srun -G 1 --cpus-per-task=12 --time=3:00:00 --mem=24G --pty apptainer shell --nv /home/user.name/example_apptainer.sif
-</code>
-==== Execute your batch file ====
-Then you can run your sbatch script :
-  sbatch ./application_sbatch.sh
+More information on our **[[infra:compute:slurmconfig|SLURM cluster]]**
 ===== Storage considerations =====
@@ Line 167: / Line 76: @@
 **NOTE :** Please note the filer storage is slow, it is not advised to run a job directly using those data.
+**TODO :** set auto-cleaning for old data on each filesystem
+==== Quotas ====
+  - The root filesystem (/, including /home) for every researchers is 20GB for convenience. For students, this quota is lowered to 10GB to encourage proper infrastructure usage / coding with a lower threshold.
+  - On local and shared datasets, quota will be set on a case by case basis
 ==== Cleaning ====
@@ Line 184: / Line 99: @@
    * [[https://github.com/microsoft/vscode-cpptools/issues/5362]]
    * [[https://learn.microsoft.com/en-us/answers/questions/1221136/visual-studio-2022-clear-local-caches]]
-  - Currently your vscode caches will be automatically removed from our servers.
+  - Currently your vscode caches are automatically removed from our servers.
+==== Automation ====
+If you need to use Cron to schedule something, you need to ask for your user to be added to the /etc/cron.allow whitelist.