====== SLURM configuration ======

This section details the current SLURM configuration for the ISC Computational Center.


===== Installation =====

SLURM has been installed from tarball, version : 24.11.0


All official plugins installed :
  * libnvidia-ml
  * TODO

Install / upgrade process TODO


===== Architecture =====

==== Chacha ====
  * Client (slurm-smd-client)
  * Worker (slurm-smd,slurm-smd-slurmd)
  * Controller (slurm-smd-slurmctld)
  * Accounting DB (slurm-smd-slurmdbd)

==== Disco ====
  * Client (slurm-smd-client)
  * Worker (slurm-smd,slurm-smd-slurmd)

==== Schema ====

TODO


===== Partitions =====

==== Dance ====

This is the default partition : is is currently composed of Chacha and Disco.

==== Chacha and Disco ====

These partitions can be used to restrain an account to use only one server.

More details of partitions usage TODO

===== Accounts =====

Accounts have been created in 2 groups :

  * Premium Researchers (premium_rs) : 
  * Standard Researchers (standard_rs) : All users who can't participate financially to the project. Students are also part of this group

There are 2 other groups : Test and temp : Test is only for administration purpose, and temp is a locked group either to migrate someone from another account (can't delete an account when someone has it as a default account) or to disallow someone to run jobs (MaxSubmitJob=0)

TODO


===== QOS and Limits =====

Current limits on QOS and accounts applied to each project account :

  * **premium_rs :**
    * MaxCPUs=44
    * MaxNodes=2
    * MaxTRES=gres/gpu=1,gres/shard=96,cpu=44,mem=500G
    * GrpTRES=gres/gpu=1,gres/shard=96,cpu=44,mem=500G
    * GrpWall=3-00:00:00
    * MaxWall=3-00:00:00

  * **standard_rs :**
    * MaxCPUs=24
    * MaxNodes=1
    * MaxTRES=gres/gpu=1,gres/shard=96,cpu=24,mem=256G
    * GrpTRES=gres/gpu=1,gres/shard=96,cpu=24,mem=256G
    * GrpWall=1-00:00:00
    * MaxWall=1-00:00:00

**TODO :** complete / modify until limits are finished

===== Scheduling =====

Fairshare is one way of priorizing jobs in the job queue.
  * premium_rs :
    * Fairshare : 750
  * standard_rs :
    * Fairshare : 250

Then on each account, everyone has a 100 fairshare : SLURM is just making calculation then on the percentage using both the share in each group, and the parent account premium / standard.


===== User creation =====

Users fill their form from the page [[infra:compute|ISC Computational Center]] , then their information is used to create their SSH access, and their SLURM user in its project account, according to the SLA / QOS we can provide : Premium or Standard.

Several users can share the same project account to work as a team. (Limits are applied both to each user for maximum limits and the group for group limits)

The [[infra:staff|staff]] creates and configure user accounts.

[[infra:compute:tooling:adduser|User creation process]]


===== Backups =====

==== Cluster configuration ====

Currently, all the SLURM configuration is manually backuped (files, DB)

TODO : automate and redirect to backup server when its ready

==== User data ====

For user data, currently compute users have to request some space on the filer01.hevs.ch server from the Sinf : there is a [[ https://hessoit.sharepoint.com/sites/VS-Intranet-SInf | request form in "Demande de service" ]] in ** "Comptes et accès > Obtention d'accès réseau"** . More documentation to see with the [[ https://servicedesk.hevs.ch/hesso_portal?sys_kb_id=4811f7f18759da1077bbc9140cbb35a7&id=kb_article_view&sysparm_rank=1&sysparm_tsqueryId=b40006138751ae10b87e43740cbb355e | Sinf here ]], generally for researchers the filesystem to ask for is the **fs_projets**.


Consider all space on the Compute center to be short-lived, it is not made to store data as backups : only temporary to compute and get results.