====== SLURM configuration ====== This section details the current SLURM configuration for the ISC Computational Center. ===== Installation ===== SLURM has been installed from tarball, version : 24.11.0 All official plugins installed : * libnvidia-ml * TODO Install / upgrade process TODO ===== Architecture ===== ==== Chacha ==== * Client (slurm-smd-client) * Worker (slurm-smd,slurm-smd-slurmd) * Controller (slurm-smd-slurmctld) * Accounting DB (slurm-smd-slurmdbd) ==== Disco ==== * Client (slurm-smd-client) * Worker (slurm-smd,slurm-smd-slurmd) ==== Schema ==== TODO ===== Partitions ===== ==== Dance ==== This is the default partition : is is currently composed of Chacha and Disco. ==== Chacha and Disco ==== These partitions can be used to restrain an account to use only one server. More details of partitions usage TODO ===== Accounts ===== Accounts have been created in 2 groups : * Premium Researchers (premium_rs) : * Standard Researchers (standard_rs) : All users who can't participate financially to the project. Students are also part of this group There are 2 other groups : Test and temp : Test is only for administration purpose, and temp is a locked group either to migrate someone from another account (can't delete an account when someone has it as a default account) or to disallow someone to run jobs (MaxSubmitJob=0) TODO ===== QOS and Limits ===== Current limits on QOS and accounts applied to each project account : * **premium_rs :** * MaxCPUs=44 * MaxNodes=2 * MaxTRES=gres/gpu=1,gres/shard=96,cpu=44,mem=500G * GrpTRES=gres/gpu=1,gres/shard=96,cpu=44,mem=500G * GrpWall=3-00:00:00 * MaxWall=3-00:00:00 * **standard_rs :** * MaxCPUs=24 * MaxNodes=1 * MaxTRES=gres/gpu=1,gres/shard=96,cpu=24,mem=256G * GrpTRES=gres/gpu=1,gres/shard=96,cpu=24,mem=256G * GrpWall=1-00:00:00 * MaxWall=1-00:00:00 **TODO :** complete / modify until limits are finished ===== Scheduling ===== Fairshare is one way of priorizing jobs in the job queue. * premium_rs : * Fairshare : 750 * standard_rs : * Fairshare : 250 Then on each account, everyone has a 100 fairshare : SLURM is just making calculation then on the percentage using both the share in each group, and the parent account premium / standard. ===== User creation ===== Users fill their form from the page [[infra:compute|ISC Computational Center]] , then their information is used to create their SSH access, and their SLURM user in its project account, according to the SLA / QOS we can provide : Premium or Standard. Several users can share the same project account to work as a team. (Limits are applied both to each user for maximum limits and the group for group limits) The [[infra:staff|staff]] creates and configure user accounts. [[infra:compute:tooling:adduser|User creation process]] ===== Backups ===== ==== Cluster configuration ==== Currently, all the SLURM configuration is manually backuped (files, DB) TODO : automate and redirect to backup server when its ready ==== User data ==== For user data, currently compute users have to request some space on the filer01.hevs.ch server from the Sinf : there is a [[ https://hessoit.sharepoint.com/sites/VS-Intranet-SInf | request form in "Demande de service" ]] in ** "Comptes et accès > Obtention d'accès réseau"** . More documentation to see with the [[ https://servicedesk.hevs.ch/hesso_portal?sys_kb_id=4811f7f18759da1077bbc9140cbb35a7&id=kb_article_view&sysparm_rank=1&sysparm_tsqueryId=b40006138751ae10b87e43740cbb355e | Sinf here ]], generally for researchers the filesystem to ask for is the **fs_projets**. Consider all space on the Compute center to be short-lived, it is not made to store data as backups : only temporary to compute and get results.