Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
administratif:todo [2025/01/20 11:46] remiadministratif:todo [2025/09/26 13:59] (current) – [TODOs from Remi's papers] remi
Line 1: Line 1:
-==== What still needs to be fixed ====+===== What still needs to be fixed =====
  
 FIXME  FIXME 
  
-=== Apptainer ===+==== Apptainer ====
  
-<del>Make that everyone exports (after adaptation of course) : +<del>Make that everyone exports (after adaptation of course) : ''export APPTAINER_CACHEDIR=/scratch/gpfs/$USER/APPTAINER_CACHE'' ''export APPTAINER_TMPDIR=/tmp''</del>  ✔ **done** : Created a /data/apptainer/user.name/.apptainer, migrated the current .apptainer users dirs and added a symlink in their home.
  
-  - ''export APPTAINER_CACHEDIR=/scratch/gpfs/$USER/APPTAINER_CACHE'' +  - To prevent quota explosion : Installed the quota package on disco and chacha :should we enforce quota on FS level ? All the people who said they would use less than 100GB are using more than what they said (ex martin.barry at 350GB) : Applied quota on filesystemnot yet on datasets.
-  - ''export APPTAINER_TMPDIR=/tmp''</del>  ✔ **done** +
-Created a /data/apptainer/user.name/.apptainermigrated the current .apptainer users dirs and added a symlink in their home.+
  
-  - To prevent quota explosion : Installed the quota package on disco and chacha :. should we enforce quota on FS level ? All the people who said they would use less than 100GB are using more than what they said (ex martin.barry at 350GB) +==== Content ====
- +
-=== Content ===+
   - Explain to PA how to create a proper structure.   - Explain to PA how to create a proper structure.
   - Where do we put the content of this file, as some information is not intended for the general public   - Where do we put the content of this file, as some information is not intended for the general public
Line 23: Line 19:
   - Construct the **tools for teacher** sections with the existing informations   - Construct the **tools for teacher** sections with the existing informations
  
-=== Monitoring === +==== Monitoring ==== 
-  - alerting and messaging on various metrics (disk space, cpu usage, ...) for the various computational resources (chacha, disco, calypso & others)+  - alerting and messaging on various metrics (disk space, cpu usage, ...) for the various computational resources <del>(chacha, disco,</del> **Done** calypso & others)
  
-=== Server room ===+==== Server room ====
   - Make something nice there. Posters on the walls, screens, stuff.   - Make something nice there. Posters on the walls, screens, stuff.
   - Why is there still a box for a server in the networking lab room ? : ** Because we need at least one for sending back in case of support / we needed one for network labs to hide the CTF network setup **   - Why is there still a box for a server in the networking lab room ? : ** Because we need at least one for sending back in case of support / we needed one for network labs to hide the CTF network setup **
   - Rename networking lab room and change the remplaçant for Darko as well   - Rename networking lab room and change the remplaçant for Darko as well
   - Why only 10 GB for the fiber    - Why only 10 GB for the fiber 
-  - I don't want a patch panel inside the server rack but outside of it. Space will be premium soon there and we don't know where to put the server rack : Search and buy a patch panel to put in the room+  - I don't want a patch panel inside the server rack but outside of it. Space will be premium soon there and we don't know where to put the server rack : <del>Search and buy a patch panel to put in the room</del> Done.
   - Find a proper layout for the server room for accommodating a water-cooled rack and maybe another one in a couple of months   - Find a proper layout for the server room for accommodating a water-cooled rack and maybe another one in a couple of months
   - If we have Rumba running there, we need some UPS solution.    - If we have Rumba running there, we need some UPS solution. 
Line 37: Line 33:
   - Choose new R630 and R730 / R740 for RUMBA main. Budget 3 kFr   - Choose new R630 and R730 / R740 for RUMBA main. Budget 3 kFr
   - Do we need a file server from the guys downstairs (baignoire)   - Do we need a file server from the guys downstairs (baignoire)
-  - Remove again the big oven : check with Hervé Girard to store it in 23N322 : this is where a student used it last time (but nicely returned it to N307), why not keep it there ? EDIT: the RoL of N322 is Thomas Sterren, I sent him a message for the oven. (Rémi) / EDIT2 : answer is "we share the room so it will stay in 307. period." : need to find a place to store it ourselves.+  - <del>Remove again the big oven : check with Hervé Girard to store it in 23N322 : this is where a student used it last time (but nicely returned it to N307), why not keep it there ? EDIT: the RoL of N322 is Thomas Sterren, I sent him a message for the oven. (Rémi) / EDIT2 : answer is "we share the room so it will stay in 307. period." : need to find a place to store it ourselves.</del> Moved to 23N321 : They want it back to N307, but for no valid reason : so it will stay in N321 where it is less annoying or if it leaves, it just leaves our rooms for good. EDIT 2025-04-03 : The oven will go in 23N111, the Rol is Cedric (Clivaz?), he should contact us to install the oven in its lab.
  
-=== Slurm on chacha or disco ===+ 
 +==== Slurm on chacha or disco ====
   - <del>Make both GPUs available in gres/slurmd confs</del> ✔ **done**   - <del>Make both GPUs available in gres/slurmd confs</del> ✔ **done**
   - <del>Make emails working for start/end of jobs, use an emailer</del> ✔ **done**   - <del>Make emails working for start/end of jobs, use an emailer</del> ✔ **done**
-  - Find how to do the ressource partitioning with billing credits by user / account+  - <del>Find how to do the ressource partitioning with billing credits by user / account</del>  ✔ **done** (but still needs tests and real jobs to see how to tweak)
   - Discuss how to allocate credits for users : what about students ?   - Discuss how to allocate credits for users : what about students ?
-  - <del>Note everywhere to either remove sshfs for VScode, and give links to properly configure it or no VScode at all :</del> Noted on [[infra:howto:runjob|runjob]] and started script to check for .vscode in homedirs : auto-rm in crontab directly ?+  - <del>Note everywhere to either remove sshfs for VScode, and give links to properly configure it or no VScode at all :</del> Noted on [[infra:howto:runjob|runjob]] <del>and started script to check for .vscode in homedirs : auto-rm in crontab directly ?</del> **Done** 
 +  - For the future jump server, need to test how to restrict ssh access to other servers : via SLURM they might recreate their authorized_keys by running a job writing a .ssh/authorized_keys on the server the job is run. (change .ssh/ permission disabling them to chmod this dir?)
  
  
-=== Calypso ===+==== Calypso ====
   - Reinstall slurm by compiling with all necessary plugins,then package using debuild : https://slurm.schedmd.com/quickstart_admin.html#debuild , then deploy the .deb by Ansible   - Reinstall slurm by compiling with all necessary plugins,then package using debuild : https://slurm.schedmd.com/quickstart_admin.html#debuild , then deploy the .deb by Ansible
  
  
-=== Rumba ===+==== Rumba ====
   - Turn on Rumba and install a proper env for us, mainly based on docker as a limited number of members will use it   - Turn on Rumba and install a proper env for us, mainly based on docker as a limited number of members will use it
   - Test backup and replicate ISC / Learn on Rumba   - Test backup and replicate ISC / Learn on Rumba
Line 58: Line 56:
   - Have VPS and cloud coder there, please.    - Have VPS and cloud coder there, please. 
  
-=== Hannibal ===+==== Hannibal ====
   - Backup DokuWiki : **✔ Done already, Hannibal has /srv/www completely backuped on the Synolog NAS DS923**   - Backup DokuWiki : **✔ Done already, Hannibal has /srv/www completely backuped on the Synolog NAS DS923**
   - Add Ingegamez website on wordpress   - Add Ingegamez website on wordpress
  
-=== Site ===+==== Site ====
   - Proper CSS for title, also for the alignment which is ugly (look at this page!)   - Proper CSS for title, also for the alignment which is ugly (look at this page!)
   - Editor with no tabs   - Editor with no tabs
   - Why is there a search box with the same text ?   - Why is there a search box with the same text ?
   - Rights done properly for every ISC member   - Rights done properly for every ISC member
 +
 +
 +
 +==== TODOs from Remi's papers ====
 +
 +Rumba :
 +  - Create laptops backup space on the Rumba NAS, make the share visible from the wireguard subnet (Volume2)
 +  - Re-create the Rsync script for Hannibal on the Rumba Synology NAS : test 
 +  - Proxmox to install on the EPYC / Check Windows VM install / Licensing for VDIs
 +  - Install and test OpenProject 
 +  - Migrate Marks-Crawler (streamlit) from Disco to Rumba
 +
 +Dance New :
 +  - Check spares for EPYC servers / order some discs, fans, power supplies
 +  - Finish the Mellanox switch IP configuration, to put in the new Sinf subnet 10.5.1.148/24 / GW 10.5.1.1 / DNS 10.130.0.11,10.130.1.11
 +  - Test and configure BeeGFS on the EPYC 48TB storage
 +  - Configure the storage infiniband network
 +  - Create SLURM Test Partition : using shard on Disco, or put the current Rumba Dell 7920 with all 3 Nvidia RTX GPUS, and put the test partition on it
 +  - Change the creation script to make the "Test" partition the default partition when a researcher arrives on the ISC Compute, then when they are ready to run assign the "Dance" partition
 +  - Apply Data quota on all ISC compute users, not the case for everyone yet
 +  - Check the BeeGFS quota mecanism to migrate EXT4 quota from current Disco/Chacha to the new Epyc storage
 +  - Script a wrapper on Apptainer to check execution context and refuse to run directly bare-bone : same for python or other execcutable to avoid run out of SLURM
 +  - Migrate current NVMe data disks from Disco/Chacha to BeeGFS when it will be tested and ready also on EPYC
 +  - Automate file deletion for Standard (Premium too?) researchers to avoid having scratch partition with old tests files / Set in meeting what TTL we want : 2 weeks standard TTL ? More for Premium ?
 +  - Rename the Filesystem : datasets -> workspace? local_workspace? chacha_workspace ? local_scratch ? / shared -> network_workspace ? remote_workspace ? remote_scratch ?
 +  - Migrate Prometheus from Chacha to the new EPYC server / Add some alerting on common checks, disks, jobs outside of slurm etc...
 +  - Move NVMe disks ? Disco 7TB to Chacha ? / NVMe 3TB from Calypso storage to Disco ? Format as BeeGFS
 +  - Check for a Modules installation ? Or Apptainer is already fine ? : Install LMOD to allow dynamic lib loading : where to put the terabytes of libraries for Dance ?
 +
 +
 +Calypso :
 +  - Finish the local DNS as forwarder either to 172.30.7.1 or another external DNS (port 53 was already opened by YB) : test that when it is given as the default DNS in the Wireguard config it doesnt defeat the Split-Tunneling purpose by sending all DNS request in the tunnel ? Or if only the DNS is sent through the tunnel, maybe live with the fact 
 +  - Test MAAS to auto-provision Calypso[0-14] slaves servers, calypso12-14 are not yet installed, we can test on it
 +  - Change CUDA12.2 to 12.6 or 13.0 ?
 +  - Add the FS switch to the stack, to allow wiring all IDRACs and hosts cables to all Calypso servers
 +  - Migrate the current 10GB router to 25GB router ? NAS can only use ethernet 10GBx2
 +  - Set somewhere suitable the UID/GID plan (currently only on users.yml in Ansible playbooks) : 
 +    - users UIDs : 1000 : ubuntu / 1001 pmudry / 1002 remi / 1003-1100 teachers / 10000-10500 ISC students / 10500-11000 other students 
 +    - services : 7000-8000 services like prometheus / 8000-9000 custom groups for students / 9000-10000 researchers groups
 +    - Maybe either one day integrate into HES Active Directory (no direct control of the groups and user management) or recreate a LDAP server in Rumba ?
 +
 +Backups :
 +  - Make a coherent backup architecture : which data has to be kept how long 
 +  - Prod backups from Hannibal to put also on the Synology SSD NAS (Rumba)
 +  - Teachers laptops backup space to create
 +  - Rumba services backups
 +  - Clean Moodle courses to make backups smaller : Did the auto-backups reduction option work ? Did the maximum version kept to (5?) works ?
 +  - Finish to test the script Rsync with ACL permissions / Extended attributes, then replace the hannibal.sh / marcellus.sh scripts on the Desktop NAS / setup on the SSD Rumba NAS, either directly from source at another moment in the night, or as a copy of the the first backup to avoid more I/Os on the prod ?
 +
 +19.SS09 :
 +  - Remove blades from the Submer pod ? Isn't it worse to put this expensive opened hardware on cloths ? When does Deepsquare pick up their servers ?
 +  - Find something to do with the pod when everything else will be out
 +
 +Other :
 +  - 23N218 poster to hide the transparent door to all people in the hall / corridor
 +  - 23N218 add a fridge for tupperwares / coffee machine
 +
 +Post Mortem Followup :
 +  - Monitoring ISC/Learn : where to install ? Rumba ? / Dashboard to create / emails alerting to setup
 +  - Finish the DRP wiki part : https://wiki.isc-vs.ch/doku.php?id=administratif:processes:drp
 +  - Create a PDF + paper DRP to make sure instruction are available even when nothing works : hannibal, wiki, and HES network down
 +  - Finish learn playbook, as a DRP system to recreate Hannibal from scratch fast in case of
 +
 +Playbooks :
 +  - Finish the isc_compute system playbook, especially the Tango login node part, and new configs to avoid direct connection on compute nodes
 +  - Finish slurm_calypso and slurm_research_TODO playbooks
 +  - Finish prometheus playbook
 +  - Finish learn playbook, as a DRP system to recreate Hannibal from scratch fast in case of
 +  - Finish the k8s playbook, from the currently "manual script" playbook_a_faire_k8s_calypso.txt
 +
 +
Back to top