What still needs to be fixed
Apptainer
Make that everyone exports (after adaptation of course) : ✔ done : Created a /data/apptainer/user.name/.apptainer, migrated the current .apptainer users dirs and added a symlink in their home.
export APPTAINER_CACHEDIR=/scratch/gpfs/$USER/APPTAINER_CACHE export APPTAINER_TMPDIR=/tmp
- To prevent quota explosion : Installed the quota package on disco and chacha :. should we enforce quota on FS level ? All the people who said they would use less than 100GB are using more than what they said (ex martin.barry at 350GB) : Applied quota on / filesystem, not yet on datasets.
Content
- Explain to PA how to create a proper structure.
- Where do we put the content of this file, as some information is not intended for the general public
- How can we make animations on the Wiki using JS + SVG and stuff ? Snow ?
Create a proper structure, starting with the infrastructures✔ done, after discussion merge into docs: for both groups- With a section for students
- With a section for teachers
Make a limited-access location with the critical information✔ done- Construct the tools for teacher sections with the existing informations
Monitoring
- alerting and messaging on various metrics (disk space, cpu usage, …) for the various computational resources
(chacha, disco,Done calypso & others)
Server room
- Make something nice there. Posters on the walls, screens, stuff.
- Why is there still a box for a server in the networking lab room ? : Because we need at least one for sending back in case of support / we needed one for network labs to hide the CTF network setup
- Rename networking lab room and change the remplaçant for Darko as well
- Why only 10 GB for the fiber
- I don't want a patch panel inside the server rack but outside of it. Space will be premium soon there and we don't know where to put the server rack :
Search and buy a patch panel to put in the roomDone. - Find a proper layout for the server room for accommodating a water-cooled rack and maybe another one in a couple of months
- If we have Rumba running there, we need some UPS solution.
- Do a drawing schematic of the future rack, notably for having a proper rumba failover policy
- Choose new R630 and R730 / R740 for RUMBA main. Budget 3 kFr
- Do we need a file server from the guys downstairs (baignoire)
Remove again the big oven : check with Hervé Girard to store it in 23N322 : this is where a student used it last time (but nicely returned it to N307), why not keep it there ? EDIT: the RoL of N322 is Thomas Sterren, I sent him a message for the oven. (Rémi) / EDIT2 : answer is “we share the room so it will stay in 307. period.” : need to find a place to store it ourselves.Moved to 23N321 : They want it back to N307, but for no valid reason : so it will stay in N321 where it is less annoying or if it leaves, it just leaves our rooms for good. EDIT 2025-04-03 : The oven will go in 23N111, the Rol is Cedric (Clivaz?), he should contact us to install the oven in its lab.
Slurm on chacha or disco
Make both GPUs available in gres/slurmd confs✔ doneMake emails working for start/end of jobs, use an emailer✔ doneFind how to do the ressource partitioning with billing credits by user / account✔ done (but still needs tests and real jobs to see how to tweak)- Discuss how to allocate credits for users : what about students ?
Note everywhere to either remove sshfs for VScode, and give links to properly configure it or no VScode at all :Noted on runjoband started script to check for .vscode in homedirs : auto-rm in crontab directly ?Done- For the future jump server, need to test how to restrict ssh access to other servers : via SLURM they might recreate their authorized_keys by running a job writing a .ssh/authorized_keys on the server the job is run. (change .ssh/ permission disabling them to chmod this dir?)
Calypso
- Reinstall slurm by compiling with all necessary plugins,then package using debuild : https://slurm.schedmd.com/quickstart_admin.html#debuild , then deploy the .deb by Ansible
Rumba
- Turn on Rumba and install a proper env for us, mainly based on docker as a limited number of members will use it
- Test backup and replicate ISC / Learn on Rumba
- Migrate the wiki there
- Migrate ISC / Learn there ? TBD
- Have VPS and cloud coder there, please.
Hannibal
- Backup DokuWiki : ✔ Done already, Hannibal has /srv/www completely backuped on the Synolog NAS DS923
- Add Ingegamez website on wordpress
Site
- Proper CSS for title, also for the alignment which is ugly (look at this page!)
- Editor with no tabs
- Why is there a search box with the same text ?
- Rights done properly for every ISC member
TODOs from Remi's papers
Rumba :
- Create laptops backup space on the Rumba NAS, make the share visible from the wireguard subnet (Volume2)
- Re-create the Rsync script for Hannibal on the Rumba Synology NAS : test
- Proxmox to install on the EPYC / Check Windows VM install / Licensing for VDIs
- Install and test OpenProject
- Migrate Marks-Crawler (streamlit) from Disco to Rumba
Dance New :
- Check spares for EPYC servers / order some discs, fans, power supplies
- Finish the Mellanox switch IP configuration, to put in the new Sinf subnet 10.5.1.148/24 / GW 10.5.1.1 / DNS 10.130.0.11,10.130.1.11
- Test and configure BeeGFS on the EPYC 48TB storage
- Configure the storage infiniband network
- Create SLURM Test Partition : using shard on Disco, or put the current Rumba Dell 7920 with all 3 Nvidia RTX GPUS, and put the test partition on it
- Change the creation script to make the “Test” partition the default partition when a researcher arrives on the ISC Compute, then when they are ready to run assign the “Dance” partition
- Apply Data quota on all ISC compute users, not the case for everyone yet
- Check the BeeGFS quota mecanism to migrate EXT4 quota from current Disco/Chacha to the new Epyc storage
- Script a wrapper on Apptainer to check execution context and refuse to run directly bare-bone : same for python or other execcutable to avoid run out of SLURM
- Migrate current NVMe data disks from Disco/Chacha to BeeGFS when it will be tested and ready also on EPYC
- Automate file deletion for Standard (Premium too?) researchers to avoid having scratch partition with old tests files / Set in meeting what TTL we want : 2 weeks standard TTL ? More for Premium ?
- Rename the Filesystem : datasets → workspace? local_workspace? chacha_workspace ? local_scratch ? / shared → network_workspace ? remote_workspace ? remote_scratch ?
- Migrate Prometheus from Chacha to the new EPYC server / Add some alerting on common checks, disks, jobs outside of slurm etc…
- Move NVMe disks ? Disco 7TB to Chacha ? / NVMe 3TB from Calypso storage to Disco ? Format as BeeGFS
- Check for a Modules installation ? Or Apptainer is already fine ? : Install LMOD to allow dynamic lib loading : where to put the terabytes of libraries for Dance ?
Calypso :
- Finish the local DNS as forwarder either to 172.30.7.1 or another external DNS (port 53 was already opened by YB) : test that when it is given as the default DNS in the Wireguard config it doesnt defeat the Split-Tunneling purpose by sending all DNS request in the tunnel ? Or if only the DNS is sent through the tunnel, maybe live with the fact
- Test MAAS to auto-provision Calypso[0-14] slaves servers, calypso12-14 are not yet installed, we can test on it
- Change CUDA12.2 to 12.6 or 13.0 ?
- Add the FS switch to the stack, to allow wiring all IDRACs and hosts cables to all Calypso servers
- Migrate the current 10GB router to 25GB router ? NAS can only use ethernet 10GBx2
- Set somewhere suitable the UID/GID plan (currently only on users.yml in Ansible playbooks) :
- users UIDs : 1000 : ubuntu / 1001 pmudry / 1002 remi / 1003-1100 teachers / 10000-10500 ISC students / 10500-11000 other students
- services : 7000-8000 services like prometheus / 8000-9000 custom groups for students / 9000-10000 researchers groups
- Maybe either one day integrate into HES Active Directory (no direct control of the groups and user management) or recreate a LDAP server in Rumba ?
Backups :
- Make a coherent backup architecture : which data has to be kept how long
- Prod backups from Hannibal to put also on the Synology SSD NAS (Rumba)
- Teachers laptops backup space to create
- Rumba services backups
- Clean Moodle courses to make backups smaller : Did the auto-backups reduction option work ? Did the maximum version kept to (5?) works ?
- Finish to test the script Rsync with ACL permissions / Extended attributes, then replace the hannibal.sh / marcellus.sh scripts on the Desktop NAS / setup on the SSD Rumba NAS, either directly from source at another moment in the night, or as a copy of the the first backup to avoid more I/Os on the prod ?
19.SS09 :
- Remove blades from the Submer pod ? Isn't it worse to put this expensive opened hardware on cloths ? When does Deepsquare pick up their servers ?
- Find something to do with the pod when everything else will be out
Other :
- 23N218 poster to hide the transparent door to all people in the hall / corridor
- 23N218 add a fridge for tupperwares / coffee machine
Post Mortem Followup :
- Monitoring ISC/Learn : where to install ? Rumba ? / Dashboard to create / emails alerting to setup
- Finish the DRP wiki part : https://wiki.isc-vs.ch/doku.php?id=administratif:processes:drp
- Create a PDF + paper DRP to make sure instruction are available even when nothing works : hannibal, wiki, and HES network down
- Finish learn playbook, as a DRP system to recreate Hannibal from scratch fast in case of
Playbooks :
- Finish the isc_compute system playbook, especially the Tango login node part, and new configs to avoid direct connection on compute nodes
- Finish slurm_calypso and slurm_research_TODO playbooks
- Finish prometheus playbook
- Finish learn playbook, as a DRP system to recreate Hannibal from scratch fast in case of
- Finish the k8s playbook, from the currently “manual script” playbook_a_faire_k8s_calypso.txt
