Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| administratif:todo [2025/01/20 11:46] – remi | administratif:todo [2025/09/26 13:59] (current) – [TODOs from Remi's papers] remi | ||
|---|---|---|---|
| Line 1: | Line 1: | ||
| - | ==== What still needs to be fixed ==== | + | ===== What still needs to be fixed ===== |
| FIXME | FIXME | ||
| - | === Apptainer === | + | ==== Apptainer |
| - | < | + | < |
| - | - '' | + | - To prevent quota explosion : Installed the quota package on disco and chacha :. should we enforce quota on FS level ? All the people who said they would use less than 100GB are using more than what they said (ex martin.barry at 350GB) : Applied quota on / filesystem, not yet on datasets. |
| - | - '' | + | |
| - | Created a / | + | |
| - | - To prevent quota explosion : Installed the quota package on disco and chacha :. should we enforce quota on FS level ? All the people who said they would use less than 100GB are using more than what they said (ex martin.barry at 350GB) | + | ==== Content |
| - | + | ||
| - | === Content === | + | |
| - Explain to PA how to create a proper structure. | - Explain to PA how to create a proper structure. | ||
| - Where do we put the content of this file, as some information is not intended for the general public | - Where do we put the content of this file, as some information is not intended for the general public | ||
| Line 23: | Line 19: | ||
| - Construct the **tools for teacher** sections with the existing informations | - Construct the **tools for teacher** sections with the existing informations | ||
| - | === Monitoring === | + | ==== Monitoring |
| - | - alerting and messaging on various metrics (disk space, cpu usage, ...) for the various computational resources (chacha, disco, calypso & others) | + | - alerting and messaging on various metrics (disk space, cpu usage, ...) for the various computational resources |
| - | === Server room === | + | ==== Server room ==== |
| - Make something nice there. Posters on the walls, screens, stuff. | - Make something nice there. Posters on the walls, screens, stuff. | ||
| - Why is there still a box for a server in the networking lab room ? : ** Because we need at least one for sending back in case of support / we needed one for network labs to hide the CTF network setup ** | - Why is there still a box for a server in the networking lab room ? : ** Because we need at least one for sending back in case of support / we needed one for network labs to hide the CTF network setup ** | ||
| - Rename networking lab room and change the remplaçant for Darko as well | - Rename networking lab room and change the remplaçant for Darko as well | ||
| - Why only 10 GB for the fiber | - Why only 10 GB for the fiber | ||
| - | - I don't want a patch panel inside the server rack but outside of it. Space will be premium soon there and we don't know where to put the server rack : Search and buy a patch panel to put in the room | + | - I don't want a patch panel inside the server rack but outside of it. Space will be premium soon there and we don't know where to put the server rack : <del>Search and buy a patch panel to put in the room</ |
| - Find a proper layout for the server room for accommodating a water-cooled rack and maybe another one in a couple of months | - Find a proper layout for the server room for accommodating a water-cooled rack and maybe another one in a couple of months | ||
| - If we have Rumba running there, we need some UPS solution. | - If we have Rumba running there, we need some UPS solution. | ||
| Line 37: | Line 33: | ||
| - Choose new R630 and R730 / R740 for RUMBA main. Budget 3 kFr | - Choose new R630 and R730 / R740 for RUMBA main. Budget 3 kFr | ||
| - Do we need a file server from the guys downstairs (baignoire) | - Do we need a file server from the guys downstairs (baignoire) | ||
| - | - Remove again the big oven : check with Hervé Girard to store it in 23N322 : this is where a student used it last time (but nicely returned it to N307), why not keep it there ? EDIT: the RoL of N322 is Thomas Sterren, I sent him a message for the oven. (Rémi) / EDIT2 : answer is "we share the room so it will stay in 307. period." | + | - <del>Remove again the big oven : check with Hervé Girard to store it in 23N322 : this is where a student used it last time (but nicely returned it to N307), why not keep it there ? EDIT: the RoL of N322 is Thomas Sterren, I sent him a message for the oven. (Rémi) / EDIT2 : answer is "we share the room so it will stay in 307. period." |
| - | === Slurm on chacha or disco === | + | |
| + | ==== Slurm on chacha or disco ==== | ||
| - < | - < | ||
| - < | - < | ||
| - | - Find how to do the ressource partitioning with billing credits by user / account | + | - <del>Find how to do the ressource partitioning with billing credits by user / account</ |
| - Discuss how to allocate credits for users : what about students ? | - Discuss how to allocate credits for users : what about students ? | ||
| - | - < | + | - < |
| + | - For the future jump server, need to test how to restrict ssh access to other servers : via SLURM they might recreate their authorized_keys by running a job writing a .ssh/ | ||
| - | === Calypso === | + | ==== Calypso |
| - Reinstall slurm by compiling with all necessary plugins, | - Reinstall slurm by compiling with all necessary plugins, | ||
| - | === Rumba === | + | ==== Rumba ==== |
| - Turn on Rumba and install a proper env for us, mainly based on docker as a limited number of members will use it | - Turn on Rumba and install a proper env for us, mainly based on docker as a limited number of members will use it | ||
| - Test backup and replicate ISC / Learn on Rumba | - Test backup and replicate ISC / Learn on Rumba | ||
| Line 58: | Line 56: | ||
| - Have VPS and cloud coder there, please. | - Have VPS and cloud coder there, please. | ||
| - | === Hannibal === | + | ==== Hannibal |
| - Backup DokuWiki : **✔ Done already, Hannibal has /srv/www completely backuped on the Synolog NAS DS923** | - Backup DokuWiki : **✔ Done already, Hannibal has /srv/www completely backuped on the Synolog NAS DS923** | ||
| - Add Ingegamez website on wordpress | - Add Ingegamez website on wordpress | ||
| - | === Site === | + | ==== Site ==== |
| - Proper CSS for title, also for the alignment which is ugly (look at this page!) | - Proper CSS for title, also for the alignment which is ugly (look at this page!) | ||
| - Editor with no tabs | - Editor with no tabs | ||
| - Why is there a search box with the same text ? | - Why is there a search box with the same text ? | ||
| - Rights done properly for every ISC member | - Rights done properly for every ISC member | ||
| + | |||
| + | |||
| + | |||
| + | ==== TODOs from Remi's papers ==== | ||
| + | |||
| + | Rumba : | ||
| + | - Create laptops backup space on the Rumba NAS, make the share visible from the wireguard subnet (Volume2) | ||
| + | - Re-create the Rsync script for Hannibal on the Rumba Synology NAS : test | ||
| + | - Proxmox to install on the EPYC / Check Windows VM install / Licensing for VDIs | ||
| + | - Install and test OpenProject | ||
| + | - Migrate Marks-Crawler (streamlit) from Disco to Rumba | ||
| + | |||
| + | Dance New : | ||
| + | - Check spares for EPYC servers / order some discs, fans, power supplies | ||
| + | - Finish the Mellanox switch IP configuration, | ||
| + | - Test and configure BeeGFS on the EPYC 48TB storage | ||
| + | - Configure the storage infiniband network | ||
| + | - Create SLURM Test Partition : using shard on Disco, or put the current Rumba Dell 7920 with all 3 Nvidia RTX GPUS, and put the test partition on it | ||
| + | - Change the creation script to make the " | ||
| + | - Apply Data quota on all ISC compute users, not the case for everyone yet | ||
| + | - Check the BeeGFS quota mecanism to migrate EXT4 quota from current Disco/ | ||
| + | - Script a wrapper on Apptainer to check execution context and refuse to run directly bare-bone : same for python or other execcutable to avoid run out of SLURM | ||
| + | - Migrate current NVMe data disks from Disco/ | ||
| + | - Automate file deletion for Standard (Premium too?) researchers to avoid having scratch partition with old tests files / Set in meeting what TTL we want : 2 weeks standard TTL ? More for Premium ? | ||
| + | - Rename the Filesystem : datasets -> workspace? local_workspace? | ||
| + | - Migrate Prometheus from Chacha to the new EPYC server / Add some alerting on common checks, disks, jobs outside of slurm etc... | ||
| + | - Move NVMe disks ? Disco 7TB to Chacha ? / NVMe 3TB from Calypso storage to Disco ? Format as BeeGFS | ||
| + | - Check for a Modules installation ? Or Apptainer is already fine ? : Install LMOD to allow dynamic lib loading : where to put the terabytes of libraries for Dance ? | ||
| + | |||
| + | |||
| + | Calypso : | ||
| + | - Finish the local DNS as forwarder either to 172.30.7.1 or another external DNS (port 53 was already opened by YB) : test that when it is given as the default DNS in the Wireguard config it doesnt defeat the Split-Tunneling purpose by sending all DNS request in the tunnel ? Or if only the DNS is sent through the tunnel, maybe live with the fact | ||
| + | - Test MAAS to auto-provision Calypso[0-14] slaves servers, calypso12-14 are not yet installed, we can test on it | ||
| + | - Change CUDA12.2 to 12.6 or 13.0 ? | ||
| + | - Add the FS switch to the stack, to allow wiring all IDRACs and hosts cables to all Calypso servers | ||
| + | - Migrate the current 10GB router to 25GB router ? NAS can only use ethernet 10GBx2 | ||
| + | - Set somewhere suitable the UID/GID plan (currently only on users.yml in Ansible playbooks) : | ||
| + | - users UIDs : 1000 : ubuntu / 1001 pmudry / 1002 remi / 1003-1100 teachers / 10000-10500 ISC students / 10500-11000 other students | ||
| + | - services : 7000-8000 services like prometheus / 8000-9000 custom groups for students / 9000-10000 researchers groups | ||
| + | - Maybe either one day integrate into HES Active Directory (no direct control of the groups and user management) or recreate a LDAP server in Rumba ? | ||
| + | |||
| + | Backups : | ||
| + | - Make a coherent backup architecture : which data has to be kept how long | ||
| + | - Prod backups from Hannibal to put also on the Synology SSD NAS (Rumba) | ||
| + | - Teachers laptops backup space to create | ||
| + | - Rumba services backups | ||
| + | - Clean Moodle courses to make backups smaller : Did the auto-backups reduction option work ? Did the maximum version kept to (5?) works ? | ||
| + | - Finish to test the script Rsync with ACL permissions / Extended attributes, then replace the hannibal.sh / marcellus.sh scripts on the Desktop NAS / setup on the SSD Rumba NAS, either directly from source at another moment in the night, or as a copy of the the first backup to avoid more I/Os on the prod ? | ||
| + | |||
| + | 19.SS09 : | ||
| + | - Remove blades from the Submer pod ? Isn't it worse to put this expensive opened hardware on cloths ? When does Deepsquare pick up their servers ? | ||
| + | - Find something to do with the pod when everything else will be out | ||
| + | |||
| + | Other : | ||
| + | - 23N218 poster to hide the transparent door to all people in the hall / corridor | ||
| + | - 23N218 add a fridge for tupperwares / coffee machine | ||
| + | |||
| + | Post Mortem Followup : | ||
| + | - Monitoring ISC/Learn : where to install ? Rumba ? / Dashboard to create / emails alerting to setup | ||
| + | - Finish the DRP wiki part : https:// | ||
| + | - Create a PDF + paper DRP to make sure instruction are available even when nothing works : hannibal, wiki, and HES network down | ||
| + | - Finish learn playbook, as a DRP system to recreate Hannibal from scratch fast in case of | ||
| + | |||
| + | Playbooks : | ||
| + | - Finish the isc_compute system playbook, especially the Tango login node part, and new configs to avoid direct connection on compute nodes | ||
| + | - Finish slurm_calypso and slurm_research_TODO playbooks | ||
| + | - Finish prometheus playbook | ||
| + | - Finish learn playbook, as a DRP system to recreate Hannibal from scratch fast in case of | ||
| + | - Finish the k8s playbook, from the currently " | ||
| + | |||
| + | |||
