Differences
This shows you the differences between two versions of the page.
| Both sides previous revisionPrevious revisionNext revision | Previous revision | ||
| administratif:todo [2025/04/03 09:55] – remi | administratif:todo [2025/09/26 13:59] (current) – [TODOs from Remi's papers] remi | ||
|---|---|---|---|
| Line 33: | Line 33: | ||
| - Choose new R630 and R730 / R740 for RUMBA main. Budget 3 kFr | - Choose new R630 and R730 / R740 for RUMBA main. Budget 3 kFr | ||
| - Do we need a file server from the guys downstairs (baignoire) | - Do we need a file server from the guys downstairs (baignoire) | ||
| - | - < | + | - < |
| Line 65: | Line 65: | ||
| - Why is there a search box with the same text ? | - Why is there a search box with the same text ? | ||
| - Rights done properly for every ISC member | - Rights done properly for every ISC member | ||
| + | |||
| + | |||
| + | |||
| + | ==== TODOs from Remi's papers ==== | ||
| + | |||
| + | Rumba : | ||
| + | - Create laptops backup space on the Rumba NAS, make the share visible from the wireguard subnet (Volume2) | ||
| + | - Re-create the Rsync script for Hannibal on the Rumba Synology NAS : test | ||
| + | - Proxmox to install on the EPYC / Check Windows VM install / Licensing for VDIs | ||
| + | - Install and test OpenProject | ||
| + | - Migrate Marks-Crawler (streamlit) from Disco to Rumba | ||
| + | |||
| + | Dance New : | ||
| + | - Check spares for EPYC servers / order some discs, fans, power supplies | ||
| + | - Finish the Mellanox switch IP configuration, | ||
| + | - Test and configure BeeGFS on the EPYC 48TB storage | ||
| + | - Configure the storage infiniband network | ||
| + | - Create SLURM Test Partition : using shard on Disco, or put the current Rumba Dell 7920 with all 3 Nvidia RTX GPUS, and put the test partition on it | ||
| + | - Change the creation script to make the " | ||
| + | - Apply Data quota on all ISC compute users, not the case for everyone yet | ||
| + | - Check the BeeGFS quota mecanism to migrate EXT4 quota from current Disco/ | ||
| + | - Script a wrapper on Apptainer to check execution context and refuse to run directly bare-bone : same for python or other execcutable to avoid run out of SLURM | ||
| + | - Migrate current NVMe data disks from Disco/ | ||
| + | - Automate file deletion for Standard (Premium too?) researchers to avoid having scratch partition with old tests files / Set in meeting what TTL we want : 2 weeks standard TTL ? More for Premium ? | ||
| + | - Rename the Filesystem : datasets -> workspace? local_workspace? | ||
| + | - Migrate Prometheus from Chacha to the new EPYC server / Add some alerting on common checks, disks, jobs outside of slurm etc... | ||
| + | - Move NVMe disks ? Disco 7TB to Chacha ? / NVMe 3TB from Calypso storage to Disco ? Format as BeeGFS | ||
| + | - Check for a Modules installation ? Or Apptainer is already fine ? : Install LMOD to allow dynamic lib loading : where to put the terabytes of libraries for Dance ? | ||
| + | |||
| + | |||
| + | Calypso : | ||
| + | - Finish the local DNS as forwarder either to 172.30.7.1 or another external DNS (port 53 was already opened by YB) : test that when it is given as the default DNS in the Wireguard config it doesnt defeat the Split-Tunneling purpose by sending all DNS request in the tunnel ? Or if only the DNS is sent through the tunnel, maybe live with the fact | ||
| + | - Test MAAS to auto-provision Calypso[0-14] slaves servers, calypso12-14 are not yet installed, we can test on it | ||
| + | - Change CUDA12.2 to 12.6 or 13.0 ? | ||
| + | - Add the FS switch to the stack, to allow wiring all IDRACs and hosts cables to all Calypso servers | ||
| + | - Migrate the current 10GB router to 25GB router ? NAS can only use ethernet 10GBx2 | ||
| + | - Set somewhere suitable the UID/GID plan (currently only on users.yml in Ansible playbooks) : | ||
| + | - users UIDs : 1000 : ubuntu / 1001 pmudry / 1002 remi / 1003-1100 teachers / 10000-10500 ISC students / 10500-11000 other students | ||
| + | - services : 7000-8000 services like prometheus / 8000-9000 custom groups for students / 9000-10000 researchers groups | ||
| + | - Maybe either one day integrate into HES Active Directory (no direct control of the groups and user management) or recreate a LDAP server in Rumba ? | ||
| + | |||
| + | Backups : | ||
| + | - Make a coherent backup architecture : which data has to be kept how long | ||
| + | - Prod backups from Hannibal to put also on the Synology SSD NAS (Rumba) | ||
| + | - Teachers laptops backup space to create | ||
| + | - Rumba services backups | ||
| + | - Clean Moodle courses to make backups smaller : Did the auto-backups reduction option work ? Did the maximum version kept to (5?) works ? | ||
| + | - Finish to test the script Rsync with ACL permissions / Extended attributes, then replace the hannibal.sh / marcellus.sh scripts on the Desktop NAS / setup on the SSD Rumba NAS, either directly from source at another moment in the night, or as a copy of the the first backup to avoid more I/Os on the prod ? | ||
| + | |||
| + | 19.SS09 : | ||
| + | - Remove blades from the Submer pod ? Isn't it worse to put this expensive opened hardware on cloths ? When does Deepsquare pick up their servers ? | ||
| + | - Find something to do with the pod when everything else will be out | ||
| + | |||
| + | Other : | ||
| + | - 23N218 poster to hide the transparent door to all people in the hall / corridor | ||
| + | - 23N218 add a fridge for tupperwares / coffee machine | ||
| + | |||
| + | Post Mortem Followup : | ||
| + | - Monitoring ISC/Learn : where to install ? Rumba ? / Dashboard to create / emails alerting to setup | ||
| + | - Finish the DRP wiki part : https:// | ||
| + | - Create a PDF + paper DRP to make sure instruction are available even when nothing works : hannibal, wiki, and HES network down | ||
| + | - Finish learn playbook, as a DRP system to recreate Hannibal from scratch fast in case of | ||
| + | |||
| + | Playbooks : | ||
| + | - Finish the isc_compute system playbook, especially the Tango login node part, and new configs to avoid direct connection on compute nodes | ||
| + | - Finish slurm_calypso and slurm_research_TODO playbooks | ||
| + | - Finish prometheus playbook | ||
| + | - Finish learn playbook, as a DRP system to recreate Hannibal from scratch fast in case of | ||
| + | - Finish the k8s playbook, from the currently " | ||
| + | |||
| + | |||
