Calypso architecture
Physical Infrastructure
It is currently composed of the following machines :
- 1 DELL R740XD (Master)
- 15 DELL R630 (11 currently active)
- 3 DELL R630 (spares)
Network
Calypso is located in an isolated separate network inside the school. It can be accessed via a Wireguard VPN (ask the staff for access).
Inside this network you have a simple setup :
192.168.88.0/24 : Network appliances subnet 192.168.89.0/24 : VPN subnet 192.168.90.0/24 : Server's IDRACs subnet 192.168.91.0/24 : Servers subnet
Currently there are 11 servers running in the cluster that are accessible to students for their labs :
calypso0 : 192.168.91.10 calypso1 : 192.168.91.11 calypso2 : 192.168.91.12 calypso3 : 192.168.91.13 calypso4 : 192.168.91.14 calypso5 : 192.168.91.15 calypso6 : 192.168.91.16 calypso7 : 192.168.91.17 calypso8 : 192.168.91.18 calypso9 : 192.168.91.19 calypso10 : 192.168.91.20
The Calypsomaster node is not available for connection to students, it contains the Kubernetes control plane :
calypsomaster : 192.168.88.248 calypsomaster IDRAC : 192.168.88.249
To work on it, see with your teacher which one are allocated to you, or if you need to run jobs on all nodes (via SLURM) you can pick any of them to run.
NAS NFS share
On the Calypso infrastructure, there is a shared storage mainly for SLURM needs : it gives students a shared storage between all Calypso nodes to run jobs on any node and have a common filesystem to run jobs / get results on
nas (NAS appliance) : 192.168.88.250
The filesystem is mounted from nas:/volume1/calypso_homes/homes/firstname.lastname , to each student's shared directory /exports/firstname.lastname/
On each student's home, there is a symlink to it : ~/nas_home
DNS server / Gateway
In this isolated network, the Sinf provides us only their gateway as the only DNS server :
DNS / Gateway : 172.30.7.1
Software Architecture
User accounts
User access is SSH based, managed by the staff.
Container Runtimes
To run containers on Calypso, you can use :
- Containerd (in Kubernetes)
SLURM Cluster
There is a SLURM cluster on all Calypso worker nodes :
calypsomaster : no SLURM calypso0 : SLURM controller + accounting DB calypso[0-10] : SLURM workers
Configuration
TODO
Kubernetes Cluster
The Kubernetes control plane is on calypsomaster, you don't have access to it, and this node can't run pods, it's just administrative for the operation of the cluster. The rest are all capable of running pods.
# kubectl get nodes NAME STATUS ROLES AGE VERSION calypso0 Ready <none> 324d v1.31.12 calypso1 Ready <none> 324d v1.31.12 calypso10 Ready <none> 19d v1.31.12 calypso2 Ready <none> 323d v1.31.12 calypso3 Ready <none> 323d v1.31.12 calypso4 Ready <none> 323d v1.31.12 calypso5 Ready <none> 323d v1.31.12 calypso6 Ready <none> 120d v1.31.12 calypso7 Ready <none> 19d v1.31.12 calypso8 Ready <none> 19d v1.31.12 calypso9 Ready <none> 19d v1.31.12 calypsomaster Ready control-plane 324d v1.31.12
There are 2 usable namespaces as of now : the default one where teachers can do tests, and the isc3 namespace for students, the others are administrative.
kubectl get namespaces NAME STATUS default Active isc3 Active kube-flannel Active kube-node-lease Active kube-public Active kube-system Active
The K8s cluster was created with containerd as the main container engine: there is also a docker-engine + docker-compose that are installed on the servers for testing.
Users
Teacher users are administrators, they have some clusterroles permissions (cluster-admin-role), and large roles permissions on default and isc3 namespaces (default-admin-role, isc-admin-role).
Students have only a role permission, named student-role to do all needed operations on pods, services, deployments etc…
Storage
The PersistentVolumes (PV) have already been created locally to reflect the local storage capacity, which is NVMe SSD storage: the trade-off is that the PersistentVolumeClaims (PVC) being only assignable 1:1 to a PV, you can't create additional PVCs, it will never be bound to a PV since they are already taken statically: this is due to the local volume type which requires the binding to use the WaitForFirstConsumer policy.
kubectl get storageclasses NAME PROVISIONER RECLAIMPOLICY VOLUMEBINDINGMODE ALLOWVOLUMEEXPANSION local-node-storage kubernetes.io/no-provisioner Delete WaitForFirstConsumer false kubectl get pv NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS VOLUMEATTRIBUTESCLASS local-node-volume 300G RWO,RWX Retain Bound default/claim300g local-node-storage <unset>
The PVCs were created with Access Mode ReadWriteOnce and ReadWriteMany: which allows pods created by several people to be able to reuse the same PV in a concurrent way.
kubectl get pvc NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS VOLUMEATTRIBUTESCLASS claim300g Bound local-node-volume 300G RWO,RWX local-node-storage <unset>
Rules for persistent storage:
- On all the nodes have up to 300G of disk space allocated to K8s. So to create a pod requiring for example a non-ephemeral database, you need to specify the PersistentVolumeClaim claim300g in the deployment yaml, and it will be dispatched to a corresponding node.
- The pods will automatically launch on the node that has the right PV/PVC pair.
kubectl get pods -n isc3 -o=custom-columns=NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName NAME STATUS NODE www Running calypso0 www Running calypso4 www2 Running calypso3
Ressources GPU
Each node in the cluster has Nvidia Container Toolkit installed, which allows you to use their Nvidia GPUs from the containers :
Calypso[0-9] are equipped with Nvidia A2 GPUs
Calypso[10-14] are equipped with Nvidia Tesla T4 GPUs
Registry Docker
To avoid saturating the network link with the builds there is a local registry that can be used on calypsomaster: you already have in your configuration daemon.json of each node a local insecure registry of settings: “insecure-registries”: [“192.168.88.248:5000”]
TODO : Recreate the registry with a certificate from local PKI, distribute CA certificates / registry certificate to every node to trust the local authority.
