Differences

This shows you the differences between two versions of the page.

Link to this comparison view

Both sides previous revisionPrevious revision
Next revision
Previous revision
infra:calypso [2025/05/02 09:14] – [Network] remiinfra:calypso [2025/11/10 12:38] (current) – [User accounts] the staff, not Rémi marc
Line 6: Line 6:
  
   * 1 DELL R740XD (Master)   * 1 DELL R740XD (Master)
-  * 12 DELL R630 (currently active) +  * 15 DELL R630 (11 currently active) 
-  * 3 DELL R630 (1 spare in 23N321, storage 1) +  * 3 DELL R630 (spares)
  
 ===== Network ===== ===== Network =====
  
-Calypso is located in an isolated separate network inside the school. It can be accessed via a [[infra:wireguard|Wireguard VPN]] (ask Rémi for access).+Calypso is located in an isolated separate network inside the school. It can be accessed via a [[infra:wireguard|Wireguard VPN]] (ask [[infra:staff|the staff]] for access).
  
 Inside this network you have a simple setup : Inside this network you have a simple setup :
Line 21: Line 20:
   192.168.91.0/24 : Servers subnet   192.168.91.0/24 : Servers subnet
  
-Currently there are servers running in the cluster that are accessible to students for their labs :+Currently there are 11 servers running in the cluster that are accessible to students for their labs :
  
   calypso0 : 192.168.91.10   calypso0 : 192.168.91.10
Line 29: Line 28:
   calypso4 : 192.168.91.14   calypso4 : 192.168.91.14
   calypso5 : 192.168.91.15   calypso5 : 192.168.91.15
 +  calypso6 : 192.168.91.16
 +  calypso7 : 192.168.91.17
 +  calypso8 : 192.168.91.18
 +  calypso9 : 192.168.91.19
 +  calypso10 : 192.168.91.20
  
 The Calypsomaster node is not available for connection to students, it contains the Kubernetes control plane : The Calypsomaster node is not available for connection to students, it contains the Kubernetes control plane :
  
   calypsomaster : 192.168.88.248   calypsomaster : 192.168.88.248
 +  calypsomaster IDRAC : 192.168.88.249
      
 To work on it, see with your teacher which one are allocated to you, or if you need to run jobs on all nodes (via SLURM) you can pick any of them to run. To work on it, see with your teacher which one are allocated to you, or if you need to run jobs on all nodes (via SLURM) you can pick any of them to run.
  
  
-**DNS server Gateway :**+==== NAS NFS share ==== 
 + 
 +On the Calypso infrastructure, there is a shared storage mainly for SLURM needs : it gives students a shared storage between all Calypso nodes to run jobs on any node and have a common filesystem to run jobs / get results on 
 + 
 +  nas (NAS appliance) : 192.168.88.250 
 + 
 +The filesystem is mounted from ** nas:/volume1/calypso_homes/homes/firstname.lastname **, to each student's shared directory **/exports/firstname.lastname/** 
 + 
 +On each student's home, there is a symlink to it : ** ~/nas_home ** 
 + 
 + 
 +==== DNS server / Gateway ====
  
 In this isolated network, the Sinf provides us only their gateway as the only DNS server : In this isolated network, the Sinf provides us only their gateway as the only DNS server :
  
   DNS / Gateway : 172.30.7.1   DNS / Gateway : 172.30.7.1
 +
 +
  
  
Line 49: Line 67:
 ==== User accounts ==== ==== User accounts ====
  
-User access is SSH based for now, managed  by Rémi.+User access is SSH based, managed  by [[infra:staff|the staff]].
  
  
Line 68: Line 86:
   calypsomaster : no SLURM   calypsomaster : no SLURM
   calypso0 : SLURM controller + accounting DB   calypso0 : SLURM controller + accounting DB
-  calypso[1-5] : SLURM workers+  calypso[0-10] : SLURM workers
  
 ==== Configuration ==== ==== Configuration ====
  
-TODO : redeploy from ISC compute center configuration+TODO
 ==== Kubernetes Cluster ==== ==== Kubernetes Cluster ====
  
-The Kubernetes control plane is on calypsomaster, you don't have access to it, and this node can't run pods, it's just administrative for the operation of the cluster. The other nodes are calypso0 to 5, they are all capable of running pods.+The Kubernetes control plane is on calypsomaster, you don't have access to it, and this node can't run pods, it's just administrative for the operation of the cluster. The rest are all capable of running pods.
  
 <code> <code>
 # kubectl get nodes # kubectl get nodes
-NAME            STATUS   ROLES           AGE   VERSION +NAME            STATUS   ROLES           AGE    VERSION 
-calypso0        Ready    <none>          22d   v1.31.1 +calypso0        Ready    <none>          324d   v1.31.12 
-calypso1        Ready    <none>          22d   v1.31.1 +calypso1        Ready    <none>          324d   v1.31.12 
-calypso2        Ready    <none>          22d   v1.31.1 +calypso10       Ready    <none>          19d    v1.31.12 
-calypso3        Ready    <none>          22d   v1.31.1 +calypso2        Ready    <none>          323d   v1.31.12 
-calypso4        Ready    <none>          22d   v1.31.1 +calypso3        Ready    <none>          323d   v1.31.12 
-calypso5        Ready    <none>          22d   v1.31.1 +calypso4        Ready    <none>          323d   v1.31.12 
-calypsomaster   Ready    control-plane   22d   v1.31.1+calypso5        Ready    <none>          323d   v1.31.12 
 +calypso6        Ready    <none>          120d   v1.31.12 
 +calypso7        Ready    <none>          19d    v1.31.12 
 +calypso8        Ready    <none>          19d    v1.31.12 
 +calypso9        Ready    <none>          19d    v1.31.12 
 +calypsomaster   Ready    control-plane   324d   v1.31.12
 </code> </code>
  
Line 124: Line 147:
 kubectl get pv kubectl get pv
 NAME                          CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                STORAGECLASS         VOLUMEATTRIBUTESCLASS NAME                          CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM                STORAGECLASS         VOLUMEATTRIBUTESCLASS
-large-local-calypso0-volume   1500Gi     RWO,RWX        Retain           Bound    default/claim1500g   local-node-storage   <unset>    
 local-node-volume             300G       RWO,RWX        Retain           Bound   default/claim300g    local-node-storage   <unset>    local-node-volume             300G       RWO,RWX        Retain           Bound   default/claim300g    local-node-storage   <unset>   
 </code> </code>
Line 133: Line 155:
 <code> <code>
 kubectl get pvc kubectl get pvc
-NAME         STATUS   VOLUME                        CAPACITY   ACCESS MODES   STORAGECLASS         VOLUMEATTRIBUTESCLASS   +NAME         STATUS   VOLUME                        CAPACITY   ACCESS MODES   STORAGECLASS         VOLUMEATTRIBUTESCLASS                 
-claim1500g   Bound    large-local-calypso0-volume   1500Gi     RWO,RWX        local-node-storage   <unset>                 +
 claim300g    Bound    local-node-volume             300G       RWO,RWX        local-node-storage   <unset>                  claim300g    Bound    local-node-volume             300G       RWO,RWX        local-node-storage   <unset>                 
 </code>    </code>   
  
 Rules for persistent storage:  Rules for persistent storage: 
-  * You should be aware that calypso0 is the only one to have a 1.5T SSD, and on all the other nodes there is 300G of space allocated to K8s. So to create a pod requiring for example a non-ephemeral database, you need to specify either the PersistentVolumeClaim claim300g or the claim1500g in the deployment yaml, and it will be dispatched to a corresponding node. +  * On all the nodes have up to 300G of disk space allocated to K8s. So to create a pod requiring for example a non-ephemeral database, you need to specify the PersistentVolumeClaim **claim300g** in the deployment yaml, and it will be dispatched to a corresponding node.
-  * Unless you need to test a large volume of data, please use the claim300g as a priority +
   * The pods will automatically launch on the node that has the right PV/PVC pair.   * The pods will automatically launch on the node that has the right PV/PVC pair.
  
Line 147: Line 167:
 kubectl get pods -n isc3 -o=custom-columns=NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName kubectl get pods -n isc3 -o=custom-columns=NAME:.metadata.name,STATUS:.status.phase,NODE:.spec.nodeName
 NAME      STATUS    NODE NAME      STATUS    NODE
-bigwww    Running   calypso0 +www       Running   calypso0
-bigwww2   Running   calypso0+
 www       Running   calypso4 www       Running   calypso4
 www2      Running   calypso3 www2      Running   calypso3
Line 155: Line 174:
 ==== Ressources GPU ==== ==== Ressources GPU ====
  
-Each node in the cluster has Nvidia Container Toolkit installed, which allows you to use their Nvidia Tesla T4 GPU from the containers.+Each node in the cluster has Nvidia Container Toolkit installed, which allows you to use their Nvidia GPUs from the containers 
 + 
 +Calypso[0-9] are equipped with Nvidia A2 GPUs 
 + 
 +Calypso[10-14] are equipped with Nvidia Tesla T4 GPUs
  
  
Back to top