Or when the issue is not visible yet but will impact users soon
What to do
Assess what is the problem
What is impacted, is the issue serious ? : maximum 15 minutes, start communication even if not completely aware of what is going on
Communication
Prepare and send the first Incident notification in the next 10 minutes if possible :(incident start template link)
It contains :
What service is impacted ?
The scope of the issue, how it is impacted ? Complete downtime or degraded situation ?
Who is impacted ? → Add people to the email when there is one
For how long it will be down ? Or at least an explanation of what cause uncertainty to the resolution time
What has been started ?
What are the next steps of the debug ?
Advise next communication will be in : 1H
Look at the time to schedule the next communication mark in 1H
Debugging
Continue debugging.
Either you solve the issue :
Take 10 minutes to send the Incident Solved notification :(Incident Solved template link) : preferably highlighting the last resolution steps in green
Or still not solved : 10 minutes before the scheduled mark, you start to append to your last communication what is the current state of the debugging (on the Teams group or on the last email that you modify with the updates, preferably with the new elements highlighted in orange)
Repeat the lasts steps of debugging and communication until this is solved, or until we reach 4H
When the issue takes more than 4H
After 4H without resolution :
Notify anyone else ? (management?)
Get some external help on the issue ? (best to have a readily list of people contacts with their expertise when this is available)
Decide to switch on a workaround ? : restore from backups, or reinstall a server ?
After a day without resolution :
Notify anyone else ? (management?)
Switch to a complete workaround ? (Reinstall a server from scratch, create a new VPS in Infomaniak and install manually without backups)