Issues updating from Coder 2.13.3 to 2.22.1
Hi guys,
We're having issues updating coder, where the update never initiates, or fails somewhere without logs.
In our setup we're deploying coder in our aws eks cluster via the existing helm chart.
We tried updating our deployments in different staging environments and it was successful with no issues, but while doing it in our production something unexpected happens. We're not running the HA mode, so we have 1 pod running (the active pod) and when we perform the helmfile update command, a new pod spawns (the update pod). The update pod seems to be stuck in a running state, where it constantly keeps restarting because of the liveness probe. NO logs are shown in the update pod, other than the standard header and the web ui url (WARN:
CODER_TELEMETRY
is deprecated, please use CODER_TELEMETRY_ENABLE
instead. Started HTTP listener at http://0.0.0.0:8080). The active pod also becomes unreachable (you cannot reach it via the web ui).
We also tried scaling the deployment to 0 replicas, and then performing the update, but success. The same issue happens.
After we rollback the helm release, everything goes back to normal.
Thank you in advance!4 Replies
<#1415409362855268372>
Category
Help needed
Product
Coder (v2)
Platform
Linux
Logs
Please post any relevant logs/error messages.
Although the upgrade should be fine.
But given it's a huge bump. I would advise upgrading gradually and check for any breaking changes for each release
Also any logs can help
The thing is there were no logs, the only bit of information we could gather was when describing the update pod (in the scenario where i scale down the original deployment to 0 replicas, then performing the update). And the events were just about the liveness and readiness probes failing.
Which we thought was weird, because it seems that the coder application didn't even start.
To our surprise, when manually bumping up the liveness probe of the deployment to a higher timeout, it seems to have fixed the issue, and the update succeeded. It seems like the probe frequency was faster than the database migration itself due to some internal state. Is this possible? We checked the database and there were no connection on it..
This specifically is weird because the length of the database migration was pretty non-deterministic