Handling Blue-Green Deployment Issues with Effect Cluster and ECS
Hey! Question about blue-green deployments:
We are using effect cluster in production with ecs and sst for deployments. Our setup includes a shard manager service and multiple runner services deployed as separate ecs tasks (very close to how https://github.com/sellooh/effect-cluster-via-sst is done).
During blue-green deployments, we're experiencing a critical issue where the cluster becomes non-functional:
1. New shard manager deploys and starts up
2. Existing runners remain connected to the OLD shard manager instance
3. Old shard manager terminates as part of blue-green → runners get deregistered
4. Runners DON'T attempt to re-register with the new shard manager
5. Database shows zero registered runners, cluster is broken
The issue only happens when runners outlive the shard manager replacement. If runners happen to start fresh after the new shard manager is ready, everything works perfectly.
So here's the question:
Is automatic re-registration expected behavior when runners lose their shard manager connection? Or do we need to handle this scenario differently?
Note:
We tried adding
Should runners inherently handle re-registering when they lose connection, or is there a recommended pattern for handling shard manager replacements during blue-green deployments?
We are using effect cluster in production with ecs and sst for deployments. Our setup includes a shard manager service and multiple runner services deployed as separate ecs tasks (very close to how https://github.com/sellooh/effect-cluster-via-sst is done).
During blue-green deployments, we're experiencing a critical issue where the cluster becomes non-functional:
1. New shard manager deploys and starts up
2. Existing runners remain connected to the OLD shard manager instance
3. Old shard manager terminates as part of blue-green → runners get deregistered
4. Runners DON'T attempt to re-register with the new shard manager
5. Database shows zero registered runners, cluster is broken
The issue only happens when runners outlive the shard manager replacement. If runners happen to start fresh after the new shard manager is ready, everything works perfectly.
So here's the question:
Is automatic re-registration expected behavior when runners lose their shard manager connection? Or do we need to handle this scenario differently?
Note:
We tried adding
dependsOn: [shardManagerService] and wait: true to the runner service config, but this doesn't actually solve the problem since it doesn't wait for the OLD shard manager to fully terminate before the runners connect:Should runners inherently handle re-registering when they lose connection, or is there a recommended pattern for handling shard manager replacements during blue-green deployments?
