5 replies

Handling Blue-Green Deployment Issues with Effect Cluster and ECS

Hey! Question about blue-green deployments:
We are using effect cluster in production with ecs and sst for deployments. Our setup includes a shard manager service and multiple runner services deployed as separate ecs tasks (very close to how https://github.com/sellooh/effect-cluster-via-sst is done).

During blue-green deployments, we're experiencing a critical issue where the cluster becomes non-functional:

1. New shard manager deploys and starts up
2. Existing runners remain connected to the OLD shard manager instance
3. Old shard manager terminates as part of blue-green → runners get deregistered
4. Runners DON'T attempt to re-register with the new shard manager
5. Database shows zero registered runners, cluster is broken

The issue only happens when runners outlive the shard manager replacement. If runners happen to start fresh after the new shard manager is ready, everything works perfectly.

So here's the question:
Is automatic re-registration expected behavior when runners lose their shard manager connection? Or do we need to handle this scenario differently?

Note:
We tried adding

dependsOn: [shardManagerService]

dependsOn: [shardManagerService]

dependsOn: [shardManagerService]

dependsOn: [shardManagerService] and

wait: true

wait: true

wait: true

wait: true to the runner service config, but this doesn't actually solve the problem since it doesn't wait for the OLD shard manager to fully terminate before the runners connect:

export const runnerService = new sst.aws.Service(
  "ClusterRunner",
  {
    wait: true,
    // ... other config
  },
  {
    dependsOn: [shardManagerService],
  },
);

export const runnerService = new sst.aws.Service(
  "ClusterRunner",
  {
    wait: true,
    // ... other config
  },
  {
    dependsOn: [shardManagerService],
  },
);

export const runnerService = new sst.aws.Service(
  "ClusterRunner",
  {
    wait: true,
    // ... other config
  },
  {
    dependsOn: [shardManagerService],
  },
);

export const runnerService = new sst.aws.Service(
  "ClusterRunner",
  {
    wait: true,
    // ... other config
  },
  {
    dependsOn: [shardManagerService],
  },
);

Should runners inherently handle re-registering when they lose connection, or is there a recommended pattern for handling shard manager replacements during blue-green deployments?

Handling Blue-Green Deployment Issues with Effect Cluster and ECS

Similar Threads

Similar Threads

Similar Threads