How Giant Swarm Live‑Migrated Hundreds of Kubernetes Clusters to Cluster API (Without Downtime)

AI generated image for How Giant Swarm Live‑Migrated Hundreds of Kubernetes Clusters to Cluster API (Without Downtime)

If you’ve ever tried to migrate “just one” production Kubernetes cluster, you know the universe immediately responds with: that’s cute.

Now scale that up to hundreds of clusters, across enterprise customer environments, with no downtime, no data loss, and without the luxury of “we’ll just rebuild it and restore from backup.” That’s the story Giant Swarm tells in its April 1, 2026 blog post Live migrating hundreds of Kubernetes clusters to Cluster API (written by The Team @ Giant Swarm, and based on a talk by Joe Salisbury at KCD UK 2025). citeturn1view0

This article is my expanded, independently researched take on what happened, why it matters to the broader platform engineering world, and what lessons other teams can steal (legally, ethically, and preferably without forking security products in the middle of the night).

What Giant Swarm actually migrated (and why it’s non-trivial)

Giant Swarm operates managed Kubernetes for enterprise customers, often in the customers’ own cloud accounts. Their “product” isn’t merely a cluster; it’s the whole platform experience: provisioning, upgrades, lifecycle management, security posture, and the ability to repeat that reliably across many customer environments.

Historically, Giant Swarm did what many early Kubernetes vendors and internal platform teams did: they built their own cluster management system. In their case it was a set of operators/controllers plus a REST API layer, with provider-specific controllers such as an aws-operator that reconciled a cluster custom resource into AWS infrastructure via CloudFormation. citeturn1view0turn3search3

That homegrown stack worked—until it became the kind of success that turns into operational debt: every new Kubernetes release, every provider change, every edge case in cluster lifecycle became something they had to own forever.

Cluster API: the upstream “we should all stop reinventing this” project

Cluster API (CAPI) is a Kubernetes subproject that provides declarative APIs and tooling to provision and manage Kubernetes clusters using Kubernetes-style resources (Clusters, Machines, etc.). The basic idea is delightfully recursive: you run CAPI controllers in a management cluster, and they create and manage workload clusters for you. citeturn1view0turn2search5

In CAPI’s model, the “core” API is infrastructure-agnostic, while infrastructure providers (AWS, Azure, etc.) implement the actual cloud integrations. For AWS specifically, that’s Cluster API Provider AWS (CAPA). citeturn3search5turn1view0

Giant Swarm’s realization was blunt and practical: their custom system and Cluster API were solving essentially the same problem with similar topology, and the open-source ecosystem was maturing fast. citeturn1view0

Why migrate at all? Three motivations that will sound familiar

Giant Swarm frames the decision to adopt Cluster API around three core drivers:

  • Reduce maintenance burden of their own controllers (especially around new Kubernetes releases). citeturn1view0
  • Gain features from a growing upstream ecosystem rather than building everything in-house. citeturn1view0
  • Speed up new infrastructure provider support. They note that under the old system, standing up a new provider could take roughly six months. citeturn1view0

That last one matters more than it sounds. Provider support isn’t just a “checkbox feature.” It’s a multi-year commitment: new instance types, load balancer quirks, IAM edge cases, API deprecations, network primitives, and—if you operate for regulated enterprises—an endless parade of security and compliance requirements. Anything you can outsource to upstream without losing differentiation is a gift.

Why live migration instead of “blue/green” replacement?

In Kubernetes land, one of the most popular upgrade and migration strategies is still “build a new cluster, move workloads.” It’s often safer than in-place upgrades, and it provides a rollback path—if you can afford it.

Giant Swarm evaluated three approaches for their AWS fleet:

  • A manual migration CLI
  • An automated migration operator
  • A blue/green cluster migration (spin up replacement clusters and move workloads)

They chose the CLI, describing it as practical for their fleet size and easier to iterate on when things broke. They rejected blue/green for a very enterprise reason: some customers had severely constrained IP address space, making it impractical to double production clusters even temporarily. citeturn1view0

That’s a detail worth underlining. Platform engineering decisions frequently get summarized as “engineering preference,” but the real constraint is often something unglamorous: IP allocation, routing rules, firewall change processes, or the fact that a particular VPC was designed in 2016 and has the address space of a small apartment.

Where they started: a custom operator stack and per-region management clusters

Giant Swarm’s original architecture (as described in the post) looked like:

  • A REST API layer
  • Provider-specific operators/controllers
  • Cloud infrastructure below

On AWS, the aws-operator (dating back to February 2017) watched a cluster custom resource and reconciled it into CloudFormation stacks, effectively “compiling” desired cluster state into AWS resources. citeturn1view0turn3search3

They also used the common multi-cluster pattern: one management cluster per cloud region or data center, each managing multiple workload clusters, with no shared infrastructure between management clusters (and therefore no shared infrastructure between customers). citeturn1view0

From a blast-radius perspective, that’s conservative—and in enterprise managed services, conservative is usually code for “we have scars.”

The migration strategy in one sentence: move cluster ownership without changing cluster identity

The core challenge wasn’t “create new clusters with Cluster API.” That part is well-documented and widely practiced. The challenge was: take an existing running production cluster and move it under a different management plane without breaking identity, networking, DNS, certificates, or customer expectations.

Giant Swarm decided not to mix old-system resources and Cluster API resources in the same management cluster, to simplify CRD and operator management. Instead, each workload cluster would be migrated from its old management cluster to a new Cluster API management cluster, transforming resources along the way. citeturn1view0

If you’ve used upstream Cluster API tooling, you’ll recognize the theme: moving cluster-defining objects between management clusters is a known concept. For example, clusterctl move exists to move Cluster API objects from one management cluster to another, and it pauses reconciliation during the move to prevent controllers from racing each other. citeturn2search1turn2search5

Giant Swarm’s case is harder because they weren’t moving CAPI objects between CAPI management clusters; they were moving from a custom system to CAPI while keeping the workload clusters live.

The two phases of the migration mechanics

Giant Swarm breaks the actual migration mechanics into two phases:

Phase 1: Custom resource and secret migration (the “paperwork” phase)

This phase included:

  • Fetching all cluster custom resources from the old system’s management cluster
  • Migrating secrets (more on that in a moment)
  • Stopping reconciliation on old controllers via a pause annotation
  • Generating equivalent Cluster API custom resources
  • Applying them to the Cluster API management cluster

In other words: get the new management plane to understand the cluster, without yet disrupting the live control plane and worker nodes. citeturn1view0

Phase 2: Node transition (the “replace the engine while driving” phase)

This is where live migration becomes a contact sport. Giant Swarm describes three major subproblems:

  • Control plane transition (hard)
  • etcd transition (harder, because it’s always etcd)
  • Worker node transition (comparatively straightforward)

They started with an HA etcd cluster across three control plane nodes. When the first Cluster API control plane node came up, its etcd member joined the existing cluster. Then Cluster API’s logic removed the old nodes it didn’t recognize. More CAPI control plane nodes joined, resulting in a fully CAPI-managed etcd cluster with the same data on new nodes. citeturn1view0

This lines up with what many kubeadm-based HA workflows support: joining and removing etcd members is a known (if delicate) operation. Kubernetes documentation even includes a kubeadm reset phase specifically for removing a local etcd member on a control plane node. citeturn2search4

For the Kubernetes API control plane itself, they kept API availability stable by adding new CAPI control plane nodes into the cluster’s existing AWS ELB target set as they became ready, keeping healthy targets throughout. Once the CAPI control plane was running, they stopped control plane components on old nodes, drained, and deleted them. citeturn1view0

Workers were done nodepool-by-nodepool: create new CAPI workers, then drain and delete old ones. Their CLI supported configurable batch sizes for constrained IP environments. citeturn1view0

The outcome they claim: networking and DNS were preserved throughout, and workloads weren’t interrupted. citeturn1view0

The wildest part: “Forking Vault for fun and certificates”

Every migration has its “this is why my hair turned gray” moment. In Giant Swarm’s narrative, it’s PKI.

Under the old system, Giant Swarm used HashiCorp Vault for PKI. Provider operators issued certificates, distributed them to nodes, and used a separate PKI root per cluster. During migration, they wanted to preserve cluster identity, which required access to the same certificate root—specifically the root CA signing key. citeturn1view0

The problem: Vault doesn’t provide an API to extract that root signing key (for good reason). Giant Swarm’s solution was, essentially: fork Vault, add an API route that bypasses the security model, extract the certificate material, then feed it into kubeadm on the Cluster API side. They swapped in their patched Vault before each migration batch, pulled the cert material, then restored normal operation. citeturn1view0

There are two important takeaways here:

  • Migrations often require one-off, controlled violations of your own rules. The trick is to do it deliberately, temporarily, and with as much auditability as possible.
  • Identity is sticky. Cluster identity isn’t just DNS names. It’s certificates, trust roots, service account issuers, and the expectations of every component that has ever cached a credential.

That last point shows up again in Giant Swarm’s post-migration documentation. Their guidance for clusters migrated from “vintage” to Cluster API discusses keeping multiple service account token issuers for a time, because old tokens need validation while new tokens should be issued under the new issuer configuration. citeturn2search10

Context: why this story matters beyond Giant Swarm

It’s tempting to file this under “cool vendor blog post,” nod appreciatively, and go back to arguing about Ingress vs Gateway API. But the story is bigger:

  • Cluster API is no longer a science project. Major operators are betting their managed services on it, which is a strong signal about maturity and ecosystem momentum. citeturn1view0turn3search5
  • Platform teams are consolidating on declarative lifecycle management. The same pattern that made Kubernetes popular for apps (“declare desired state; controllers reconcile”) is being extended to clusters themselves.
  • Multi-cluster operations are the norm, not the edge case. Whether for isolation, compliance, latency, or organizational boundaries, many enterprises run dozens to hundreds of clusters. And yes, that means upgrades and migrations need industrial tooling, not heroics.

Enterprise reality check: why “no downtime” is so hard to deliver

“No downtime” is often used casually, as if it’s a checkbox. In practice, it means:

  • The Kubernetes API server endpoint stays reachable
  • etcd quorum stays healthy
  • Node replacement doesn’t violate PodDisruptionBudgets or overload remaining nodes
  • Networking and DNS remain consistent
  • Certificates and authentication flows don’t break during transition

Giant Swarm’s approach touches several of these explicitly: they preserved networking and DNS, carefully managed control plane target membership in the ELB, and treated etcd membership as the high-wire act it is. citeturn1view0

It’s also a reminder that “live migration” here doesn’t mean moving running containers between clusters in the VM live migration sense. It means live-moving cluster management ownership while keeping workloads running.

Comparisons: Cluster API vs custom controllers vs managed Kubernetes

Custom controllers (Giant Swarm’s original approach)

Pros:

  • Maximum control over architecture and workflows
  • Tight integration with your product and release model

Cons:

  • You own every edge case forever
  • Kubernetes release cadence becomes your release cadence
  • Provider changes can become existential

The fact that Giant Swarm’s aws-operator dates back to 2017 is both impressive and a hint at how long they carried that burden. citeturn1view0turn3search3

Cluster API + provider implementations (their new approach)

Pros:

  • Upstream community investment and shared maintenance
  • Consistent APIs across providers
  • Growing ecosystem of tooling (including move semantics and lifecycle workflows)

Cons:

  • Not a drop-in replacement; maturity varies by provider
  • You may still need to add enterprise-grade hardening, integration, or operational guardrails

Giant Swarm explicitly notes that Cluster API wasn’t a drop-in replacement and required “a thousand small features” worth of productionization work. citeturn1view0turn0search1

Fully managed Kubernetes (EKS/GKE/AKS)

Some readers will ask: “Why not just use EKS and call it a day?”

For many organizations, that’s a good answer. But managed services don’t eliminate platform work; they change it. You still deal with:

  • Multi-account IAM and identity
  • Network and IP design
  • Add-ons, policy enforcement, and compliance
  • Multi-cluster fleet operations

And if your business model is “we deliver a consistent platform stack across customer environments,” you may need a control plane that is consistent across clouds, regions, and customer accounts—not just whatever the cloud provider’s managed service happens to support.

What broke (implicitly) and what to watch out for if you try this

Giant Swarm’s post promises “what broke along the way,” but the most useful insights are often the structural ones rather than a list of individual bugs. Here are the big risk zones their story highlights:

1) Secrets and PKI are migration-critical

If your cluster identity changes, everything that depends on trust chains can go sideways. Giant Swarm’s Vault fork anecdote is extreme, but it’s a symptom of a common truth: certificate roots are the keys to the kingdom, and migrations often need continuity.

2) etcd is a quorum-based mood swing

Any strategy that replaces control plane nodes needs a plan for etcd membership and quorum. kubeadm provides documented mechanisms for removing etcd members during reset flows, which is helpful, but operationally it’s still an area where “one wrong step” can become a very long day. citeturn2search4

3) Load balancers are the real control-plane API surface

By treating the AWS ELB target set as the stability anchor—adding new control plane nodes as targets before removing old ones—Giant Swarm aligned with a practical reality: for most clients, the control plane is “whatever is behind the LB DNS name.” Keep that stable, and you have a chance.

4) IP space constraints can kill otherwise “clean” designs

Blue/green migrations are elegant until you meet the enterprise VPC that can’t spare the extra addresses. Giant Swarm’s CLI batching to handle constrained IP environments is a great example of building for the real world. citeturn1view0

Organizational lessons: migrations are not just an engineering project

Two organizational choices in the Giant Swarm story stand out:

The “hive sprint”

Early on, Giant Swarm ran a month-long internal “hive sprint” where they suspended normal structures and had the whole company hack on Cluster API. They didn’t finish the migration in that month, but they credit it with accelerating organizational understanding, surfacing blockers, and kickstarting architectural discussions. citeturn1view0

This is a useful pattern when you’re changing the foundation of your product. If only one small team “gets it,” everything else becomes friction: sales promises the old world, support documents the old world, and engineering quietly tries to replace the engine mid-flight.

Splitting maintenance from the migration mandate

They initially had one team doing both old-system maintenance and Cluster API development. Maintenance won (because production always wins). Eventually they split into two teams: one in maintenance mode building the “capstone” release that enabled migration, and one with a clear mandate to make Cluster API migration-ready. citeturn1view0

If you’re a platform leader reading this: that’s your hint. If you don’t structurally protect the migration work, it will lose to today’s pager noise every time.

Implications for the Cluster API ecosystem (and for the rest of us)

Giant Swarm’s successful migration has a few implications worth paying attention to:

  • Validation of CAPI as a long-term control plane for fleet management. When a managed Kubernetes provider adopts it at scale, it’s a confidence signal for the ecosystem.
  • More upstream contributions and enterprise hardening. Giant Swarm explicitly talks about contributing what they needed upstream and productionizing CAPI for enterprise use. citeturn1view0turn0search1
  • Pressure on other vendors and internal platforms. If your differentiation is “we have custom cluster lifecycle tooling,” the market is telling you that might not be differentiation for long.

Also, the moment you stop maintaining bespoke cluster controllers, you free up engineering time for things customers actually notice. Giant Swarm says the reclaimed capacity is being invested into higher-value initiatives like hybrid edge/industrial IoT platforms and agentic AI platform work. citeturn1view0turn0search6

Practical guidance: questions to ask before you attempt a similar migration

If you’re considering a move to Cluster API (or any new cluster lifecycle system), here’s a checklist inspired by the Giant Swarm story and upstream realities:

  • What must remain stable? API endpoints, DNS names, VPCs/subnets, load balancers, certificate roots, OIDC issuers, etc.
  • How will you pause old reconciliation? You need a clean handoff to prevent dueling controllers. (CAPI tooling uses pausing semantics during object moves; mimic the principle even if your source system isn’t CAPI.) citeturn2search1
  • How will you preserve identity? If your trust roots or service account issuers change, you need a staged strategy to validate old tokens while issuing new ones. citeturn2search10
  • What is your etcd strategy? Node replacement patterns must keep quorum and data integrity intact.
  • Do you have IP headroom? If not, you’ll need batching or alternative migration patterns, and blue/green may be a non-starter. citeturn1view0
  • How will you test at scale? The hard bugs show up at “hundreds of clusters” not “one staging cluster.”
  • Who owns the work? If it’s “everyone when they have time,” it’s no one.

A quick note on “move” tooling and why Giant Swarm used a CLI

In upstream Cluster API, moving cluster-defining resources between management clusters is often done using clusterctl move. The command exists specifically to migrate Cluster API objects and pauses reconciliation to prevent conflicts. citeturn2search1turn2search5

Giant Swarm didn’t simply run clusterctl move because their source objects weren’t Cluster API objects. Their decision to write a dedicated CLI makes sense: they needed transformation logic, secret handling, and provider-specific handoffs. A CLI is also easier to iterate on than a full automation operator when you’re discovering new failure modes weekly. citeturn1view0

So… should you migrate to Cluster API?

If you operate more than a handful of clusters, or you need consistent lifecycle management across environments, Cluster API is increasingly the default answer—especially if you don’t want your team to be the only ones on Earth maintaining your cluster lifecycle logic.

But Giant Swarm’s experience is also a warning label:

  • You will not get this “for free.” Productionizing upstream tooling for enterprise fleets requires serious engineering and operational discipline.
  • You need a migration plan that respects identity and constraints. The best architecture diagram loses to an IP shortage and a root CA you can’t extract.
  • You need organizational alignment. Their hive sprint and team split are the kind of moves that make “impossible” projects merely “painful.” citeturn1view0

Still, the payoff can be huge: less bespoke maintenance, faster provider support, and more time spent on the platform features customers actually buy.

Sources

Bas Dorland, Technology Journalist & Founder of dorland.org