Running EKS on a /26: Why We Ripped Out the AWS VPC CNI for Cilium

We swapped out the CNI on a production EKS cluster in GovCloud. Sounds like a one-line change. Was not a one-line change.

The setup: AWS GovCloud, a hard compliance posture, and a subnet allocation from the networking team that was, frankly, comically small. A /26. Sixty-four IP addresses total, fifty-nine usable once you back out AWS’s five reserved per subnet, to run an entire Kubernetes cluster on.

Why /26 breaks EKS

GovCloud isn’t fundamentally different from commercial AWS at the API level, but the operational reality is. Services lag behind, some features don’t exist at all, and the surrounding network and compliance posture tends to be tighter because the customers operating there have rules they cannot bend.

On EKS with the default AWS VPC CNI, every pod gets a real, routable VPC IP. So do the nodes. So do load balancer ENIs, NAT gateways, and a handful of internal AWS-managed interfaces. With 59 usable addresses, the math falls apart immediately.

On EKS, instance size doesn’t just determine your CPU and memory budget. It determines your maximum pod count, because the AWS VPC CNI hands out pod IPs from the secondary IP slots on a node’s ENIs. An m5.large supports 3 ENIs with 10 IPs each. That’s 30 total, minus the primary IP on each ENI, leaving 27 secondary IPs available for pods. (AWS’s published max-pods value for that instance type is 29, which adds 2 back for hostNetwork pods that don’t consume an ENI IP.) Even with memory to spare, you hit a hard ceiling on pod density that is purely a function of IP availability.

In our /26 world, that was catastrophic. We weren’t running out of CPU or memory. We were running out of addresses, and there was no instance type we could pick that would help, because every IP a node consumed came out of the same 59-address pool that everything else needed.

Why Cilium

The way out was simple in the abstract: stop using VPC IPs for pods. If pods live in a synthetic address space the VPC doesn’t have to know about, the /26 only has to be big enough for nodes and load balancers, both of which we could count on two hands.

This is what overlay CNIs are for. The two we seriously considered were Calico in IPIP mode and Cilium. We picked Cilium for the eBPF-based dataplane, the network policy and Hubble observability story, and the fact that it had clear momentum as a project. The compliance side of the house cared about the policy primitives. I cared about making sure it… worked.

The plan was Cilium in VXLAN overlay mode with Cluster Pool IPAM, a private /16 for pods that had no relationship to the VPC CIDR, and Cilium handling masquerading on egress so external traffic looked like it was coming from the node’s real IP.

The migration

Switching CNIs on a live cluster is not a supported operation in any meaningful sense. The AWS VPC CNI runs as a DaemonSet, has already assigned IPs to every pod in the cluster, and has populated kernel routing tables on every node with rules tied to those IPs. kubectl apply-ing Cilium on top and walking away is not a thing.

Draining nodes to force pod re-creation

I wrote a script that walked the cluster node by node, cordoned each one, drained the pods off it, and then either let it recycle or replaced it outright. The goal was to force every workload to be re-scheduled and re-IP’d under Cilium’s IPAM rather than the VPC CNI’s. For most application workloads this worked cleanly. They came back on overlay IPs and were happy.

Manually killing system components

Some pods didn’t cooperate. CoreDNS was the worst offender. It would come back up after a drain, but with stale iptables rules and endpoint references pointing at IPs from the old IPAM, and cluster DNS would silently break. The fix, ugly but reliable, was to delete the CoreDNS pods (and sometimes kube-proxy) outright and let them be recreated fresh. A few other system components needed the same treatment. There is no document anywhere that lists exactly which ones. You find out by watching what breaks.

We kept kube-proxy rather than enabling Cilium’s kubeProxyReplacement. On AL2 with cgroup v1, flipping that on would have meant per-node kubelet changes on top of the CNI swap, plus removing the EKS-managed kube-proxy add-on at the same time. Not the migration to layer that on.

`hostNetwork: true` as a transition crutch

Nobody warns you about this part. During the transition window, some pods are on the old IP space and some are on the new, and routing between them is not clean. Anything that had to keep working through the cutover (ingress controllers, monitoring agents, log forwarders) got moved to hostNetwork: true so it rode on the node’s actual VPC IP and bypassed the CNI question entirely.

The Cilium agent itself runs on hostNetwork: true for a deeper reason. It can’t depend on pod networking to start, because it is what provides pod networking. Same chicken-and-egg with kube-proxy and a few other infra-tier components. Once you internalize the pattern, you see it everywhere. Anything that has to be running while the network is being assembled has to live outside the network it’s assembling.

Some of those hostNetwork: true workloads stayed that way permanently. If it works and it’s not causing problems, there’s no urgency to migrate it back.

Egress and masquerading

The last gotcha was external connectivity. Pods in the new overlay (10.244.0.0/16 in our case) needed to reach external services. AWS APIs, internal corporate endpoints, the usual. But the VPC has no idea that overlay CIDR exists. Without proper masquerading, pods send SYN packets out and the return traffic has nowhere to go.

Tuning Cilium’s masquerading so that any traffic leaving the cluster got SNAT’d to the node’s real VPC IP fixed it. enable-ipv4-masquerade: true turned masquerading on. ipv4-native-routing-cidr, set to the overlay /16, drew the boundary: traffic destined outside the overlay got SNAT’d, traffic inside it was left alone. Before you get that right, you see one-way connectivity issues that are genuinely confusing to debug if you don’t know to look for them.

If you’re about to do this

A few things that I think generalize:

Overlay networking is the right answer when you’re IP-starved. Don’t waste calendar time trying to negotiate a bigger VPC if the org won’t bend. Decouple pod IPs from VPC IPs and move on.
If you can do this as blue/green with a fresh cluster, do it. A live cutover is significantly more painful than spinning up an adjacent cluster and migrating workloads to it.
Expect to manually intervene with CoreDNS, kube-proxy, and a handful of other system components. Drain-and-recycle gets you 90% of the way. The rest needs hands.
hostNetwork: true is fine. Use it for anything that needs to keep working while the network is in flux, and clean up later if you feel like it.
Get masquerading right before you cut traffic over. The external egress issues are non-obvious at first.

“Swapped a CNI” doesn’t read like much on a resume. The actual work is in the dozens of small, environment-specific problems that don’t appear in any documentation, and the only reason any of this is written down is that I want the next person staring at a /26 in GovCloud to spend less time figuring out which CoreDNS pod to kill than I did.