Why NoSNAT Breaks Pod Egress in GKE Autopilot?

GKE Autopilot Pod Egress Broken by NoSNAT

I recently spent several hours debugging what initially looked like a very simple networking problem in GKE Autopilot.

A virtual machine in my VPC could reach an internal system without any issue. The system lived in the 172.x.x.x network and had been working fine for years. However, pods running inside a GKE Autopilot cluster in the same VPC could not reach it at all.

There were no errors. No obvious firewall denies. No dropped packets. The connections from the pods just hung until they timed out. At first glance everything looked correct, which made the issue extremely confusing.

The root cause turned out to be a single Kubernetes object.

 				 					apiVersion: networking.gke.io/v1 kind: EgressNATPolicy spec:   action: NoSNAT   destinations:   - 10.0.0.0/8 				 			
An EgressNATPolicy had been configured with action: NoSNAT for the entire 10.0.0.0/8 range. That one line was enough to silently break all pod egress to the internal network.

The Setup That Looked Perfect

The environment was straightforward. The VPC subnets were in the 172.x.x.x range. Firewall rules allowed traffic from those subnets. A GKE Autopilot cluster was deployed in the same VPC, but its pod CIDR was in the 10.x.x.x range.

The VM lived in 172.x.x.x, so when it connected to the internal system, the source IP was trusted and everything worked.

The pods, however, had IPs like 10.92.1.15. Under normal circumstances in Autopilot, that is not a problem, because the pod IP is never exposed to the network. Kubernetes performs source NAT so that external systems see the node IP instead.

But that mechanism had been disabled.

What IP Masquerading Actually Does

IP masquerading in Kubernetes is simply source NAT. When a pod sends traffic outside the cluster, Kubernetes rewrites the source IP from the pod IP to the node IP.

So instead of the network seeing:

10.92.1.15 → 172.30.4.10

It sees:

172.20.5.7 → 172.30.4.10

This is absolutely critical in enterprise environments. Firewalls almost always trust subnets, not individual pod ranges. Routing tables almost never include dynamic pod CIDRs. Return paths are built around known infrastructure networks.

SNAT is what makes Kubernetes compatible with real-world networks.

Without it, pods behave like unknown devices on random IP ranges.

Why This Failed So Quietly

Once NoSNAT was applied, pods started leaving the cluster with their real IP addresses. The internal system saw traffic coming from 10.x.x.x and had no idea what to do with it.

From Kubernetes’ point of view, the packets were sent successfully. From the network’s point of view, the source IP was untrusted and unroutable. So return traffic never came back.

Nothing was explicitly blocked. No firewall rule fired. The TCP connections simply hung.

This is the most dangerous type of failure: everything is technically working, but the system is architecturally broken.

Why Autopilot Makes This Worse

In standard GKE you can sometimes get away with routed pod networks. You can control node pools, advertise routes, tune the ip-masq-agent, and integrate with on-prem routing.

Autopilot gives you none of that control.

You do not manage the nodes. You do not control the pod routing fabric. You cannot inject custom routes. You cannot monitor low-level network behavior.

Autopilot is built on a very strong assumption: pods are not first-class network citizens. Nodes are.

Once you disable SNAT, you violate that assumption, and you inherit all the complexity of building a real routed network without having any of the tools to manage it.

The One-Line Fix

The fix was embarrassingly simple.

We deleted the EgressNATPolicy.

Immediately pods started SNATing again. The internal systems saw traffic from trusted 172.x.x.x node IPs. Return paths worked. Everything recovered instantly.

No redeployments. No restarts. No code changes.

One object caused the entire incident.

When NoSNAT Actually Makes Sense

Disabling SNAT is not inherently wrong. It is required in some advanced architectures: service mesh gateways, on-prem Kubernetes with BGP, full pod-level identity models.

But those environments have full control over routing, firewall policy, network observability, and security boundaries.

GKE Autopilot is the opposite of that. It is designed for simplicity, not network engineering experiments.

Using broad RFC1918 ranges like 10.0.0.0/8 in NoSNAT effectively turns your cluster into a routed network device. Most enterprise networks are not built to support that, and Autopilot does not give you the knobs to make it safe.

The Real Lesson

This incident was not caused by a bug. It was caused by a wrong mental model.

I treated GKE Autopilot like a traditional Kubernetes cluster with advanced networking capabilities. It is not. It is a managed abstraction layer where pods live in a private bubble and nodes represent the cluster to the outside world.

The correct model for Autopilot is simple:

Pods behave like VMs.
Nodes are the only network identity.
SNAT is not optional — it is fundamental.

Once you internalize that, most Autopilot networking problems become trivial to reason about.

And once you forget it, a single YAML object can take down your entire platform without leaving a single error behind.

Want help with GKE or platform engineering?

If you're working on GKE or platform engineering at scale, I help teams design production-grade cloud and DevSecOps platforms.
Buoyant Cloud Inc
Privacy Overview

This website uses cookies so that we can provide you with the best user experience possible. Cookie information is stored in your browser and performs functions such as recognising you when you return to our website and helping our team to understand which sections of the website you find most interesting and useful.