The Reminiscences of a Government Kubernetes Operator

28 March 2026

The Reminiscences of a Government Kubernetes Operator

I was twenty-eight years old the first time a cluster tried to kill me.

Not literally, of course. The cluster was just doing what clusters do when they smell blood. But at 2:14 a.m. on a wet Tuesday in a windowless SCIF somewhere in the Home Counties, with the citizen tax portal bleeding red across every monitoring dashboard in the kingdom, it felt personal. The kind of personal that makes a man question every life choice that led him to this terminal.

I had been a cloud architect for exactly eleven months. Green as grass, cocky as a new trader with his first margin account. I thought I knew Kubernetes. I could recite the pod lifecycle in my sleep. Then the real market, the UK government cloud, opened its books and showed me what a margin call actually looks like when the stakes are not money but national service availability and a possible career-ending data-sovereignty breach.

Let me take you back to that night.

The alerts hit like a short squeeze. Worker nodes ghosting the API server. Pods evaporating. The control plane had gone strangely, terrifyingly quiet. I punched in the first command any sane operator reaches for, kubectl get nodes, and the terminal spat back a wall of 403 Forbidden. Not one node. Every single one. My stomach dropped the way it does when you watch your entire position reverse in the final fifteen minutes of trading.

I sat there in the half-dark, heart hammering, and did the only thing a man can do when the market turns against him: I stopped. I forced myself to remember the first rule I had learned the hard way, months earlier, during my baptism in the CrashLoopBackOff.

That lesson had come on a smaller deployment, but the terror was the same. A brand-new microservice, fresh from the CI pipeline, hit the cluster and immediately began its death spiral. CrashLoopBackOff. The pod would start, panic, die, wait a little longer, try again. Kubernetes wasn’t being cruel; it was being wise. Exponential backoff is the system’s way of saying, “Son, you flooded the engine. Stop turning the key every three seconds or you’ll burn the whole car down.” The control plane was protecting itself from the very thing I was trying to force-feed it, broken code that would have consumed every CPU cycle on the node if left unchecked.

But in the government cloud, the diagnosis was never simple. My pod wasn’t alone in its container. There was the Istio sidecar enforcing mTLS, the Fluent Bit sidecar shipping logs, and the security scanner that never slept. kubectl logs without the -c flag? Pointless. The command would fail with the polite indifference of a bouncer who has seen your kind before. And even if I got the right container, direct log access was often forbidden by RBAC. Logs had to flow through the immutable pipeline to Splunk or Elasticsearch. No shortcuts. No “just this once.”

I learned to love that constraint the way a trader learns to love the tape. It forced clarity.

The next killer I met was ImagePullBackOff.

Imagine you are trying to get a package delivered to a classified facility. The courier is holding a perfectly innocent box of paper clips. Doesn’t matter. If that courier isn’t on the approved list, he never gets past the gate. That is exactly what happens when your kubelet tries to reach Docker Hub from inside the air-gapped perimeter. The cluster doesn’t throw a helpful “network error.” It simply sits there in ImagePullBackOff, looking innocent while your deployment dies.

I once spent forty panicked minutes chasing phantom network issues before I remembered the gospel according to the hardened registry. Harbor. Scanned at every layer. Signed with cosign. Admission controllers ready to reject anything that smelled of the public internet. The fix was never “just whitelist Docker Hub.” That would have been punching a hole clean through the zero-trust perimeter. Instead, I learned to check the imagePullSecret, verify the VPC endpoint routing, and make sure the worker nodes could actually speak to the internal registry without crossing the wrong subnet. Supply-chain security is not paranoia when the alternative is inviting state actors to ride your own scale-up events straight into the heart of the citizen database.

Then there was the night the pods simply refused to schedule. Pending. Perfectly healthy images, flawless manifests, and yet the scheduler kept them in limbo like a hotel clerk who has rooms but won’t give you the key.

I ran kubectl describe pod and scrolled, always scroll to the Events section, the very bottom, where the scheduler keeps its diary. Five nodes available. Five insufficient CPU. Or worse: the taint that said “this node is for TOP SECRET only” and my pod had no toleration. Affinities and taints are not architectural showing-off. They are the biometric scanners on every internal door. Cross-contamination is not a risk; it is a career-ender.

I learned to read the scheduler the way Livermore read the tape: not for what it said, but for what it refused to say.

But the invisible killer, the one that still wakes me up sometimes, was networking.

Pods healthy. CPU fine. Memory stable. And yet nothing talked to anything. The classic invisible wall.

I had a service. I had endpoints. Or so I thought. One missing hyphen in a label selector and the controller manager wrote an empty Endpoints object. Kube-proxy read zero rules into iptables and the packets simply evaporated. No error. No log. Just silence.

Then came CoreDNS. The search domains. The query amplification. One innocent database lookup turning into five UDP queries across svc.cluster.local until the CoreDNS pods in kube-system were drowning in their own receive queues. I learned to spin up an Alpine debug pod and run nslookup before I touched anything else. Because in a government cluster, DNS failure masquerades as application failure the way a margin call masquerades as bad luck.

And always, always, the network policies. Default deny. Every pod isolated by design. Ingress rule, egress rule, or the packet dies at the kernel level. Zero trust is not a slogan when the alternative is lateral movement from a compromised PDF generator straight into the citizen database.

I remember the night the entire control plane locked me out.

kubectl get events returned nothing but 403s. My first instinct was panic: the API server is dead, etcd has lost quorum, the cluster is toast. Then I forced myself to run kubectl auth can-i get events --namespace=default. The answer came back: no.

The zero-trust architecture was simply doing its job. My OIDC token had expired. Or my ClusterRoleBinding was too narrow. The system wasn’t broken; I was unauthorized. In a startup you blast through with cluster-admin and fix it later. In government you stop, refresh credentials through the proper bastion host, and document every step like your pension depends on it. Because it does.

Once inside the perimeter I learned the real discipline: never exfiltrate logs to your laptop. That stack trace might contain unredacted PII. Downloading it turns an outage into a reportable breach. Instead you authenticate to the jump box, query the centralized SIEM from inside the enclave, and keep the data where it belongs.

And when the running pod was misbehaving, HTTP 500s, dropped packets, the works, the old instinct screamed kubectl exec -it. The source material of my nightmares called it the golden rule violation. Exec mutates the immutable. It contaminates the sealed artifact. Admission controllers like OPA Gatekeeper or Kyverno kill the request before it ever reaches the node. Good. Because production containers are operating rooms. You do not walk in wearing muddy boots.

The solution that saved me more nights than I can count was the ephemeral container, the sterile observation bubble. You attach it via the debug API sub-resource. It joins the pod’s network and PID namespaces so you can run tcpdump, strace, dnsutils, everything you need, but it keeps its own isolated mount namespace. You see everything. You change nothing. The production container remains pristine. The audit trail stays immaculate. Security is happy. Engineers are no longer flying blind at 3 a.m.

I have watched clusters die for every reason the textbooks warn about and a few they don’t. I have felt the cold sweat when the events section of kubectl describe tells you the real story. I have learned that self-healing is a beautiful lie told by level-triggered controllers. The controller manager will keep trying to reconcile desired state with actual state until it runs out of fuel, exactly like the thermostat that keeps blasting the furnace while the window stays broken. The events are the breadcrumbs. Ignore them at your peril.

So here I sit, years later, a little greyer, a lot more cautious, staring at another pristine-but-broken deployment. The pods are green. The metrics are flat. And somewhere in the dark, two microservices are refusing to speak.

I smile now. Because I know the checklist. I know the order. I know the discipline.

And I know that the cluster is not the enemy.

The cluster is the market.

It is always telling you the truth, if you have the stomach to listen.

And if you don’t… well, the 403 will be waiting for you at 2:14 a.m., right when the nation needs its tax portal to work.

That is the only lesson that ever mattered.

bolao

Bola's Blog

The Reminiscences of a Government Kubernetes Operator

Leave a Reply Cancel reply

Archives

Categories

Recent Posts

Recent Comments