Spent the better part of the day debugging sporadic network failures in a kubernetes cluster.
TIL:
- k8s uses lots of iptables magic under the hood.
- iptables has a mechanism to apply rules based on probability and that’s how k8s does load balancing (e.g., if you have a service that points to several pods): https://man.archlinux.org/man/iptables-extensions.8#statistic
- The root cause of our sporadic failures were stale iptables rules: Some of them pointed to no longer existing pods (but because probabilities are involved, they didn’t always trigger).
- This isn’t Sparta, this is madness. And probably a k8s bug.