Kubernetes
Index
- 1. Introduction to Kubernetes
- 2. Kubernetes Architecture
- 3. Setting Up Kubernetes
- 4. Workload Resources
- 5. Configuration and Secrets Management
- 6. Networking
- 7. Storage
- 8. Scheduling and Resource Management
- 9. Security
- 10. Observability
- 11. Helm and Package Management
- 12. CI/CD and GitOps
- 13. Service Mesh
- 14. Extending Kubernetes
- 15. Multi-Cluster and Federation
- 16. Cluster Operations and Maintenance
- 17. Advanced Topics
1. Introduction to Kubernetes
Kubernetes is the de facto standard for deploying, scaling, and managing containerized applications in production environments. This chapter builds the conceptual foundation you need before touching a single kubectl command — covering why container orchestration exists, where Kubernetes came from, and the core vocabulary and mental models that underpin everything else in the platform.
1.1 Container Orchestration Fundamentals
Container orchestration is the discipline of automating the lifecycle of containers at scale. Before diving into Kubernetes specifically, this subchapter explains the problem space it was built to solve and how we arrived here historically.
What is Container Orchestration
Theory
A container is a lightweight, isolated unit of software that packages an application together with everything it needs to run: its runtime, libraries, and configuration. Running one container on one machine is trivial. Running hundreds or thousands of containers across dozens of machines — reliably, efficiently, and continuously — is not.
Container orchestration is the automated management of that complexity. An orchestration system is responsible for:
- Deciding where each container runs (scheduling)
- Ensuring containers that crash are restarted (self-healing)
- Distributing network traffic across healthy container instances (load balancing)
- Scaling the number of containers up or down based on demand
- Rolling out updates without causing downtime
- Providing containers a way to discover and communicate with each other
A useful analogy: think of an orchestrator as an air traffic controller. Individual planes (containers) know how to fly, but the controller manages the overall system — assigning runways (nodes), redirecting traffic around problems, and ensuring the whole airport (cluster) operates without collision.
Example
Without orchestration, a simple deployment might look like this manual process:
# Without orchestration: manually SSH into each machine and run containers
ssh user@server-01 "docker run -d --name api-v2 my-api:2.0"
ssh user@server-02 "docker run -d --name api-v2 my-api:2.0"
ssh user@server-03 "docker run -d --name api-v2 my-api:2.0"
# If server-02 goes down, you must notice it yourself and intervene manually
ssh user@server-04 "docker run -d --name api-v2 my-api:2.0"
With orchestration (Kubernetes), you declare what you want and the system handles it:
# With orchestration: declare desired state, let the system manage reality
apiVersion: apps/v1
kind: Deployment
metadata:
name: api
spec:
replicas: 3 # "I want 3 copies running at all times"
selector:
matchLabels:
app: api
template:
metadata:
labels:
app: api
spec:
containers:
- name: api
image: my-api:2.0
The orchestrator reads this file, places containers on available nodes, monitors them continuously, and replaces any that fail — without further human input.
Exercises
- (Beginner) Name three responsibilities of a container orchestration system without looking back at the text.
- (Beginner) In your own words, explain the difference between a container and a container orchestration system.
- (Intermediate) A web service runs as 10 container instances across 5 nodes. One node loses power. Describe, step by step, what an orchestration system should do automatically in response.
- (Interview) "Why can't we just use a shell script to restart failed containers?" — What limitations does this approach have at scale? (Hint: think about state tracking, distributed coordination, and network topology.)
Answers
- Any three from: scheduling, self-healing/restart, load balancing, scaling, rolling updates, service discovery. Full list: deciding where containers run, restarting crashed containers, distributing traffic, scaling up/down, rolling out updates, service discovery.
- A container packages an application and its dependencies into an isolated, portable unit. A container orchestration system manages the lifecycle of many containers across many machines — handling placement, scaling, health, and networking — so operators do not have to do this manually.
- The orchestrator detects the node is unreachable (via health checks). It marks that node as
NotReady. It identifies the 2 container instances that were running on the failed node. It schedules 2 replacement instances on the remaining 4 healthy nodes. The load balancer is updated to remove the failed node's IP from the pool and add the new instances' IPs. End users experience no interruption (assuming the rollout of replacements is fast enough).- Shell scripts are single-point solutions: they run on one machine and have no awareness of other machines. Limitations include: (a) they cannot detect failures on remote hosts without complex polling logic; (b) they have no consistent view of global state across nodes; (c) they cannot make scheduling decisions based on available resources; (d) they do not handle network reconfiguration; (e) concurrent execution across nodes leads to race conditions; (f) they provide no audit trail or rollback mechanism.
Problems Solved by Orchestration
Theory
To appreciate orchestration, it helps to enumerate the concrete operational problems that arise when running containers in production without it.
| Problem | Without Orchestration | With Orchestration |
|---|---|---|
| Container crash | Manual restart by an operator | Automatic restart by the system |
| Node failure | Manual rescheduling to other machines | Automatic rescheduling |
| Traffic spikes | Manual scaling (more containers by hand) | Automatic horizontal scaling |
| Deployment update | Manual, often requires downtime | Rolling update with zero downtime |
| Service discovery | Hardcoded IPs or manual DNS updates | Dynamic service registry |
| Resource waste | Containers placed arbitrarily, nodes over/under-used | Bin-packing based on CPU/memory requests |
| Secret management | Passed as environment variables in scripts | Encrypted secrets injected at runtime |
| Configuration drift | Config files diverge across machines over time | Declarative config enforced continuously |
Each of these problems is manageable with one or two services. At hundreds of services and thousands of containers, manual management is not operationally viable.
Example
Consider the bin-packing problem. Without orchestration, you might manually assign containers to nodes without considering resource utilization:
Node A (8 CPU, 16GB RAM) — used manually:
[Container: 4 CPU] [Container: 4 CPU] [empty 8GB RAM wasted]
Node B (8 CPU, 16GB RAM) — used manually:
[Container: 2 CPU] [empty 6 CPU wasted] [Container: 8GB]
Total waste: 6 idle CPUs, 8GB idle RAM across 2 nodes
An orchestrator uses resource requests to pack containers efficiently:
Node A (8 CPU, 16GB RAM) — after orchestrator placement:
[4 CPU / 4GB] [2 CPU / 4GB] [2 CPU / 8GB] <- fully utilized
Node B is now free for other workloads or can be decommissioned,
saving cloud infrastructure cost.
Exercises
- (Beginner) What is "service discovery" and why does it become a problem at scale without orchestration?
- (Beginner) Match each problem to the orchestration feature that solves it: (a) container crash, (b) traffic spike, (c) deployment update. Features: rolling update, self-healing, autoscaling.
- (Intermediate) Explain "configuration drift" with a concrete example involving three servers and a configuration file.
- (Interview) Your team argues that since you only have 5 microservices, you don't need Kubernetes. What are the strongest arguments for and against adopting it at this scale? (Hint: consider operational overhead vs. future growth and team learning curve.)
Answers
- Service discovery is the mechanism by which one service finds the network address (IP and port) of another. Without orchestration, containers may be assigned dynamic IPs when restarted or moved. Hardcoding IPs breaks whenever a container moves. At scale, with hundreds of services restarting frequently, keeping IP mappings current manually is impossible. Orchestrators provide a stable DNS name or virtual IP that always resolves to healthy instances.
- (a) crash -> self-healing; (b) traffic spike -> autoscaling; (c) deployment update -> rolling update.
- Example: three servers initially all have
config.yamlwithlog_level: INFO. An operator SSHes into server-1 to debug an issue and changes it tolog_level: DEBUG, then forgets to revert. A week later, another operator SSHes into server-2 and adds a new database connection string. Server-3 remains at the original config. Now all three servers have different configurations, causing inconsistent behavior that is hard to diagnose. Orchestration solves this by enforcing a single source of truth: the declared config in the cluster's API is applied to all instances continuously.- Arguments for: (a) Kubernetes skills are valuable and the team learns on a smaller, lower-risk system before it becomes critical; (b) growth from 5 to 50 services is easier if the infrastructure is already in place; (c) rolling updates, health checks, and self-healing are valuable even at small scale. Arguments against: (a) Kubernetes has significant operational overhead — the control plane itself must be maintained; (b) simpler tools (Docker Compose, a single VM with systemd) may suffice; (c) debugging Kubernetes networking and scheduling adds cognitive complexity; (d) small teams may find the learning curve slows delivery.
Evolution from VMs to Containers to Orchestration
Theory
Modern container orchestration did not appear in isolation — it is the product of decades of progress in how we isolate and deploy software. Understanding this lineage clarifies why each layer of the stack exists.
Physical Servers (pre-2000s)
Early software ran directly on physical hardware. One application per server was common, because applications often conflicted in their library versions and OS dependencies. Utilization was poor — a server bought for peak load sat idle most of the time.
Virtual Machines (2000s)
Hypervisors (VMware, KVM, Xen) allowed a single physical machine to host multiple isolated virtual machines, each with its own OS. This improved utilization and allowed different applications with conflicting dependencies to coexist. However, VMs are heavyweight: each carries a full OS kernel (gigabytes of disk, hundreds of megabytes of RAM), and boot times are measured in minutes.
Containers (2010s)
Containers use Linux kernel features — namespaces (for isolation) and cgroups (for resource limiting) — to create isolated environments that share the host OS kernel. This eliminates the overhead of a full OS per unit: containers are megabytes in size and start in milliseconds. Docker (2013) made containers accessible to developers by providing a simple toolchain and a public image registry.
Container Orchestration (mid-2010s onward)
As organizations adopted containers at scale, the need to manage them systematically became acute. Running 10 containers manually is feasible. Running 10,000 across a dynamic fleet of machines is not. Orchestration platforms emerged to automate the placement, scaling, networking, and lifecycle management of containers across clusters.
Timeline of the evolution:
Physical Server
|
| Problem: resource waste, conflicts, slow provisioning
v
Virtual Machine (VMware, KVM)
|
| Problem: heavyweight (full OS per VM), slow boot, large footprint
v
Container (Docker, runc, OCI standard)
|
| Problem: managing thousands of containers manually doesn't scale
v
Container Orchestration (Kubernetes, Docker Swarm, Nomad)
Example: Comparing Resource Footprints
+------------------+------------------+------------------+
| Physical Server | Virtual Machine | Container |
+------------------+------------------+------------------+
| Full hardware | Virtualized HW | Process + libs |
| One OS | Own OS kernel | Shared OS kernel |
| Days to reprovision| Minutes to boot| Milliseconds |
| GB of overhead | 1-4 GB overhead | 10-200 MB image |
| Near-zero density| ~10-20 VMs/host | 100s/host |
+------------------+------------------+------------------+
A concrete size comparison for running an nginx web server:
Full VM image running nginx: ~2,000 MB (OS + libraries + nginx)
Docker image (nginx:alpine): ~23 MB (just what nginx needs)
Startup time (VM): 60-120 seconds
Startup time (container): < 1 second
Exercises
- (Beginner) What Linux kernel features underpin container isolation? Name both and describe what each does.
- (Beginner) Why are VMs considered "heavyweight" compared to containers?
- (Intermediate) A company runs 500 VMs on a fleet of physical servers. An engineer proposes migrating to containers. What are three concrete benefits they should expect, and one risk they must manage?
- (Interview) "Containers are just lightweight VMs." Is this statement accurate? Explain the technical distinction. (Hint: consider the role of the OS kernel in each model.)
Answers
- Namespaces provide isolation: each container sees its own view of the file system (mount namespace), process tree (PID namespace), network interfaces (network namespace), and hostname (UTS namespace). cgroups (control groups) enforce resource limits: they cap how much CPU, memory, disk I/O, and network bandwidth a container can consume, preventing one container from starving others.
- VMs are heavyweight because each includes a full, independent OS kernel and OS userland (init system, system libraries, drivers). This typically adds 1–4 GB of disk space and hundreds of MB of RAM per VM, plus minutes of boot time as the kernel initializes. Containers share the host OS kernel, so they only include the application and its userspace dependencies.
- Benefits: (a) Higher density — a host that ran 20 VMs might run 200 containers, reducing infrastructure cost; (b) faster deployment — new container instances start in under a second versus minutes for a VM; (c) consistent environments — container images bundle all dependencies, eliminating "works on my machine" problems. Risk: (a) shared kernel security boundary — a kernel exploit could affect all containers on a host simultaneously, whereas VMs provide stronger isolation; teams must apply OS patches to the host promptly and consider additional sandboxing (gVisor, Kata Containers) for untrusted workloads.
- The statement is inaccurate. VMs virtualize hardware: each VM runs a complete, independent OS kernel on top of a hypervisor. The kernel itself is isolated. Containers do not virtualize the kernel — they share the host OS kernel and use kernel-level isolation primitives (namespaces, cgroups) to separate processes. This makes containers faster and lighter but means they are OS-kernel-sharing processes, not isolated machines. A kernel-level bug on the host affects all containers; an equivalent bug in a guest OS affects only that VM.
Kubernetes vs Docker Swarm vs Nomad
Theory
Multiple container orchestration platforms exist, each with different design philosophies and trade-offs. The three most widely discussed are Kubernetes, Docker Swarm, and HashiCorp Nomad.
Kubernetes Originally developed at Google (derived from an internal system called Borg), Kubernetes is the most feature-rich and widely adopted orchestrator. It is highly extensible via Custom Resource Definitions (CRDs) and has a massive ecosystem. This power comes with complexity: Kubernetes has a steep learning curve and significant operational overhead.
Docker Swarm Docker Swarm is Docker's built-in orchestration mode. It is significantly simpler to set up than Kubernetes — a cluster can be initialized with a single command. It uses the same Docker Compose file format many developers already know. However, it has fewer features (no built-in autoscaling, limited scheduling options), a smaller ecosystem, and Docker Inc. has deprioritized its development in favor of Kubernetes integration.
HashiCorp Nomad Nomad is a general-purpose workload orchestrator from HashiCorp. Unlike Kubernetes, it is not container-specific — it can schedule Docker containers, Java JARs, binaries, and virtual machines using the same scheduler. It is architecturally simpler than Kubernetes and integrates naturally with other HashiCorp tools (Consul for service discovery, Vault for secrets). It is a strong choice for heterogeneous workloads but has a smaller community than Kubernetes.
| Feature | Kubernetes | Docker Swarm | Nomad |
|---|---|---|---|
| Workload types | Containers (OCI) | Docker containers | Containers, binaries, VMs |
| Setup complexity | High | Low | Medium |
| Learning curve | Steep | Gentle | Moderate |
| Autoscaling | Built-in (HPA, VPA) | Not built-in | Via Nomad autoscaler |
| Ecosystem/plugins | Very large (CNCF) | Small | Medium (HashiCorp) |
| Multi-tenancy | Strong (Namespaces, RBAC) | Basic | Moderate (Namespaces, ACLs) |
| Production adoption | Dominant | Declining | Niche but growing |
| Community size | Very large | Small | Medium |
| Best for | Large-scale, complex workloads | Simple Docker deployments | Mixed workloads, HashiCorp shops |
Example: Initializing a Cluster in Each System
# --- Docker Swarm: 2 commands to get a cluster running ---
docker swarm init # Initialize manager node
docker swarm join --token <TOKEN> <HOST> # Worker joins cluster
# --- Nomad: run agent on each node, federation is automatic ---
nomad agent -config=/etc/nomad.d/ # Start agent (dev mode: nomad agent -dev)
# --- Kubernetes: significantly more steps (simplified with kubeadm) ---
kubeadm init --pod-network-cidr=10.244.0.0/16 # Initialize control plane
# Then: configure kubectl, install a CNI plugin, join worker nodes
kubectl apply -f https://raw.githubusercontent.com/flannel-io/flannel/master/Documentation/kube-flannel.yml
kubeadm join <HOST>:6443 --token <TOKEN> # Worker joins cluster
The setup complexity difference is real and operationally significant.
Exercises
- (Beginner) Name one scenario where Docker Swarm would be a more appropriate choice than Kubernetes.
- (Beginner) What makes Nomad distinct from both Kubernetes and Docker Swarm in terms of workload types it can manage?
- (Intermediate) A startup has a team of 3 engineers, runs 8 microservices as Docker containers, and needs to deploy to production quickly. Recommend an orchestrator and justify your choice considering trade-offs.
- (Interview) A candidate says: "Kubernetes won the orchestration wars, so there's no reason to ever choose Nomad or Swarm." Present a technical counter-argument. (Hint: think about operational simplicity, heterogeneous workloads, and total cost of ownership.)
Answers
- Docker Swarm is appropriate when: the team is small and already fluent with Docker Compose; the number of services is low (< 20); setup must be done quickly with minimal operational overhead; and advanced features like autoscaling and complex scheduling are not required. A small internal tooling deployment is a clear fit.
- Nomad is workload-agnostic: it can schedule Docker containers, raw binaries/executables, Java applications (JARs), and virtual machines (via the QEMU driver) using the same scheduler and job specification format. Kubernetes and Docker Swarm are both container-specific — they can only orchestrate OCI-compatible containers.
- Recommendation: Docker Swarm or a managed Kubernetes service (e.g., AWS EKS, Google GKE). Justification for Swarm: a 3-engineer team deploying 8 services has low complexity, and the overhead of operating a self-managed Kubernetes cluster (etcd maintenance, control plane upgrades, CNI management) would consume significant engineering time. Swarm's Docker Compose compatibility means a lower learning curve. Justification for managed Kubernetes: if the team expects rapid growth, starting on Kubernetes (managed) avoids a painful migration later; the managed control plane removes most operational burden. The key trade-off is operational overhead today vs. migration cost later.
- Counter-argument: "Kubernetes won" applies to the general market, not to every specific use case. (a) Operational cost: Kubernetes requires significant expertise to operate safely. A small team may spend more time managing Kubernetes than building product; Swarm or Nomad reduces this cost. (b) Heterogeneous workloads: organizations running a mix of legacy Java applications, native binaries, and containers benefit from Nomad's unified scheduler — Kubernetes cannot natively schedule non-containerized workloads without significant workarounds. (c) HashiCorp integration: teams already using Consul, Vault, and Terraform have a cohesive operational model with Nomad that is harder to replicate with Kubernetes. The right tool depends on organizational context, not market share.
1.2 Kubernetes Overview
Kubernetes is more than a piece of software — it is a project, a community, and an ecosystem. This subchapter traces Kubernetes from its origins inside Google to its current position as the foundation of modern cloud-native infrastructure.
History and Origin of Kubernetes
Theory
Kubernetes was not invented in a vacuum. For over a decade before its public release, Google ran nearly all of its services — Search, Gmail, YouTube — on an internal cluster management system called Borg. Borg scheduled containers (using Linux cgroups, though the term "container" was not yet common) across Google's global fleet of machines, automatically handling failures, resource packing, and rolling updates at a scale no other organization matched.
In 2013, Docker made containers accessible to the broader industry. Google recognized that the industry would soon face the same scale problems Borg had already solved. A small team — including Joe Beda, Brendan Burns, and Craig McLuckie — began building an open-source system inspired by Borg's lessons but designed for the broader ecosystem. This project was initially named "Project Seven" (a reference to the Star Trek character Seven of Nine, a nod to a "more humane Borg").
Kubernetes (from the Greek "kubernetes," meaning helmsman or pilot) was announced publicly at Google I/O in June 2014 and version 1.0 was released in July 2015. On the same day as the 1.0 release, Google donated Kubernetes to the Cloud Native Computing Foundation (CNCF), a newly formed vendor-neutral foundation under the Linux Foundation.
Key milestones:
| Year | Event |
|---|---|
| 2003 | Google's Borg system begins internal development |
| 2013 | Docker is released publicly; containers go mainstream |
| 2014 | Kubernetes announced at Google I/O (June) |
| 2015 | Kubernetes 1.0 released; donated to CNCF (July) |
| 2016 | Kubernetes becomes the most popular container orchestrator |
| 2018 | Kubernetes graduates from CNCF incubation |
| 2022+ | Kubernetes is the foundation of nearly all major cloud platforms |
Example: The Borg Legacy in Kubernetes Design
Many Kubernetes concepts map directly to Borg concepts, demonstrating the lineage:
Borg Concept -> Kubernetes Equivalent
---------------------------------------------------
Borgmaster -> kube-apiserver (control plane)
Borg scheduler -> kube-scheduler
Borglet -> kubelet (node agent)
Task -> Container
Job -> Deployment / StatefulSet
Alloc -> Pod (the unit of co-scheduling)
BNS (naming) -> Kubernetes Services + DNS
The paper "Large-scale cluster management at Google with Borg" (2015, Verma et al.) formally documented Borg and is worth reading alongside Kubernetes documentation.
Exercises
- (Beginner) What is the name of Google's internal cluster management system that inspired Kubernetes?
- (Beginner) What does the word "Kubernetes" mean, and what is the significance of the name?
- (Intermediate) Why did Google choose to open-source Kubernetes and donate it to the CNCF rather than keep it proprietary or sell it as a product?
- (Interview) What lessons from the Borg paper are reflected in Kubernetes's design? (Hint: think about the alloc/Pod abstraction and the value of declarative job specifications.)
Answers
- Google's internal cluster management system is called Borg. It has been running Google's production workloads since approximately 2003.
- "Kubernetes" is a Greek word meaning helmsman or pilot — the person who steers a ship. The name reflects the system's role in steering containerized applications across infrastructure. The "k8s" abbreviation replaces the eight letters between "k" and "s."
- Google's primary business is cloud services, not software licensing. By open-sourcing Kubernetes: (a) the broader industry would adopt containers, growing the overall market for cloud infrastructure; (b) a strong open-source community would contribute improvements that benefit Google at no cost; (c) the ecosystem of tools built around Kubernetes (monitoring, logging, CI/CD) would also benefit Google Cloud customers; (d) a vendor-neutral foundation (CNCF) would prevent any single vendor from controlling the standard, which also served Google's interests against Microsoft and Amazon.
- Key Borg lessons reflected in Kubernetes: (a) The alloc (Pod) abstraction: Borg learned that grouping multiple tasks that must run together on the same machine (e.g., a web server and a log shipper) required a first-class scheduling unit — the alloc. Kubernetes adopted this as the Pod. (b) Declarative specifications: Borg operators specified what they wanted (replicas, resources, health checks), not a procedure, allowing the system to reconcile reality to the spec. Kubernetes inherits this as its declarative model. (c) Importance of naming and discovery: Borg's naming service (BNS) became Kubernetes Services and DNS. (d) Priority and preemption: Borg's workload classes (production vs. batch) informed Kubernetes's QoS classes and priority classes.
CNCF and the Kubernetes Ecosystem
Theory
The Cloud Native Computing Foundation (CNCF) is a vendor-neutral open-source foundation, itself a part of the Linux Foundation. It was founded in 2015 with Kubernetes as its seed project. The CNCF's mission is to foster and sustain the ecosystem of open-source, cloud-native technologies.
"Cloud-native" refers to an approach to building and running applications that exploits the advantages of the cloud computing model: elasticity, distributed systems, automation, and resilience. Cloud-native applications are typically containerized, dynamically orchestrated, and composed of microservices.
The CNCF manages a landscape of over 1,000 projects and products organized into categories. Projects go through a maturity lifecycle:
- Sandbox: Early-stage, experimental projects
- Incubating: Growing adoption, defined governance
- Graduated: Proven at scale, stable API, strong governance (Kubernetes, Prometheus, Envoy, etc.)
Notable CNCF graduated projects that commonly accompany Kubernetes in production:
| Project | Category | Purpose |
|---|---|---|
| Kubernetes | Orchestration | Container lifecycle management |
| Prometheus | Monitoring | Metrics collection and alerting |
| Envoy | Service Proxy | Layer-7 proxy, sidecar pattern |
| Helm | Package Management | Kubernetes application packaging |
| Fluentd | Logging | Log collection and routing |
| containerd | Container Runtime | Low-level container execution |
| Argo | CI/CD | GitOps and workflow orchestration |
| Jaeger | Tracing | Distributed request tracing |
The CNCF ecosystem is often visualized as the "CNCF Landscape" — a sprawling map of tools covering every layer of cloud-native infrastructure.
Example: A Typical Kubernetes + CNCF Production Stack
+---------------------------------------------------------------+
| Application Layer |
| Your Microservices (running as Kubernetes Deployments) |
+---------------------------------------------------------------+
| Kubernetes (Orchestration) |
| Scheduling | Self-healing | Scaling | Service Discovery |
+---------------------------------------------------------------+
| Helm (packaging) | Argo CD (GitOps deployment) |
+---------------------------------------------------------------+
| Envoy/Istio (service mesh, mTLS, traffic management) |
+---------------------------------------------------------------+
| Prometheus + Grafana (metrics) | Jaeger (tracing) |
| Fluentd + Elasticsearch (logs) |
+---------------------------------------------------------------+
| containerd (container runtime) |
+---------------------------------------------------------------+
| Linux Nodes (VMs or bare metal) |
+---------------------------------------------------------------+
Each layer solves a distinct problem. Kubernetes alone does not provide a complete production system — the surrounding CNCF ecosystem completes it.
Exercises
- (Beginner) What does CNCF stand for, and what is its stated purpose?
- (Beginner) What are the three maturity stages of a CNCF project?
- (Intermediate) Why is Helm described as a "package manager for Kubernetes"? What problem does it solve that plain YAML files do not?
- (Interview) A CTO says: "We're adopting Kubernetes — that handles everything we need for production." What gaps in a production system does Kubernetes alone not address, and what tools fill them? (Hint: think about observability, security, and packaging.)
Answers
- CNCF stands for Cloud Native Computing Foundation. Its purpose is to foster and sustain the ecosystem of open-source, cloud-native technologies by providing a vendor-neutral home for important projects and promoting cloud-native best practices.
- The three maturity stages are: Sandbox (experimental, early adoption), Incubating (growing, defined governance), and Graduated (production-proven, stable, strong governance).
- Helm is a package manager for Kubernetes because it solves the problem of deploying complex applications that require many related Kubernetes resources (Deployments, Services, ConfigMaps, Secrets, RBAC rules, etc.). Without Helm, each resource must be managed as a separate YAML file with no built-in mechanism for versioning, parameterization (e.g., different values for dev vs. prod), or atomic installation/upgrade/rollback. A Helm Chart is a versioned, parameterizable bundle of all related resources that can be installed, upgraded, and rolled back as a single unit with
helm install/helm upgrade/helm rollback.- Gaps and tools: (a) Observability/Metrics: Kubernetes does not collect or store application metrics. Prometheus (metrics) + Grafana (visualization) fill this. (b) Distributed tracing: no built-in request tracing across services; Jaeger or Tempo provide this. (c) Log aggregation: Kubernetes provides raw container logs but no long-term storage or search; Fluentd + Elasticsearch + Kibana (or Loki + Grafana) fill this. (d) Application packaging: raw YAML files are not parameterizable or versioned as a unit; Helm solves this. (e) GitOps/CD: Kubernetes has no built-in continuous deployment; Argo CD or Flux provide this. (f) Service-to-service security (mTLS): Kubernetes networking does not encrypt traffic between pods by default; a service mesh (Istio, Linkerd) adds mutual TLS. (g) Secret management beyond basic: Kubernetes Secrets are base64-encoded (not encrypted at rest by default); Vault or Sealed Secrets provide stronger secret management.
Kubernetes Use Cases and Adoption
Theory
Kubernetes has been adopted across a remarkably diverse range of organizations and use cases, from small startups to global enterprises. Understanding the use cases helps clarify when Kubernetes is the right choice and when it may be overkill.
Primary use cases:
- Microservices deployment: Running and managing dozens to hundreds of independently deployable services
- Batch and data processing: Running periodic or event-driven workloads (CronJobs, one-off Jobs)
- CI/CD pipelines: Spinning up ephemeral build environments and test runners
- Machine learning workflows: Managing GPU workloads, training jobs, and model serving (Kubeflow)
- Edge computing: Running Kubernetes on smaller clusters at edge locations (k3s)
- Multi-cloud and hybrid cloud: Providing a consistent API across cloud providers and on-premises
Who uses Kubernetes:
Organizations across industries run Kubernetes in production: Airbnb, Spotify, Pinterest, The New York Times, Goldman Sachs, Walmart, CERN, and the majority of Fortune 500 companies. CNCF annual surveys consistently show over 90% of respondents using Kubernetes in some form.
When Kubernetes may be the wrong choice:
- Single monolithic application with no plans to decompose
- Very small team with no existing container expertise
- Stateful applications with complex data requirements that are not yet containerized
- Workloads that do not benefit from horizontal scaling
Example: Mapping Use Cases to Kubernetes Features
Use Case -> Kubernetes Feature Used
----------------------------------------------------
Long-running API server -> Deployment + Service
Scheduled batch job -> CronJob
Database (stateful) -> StatefulSet + PersistentVolumeClaim
One-off migration job -> Job
A/B testing -> Ingress + canary Deployment
GPU-based ML training -> Job with GPU resource limits
Autoscaling web app -> Deployment + HorizontalPodAutoscaler
Configuration by env -> ConfigMap + Secret
Multi-tenant SaaS -> Namespaces + RBAC + ResourceQuota
Exercises
- (Beginner) Name two Kubernetes use cases beyond running a web API server.
- (Beginner) What is the difference between a Kubernetes
Joband aCronJob?- (Intermediate) An e-commerce company has a single monolithic application written in PHP, deployed on 3 VMs, and serving 10,000 users per day. Their engineering manager wants to "move everything to Kubernetes." Evaluate whether this is advisable.
- (Interview) Kubernetes is sometimes described as both "the best and worst solution" for microservices. What does this mean? (Hint: think about the operational value it provides versus the complexity it introduces.)
Answers
- Any two from: batch and data processing, CI/CD pipeline environments, machine learning workload management, edge computing, multi-cloud deployments, A/B testing, scheduled database migrations, GPU workloads.
- A Job runs a containerized task to completion exactly once (or a specified number of times with optional parallelism) and then terminates. A CronJob is a Job that is automatically scheduled to run on a recurring schedule defined by a cron expression (e.g.,
0 2 * * *for 2am daily). A CronJob creates a new Job object each time its schedule fires.- This is likely inadvisable without significant prerequisites. A PHP monolith on 3 VMs is a well-understood system with predictable failure modes. Moving it to Kubernetes without first containerizing the application, addressing state management (session data, file uploads), understanding Kubernetes networking, and training the team introduces enormous risk for uncertain benefit. Kubernetes provides the most value for decomposed services that need independent scaling. The recommendation should be: (a) first containerize the monolith and validate it works in Docker; (b) evaluate whether the operational problems (scaling, downtime during deployments) justify Kubernetes; (c) consider a managed Kubernetes service to reduce operational overhead; (d) plan a migration path rather than a big-bang lift-and-shift.
- Kubernetes is "the best solution" for microservices because it directly solves the hardest operational problems at scale: scheduling, health management, service discovery, rolling updates, and resource efficiency across large fleets. It is "the worst solution" because it introduces significant accidental complexity: a new networking model, a new security model (RBAC), a new way to think about storage, a large surface area of YAML configuration, and an operational dependency on the Kubernetes control plane itself. Organizations that adopt Kubernetes prematurely — before they have the scale problems it solves or the expertise to operate it — often find that the tool creates more problems than it resolves. The value proposition becomes clear at scale; below that threshold, the cost dominates.
Kubernetes Release Cycle and Versioning
Theory
Kubernetes follows a structured, predictable release process. Understanding it matters for operators: running an unsupported version in production creates security and stability risks, and upgrades require planning.
Release cadence: Kubernetes publishes approximately 3 minor releases per year, roughly every 4 months (historically targeting January, May, and September, though exact dates vary). Each minor release is designated 1.X (e.g., 1.28, 1.29, 1.30).
Version scheme:
v1.30.2
^ ^ ^
| | |
| | Patch version (bug fixes, security patches — no API changes)
| Minor version (new features, may deprecate APIs)
Major version (breaking changes; has been 1 since 2015)
Support window: Each minor version receives patch releases for approximately 14 months from its release date. At any given time, approximately 3 minor versions are actively supported. This means clusters must be upgraded roughly every 14 months to stay within support.
API deprecation policy: Kubernetes has a formal policy for removing deprecated API versions. APIs are typically deprecated for at least 3 minor releases before removal for GA APIs, and 1 release for alpha APIs. Operators must track deprecations when upgrading.
Feature maturity gates:
| Stage | Stability | Default Enabled | API Change Risk |
|---|---|---|---|
Alpha (v1alpha1) | Experimental | No (feature gate required) | May be removed without notice |
Beta (v1beta1) | Mostly stable | Yes (in most releases) | May change; deprecated with notice |
GA/Stable (v1) | Production-ready | Yes | Breaking changes only in major versions |
Exercises
- (Beginner) How many minor Kubernetes releases are published per year, and approximately how long is each minor version supported?
- (Beginner) Given the version string
v1.29.4, identify the major, minor, and patch version numbers and explain what each change typically represents.- (Intermediate) Your production cluster runs Kubernetes v1.25. The current release is v1.30. How many versions behind are you, and what is the risk of remaining on v1.25?
- (Interview) A developer wants to use an alpha API feature in production. What are the risks, and how would you advise them? (Hint: consider the alpha stability guarantee and the production impact of an API being removed in the next release.)
Answers
- Kubernetes publishes approximately 3 minor releases per year. Each minor version is supported for approximately 14 months (receiving patch releases during that window). At any given time, roughly 3 minor versions are in the support window simultaneously.
v1.29.4: Major version 1 — the major version has been 1 since 2015; a major version bump would indicate breaking API changes across the project. Minor version 29 — minor versions introduce new features and may deprecate APIs; upgrading minor versions requires reviewing changelogs for deprecations. Patch version 4 — patch releases contain only bug fixes and security patches; they do not change APIs and are generally safe to apply without significant testing.- v1.25 is 5 minor versions behind v1.30. Since Kubernetes supports approximately the last 3 minor versions, v1.25 is well outside the support window and has been for some time. Risks include: (a) no further security patches — known CVEs in v1.25 will not be fixed; (b) API deprecations — several API versions removed in v1.26–v1.30 (e.g.,
policy/v1beta1PodSecurityPolicy was removed in v1.25 itself, and subsequent versions removed other beta APIs); (c) incompatibility with newer tooling and Helm charts that require newer API versions; (d) inability to use new features (autoscaling improvements, scheduling enhancements, security hardening). Upgrading 5 versions requires a sequential process (cannot skip more than 1-2 minor versions per upgrade) and should be treated as a significant project.- Using alpha APIs in production carries several critical risks: (a) No stability guarantee: alpha APIs may be changed or removed entirely in the next release with no deprecation period; (b) Not enabled by default: alpha features require explicit feature gates, which may complicate cluster configuration and upgrades; (c) Untested at scale: alpha features have not been through the community's broad testing process; production behavior may differ from expectations; (d) Upgrade blocking: if the alpha API is removed in the next release, your production workloads that depend on it will break during the cluster upgrade. Advice: use the GA (
v1) or beta (v1beta1) equivalent if one exists. If no stable version exists, treat the feature as a prototype, document the dependency clearly, establish a plan for migration when the API graduates, and thoroughly test every upgrade. Do not use alpha APIs in production systems where uptime is critical.
1.3 Core Concepts and Terminology
Kubernetes has a rich, precise vocabulary. Misunderstanding these foundational terms causes confusion when reading documentation, debugging issues, and discussing architecture. This subchapter defines the essential concepts and the mental models that connect them.
Clusters, Nodes, and Pods
Theory
These three terms form the basic hierarchy of a Kubernetes deployment. Each level is a composition of the one below it.
Cluster
A Kubernetes cluster is the top-level administrative unit — the complete set of machines and control software that Kubernetes manages as a single system. When you run kubectl commands, you are talking to a cluster's API server. An organization typically has multiple clusters (development, staging, production; or regional clusters for geographic distribution).
Node
A node is a single machine (physical server or virtual machine) that is a member of a cluster. Nodes are the workers — they run the actual containers. Each node runs a small set of Kubernetes system processes (kubelet, kube-proxy, a container runtime) that allow the control plane to manage it.
There are two roles a node can play:
- Control plane node (formerly "master"): runs the Kubernetes management components (API server, scheduler, controller manager, etcd)
- Worker node: runs user workloads (your containers)
Pod
A Pod is the smallest deployable unit in Kubernetes. It is not a container — it is a wrapper around one or more containers that should always run together on the same node, share the same network namespace (same IP address), and optionally share storage volumes.
Most Pods contain a single container. The multi-container Pod pattern is used for specific cases, such as:
- Sidecar: a helper container alongside the main container (e.g., a log shipper)
- Ambassador: a proxy that handles external communication on behalf of the main container
- Init containers: containers that run to completion before the main container starts
Cluster
|
+-- Node (worker-1, e.g., 8 CPU / 32GB RAM)
| |
| +-- Pod (api-pod-abc)
| | |-- Container: api-server (nginx)
| | |-- Container: log-sidecar (fluentd)
| |
| +-- Pod (auth-pod-xyz)
| |-- Container: auth-service (go binary)
|
+-- Node (worker-2, e.g., 8 CPU / 32GB RAM)
|
+-- Pod (api-pod-def) <- another replica of api
|-- Container: api-server (nginx)
|-- Container: log-sidecar (fluentd)
Example: Inspecting Cluster, Nodes, and Pods
# View all nodes in the cluster and their status
kubectl get nodes
# Example output:
# NAME STATUS ROLES AGE VERSION
# master-1 Ready control-plane 30d v1.30.0
# worker-1 Ready <none> 30d v1.30.0
# worker-2 Ready <none> 30d v1.30.0
# View all Pods across all namespaces
kubectl get pods --all-namespaces
# Describe a specific Pod to see containers, node assignment, and events
kubectl describe pod api-pod-abc -n default
A minimal Pod definition in YAML:
apiVersion: v1
kind: Pod
metadata:
name: my-api-pod # Unique name within the namespace
labels:
app: my-api # Labels used for selection by Services/Deployments
spec:
containers:
- name: api # Name of the container within the Pod
image: my-api:1.0 # Docker image to run
ports:
- containerPort: 8080 # Port the container listens on (documentation only)
resources:
requests:
cpu: "250m" # Request 0.25 CPU cores
memory: "128Mi" # Request 128 megabytes of RAM
limits:
cpu: "500m" # Never use more than 0.5 CPU cores
memory: "256Mi" # Never use more than 256MB RAM
Exercises
- (Beginner) What is the relationship between a Cluster, a Node, and a Pod? Write one sentence describing how they are related.
- (Beginner) Why would you put two containers in the same Pod instead of separate Pods?
- (Intermediate) A Pod has two containers:
appandsidecar. Theappcontainer writes logs to/var/log/app/app.log. How would you configure the sidecar to read those logs? What Kubernetes feature enables this?- (Interview) "You should always use one container per Pod." Is this a rule or a guideline? When is it correct to violate it, and when should you not? (Hint: consider tight coupling, the sidecar pattern, and scaling implications.)
Answers
- A Cluster is the complete Kubernetes system; it consists of multiple Nodes (machines); each Node runs one or more Pods; each Pod contains one or more containers that run together on that Node.
- Two containers belong in the same Pod when they are tightly coupled and must: (a) always run on the same physical machine; (b) share the same network namespace (communicate via
localhostrather than the network); (c) share a filesystem volume (e.g., one container writes files that another reads). The sidecar pattern — a main application container + a helper (logging agent, proxy, secrets refresher) — is the canonical example.- You would use a shared emptyDir volume. Both containers mount the volume at the same path (or different paths that serve the same logical purpose). The
appcontainer writes to/var/log/app/which is backed by the volume. Thesidecarcontainer mounts the same volume and reads from it. AnemptyDirvolume is created when a Pod is assigned to a node and deleted when the Pod is removed; it enables intra-Pod file sharing.
```yaml
spec:
volumes:
- name: log-volume # Shared emptyDir volume
emptyDir: {}
containers:
- name: app
image: my-app:1.0
volumeMounts:
- name: log-volume
mountPath: /var/log/app # App writes logs here
- name: sidecar
image: fluentd:latest
volumeMounts:
- name: log-volume
mountPath: /var/log/app # Sidecar reads logs from the same path
```
4. "One container per Pod" is a guideline, not an absolute rule. It should be followed in the vast majority of cases because: (a) it enables independent scaling — scaling a Deployment scales all containers in the Pod together; if
appanddbare in the same Pod, you cannot scaleappwithout also scalingdb; (b) it enables independent lifecycle management — separate Pods can be updated independently. It is correct to use multiple containers per Pod specifically for the sidecar pattern (log shippers, proxy agents, secrets injectors) where the helper container is a per-instance concern tightly coupled to the main container's lifecycle. It is incorrect when the containers have different scaling requirements or could be independently deployed services.
Control Plane vs Data Plane
Theory
Kubernetes separates its responsibilities into two distinct planes of operation. This separation is a fundamental architectural decision that enables Kubernetes to be both highly available and scalable.
Control Plane
The control plane is the "brain" of the cluster. It makes global decisions about the cluster (scheduling, scaling, responding to failures) and is responsible for maintaining the desired state. The control plane components are:
| Component | Responsibility |
|---|---|
kube-apiserver | The front door to Kubernetes — all communication goes through it. Exposes the Kubernetes API over HTTPS. Validates and processes API requests. |
etcd | A distributed key-value store that holds all cluster state. The single source of truth for desired configuration. |
kube-scheduler | Watches for unscheduled Pods and assigns them to appropriate nodes based on resource availability, affinity rules, and policies. |
kube-controller-manager | Runs control loops (controllers) that watch cluster state and take action to converge reality to desired state (e.g., the ReplicaSet controller ensures the right number of Pod replicas exist). |
cloud-controller-manager | (Optional) Integrates with cloud provider APIs (AWS, GCP, Azure) to manage load balancers, volumes, and node lifecycle. |
Data Plane
The data plane is where the actual workloads run. It consists of the worker nodes and the components running on them:
| Component | Responsibility |
|---|---|
kubelet | The node agent. Receives Pod specs from the API server and ensures the containers described are running and healthy. Reports node and Pod status back to the control plane. |
kube-proxy | Manages network rules on each node to implement Kubernetes Services — ensuring traffic destined for a Service IP is forwarded to the appropriate Pod(s). |
| Container Runtime | The software that actually runs containers (e.g., containerd, CRI-O). The kubelet delegates to it via the Container Runtime Interface (CRI). |
+---------------------------+ +---------------------------+
| CONTROL PLANE | | DATA PLANE |
| | | |
| kube-apiserver <--------+--+-> kubelet (each node) |
| | | | | |
| etcd (cluster state) | | container runtime |
| kube-scheduler | | kube-proxy |
| controller-manager | | |
| cloud-controller-manager | | [Your Pods run here] |
+---------------------------+ +---------------------------+
^
|
kubectl (your CLI)
In production, the control plane typically runs on dedicated nodes (separate from worker nodes) and is often replicated (3 or 5 instances of each component) for high availability.
Example: Control Plane in Action — Pod Scheduling
When you apply a Deployment, a chain of control plane events occurs:
1. You run: kubectl apply -f deployment.yaml
2. kube-apiserver:
- Receives the request
- Validates the YAML against the API schema
- Authenticates and authorizes your identity (RBAC)
- Writes the Deployment object to etcd
3. kube-controller-manager (Deployment controller):
- Notices a new Deployment object
- Creates a ReplicaSet object representing the desired Pods
- ReplicaSet controller creates Pod objects (with no node assigned)
- Writes Pods to etcd with nodeName: "" (unscheduled)
4. kube-scheduler:
- Watches for Pods with no nodeName
- Evaluates all nodes: resource availability, taints/tolerations, affinity
- Selects best node; writes nodeName: "worker-2" to the Pod in etcd
5. kubelet on worker-2:
- Watches for Pods assigned to its node
- Reads the Pod spec
- Instructs containerd to pull the image and start the container
- Monitors the container; reports status back to kube-apiserver
Exercises
- (Beginner) Name the two planes in Kubernetes and state the primary responsibility of each.
- (Beginner) What is
etcdand why is it critical to the cluster?- (Intermediate) If
kube-schedulercrashes, what happens to existing running Pods? What happens if you try to create a new Pod while the scheduler is down?- (Interview) Why is it important to run an odd number of
etcdnodes (e.g., 3 or 5) in a production cluster? (Hint: consider distributed consensus and what happens when a node becomes unreachable.)
Answers
- Control plane: makes decisions about the cluster (scheduling, scaling, reconciliation); maintains desired state. Data plane: executes workloads; runs the actual containers as directed by the control plane.
etcdis a distributed, consistent key-value store. It is the single source of truth for all Kubernetes cluster state — every object (Pods, Deployments, ConfigMaps, Secrets, etc.) is stored in etcd. Every change to the cluster is committed to etcd before being acted upon. If etcd data is lost without a backup, the cluster state is unrecoverable. This makes etcd the most critical component to back up and make highly available.- Existing Pods: unaffected. The kubelet on each node manages running containers locally. Containers that were already started continue running; the kubelet restarts them if they crash (it does not need the scheduler for this — it already knows which Pods it should be running). New Pods: they are created as objects in etcd (by the controller manager) with no node assignment, and they remain in
Pendingstate indefinitely until the scheduler recovers and assigns them to nodes. No new workload can be placed on nodes while the scheduler is down.- etcd uses the Raft consensus algorithm, which requires a majority quorum of nodes to agree before any write is committed. With
nnodes, the cluster tolerates(n-1)/2node failures and still achieves quorum. With 3 nodes: tolerates 1 failure (2 of 3 remain). With 5 nodes: tolerates 2 failures (3 of 5 remain). With an even number (e.g., 4 nodes): you tolerate only 1 failure (need 3 of 4), the same as with 3 nodes — but you pay for an extra node with no additional fault tolerance. Even-numbered clusters also carry a higher risk of a "split-brain" scenario where the cluster divides evenly and neither partition can achieve quorum. Therefore, 3 and 5 are the standard production configurations.
Declarative vs Imperative Model
Theory
These two terms describe fundamentally different ways of interacting with a system.
Imperative model: You tell the system how to achieve a goal, step by step.
"Start 3 copies of my container on these specific nodes.
Then, if one fails, restart it on node-2."
This is the model used by direct shell scripting, Docker CLI commands, and similar tools. You specify the procedure.
Declarative model: You tell the system what you want the end state to be. You do not describe how to get there — the system figures that out.
"I want 3 copies of my container running at all times."
The system examines current reality, compares it to what you declared, and makes whatever changes are necessary to close the gap. You describe the desired state; the system manages the procedure.
Kubernetes is primarily declarative. You write YAML manifests describing what you want (Deployments, Services, ConfigMaps), apply them to the cluster, and Kubernetes continuously works to ensure reality matches your declarations.
Why declarative?
- Idempotency: applying the same manifest multiple times produces the same result. Running the same imperative command twice might create duplicates or fail with a conflict.
- Self-documentation: the YAML files in your repository are always a current record of what should be running. There is no need for a separate runbook describing the current setup.
- GitOps: storing declarative manifests in git creates a complete audit trail of every change. Rolling back means reverting a git commit.
- Self-healing: the system continuously reconciles toward the declared state, which means it can automatically correct drift caused by failures.
Kubernetes does support imperative commands (e.g., kubectl run, kubectl delete, kubectl scale), but these are generally used for debugging and exploration, not for managing production systems.
Example: Imperative vs Declarative Side by Side
# --- IMPERATIVE: step-by-step instructions ---
# Create a deployment
kubectl run nginx --image=nginx:1.25 --replicas=3
# Scale it up
kubectl scale deployment nginx --replicas=5
# Update the image
kubectl set image deployment/nginx nginx=nginx:1.26
# Problem: the cluster's actual state may diverge from what you think it is.
# There is no canonical file representing "what should be running."
# --- DECLARATIVE: describe desired state in a file ---
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
spec:
replicas: 5 # Desired state: 5 replicas
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx:1.26 # Desired state: version 1.26
# Apply the desired state. Kubernetes handles the rest.
kubectl apply -f deployment.yaml
# Apply again with replicas changed to 3 — Kubernetes scales down.
# Apply again with the same file — nothing changes (idempotent).
# A Pod crashes — Kubernetes restarts it without any further input.
Exercises
- (Beginner) In one sentence each, define imperative and declarative in the context of Kubernetes.
- (Beginner) What does "idempotent" mean, and why is it important for declarative systems?
- (Intermediate) A colleague runs
kubectl scale deployment api --replicas=10directly on the production cluster to handle a traffic spike. An hour later, someone runskubectl apply -f deployment.yaml(which declaresreplicas: 3). What happens? What does this illustrate about imperative changes to declaratively managed resources?- (Interview) What is GitOps, and how does Kubernetes's declarative model enable it? (Hint: think about what git provides — versioning, audit, rollback — and how it maps to managing declared cluster state.)
Answers
- Imperative: you tell Kubernetes the specific steps to perform (run this command, scale to this number, update this image). Declarative: you tell Kubernetes the desired end state (a file describing what should exist), and Kubernetes determines and executes the steps to achieve it.
- Idempotent means that performing the same operation multiple times produces the same result as performing it once. For declarative systems, this is critical because
kubectl apply -f deployment.yamlcan be run repeatedly — in CI/CD pipelines, after operator error, after a network failure — and it will always converge to the declared state without creating duplicate resources or causing errors if nothing has changed.- The
kubectl applyoverwrites the Deployment'sreplicasfield back to 3, and Kubernetes scales down from 10 to 3 Pods. The imperativekubectl scalechange was overwritten by the declarative apply. This illustrates a critical operational rule: never mix imperative changes with declarative management for the same resource. Imperative changes bypass the source of truth (the YAML file), creating invisible state drift. When the file is applied next (by a human or a CI/CD pipeline), the imperative changes are silently reverted. The correct approach would have been to update the YAML file toreplicas: 10, apply it, and then revert to 3 in a tracked commit.- GitOps is an operational practice where the desired state of infrastructure (and applications) is stored entirely in a git repository. Deployments happen by committing changes to git, and an automated operator (Argo CD, Flux) continuously synchronizes the cluster's actual state to match what is in the git repository. Kubernetes's declarative model enables GitOps because: (a) desired state can be fully expressed as YAML files that can be committed to git; (b) these files are version-controlled — every change has an author, a timestamp, and a diff; (c) rollback means reverting a git commit, which automatically triggers re-synchronization to the previous desired state; (d) git's access control (branch protection, pull request review) becomes the control gate for infrastructure changes.
Desired State and Reconciliation Loop
Theory
The desired state and reconciliation loop are the two halves of the central operating principle of Kubernetes. Together, they explain why Kubernetes behaves the way it does.
Desired State
Desired state is what you have declared should exist in the cluster — the contents of your Kubernetes objects stored in etcd. It is the answer to the question: "What should the world look like?"
Examples of desired state:
- "There should be 3 running replicas of the
apiDeployment" - "The
apiService should route traffic to Pods with labelapp: api" - "The ConfigMap
app-configshould contain the keylog_level: INFO"
Actual State (Observed State)
Actual state is the current reality of the cluster: what containers are actually running, what their health is, what nodes exist. It is observed continuously by the control plane components.
Reconciliation Loop
The reconciliation loop (also called the control loop or watch loop) is the mechanism that closes the gap between desired and actual state. Every Kubernetes controller runs a reconciliation loop:
- Observe the current actual state
- Compare it to the desired state
- Act to reduce the difference (if any)
- Repeat
This loop runs continuously. It is not a one-time operation. This is what gives Kubernetes its self-healing property: if a container crashes and the actual state diverges from desired, the controller notices on the next loop iteration and restarts the container.
An analogy: a home thermostat is a reconciliation loop. The desired state is the temperature you set (22°C). The actual state is the current room temperature. The thermostat continuously observes, compares, and acts (turn heating on/off) to close the gap.
+-------------------+
| Desired State | <- stored in etcd
| (replicas: 3) |
+-------------------+
|
| compare
v
+-------------------+
| Actual State | <- observed from kubelet reports
| (2 Pods running) |
+-------------------+
|
| difference detected: 1 Pod missing
v
+-------------------+
| Controller Acts |
| (create 1 Pod) |
+-------------------+
|
+--------> Loop repeats (all 3 now running)
Level-triggered vs edge-triggered
Kubernetes controllers are level-triggered, not edge-triggered. This means they respond to the current state of the world (the level), not to the event that caused a change (the edge). If a controller misses an event (e.g., due to a crash), it still converges to correct state on the next loop because it always observes and responds to current reality. This makes the system robust to transient failures.
Example: Reconciliation in Practice
# Desired state: 3 replicas in the Deployment
kubectl apply -f deployment.yaml # replicas: 3
# Check actual state
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# api-7d9c-abc 1/1 Running 0 5m
# api-7d9c-def 1/1 Running 0 5m
# api-7d9c-ghi 1/1 Running 0 5m
# Simulate a failure: manually delete one Pod
kubectl delete pod api-7d9c-ghi
# Actual state is now: 2 Pods (diverged from desired: 3)
# The ReplicaSet controller's reconciliation loop detects this.
# Within seconds, it creates a replacement Pod.
kubectl get pods
# NAME READY STATUS RESTARTS AGE
# api-7d9c-abc 1/1 Running 0 6m
# api-7d9c-def 1/1 Running 0 6m
# api-7d9c-jkl 0/1 ContainerCreating 0 2s <- new Pod
The operator did nothing after deleting the Pod. The reconciliation loop detected the drift and corrected it autonomously.
Exercises
- (Beginner) Describe the three steps of a Kubernetes reconciliation loop in your own words.
- (Beginner) If you manually delete a Pod that was created by a Deployment, what happens and why?
- (Intermediate) A Kubernetes cluster loses its connection to etcd for 30 seconds, then reconnects. During this 30 seconds, a node goes down, taking 2 Pods with it. Describe what happens during the outage and after connectivity is restored.
- (Interview) Explain the difference between "level-triggered" and "edge-triggered" systems. Why does Kubernetes use level-triggered reconciliation, and what resilience properties does this provide? (Hint: consider what happens when a controller crashes and restarts mid-operation.)
Answers
- A reconciliation loop: (1) Observes the current actual state of the cluster (e.g., how many Pods are running); (2) Compares it to the desired state stored in etcd (e.g., how many should be running); (3) Acts to close any gap (e.g., creates a new Pod if too few are running, or deletes a Pod if too many). This loop repeats continuously.
- When you delete a Pod that was created by a Deployment (via a ReplicaSet), the ReplicaSet controller's reconciliation loop detects that actual state (fewer Pods than declared) diverges from desired state (the
replicascount). It creates a new Pod to replace the deleted one. This is by design — Kubernetes treats Pod deletion as a signal that desired state is not met, not as a final instruction. If you truly want to scale down, you must reduce thereplicascount in the Deployment, not delete individual Pods.- During the 30-second outage: the controllers in the control plane cannot read from or write to etcd. They cannot process new reconciliation actions. The kubelets on worker nodes continue running existing containers (they cache their current Pod assignments locally). The scheduler cannot place new Pods. When the node goes down, the kubelet on that node stops sending heartbeats, but the control plane cannot yet record the node as
NotReadybecause it cannot write to etcd. After connectivity is restored: the controllers re-read etcd state. They observe the node's heartbeat has been missing and eventually mark itNotReady. The Pod eviction mechanism (controlled by the node lifecycle controller) detects Pods on the failed node. After a configurable eviction timeout (default 5 minutes fornode.kubernetes.io/unreachabletaint), it marks the Pods asTerminatingand the ReplicaSet controller creates replacement Pods on healthy nodes. The 30-second etcd outage delayed this process but did not prevent correct reconciliation — demonstrating level-triggered resilience.- An edge-triggered system responds to events (state transitions): "a Pod was deleted" triggers a create action. If the system misses the event (crashes before processing it), the action is never taken and the system is stuck in incorrect state. A level-triggered system responds to state (the current level of reality): "there are 2 Pods, but there should be 3" triggers a create action. It does not matter how the system got to that state or whether any events were missed — every time the loop runs, it observes current reality and acts. Kubernetes uses level-triggered reconciliation because: (a) if a controller crashes and restarts, it simply re-observes current state and acts on what it finds — it does not need to replay missed events; (b) if an action fails mid-execution, the loop will retry on the next iteration because the desired/actual gap still exists; (c) this makes the system robust against partial failures, network partitions, and controller restarts without needing event replay or persistent event queues. The resilience property is eventual consistency: regardless of intermediate failures, the cluster will eventually converge to desired state.
2. Kubernetes Architecture
Every Kubernetes cluster is split into two planes: a control plane that makes global decisions and stores the cluster's desired state, and a set of worker nodes that actually run the containers. This chapter dissects each component, traces how they communicate through the API server, and covers the add-ons that turn a bare cluster into a usable platform. Understanding this architecture is what separates someone who can run kubectl apply from someone who can debug why a Pod will not start.
2.1 Control Plane Components
The control plane is the cluster's brain. It exposes the API, persists state, schedules workloads, and runs the control loops that keep reality matching intent. This subchapter examines each control plane process in turn.
kube-apiserver
Theory
Imagine a company where every decision — hiring, spending, scheduling — must go through a single front desk that validates the request, records it in the official ledger, and notifies the relevant departments. Nothing is "real" until the front desk writes it down. In Kubernetes, that front desk is the kube-apiserver.
The API server is the only component that talks to etcd (the datastore). Every other component — the scheduler, controllers, kubelets, and your kubectl — communicates through the API server, never directly with each other or with storage. This hub-and-spoke design means the API server is the single point of authentication, authorization, validation, and audit.
Key properties:
- RESTful: resources (Pods, Services, etc.) are exposed as HTTP endpoints you can
GET,POST,PUT,PATCH, andDELETE. - Stateless: the API server holds no state itself; all state lives in etcd. This means you can run multiple API server replicas behind a load balancer for high availability.
- Declarative: clients submit the desired state of an object; the API server validates and stores it.
Example
# kubectl is just an HTTP client for the API server.
# This flag shows the actual REST calls kubectl makes:
kubectl get pods -v=8
# You can talk to the API server directly with curl via a proxy:
kubectl proxy --port=8080 &
curl http://localhost:8080/api/v1/namespaces/default/pods # list Pods as JSON
# Inspect the API server's own health endpoint:
curl http://localhost:8080/healthz # returns "ok" when healthy
+-------------------------------------------+
| kube-apiserver |
kubectl -->| authn -> authz -> admission -> validate |--> etcd
kubelet -->| (the ONLY writer to etcd) |
scheduler ->+-------------------------------------------+
Exercises
- (Beginner) Which Kubernetes component is the only one that reads from and writes to etcd directly?
- (Beginner) Why can you run multiple replicas of the kube-apiserver without a leader election, but not multiple active schedulers?
- (Intermediate) Using
kubectl get pods -v=8, identify the HTTP method and path used to list Pods in thedefaultnamespace.- (Interview) The API server is described as "stateless." Where is cluster state actually stored, and what advantage does keeping the API server stateless provide for availability? (Hint: think about horizontal scaling and load balancing.)
Answers
- The kube-apiserver is the only component that communicates with etcd directly. All other components go through the API server.
- The API server is stateless — every replica reads/writes the same etcd backend, so requests can be load-balanced across replicas without coordination. The scheduler and controllers, by contrast, take actions (creating Pods, binding them to nodes); running several active copies would cause duplicate or conflicting actions, so they use leader election to ensure only one is active at a time.
- The request is
GET /api/v1/namespaces/default/pods. The-v=8output shows a line likeGET https://<host>:6443/api/v1/namespaces/default/pods?limit=500.- Cluster state is stored in etcd. Because the API server holds no state of its own, any replica can serve any request, and you can place multiple replicas behind a load balancer (active-active). If one fails, traffic simply routes to another with no failover delay or state migration. This is the foundation of control plane high availability.
etcd as the cluster store
Theory
If the API server is the front desk, etcd is the official ledger locked in the vault. It is a distributed, consistent key-value store that holds the entire state of the cluster: every Pod, Service, ConfigMap, Secret, and the desired-state specs you submit. If etcd is lost and unrecoverable, the cluster's state is gone.
etcd uses the Raft consensus algorithm to stay consistent across multiple replicas. Raft elects a leader; all writes go through the leader, which replicates them to followers. A write is only acknowledged once a majority (quorum) of members have persisted it. This is why etcd clusters use an odd number of members (3, 5, 7): a 3-node cluster tolerates 1 failure, a 5-node cluster tolerates 2.
Because consistency requires a quorum, etcd prioritizes consistency over availability (CP in CAP terms). If quorum is lost, etcd stops accepting writes rather than risk diverging state.
Example
# etcd stores everything under /registry. Inspect a key directly:
ETCDCTL_API=3 etcdctl get /registry/pods/default/my-pod \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Check cluster health and which member is the leader:
ETCDCTL_API=3 etcdctl endpoint status --write-out=table \
--endpoints=https://127.0.0.1:2379 ...
Raft quorum math (why odd numbers):
3 members -> quorum = 2 -> tolerates 1 failure
5 members -> quorum = 3 -> tolerates 2 failures
4 members -> quorum = 3 -> tolerates 1 failure (no better than 3, more cost)
Exercises
- (Beginner) What kind of data store is etcd, and what does it hold for a Kubernetes cluster?
- (Beginner) Why are etcd clusters deployed with an odd number of members?
- (Intermediate) A 5-member etcd cluster loses 3 members simultaneously. Can it still accept writes? Explain using the concept of quorum.
- (Interview) etcd is a CP system in CAP terms. What does it sacrifice when a network partition occurs, and why is that the correct trade-off for a cluster datastore? (Hint: consider what would happen if two halves of a partition both accepted writes.)
Answers
- etcd is a distributed, strongly consistent key-value store. It holds the complete state of the cluster: all API objects (Pods, Services, Deployments, Secrets, ConfigMaps, etc.) and their desired/observed state.
- An odd number maximizes fault tolerance per node. Quorum is
(N/2)+1. A 3-node cluster (quorum 2) tolerates 1 failure; a 4-node cluster (quorum 3) also only tolerates 1 failure but costs an extra node. Odd numbers avoid paying for capacity that does not improve tolerance.- No. A 5-member cluster has quorum = 3. Losing 3 members leaves only 2 available, which is below quorum. etcd will stop accepting writes (reads may be served stale depending on configuration) until quorum is restored, to avoid split-brain.
- During a network partition, etcd sacrifices availability on the minority side: the partition without a quorum stops serving writes. This is correct because allowing both sides to accept writes independently would create divergent, conflicting cluster states (split-brain) — e.g., two API servers each believing different Pods should exist. Consistency is non-negotiable for a datastore that is the single source of truth, so refusing writes is safer than diverging.
kube-scheduler
Theory
When you create a Pod, it does not immediately run anywhere — it is born "unscheduled," with no node assigned. The kube-scheduler is the matchmaker that watches for these unassigned Pods and decides which node each should run on. It does not start the container itself; it merely writes the chosen node into the Pod's spec.nodeName. The kubelet on that node then takes over.
Scheduling happens in two phases:
- Filtering (Predicates): eliminate nodes that cannot run the Pod. A node is filtered out if it lacks sufficient CPU/memory, fails to match a
nodeSelector, has a taint the Pod does not tolerate, or has port conflicts. - Scoring (Priorities): rank the remaining feasible nodes. Factors include spreading Pods across nodes, packing onto fuller nodes, image locality, and affinity preferences. The highest-scoring node wins (ties broken randomly).
Example
# This Pod requests resources and a node label; the scheduler must find
# a node that satisfies BOTH before binding.
apiVersion: v1
kind: Pod
metadata:
name: web
spec:
nodeSelector:
disktype: ssd # filtering: only nodes labeled disktype=ssd qualify
containers:
- name: web
image: nginx
resources:
requests:
cpu: "500m" # filtering: node must have >= 0.5 CPU free
memory: "256Mi"
# Watch the scheduler's decision in the Pod's events:
kubectl describe pod web | grep -A2 Events
# Normal Scheduled ... Successfully assigned default/web to node-2
Exercises
- (Beginner) After you create a Pod, what does the scheduler actually write to make it run on a node?
- (Beginner) Name the two phases of the scheduling process and describe what each accomplishes.
- (Intermediate) A Pod stays in
Pendingstate forever. List three scheduler-related reasons this could happen and how you would diagnose each.- (Interview) The scheduler only sets
spec.nodeName; it never starts the container. Explain the separation of concerns between the scheduler and the kubelet, and why this decoupling is valuable. (Hint: think about which component owns node-local execution.)
Answers
- The scheduler writes the chosen node's name into the Pod's
spec.nodeNamefield (via a Binding subresource). It does not launch the container.- Filtering (predicates): remove nodes that cannot run the Pod (insufficient resources, taint mismatch, selector mismatch, port conflict). Scoring (priorities): rank the feasible nodes by preference (spreading, affinity, resource balance, image locality); the highest-scoring node is selected.
- (a) Insufficient resources — no node has enough free CPU/memory; check
kubectl describe podevents for "Insufficient cpu/memory". (b) UnsatisfiablenodeSelector/affinity — no node has the required labels; check node labels withkubectl get nodes --show-labels. (c) Taints without tolerations — all nodes are tainted (e.g., control-plane taint); checkkubectl describe nodefor taints. The Pod's events fromkubectl describe podusually state the exact reason.- The scheduler owns the global placement decision (which node), while the kubelet owns local execution (pulling images, starting containers, reporting status) on its own node. Decoupling them means: the scheduler can be replaced or extended without touching node execution; node failures are isolated to the kubelet; and the scheduler remains a pure decision-maker working only through the API server, keeping it stateless and replaceable. The kubelet reacts to the bound Pod via the level-triggered watch model.
kube-controller-manager
Theory
Kubernetes is built on controllers — independent control loops that each watch a resource type and work to make actual state match desired state. Rather than running dozens of separate processes, Kubernetes bundles the core controllers into a single binary, the kube-controller-manager, for operational simplicity. Inside it run the Node controller, ReplicaSet controller, Deployment controller, Job controller, EndpointSlice controller, ServiceAccount controller, and many more.
Each controller follows the same reconciliation pattern: observe current state via the API server's watch mechanism, compare it to desired state, and act to close the gap. For example, the ReplicaSet controller notices a ReplicaSet wants 3 Pods but only 2 exist, and creates one more.
In a high-availability setup you run multiple controller-manager replicas, but only one is active at a time via leader election — because controllers take actions, and duplicate actions would cause chaos.
Example
kube-controller-manager (one process, many loops)
├── Node controller watches Nodes, marks NotReady, evicts Pods
├── ReplicaSet controller maintains desired Pod count
├── Deployment controller manages ReplicaSets for rollouts
├── Job controller runs Pods to completion
├── EndpointSlice controller keeps Service endpoints current
└── ServiceAccount controller creates default SAs / tokens
# Leader election is recorded in a Lease object:
kubectl -n kube-system get lease kube-controller-manager
# NAME HOLDER AGE
# kube-controller-manager control-plane-1_a1b2c3... 10d
Exercises
- (Beginner) What is a "controller" in Kubernetes, in one sentence?
- (Beginner) Why are many controllers packaged into a single kube-controller-manager binary?
- (Intermediate) Name three controllers that run inside the controller-manager and describe what each one keeps in sync.
- (Interview) Why do controller-manager replicas use leader election while the API server does not? (Hint: distinguish components that decide/act from components that serve.)
Answers
- A controller is a control loop that continuously watches a resource and reconciles actual cluster state toward the declared desired state.
- For operational simplicity: bundling them into one process reduces the number of binaries to deploy, configure, secure, and monitor, while each controller still runs as a logically independent loop.
- Examples: Node controller keeps node health/status in sync and evicts Pods from dead nodes; ReplicaSet controller keeps the running Pod count equal to the desired
replicas; Job controller ensures a Job's Pods run to successful completion. (Also acceptable: Deployment, EndpointSlice, ServiceAccount controllers.)- Controllers act on the cluster (create/delete Pods, evict, update endpoints). Two active controller-managers would issue duplicate or conflicting actions, so leader election ensures exactly one is active. The API server only serves and stores requests statelessly, so multiple replicas can run active-active behind a load balancer without conflict.
cloud-controller-manager
Theory
Kubernetes needs to interact with the underlying cloud provider for certain operations: provisioning a load balancer when you create a LoadBalancer Service, attaching storage volumes, or learning a node's region/zone. Originally this cloud-specific code lived inside the core components, which made the project hard to maintain and bound to specific clouds. The cloud-controller-manager (CCM) extracts all cloud-provider-specific logic into a separate, pluggable binary.
This separation means cloud vendors can develop and release their integration on their own schedule, and the core Kubernetes binaries stay vendor-neutral. The CCM runs controllers such as:
- Node controller (cloud part): checks the cloud API to confirm whether a node that stopped responding was actually deleted from the cloud.
- Route controller: configures network routes in the cloud's network for Pod traffic.
- Service controller: creates/updates/deletes cloud load balancers for
LoadBalancerServices.
On a self-managed bare-metal cluster with no cloud provider, the CCM is simply absent.
Example
# Creating this Service on a cloud cluster triggers the CCM's Service
# controller to provision an actual cloud load balancer (ELB, GLB, etc.).
apiVersion: v1
kind: Service
metadata:
name: frontend
spec:
type: LoadBalancer # CCM sees this and calls the cloud's LB API
selector:
app: frontend
ports:
- port: 80
targetPort: 8080
kubectl get svc frontend
# NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
# frontend LoadBalancer 10.96.0.40 a1b2c3.elb.aws... 80:31234/TCP
# The EXTERNAL-IP is populated by the CCM after the cloud LB is ready.
Exercises
- (Beginner) What problem does separating cloud-specific logic into the cloud-controller-manager solve?
- (Beginner) On a bare-metal cluster with no cloud provider, is the cloud-controller-manager present?
- (Intermediate) You create a
type: LoadBalancerService on a self-managed bare-metal cluster and theEXTERNAL-IPstays<pending>forever. Explain why, and name a tool that addresses this.- (Interview) Why was moving cloud-provider code out of the core kube-controller-manager beneficial for both the Kubernetes project and cloud vendors? (Hint: think about release cadence and code ownership.)
Answers
- It decouples vendor-specific integration (load balancers, volumes, node metadata, routes) from the core Kubernetes codebase, keeping the core vendor-neutral and letting cloud providers maintain their own integrations independently.
- No. With no cloud provider configured, there is no cloud-controller-manager — its controllers have nothing to integrate with.
- There is no cloud provider to fulfill the LoadBalancer request, so no external load balancer is provisioned and
EXTERNAL-IPnever populates. A tool like MetalLB (or kube-vip) implements LoadBalancer Services on bare metal by assigning IPs from a configured pool and announcing them via ARP/BGP.- The core project no longer has to carry, test, and release code for every cloud, reducing maintenance burden and avoiding vendor lock-in. Cloud vendors gain ownership of their integration and can fix bugs or add features on their own release cadence without waiting for a Kubernetes release, improving velocity for both sides.
2.2 Worker Node Components
Worker nodes are where containers actually run. Each node runs a small set of components that register the node, execute Pods, wire up networking, and talk to a container runtime. This subchapter covers them.
kubelet
Theory
The kubelet is the primary node agent — the on-the-ground supervisor present on every node, including control plane nodes. Its job is narrow but critical: ensure that the containers described in the Pods assigned to its node are running and healthy. It is the bridge between the cluster's desired state and the actual processes on the machine.
The kubelet watches the API server for Pods bound to its node (spec.nodeName matches). For each such Pod, it:
- Pulls the required container images.
- Asks the container runtime (via the CRI) to create and start the containers.
- Runs the Pod's liveness, readiness, and startup probes.
- Reports the Pod's status and the node's health (heartbeats) back to the API server.
Importantly, the kubelet only manages Pods it knows about from the API server (plus "static Pods" defined by local manifest files). It does not manage arbitrary containers you start manually with docker run.
Example
# Static Pods: the kubelet runs any manifest placed in this directory,
# even without a working API server. Control plane components often run this way.
ls /etc/kubernetes/manifests/
# etcd.yaml kube-apiserver.yaml kube-controller-manager.yaml kube-scheduler.yaml
# The kubelet sends node heartbeats; a missed heartbeat eventually marks
# the node NotReady:
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# node-1 Ready <none> 12d v1.30.2
# node-2 NotReady <none> 12d v1.30.2 <- kubelet stopped reporting
Exercises
- (Beginner) On which nodes does the kubelet run?
- (Beginner) Does the kubelet manage containers you start manually with
docker run? Why or why not?- (Intermediate) What is a "static Pod," where is it defined, and why are control plane components often run as static Pods?
- (Interview) A node shows
NotReady, but the application Pods on it appear to keep serving traffic for several minutes. Explain what is happening from the kubelet's and control plane's perspectives. (Hint: heartbeats vs. running containers vs. eviction timeout.)
Answers
- The kubelet runs on every node in the cluster, including control plane nodes (so they can run static Pods like the API server).
- No. The kubelet only manages Pods assigned to its node by the API server, plus static Pods from its manifest directory. Containers started directly via the runtime/
docker runare outside Kubernetes' knowledge, so the kubelet ignores them.- A static Pod is a Pod defined by a manifest file in the kubelet's watched directory (default
/etc/kubernetes/manifests). The kubelet runs it directly without involving the scheduler or controllers, and it works even before/without the API server — which is exactly why bootstrap control plane components (apiserver, etcd, scheduler, controller-manager) run as static Pods.- The kubelet on the failing node has stopped sending heartbeats (e.g., it crashed or lost network), so the node controller marks the node
NotReady. However, the containers themselves may still be running locally and reachable, so traffic continues. The control plane waits for the eviction timeout (default ~5 minutes via theunreachabletaint) before marking the Pods for deletion and rescheduling them elsewhere. During that window the Pods exist in both places conceptually, which is why they keep serving until eviction kicks in.
kube-proxy
Theory
Pods are ephemeral — they come and go, each with its own IP. A Service gives a stable virtual IP (ClusterIP) that fronts a changing set of Pods. But something has to actually route traffic sent to that virtual IP to a real Pod. That something is kube-proxy, a network agent running on every node.
kube-proxy watches the API server for Services and their EndpointSlices (the current set of healthy backing Pod IPs) and programs the node's networking rules so that traffic to a Service's ClusterIP is load-balanced to one of the backend Pods. It typically operates in iptables mode (writes iptables rules) or IPVS mode (uses the kernel's IP Virtual Server for better performance at scale). It does not proxy packets in userspace in modern modes — it configures the kernel to do the routing.
Example
Request to Service ClusterIP 10.96.0.10:80
|
v (kube-proxy programmed kernel rules on this node)
DNAT to one of: 10.244.1.5:8080
10.244.2.7:8080 <- chosen pod (random/round-robin)
10.244.3.9:8080
# In iptables mode, you can see the rules kube-proxy created for a Service:
iptables -t nat -L KUBE-SERVICES -n | grep 10.96.0.10
# Each Service ClusterIP maps to a chain that DNATs to backend Pod IPs.
# Check kube-proxy is running on every node (it's a DaemonSet):
kubectl -n kube-system get pods -l k8s-app=kube-proxy -o wide
Exercises
- (Beginner) What problem does kube-proxy solve, given that Pod IPs change constantly?
- (Beginner) Name the two common modes kube-proxy operates in.
- (Intermediate) Traffic to a Service's ClusterIP is failing, but the backend Pods are healthy and reachable by their Pod IPs. How would you investigate kube-proxy as the cause?
- (Interview) Modern kube-proxy modes program the kernel (iptables/IPVS) rather than proxying packets in userspace. Why is that a significant performance improvement? (Hint: think about context switches and per-packet handling.)
Answers
- kube-proxy implements the Service abstraction at the network level: it programs each node so that traffic to a Service's stable virtual IP is forwarded and load-balanced to the current set of healthy backend Pod IPs, insulating clients from Pod churn.
- iptables mode (programs iptables NAT rules) and IPVS mode (uses the kernel's IP Virtual Server, more efficient with many Services).
- Verify kube-proxy is running on the relevant node(s) (
kubectl -n kube-system get pods -l k8s-app=kube-proxy), check its logs, confirm the Service has populated EndpointSlices (kubectl get endpointslices), and inspect the kernel rules (iptables -t nat -L | grep <clusterIP>oripvsadm -Ln). Missing rules or empty endpoints point to kube-proxy or selector/label issues.- Userspace proxying copies every packet between kernel and a userspace process, incurring a context switch and data copy per packet — expensive and a bottleneck under load. Programming iptables/IPVS lets the kernel route packets directly in the network stack with no userspace round trip, dramatically reducing per-packet overhead and scaling to far more connections and Services.
Container runtime interface (CRI)
Theory
The kubelet does not know how to start a container by itself — it delegates that to a container runtime. Early on, Kubernetes had hard-coded support for Docker. To support multiple runtimes cleanly, the project defined the Container Runtime Interface (CRI): a standard gRPC API the kubelet uses to talk to any compliant runtime. This is the same decoupling philosophy as the CRI's cousins, CNI (networking) and CSI (storage).
The CRI defines two gRPC services:
- RuntimeService: lifecycle of Pods (sandboxes) and containers — create, start, stop, remove, exec, list.
- ImageService: pulling, listing, and removing container images.
Because of the CRI, the kubelet is runtime-agnostic. As long as a runtime implements the CRI, the kubelet can use it. This is what allowed Kubernetes to remove the built-in "dockershim" in v1.24 — Docker itself does not implement the CRI directly.
Example
+---------+ CRI (gRPC) +--------------+ OCI runtime +-------+
| kubelet | -------------> | containerd / | -------------> | runc | -> container
| | RuntimeSvc | CRI-O | +-------+
| | ImageSvc +--------------+
+---------+
# crictl is a CRI-compliant CLI for debugging at the node level
# (analogous to docker, but speaks CRI directly to the runtime):
crictl ps # list running containers known to the runtime
crictl images # list images
crictl pods # list Pod sandboxes
Exercises
- (Beginner) What does the CRI let the kubelet do that hard-coded runtime support did not?
- (Beginner) Name the two gRPC services the CRI defines and what each handles.
- (Intermediate) Why was Kubernetes able to remove "dockershim" in v1.24 without breaking the ability to run containers? What did clusters switch to?
- (Interview) The CRI, CNI, and CSI all follow the same architectural pattern. Describe that pattern and explain why Kubernetes favors it. (Hint: stable interface vs. swappable implementation.)
Answers
- The CRI lets the kubelet talk to any compliant container runtime through a single standard gRPC API, making the kubelet runtime-agnostic instead of tied to one specific runtime.
- RuntimeService handles Pod sandbox and container lifecycle (create/start/stop/remove/exec/list); ImageService handles image operations (pull/list/remove).
- Docker does not implement the CRI natively; the kubelet used a shim ("dockershim") to translate. Maintaining that shim in-tree was a burden, so it was removed. Clusters switched to CRI-native runtimes — most commonly containerd (which Docker already uses under the hood) or CRI-O — so containers still run normally.
- The pattern is a stable, well-defined interface with swappable implementations: Kubernetes core depends only on the interface (CRI for runtimes, CNI for networking, CSI for storage), while vendors provide pluggable implementations behind it. This keeps the core small and vendor-neutral, lets implementations evolve and ship independently, and allows operators to choose the best implementation for their needs without changing Kubernetes itself.
containerd and CRI-O
Theory
With the CRI defining the contract, two runtimes dominate real clusters: containerd and CRI-O. Both implement the CRI and both ultimately use a low-level OCI runtime (usually runc) to actually create the container process via Linux namespaces and cgroups.
- containerd: a general-purpose, industry-standard runtime originally extracted from Docker and donated to the CNCF. It is broader in scope than just Kubernetes (it can be embedded in other tools), supports a plugin model, and is the default runtime in most managed Kubernetes offerings. Its CRI support is provided by a built-in plugin.
- CRI-O: a lightweight runtime built specifically for Kubernetes by Red Hat and the community. It implements only what the CRI requires — nothing more — which keeps it minimal and tightly aligned to Kubernetes releases. It is the default in OpenShift.
In practice both are production-grade; the choice often comes down to distribution defaults and operational preference.
Example
| Aspect | containerd | CRI-O |
|---|---|---|
| Origin | Extracted from Docker, CNCF graduated | Built by Red Hat for Kubernetes |
| Scope | General-purpose, embeddable | Kubernetes-only |
| CRI support | Via built-in CRI plugin | Native, the entire purpose |
| OCI runtime | runc (default) | runc (default) |
| Common in | GKE, EKS, AKS, kind, k3s | OpenShift, many on-prem |
| CLI for debugging | ctr, crictl | crictl |
# Find out which runtime a node uses:
kubectl get nodes -o wide
# ... CONTAINER-RUNTIME
# ... containerd://1.7.13
Exercises
- (Beginner) What do both containerd and CRI-O ultimately use to create the container process?
- (Beginner) Which runtime was built specifically and only for Kubernetes?
- (Intermediate) How would you determine which container runtime a given node is using?
- (Interview) containerd has broader scope than CRI-O, while CRI-O is minimal. Discuss the trade-offs of a general-purpose runtime versus a purpose-built one for a Kubernetes platform team. (Hint: think features and reuse vs. surface area and alignment.)
Answers
- Both use a low-level OCI runtime, typically runc, which sets up the Linux namespaces and cgroups for the container process.
- CRI-O was built specifically and exclusively for Kubernetes (it implements only the CRI).
- Run
kubectl get nodes -o wideand read theCONTAINER-RUNTIMEcolumn (e.g.,containerd://1.7.13orcri-o://1.30.0). On the node itself,crictl versionalso reports the runtime.- A general-purpose runtime (containerd) offers broader functionality, a plugin ecosystem, and reuse outside Kubernetes, but carries more surface area and features a Kubernetes-only platform may never use. A purpose-built runtime (CRI-O) has a smaller footprint, fewer moving parts, and tracks Kubernetes releases tightly — reducing attack surface and version drift — at the cost of not being reusable elsewhere. Teams optimizing for minimalism and tight K8s alignment lean CRI-O; teams valuing ecosystem breadth and broad tooling lean containerd.
2.3 Cluster Communication
This subchapter zooms out to how the components actually talk to each other: everything flows through the API server, using an efficient watch mechanism, and every request passes through authentication, authorization, and admission before it is stored.
API server as the central hub
Theory
A recurring theme is worth stating explicitly: Kubernetes uses a hub-and-spoke communication model with the API server at the center. No component talks directly to another. The scheduler does not call the kubelet; the controller-manager does not call etcd. Instead, every component reads from and writes to the API server, and reacts to changes it observes there.
This design has profound benefits:
- Single security boundary: authn, authz, admission, and audit happen in exactly one place.
- Loose coupling: components don't need to know about each other's network addresses or even existence; they only know the API server.
- Extensibility: a new controller or operator just watches and updates objects — no changes to existing components required.
The flip side: the API server is on the critical path for everything, so its availability and performance are paramount.
Example
+--------------------+
| kube-apiserver |<---- kubectl (you)
+---------+----------+
^ ^ ^ ^
| | | | (everyone watches/writes here, never each other)
+----+ +-+ +-+ +----+
|sched| |cm| |kubelet x N
+-----+ +--+ +-------+
# Example of the indirection: the scheduler "tells" the kubelet to run a
# Pod by writing spec.nodeName via the API server — it never calls the kubelet.
kubectl get pod web -o jsonpath='{.spec.nodeName}' # node-2
Exercises
- (Beginner) In the hub-and-spoke model, what is the hub?
- (Beginner) Does the scheduler ever directly contact the kubelet? How does its decision reach the kubelet?
- (Intermediate) Give two concrete benefits of routing all communication through the API server instead of letting components talk peer-to-peer.
- (Interview) The hub-and-spoke model makes the API server a critical dependency. What architectural measures keep this from being a single point of failure? (Hint: stateless replicas, etcd quorum, load balancing.)
Answers
- The kube-apiserver is the hub; all other components are spokes.
- No, the scheduler never contacts the kubelet directly. It writes the binding (
spec.nodeName) to the API server; the kubelet on that node observes the change via its watch and acts on it.- Any two: a single place to enforce authentication/authorization/admission/audit; loose coupling (components only need to know the API server, not each other); easy extensibility (new controllers just watch/update objects); consistent validation of all writes.
- Run multiple stateless API server replicas behind a load balancer (active-active), back them with a multi-member etcd cluster that maintains quorum (tolerating member failures), and place the control plane across failure domains/availability zones. Because the API server holds no state, any replica can serve any request, so losing one causes no data loss and minimal disruption.
Watch and event-driven updates
Theory
If every component had to constantly poll the API server ("any changes yet? any changes yet?"), the load would be enormous and updates would lag. Instead Kubernetes uses watches: a client opens a long-lived connection and the API server pushes notifications whenever a matching object is created, updated, or deleted. This is the event-driven backbone of the whole system.
Under the hood:
- Each object has a resourceVersion (derived from etcd's revision). A watch can resume from a given resourceVersion, so a client that reconnects does not miss changes.
- Clients typically use an informer (from client-go): it does an initial
LISTto build a local cache, then aWATCHto keep the cache updated incrementally. Controllers read from this local cache instead of hammering the API server. - Watches make the system level-triggered in spirit but efficient in practice: controllers are woken by events but always reconcile against current state.
Example
# Observe the watch stream yourself — this blocks and prints changes live:
kubectl get pods --watch
# NAME READY STATUS AGE
# web 0/1 Pending 0s
# web 0/1 ContainerCreating 1s
# web 1/1 Running 4s <- pushed as state changes
LIST + WATCH pattern (informer):
1. LIST all Pods at resourceVersion=1000 -> populate local cache
2. WATCH from resourceVersion=1000 -> receive deltas (ADD/UPDATE/DELETE)
3. On disconnect, re-WATCH from last seen resourceVersion (no missed events)
Exercises
- (Beginner) Why is the watch mechanism preferable to having every component poll the API server?
- (Beginner) What is the purpose of an object's
resourceVersionin the context of watches?- (Intermediate) Describe the LIST-then-WATCH pattern an informer uses and why the initial LIST is necessary.
- (Interview) A controller's watch connection drops for 10 seconds and several objects change during the gap. Explain how the controller avoids missing those changes when it reconnects. (Hint: resourceVersion and re-list/resync.)
Answers
- Watches push changes to clients over a long-lived connection, so clients learn of changes immediately without repeated polling. Polling would create high, constant load on the API server and introduce latency between a change and its detection.
- The
resourceVersionis a monotonic marker (from etcd) identifying a point in the object's change history. A watch can start/resume from a specific resourceVersion so the client receives all subsequent changes and does not miss or duplicate events across reconnects.- The informer first issues a LIST to fetch the full current set of objects and build a local cache (and learn the current resourceVersion). It then opens a WATCH from that resourceVersion to receive incremental deltas. The initial LIST is necessary because a watch only delivers future changes; without it the cache would start empty and the controller would lack a baseline of current state.
- On reconnect the client resumes the watch from the last observed
resourceVersion, and the API server replays the changes that occurred since then (as long as they are still within the server's history window). If the resourceVersion is too old to replay ("too old" error), the informer performs a fresh LIST to resync the cache, then re-establishes the watch. Either way, level-triggered reconciliation against current state ensures correctness even if individual events were missed.
Authentication and authorization flow
Theory
Before any request changes the cluster, the API server runs it through a gauntlet. The first two stages are authentication (who are you?) and authorization (are you allowed to do this?).
- Authentication (authn): the API server verifies the caller's identity using one or more configured methods — X.509 client certificates, bearer tokens, ServiceAccount tokens, or OIDC. The result is a username and group memberships. If no authenticator succeeds, the request is rejected with
401 Unauthorized. Note: Kubernetes has no built-in "user" object — users are external identities. - Authorization (authz): given the authenticated identity, the API server asks the configured authorizers (most commonly RBAC) whether this identity may perform this verb (get/list/create/update/delete) on this resource in this namespace. If none allows it, the request is rejected with
403 Forbidden.
Only after passing both does the request continue to admission control.
Example
Request --> [ Authentication ] --> [ Authorization ] --> [ Admission ] --> etcd
who are you? are you allowed? mutate/validate
401 if fail 403 if fail 422 if rejected
# "Can I do this?" — query the authorization layer without performing the action:
kubectl auth can-i create deployments --namespace dev
# yes
kubectl auth can-i delete nodes
# no
# Check as another identity (impersonation, if you have permission):
kubectl auth can-i list secrets --as=system:serviceaccount:dev:builder -n dev
Exercises
- (Beginner) What question does authentication answer, and what question does authorization answer?
- (Beginner) Which HTTP status code indicates an authentication failure, and which indicates an authorization failure?
- (Intermediate) Use a single kubectl command to check whether the ServiceAccount
dev:buildercan list Secrets in thedevnamespace.- (Interview) Kubernetes has no built-in user objects, yet it authenticates users. How is this possible, and what are two mechanisms used to establish user identity? (Hint: certificates and external identity providers.)
Answers
- Authentication answers "who are you?" (establishes identity). Authorization answers "are you allowed to do this?" (checks permissions for the requested action).
- Authentication failure ->
401 Unauthorized. Authorization failure ->403 Forbidden.kubectl auth can-i list secrets --as=system:serviceaccount:dev:builder -n dev.- Users are external to Kubernetes — there is no User API object. Identity is asserted by credentials the API server validates: e.g., (a) X.509 client certificates where the certificate's Common Name is the username and Organization fields are groups, signed by a CA the API server trusts; and (b) OIDC tokens from an external identity provider (Google, Okta, Azure AD, Dex), where the API server validates the JWT and extracts username/groups from claims. (Bearer tokens and authenticating webhooks are also valid.) ServiceAccounts, by contrast, are in-cluster objects for in-cluster identities.
Admission controllers pipeline
Theory
After authn and authz, but before the object is persisted, the request passes through the admission control pipeline — a chain of plugins that can inspect, modify, or reject the request based on policy. This is where the cluster enforces rules that go beyond "who can do what" into "what is allowed to exist and how."
Admission runs in two ordered phases:
- Mutating admission: plugins may change the object (e.g., inject a sidecar, set default values, add labels). The MutatingAdmissionWebhook plugin calls external webhooks for this.
- Validating admission: plugins may accept or reject but not modify (e.g., enforce that no privileged Pods are created). The ValidatingAdmissionWebhook plugin and the newer ValidatingAdmissionPolicy (CEL-based) handle this.
If any admission plugin rejects the request, it never reaches etcd and the client gets an error (typically 422). Built-in controllers include NamespaceLifecycle, LimitRanger, ResourceQuota, and PodSecurity.
Example
+--------------------- Admission --------------------+
authn/authz | Mutating (can modify) --> Validating (accept/reject)| --> etcd
+----------------------------------------------------+
e.g. inject sidecar, set defaults e.g. block privileged Pods
# A MutatingWebhookConfiguration wires an external webhook into the pipeline:
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata:
name: sidecar-injector
webhooks:
- name: inject.example.com
rules:
- operations: ["CREATE"]
apiGroups: [""]
apiVersions: ["v1"]
resources: ["pods"] # intercept Pod creation to inject a sidecar
clientConfig:
service:
name: injector
namespace: sidecar-system
path: /mutate
admissionReviewVersions: ["v1"]
sideEffects: None
Exercises
- (Beginner) At what point in request handling do admission controllers run — before or after the object is stored in etcd?
- (Beginner) What is the key difference between a mutating and a validating admission controller?
- (Intermediate) Why must the mutating phase run before the validating phase? Give an example where the order matters.
- (Interview) Admission webhooks let you enforce arbitrary organizational policy. What is a major operational risk of a poorly configured mutating/validating webhook, and how can you mitigate it? (Hint: think about what happens when the webhook endpoint is down and
failurePolicy.)
Answers
- Admission controllers run after authentication and authorization but before the object is persisted to etcd.
- A mutating controller can modify the incoming object (set defaults, inject containers/labels). A validating controller can only accept or reject it — it cannot change it.
- Mutating runs first so that validating sees the final, post-mutation object. Example: a mutating webhook injects a sidecar container and security context defaults; a validating webhook (or PodSecurity) then checks the complete Pod against policy. If validation ran first, it would judge an incomplete object and could pass something the mutation later makes non-compliant — or reject something the mutation would have fixed.
- If a webhook is configured with
failurePolicy: Failand its endpoint becomes unavailable, the API server cannot complete admission and will reject the affected requests, potentially blocking all Pod creation cluster-wide (an outage). Mitigations: scope the webhook narrowly with preciserulesandnamespaceSelector/objectSelector(exclude kube-system), run the webhook service highly available, set sensibletimeoutSeconds, and considerfailurePolicy: Ignorefor non-critical policies (accepting that some requests bypass the check during outages). Always test webhook downtime scenarios.
2.4 Add-ons and Extensions
A freshly bootstrapped cluster is missing pieces most workloads assume exist: in-cluster DNS, metrics, a dashboard, and a network implementation. These are provided by add-ons — themselves running as Kubernetes workloads. This subchapter covers the most important ones.
CoreDNS
Theory
Services need names, not just IPs. CoreDNS is the cluster's DNS server, deployed as a Deployment in the kube-system namespace. It gives every Service a stable DNS name so Pods can find each other by name (e.g., payments.default.svc.cluster.local) instead of hard-coding ClusterIPs that could change.
When a Pod makes a DNS query, it is routed (via the Pod's /etc/resolv.conf, configured by the kubelet) to the CoreDNS Service. CoreDNS watches the API server for Services and Endpoints and answers queries from that live data. It is configured via a Corefile stored in a ConfigMap, using a plugin chain (the kubernetes plugin handles cluster records; forward sends external queries upstream).
Example
# Every Service gets a name of the form:
# <service>.<namespace>.svc.cluster.local
kubectl run test --rm -it --image=busybox -- nslookup payments.default
# Server: 10.96.0.10
# Address: 10.96.0.10:53
# Name: payments.default.svc.cluster.local
# Address: 10.96.43.12
# CoreDNS Corefile (in the coredns ConfigMap):
.:53 {
kubernetes cluster.local in-addr.arpa ip6.arpa { # serve cluster records
pods insecure
}
forward . /etc/resolv.conf # forward everything else upstream
cache 30
}
Exercises
- (Beginner) What does CoreDNS provide to the cluster, and in which namespace does it run?
- (Beginner) Write the fully qualified DNS name for a Service named
cachein theprodnamespace.- (Intermediate) A Pod cannot resolve
payments.default. List three things you would check to diagnose the DNS failure.- (Interview) CoreDNS answers from data it watches in the API server rather than a static zone file. Why is this essential in a dynamic cluster? (Hint: Services and endpoints change constantly.)
Answers
- CoreDNS provides in-cluster DNS — resolving Service (and Pod) names to IPs — and runs in the
kube-systemnamespace (typically as thecorednsDeployment fronted by thekube-dnsService).cache.prod.svc.cluster.local.- Any three: confirm CoreDNS Pods are running and healthy (
kubectl -n kube-system get pods -l k8s-app=kube-dns); check the Pod's/etc/resolv.confpoints at the cluster DNS IP; verify the target Service exists and has endpoints (kubectl get svc,endpointslices); test resolution from a debug Pod (nslookup); inspect CoreDNS logs and the Corefile; check NetworkPolicies aren't blocking port 53.- Services and the Pods backing them are created, deleted, and rescheduled constantly, changing IPs frequently. A static zone file would be stale within seconds. By watching the API server, CoreDNS always answers with the current set of Services and endpoints, so name resolution stays correct as the cluster changes — which is the whole point of stable names over ephemeral IPs.
Kubernetes Dashboard
Theory
The Kubernetes Dashboard is an optional web-based UI for the cluster. It lets you view and manage workloads, inspect logs, see resource usage, and create/edit objects through a browser instead of kubectl. It is useful for newcomers, for quick visual inspection, and for operators who prefer a GUI for some tasks.
The Dashboard is not installed by default and runs as a workload in the cluster (commonly the kubernetes-dashboard namespace). Because it can be a powerful entry point, it must be secured carefully: it authenticates users and acts on their behalf via the API server, respecting their RBAC permissions. Historically, misconfigured Dashboards (granted cluster-admin and exposed publicly) have been a notable attack vector — so it should be exposed via kubectl proxy or an authenticated Ingress, never wide open with broad privileges.
Example
# Install (community manifest) and access securely via the API server proxy:
kubectl apply -f https://raw.githubusercontent.com/kubernetes/dashboard/.../recommended.yaml
kubectl proxy
# Then browse to:
# http://localhost:8001/api/v1/namespaces/kubernetes-dashboard/services/https:kubernetes-dashboard:/proxy/
# Create a token for a ServiceAccount to log in (scoped by RBAC):
kubectl -n kubernetes-dashboard create token admin-user
Exercises
- (Beginner) Is the Kubernetes Dashboard installed by default in a standard cluster?
- (Beginner) What does the Dashboard use under the hood to read and modify cluster objects?
- (Intermediate) Describe a safe way to access the Dashboard without exposing it publicly.
- (Interview) Why is granting the Dashboard's ServiceAccount cluster-admin and exposing it via a public LoadBalancer dangerous? (Hint: think about the blast radius if the endpoint is reached by an attacker.)
Answers
- No. The Dashboard is an optional add-on that must be installed explicitly.
- It uses the kube-apiserver (the standard Kubernetes API), acting on behalf of the logged-in identity and constrained by that identity's RBAC permissions.
- Install it, then access it through
kubectl proxy(which tunnels to the API server over your authenticated kubeconfig) on localhost, or place it behind an authenticated Ingress with TLS and SSO. Log in with a token tied to a least-privilege ServiceAccount. Do not expose it via a public LoadBalancer.- If the Dashboard's ServiceAccount has cluster-admin and the UI is publicly reachable, anyone who reaches it (or bypasses weak auth) can perform any action in the cluster — read all Secrets, create privileged Pods, exfiltrate data, or take over nodes. The blast radius is the entire cluster. Least privilege (scoped RBAC) plus no public exposure limits damage even if the endpoint is reached.
Metrics Server
Theory
To answer "how much CPU and memory is this Pod using right now?" — and to drive the Horizontal Pod Autoscaler — the cluster needs live resource metrics. The Metrics Server is a lightweight add-on that collects CPU/memory usage from each node's kubelet (via the Summary API), aggregates it in memory, and exposes it through the Metrics API (metrics.k8s.io).
Crucially, Metrics Server is for real-time, short-lived metrics used by kubectl top and autoscalers — it is not a monitoring system. It keeps only the latest readings in memory (no historical storage), so for dashboards, trends, and alerting you use Prometheus instead. Without Metrics Server installed, kubectl top and CPU/memory-based HPA will not work.
Example
# Metrics Server powers these commands:
kubectl top nodes
# NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
# node-1 210m 10% 1200Mi 31%
kubectl top pods -n default
# NAME CPU(cores) MEMORY(bytes)
# web 5m 48Mi
# An HPA consumes the Metrics API to scale on CPU:
kubectl autoscale deployment web --cpu-percent=70 --min=2 --max=10
Exercises
- (Beginner) What two commands stop working if the Metrics Server is not installed?
- (Beginner) Where does the Metrics Server get its data from?
- (Intermediate) Why is the Metrics Server unsuitable as your monitoring/alerting solution, and what would you use instead?
- (Interview) The Horizontal Pod Autoscaler depends on the Metrics API. Trace the path from a CPU spike in a Pod to the HPA adding a replica. (Hint: kubelet -> Metrics Server -> Metrics API -> HPA controller.)
Answers
kubectl top nodesandkubectl top pods(and CPU/memory-based HPA scaling) stop working.- It scrapes resource usage from each node's kubelet (the kubelet's Summary API, sourced from cAdvisor), then aggregates it.
- Metrics Server holds only the most recent readings in memory with no historical retention, so it cannot provide trends, long-term dashboards, or alerting. Use Prometheus (with kube-state-metrics, Grafana, and Alertmanager) for monitoring, history, and alerts.
- The Pod's container CPU usage rises; the node's kubelet/cAdvisor records it. Metrics Server scrapes the kubelet and exposes the aggregated value via the Metrics API (
metrics.k8s.io). The HPA controller (in the controller-manager) periodically queries the Metrics API, computes the current CPU utilization against the target (e.g., 70%), calculates the desired replica count, and updates the Deployment'sreplicas. The Deployment/ReplicaSet controller then creates the new Pod, which the scheduler places on a node.
Container Network Interface (CNI) plugins
Theory
Kubernetes defines that every Pod must get its own IP and be able to reach every other Pod, but it deliberately does not implement the networking itself. That job is delegated to a CNI (Container Network Interface) plugin. CNI is a CNCF standard: when the kubelet creates a Pod, it calls the configured CNI plugin to set up the Pod's network namespace, assign an IP, and wire up routes.
This is the same "stable interface, swappable implementation" pattern as CRI and CSI. Different CNI plugins make different trade-offs:
- Flannel: simple overlay networking, easy to set up, minimal features.
- Calico: high performance, supports NetworkPolicy enforcement, can run without an overlay (BGP routing).
- Cilium: eBPF-based, advanced observability, network policy, and service mesh features.
Without a CNI plugin installed, Pods stay stuck in ContainerCreating / Pending because their network cannot be set up — a very common "fresh kubeadm cluster" gotcha.
Example
# A brand-new kubeadm cluster has nodes NotReady until a CNI is installed:
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# node-1 NotReady control-plane 2m v1.30.2 <- no CNI yet
# Install a CNI (Calico example), then nodes go Ready:
kubectl apply -f https://raw.githubusercontent.com/projectcalico/calico/.../calico.yaml
kubectl get nodes
# node-1 Ready control-plane 5m v1.30.2
CNI config lives on each node, e.g. /etc/cni/net.d/10-calico.conflist
kubelet --(CNI ADD)--> plugin --> assigns Pod IP, sets up veth + routes
Exercises
- (Beginner) Why doesn't Kubernetes ship with its own built-in Pod networking implementation?
- (Beginner) What symptom do you see on a fresh cluster that has no CNI plugin installed?
- (Intermediate) Compare Flannel and Calico in terms of features, particularly around NetworkPolicy.
- (Interview) The CNI follows the same design philosophy as CRI and CSI. State that philosophy and explain why offloading networking to plugins is strategically smart for the Kubernetes project. (Hint: ecosystem, neutrality, innovation.)
Answers
- Kubernetes specifies the networking model (every Pod gets an IP; all Pods can reach all Pods) but leaves the implementation to pluggable CNI plugins, so it stays neutral and lets specialized projects provide networking suited to different environments.
- Nodes stay
NotReadyand Pods stayPending/ContainerCreatingbecause the kubelet cannot set up Pod networking without a CNI plugin.- Flannel provides simple overlay networking (e.g., VXLAN) and is easy to deploy, but historically does not enforce NetworkPolicy on its own. Calico is higher-performance, can route without an overlay (BGP), and natively enforces Kubernetes NetworkPolicy (plus richer Calico-specific policies). For network segmentation/policy, Calico (or Cilium) is the choice.
- The philosophy is a standard interface with swappable implementations: Kubernetes depends only on the CNI contract, and vendors supply implementations behind it. This keeps the core small and vendor-neutral, fosters a competitive ecosystem (Calico, Cilium, Flannel, Weave) that can innovate independently (e.g., eBPF in Cilium), and lets operators pick the networking stack that fits their performance, security, and topology needs without forking Kubernetes.
3. Setting Up Kubernetes
Before you can deploy anything, you need a cluster — and where that cluster runs depends entirely on your goal. This chapter covers the full spectrum: lightweight local clusters for development, production-grade provisioning on bare metal and managed clouds, and the kubectl command-line tool you will use to drive any of them. Choosing the right setup for the right context saves enormous time and frustration.
3.1 Local Development Clusters
You do not need a fleet of servers to learn or develop on Kubernetes. Several tools spin up a real, functional cluster on a single laptop. This subchapter compares the main options.
Minikube setup and usage
Theory
Minikube is the original and most beginner-friendly tool for running a local single-node (optionally multi-node) Kubernetes cluster. Think of it as a "Kubernetes in a box" — it provisions a small cluster inside a VM or container on your machine, complete with a working control plane and a node, so you can practice realistic workflows without any cloud account.
Minikube's defining strength is its addons system and driver flexibility. It can run on top of several "drivers" (Docker, VirtualBox, Hyper-V, KVM, QEMU) and bundles one-command addons for things newcomers usually struggle to set up: the dashboard, an Ingress controller, the metrics-server, and a LoadBalancer simulator (minikube tunnel). This makes it ideal for learning and demos.
Example
minikube start --driver=docker --nodes=2 # start a 2-node cluster on Docker
minikube status # show control plane / kubelet health
kubectl get nodes # minikube auto-configures kubectl
minikube addons enable ingress # one command to get an Ingress controller
minikube addons enable metrics-server # enable kubectl top
minikube dashboard # open the web UI in a browser
minikube service web --url # get a reachable URL for a Service
minikube delete # tear the cluster down
Exercises
- (Beginner) What does Minikube create on your machine, and what is its main intended audience?
- (Beginner) Name two Minikube addons and what each provides.
- (Intermediate) You created a
type: LoadBalancerService in Minikube and its EXTERNAL-IP is<pending>. Which Minikube command makes it reachable, and why is it needed?- (Interview) Minikube supports multiple "drivers" (docker, virtualbox, hyperv, kvm). Why does abstracting the driver matter for a cross-platform local tool? (Hint: think about what differs between macOS, Windows, and Linux hosts.)
Answers
- Minikube creates a small, local, single- or multi-node Kubernetes cluster (inside a VM or container) on your own machine. Its main audience is learners and developers who want a real cluster locally without a cloud account.
- Any two:
ingress(NGINX Ingress controller),metrics-server(enableskubectl topand HPA),dashboard(web UI),registry(local image registry).- Run
minikube tunnel(in a separate terminal). Minikube has no cloud to provision a real load balancer, so the tunnel creates a network route on the host that assigns the LoadBalancer Service a reachable IP. Alternativelyminikube service <name> --urlexposes it directly.- Different host operating systems provide different virtualization/container backends (Hyper-V on Windows, HyperKit/QEMU on macOS, KVM on Linux, or Docker everywhere). Abstracting the driver lets the same
minikube startworkflow run on any OS by selecting an appropriate backend, so users get a consistent experience regardless of platform.
kind (Kubernetes in Docker)
Theory
kind ("Kubernetes IN Docker") runs each Kubernetes node as a Docker container rather than a VM. Each "node" is a container running its own kubelet and container runtime, and the cluster is bootstrapped with kubeadm inside those containers. Because containers start in seconds and consume little memory, kind clusters are extremely fast to create and destroy.
This speed and reproducibility are why kind is the de facto standard for CI pipelines and testing Kubernetes itself. You can define a multi-node cluster in a small YAML file, spin it up in a GitHub Actions job, run your tests, and tear it down — all in under a minute. It is less focused on a polished local-dev UX (no built-in dashboard/tunnel like Minikube) and more on disposable, scriptable clusters.
Example
# kind-config.yaml: a 1 control-plane + 2 worker cluster
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
kind create cluster --name dev --config kind-config.yaml
kubectl get nodes # three "nodes" that are actually containers
docker ps # you can see them as Docker containers
kind load docker-image myapp:latest --name dev # push a local image into the cluster
kind delete cluster --name dev
Exercises
- (Beginner) How does kind represent a Kubernetes node?
- (Beginner) What is kind's most common use case?
- (Intermediate) Your locally built image
myapp:latestshowsErrImagePullin a kind cluster even though it exists on your machine. Why, and how do you fix it?- (Interview) Why is kind preferred over Minikube for continuous integration, despite Minikube being more feature-rich for local development? (Hint: startup time, footprint, scriptability, nested virtualization in CI.)
Answers
- kind runs each node as a Docker container (running kubelet + a runtime inside), bootstrapped with kubeadm.
- Disposable clusters for CI/CD pipelines and testing (including testing Kubernetes itself), where fast create/teardown and reproducibility matter most.
- The image lives in your host's Docker daemon, not inside the kind node containers' runtime, so the cluster cannot find it. Load it in with
kind load docker-image myapp:latest --name dev(or push to a registry the cluster can reach).- CI runners often cannot do nested hardware virtualization (so VM-based tools are awkward), while Docker is universally available. kind nodes are containers that start in seconds with a small memory footprint, are fully defined in YAML (scriptable/reproducible), and tear down cleanly — ideal for ephemeral test jobs. Minikube's addons and tunnel are valuable for interactive local dev but add weight and setup that CI does not need.
k3s and k3d for lightweight clusters
Theory
k3s (by Rancher/SUSE) is a fully certified Kubernetes distribution stripped down to a single binary under ~100 MB, designed for resource-constrained environments: edge devices, IoT, CI, and ARM boards like a Raspberry Pi. It achieves this by removing legacy/optional features, replacing etcd with an embedded SQLite datastore by default (etcd is still an option for HA), and bundling components into one process. Despite the trimming, it passes the CNCF conformance tests — it is real Kubernetes.
k3d is a wrapper that runs k3s inside Docker (analogous to how kind runs upstream Kubernetes in Docker). It gives you the fast, disposable, multi-node local-cluster experience of kind, but built on the lightweight k3s, so it starts even faster and uses less RAM.
Example
# k3s: install directly on a Linux host/edge device (one command)
curl -sfL https://get.k3s.io | sh -
sudo k3s kubectl get nodes # k3s ships its own kubectl
# k3d: k3s-in-Docker for local multi-node clusters
k3d cluster create dev --servers 1 --agents 2
kubectl get nodes
k3d cluster delete dev
Why k3s is small:
single binary | SQLite default (no external etcd) | legacy/cloud
~ <100 MB | lower memory footprint | features removed
Exercises
- (Beginner) What makes k3s "lightweight" compared to a standard Kubernetes install?
- (Beginner) What is the relationship between k3s and k3d?
- (Intermediate) k3s uses SQLite by default instead of etcd. What is the consequence for high availability, and how do you get HA with k3s?
- (Interview) k3s removes features yet remains CNCF-conformant. What does "conformance" guarantee, and why does it matter that an edge distribution still passes it? (Hint: portability of workloads and APIs.)
Answers
- k3s ships as a single small binary (<100 MB), bundles components into one process, uses an embedded SQLite datastore by default instead of external etcd, and drops legacy/optional in-tree features — drastically reducing footprint and operational overhead.
- k3d runs k3s inside Docker containers, providing a fast, disposable, multi-node local cluster experience built on the lightweight k3s distribution.
- SQLite is a single-file, single-node datastore, so the default k3s setup is not highly available (one control-plane datastore). For HA, configure k3s with an embedded etcd datastore (or an external SQL datastore) across multiple server nodes so control-plane state is replicated.
- CNCF conformance certifies that a distribution implements the standard Kubernetes APIs and behaviors. This guarantees that workloads and manifests written for conformant Kubernetes run the same way on k3s. For an edge distribution it matters because teams can develop against standard Kubernetes and deploy to constrained edge clusters without rewriting manifests or relying on non-portable behavior.
Docker Desktop Kubernetes
Theory
Docker Desktop (on macOS and Windows) includes a built-in, single-node Kubernetes cluster you can enable with a single checkbox in its settings. For developers who already use Docker Desktop, this is the lowest-friction option: no extra tools to install, and kubectl is configured automatically with a docker-desktop context.
The trade-offs: it is single-node only (no multi-node topology testing), tied to Docker Desktop's lifecycle and resource limits, and Docker Desktop requires a paid license for larger organizations. It is excellent for quick local iteration when you are already in the Docker ecosystem, but less suited to testing multi-node behavior (scheduling spread, DaemonSets across nodes) than kind or Minikube.
Example
Docker Desktop -> Settings -> Kubernetes -> [x] Enable Kubernetes -> Apply
kubectl config get-contexts # docker-desktop appears automatically
kubectl config use-context docker-desktop
kubectl get nodes
# NAME STATUS ROLES AGE VERSION
# docker-desktop Ready control-plane 1m v1.30.2
Exercises
- (Beginner) How do you enable Kubernetes in Docker Desktop?
- (Beginner) How many nodes does Docker Desktop's Kubernetes provide?
- (Intermediate) You need to verify that a DaemonSet schedules one Pod per node across three nodes. Is Docker Desktop a suitable tool? If not, what would you use?
- (Interview) For a developer already using Docker Desktop daily, what are the pros and cons of using its built-in Kubernetes versus installing kind? (Hint: friction vs. flexibility and multi-node testing.)
Answers
- Open Docker Desktop's Settings, go to the Kubernetes section, check "Enable Kubernetes," and apply — Docker Desktop provisions a single-node cluster and configures kubectl.
- A single node (it is single-node only).
- No — Docker Desktop is single-node, so it cannot demonstrate per-node DaemonSet spread. Use a multi-node tool like kind (
nodes:with multiple workers), Minikube (--nodes=3), or k3d (--agents).- Pros of Docker Desktop K8s: zero extra installation, one-click enable, automatic kubectl context — lowest friction for someone already in Docker Desktop. Cons: single-node only (cannot test multi-node scheduling, spread, or DaemonSets realistically), bound to Docker Desktop's resource settings and release cadence, and licensing costs for larger orgs. kind adds a small install step but gives scriptable, multi-node, disposable clusters better suited to realistic and CI testing.
3.2 Production Cluster Provisioning
Local tools are for development; production demands different tooling that addresses high availability, security, upgrades, and scale. This subchapter covers the main paths to a production cluster.
kubeadm cluster bootstrap
Theory
kubeadm is the official, low-level tool for bootstrapping a production-grade Kubernetes cluster on machines you control. It does not provision infrastructure (you bring your own Linux hosts), but it handles the hard, error-prone parts of standing up a conformant cluster: generating certificates, writing static Pod manifests for the control plane, configuring the kubelet, and producing join tokens for worker nodes.
The mental model is two commands: kubeadm init on the first control-plane node sets up the control plane and prints a kubeadm join command; you run that join command on each worker (and additional control-plane) node. kubeadm deliberately leaves some choices to you — most importantly, you must install a CNI plugin yourself afterward, and you manage the underlying OS, load balancer, and etcd topology. It is the foundation that many higher-level tools (like kubespray) build upon.
Example
# On the first control-plane node:
sudo kubeadm init --pod-network-cidr=10.244.0.0/16
# kubeadm prints a "kubeadm join ..." command — save it.
# Configure kubectl for your user:
mkdir -p $HOME/.kube && sudo cp /etc/kubernetes/admin.conf $HOME/.kube/config
# Install a CNI (required — nodes stay NotReady without it):
kubectl apply -f https://.../calico.yaml
# On each worker node, run the printed join command:
sudo kubeadm join 10.0.0.1:6443 --token abcdef.0123456789 \
--discovery-token-ca-cert-hash sha256:<hash>
Exercises
- (Beginner) What does kubeadm do, and what does it deliberately not do?
- (Beginner) After
kubeadm init, why do nodes remainNotReadyuntil you take one more step?- (Intermediate) Describe the two-phase workflow of using kubeadm to create a cluster with one control-plane and two worker nodes.
- (Interview) kubeadm leaves CNI, OS management, and load balancer setup to the operator. Why is this "batteries not included" philosophy appropriate for a production bootstrap tool? (Hint: flexibility, environment-specific choices, separation of concerns.)
Answers
- kubeadm bootstraps a conformant cluster: it generates certificates, brings up the control-plane components (as static Pods), configures the kubelet, and creates join tokens. It does not provision machines/infrastructure, install a CNI plugin, manage the host OS, or set up an external load balancer — those are left to the operator.
- The control plane is up, but Pod networking is not — kubeadm does not install a CNI plugin. Until you apply one (e.g., Calico/Flannel), the kubelet cannot set up Pod networking, so nodes report
NotReady.- Phase 1: on the first control-plane node, run
kubeadm init(sets up control plane, outputs a join command); configure kubectl; install a CNI. Phase 2: on each worker, run thekubeadm joincommand with the token and CA cert hash to register the node with the control plane. Repeat join for each worker.- Production environments differ widely in networking, OS, storage, and HA topology. By not hard-coding these, kubeadm stays flexible and composable: operators choose the CNI that fits their performance/policy needs, manage OS patching with their own tooling, and integrate their own load balancer. This separation of concerns keeps kubeadm focused on correctly bootstrapping the cluster while letting environment-specific decisions be made by those who know the environment.
Managed Kubernetes: EKS, GKE, AKS
Theory
Running your own control plane (etcd backups, API server HA, upgrades, certificate rotation) is significant ongoing work. Managed Kubernetes services — Amazon EKS, Google GKE, Azure AKS — run and operate the control plane for you. You consume the Kubernetes API; the cloud provider handles control-plane availability, patching, and (often) etcd. You typically pay for the worker nodes plus a small per-cluster control-plane fee.
The key division of responsibility: the provider owns the control plane; you own your workloads, node configuration (unless using fully serverless node options like GKE Autopilot or EKS Fargate), and your application security. Managed offerings integrate tightly with their cloud's IAM, load balancers, and storage (via the cloud-controller-manager and CSI drivers). This is the default choice for most organizations because it removes the hardest operational burden.
Example
# GKE
gcloud container clusters create demo --num-nodes=3 --region=us-central1
# EKS (eksctl simplifies the many underlying resources)
eksctl create cluster --name demo --nodes 3 --region us-east-1
# AKS
az aks create --resource-group rg --name demo --node-count 3 --generate-ssh-keys
# All three: fetch credentials so kubectl points at the managed cluster
aws eks update-kubeconfig --name demo --region us-east-1
gcloud container clusters get-credentials demo --region us-central1
az aks get-credentials --resource-group rg --name demo
Exercises
- (Beginner) In a managed Kubernetes service, who is responsible for operating the control plane?
- (Beginner) Name the managed Kubernetes offering of AWS, Google Cloud, and Azure.
- (Intermediate) Describe the shared responsibility split between you and the cloud provider when using EKS/GKE/AKS.
- (Interview) When would a company still choose to self-manage Kubernetes (kubeadm/kubespray) instead of a managed service? (Hint: on-prem/air-gapped, regulatory, cost at scale, control-plane customization.)
Answers
- The cloud provider operates the control plane (API server, scheduler, controller-manager, and usually etcd) — including its availability, patching, and upgrades.
- AWS -> EKS (Elastic Kubernetes Service); Google Cloud -> GKE (Google Kubernetes Engine); Azure -> AKS (Azure Kubernetes Service).
- The provider manages the control plane and its uptime/upgrades. You manage your workloads (Deployments, Services), application security and RBAC, and (unless using serverless node modes) worker node pools, their OS/upgrades, scaling, and resource configuration. Networking, storage, and load balancing integrate via the provider's CCM/CSI but you configure how you use them.
- Reasons to self-manage: on-premises or air-gapped environments with no cloud option; strict regulatory/data-sovereignty requirements; the need to customize the control plane (custom admission, schedulers, API flags) beyond what managed offerings allow; potential cost savings at very large scale or with existing hardware; and avoiding cloud lock-in. The trade-off is taking on the full operational burden of running the control plane.
Bare-metal provisioning with kubespray
Theory
When you have a fleet of bare-metal or VM hosts and want a production, highly-available cluster without a cloud's managed service, kubespray automates the job. It is a collection of Ansible playbooks that install and configure Kubernetes (using kubeadm under the hood) across many nodes — handling the control-plane HA, etcd cluster, CNI installation, and add-ons that you would otherwise wire up by hand.
kubespray's value is repeatability and scale: you declare your inventory (which hosts are control-plane, which are workers, etcd topology) and configuration in files, then run a playbook to provision the whole cluster consistently. It supports many CNIs and OSes, and can also perform upgrades and scaling operations. The cost is the complexity of operating Ansible and understanding the many configuration knobs.
Example
# inventory/mycluster/hosts.ini — declare node roles
[kube_control_plane]
node-1
node-2
node-3
[etcd]
node-1
node-2
node-3
[kube_node]
node-4
node-5
# Run the cluster-provisioning playbook against the inventory:
ansible-playbook -i inventory/mycluster/hosts.ini cluster.yml -b
# Scale later by editing the inventory and running scale.yml / upgrade-cluster.yml
Exercises
- (Beginner) What automation technology does kubespray use to provision clusters?
- (Beginner) What does kubespray use under the hood to actually bootstrap each cluster?
- (Intermediate) What is the main advantage of declaring your cluster topology in an inventory file rather than running kubeadm manually on each host?
- (Interview) A team can already use kubeadm. What does kubespray add on top, and when is that extra layer worth the added complexity? (Hint: many nodes, HA etcd, repeatable upgrades, fleet consistency.)
Answers
- Ansible (kubespray is a set of Ansible playbooks).
- kubeadm — kubespray orchestrates kubeadm-based bootstrap across all the nodes.
- Repeatability and consistency at scale: the inventory + config is declarative and version-controllable, so the entire multi-node, HA cluster is provisioned identically every time. Manual kubeadm across many hosts is error-prone, hard to reproduce, and tedious to keep consistent.
- kubeadm bootstraps one cluster but leaves orchestration across many nodes, HA control-plane and etcd setup, CNI/add-on installation, and upgrades to you. kubespray automates all of that across a fleet via Ansible, with repeatable provisioning, scaling, and upgrade playbooks. The extra complexity is worth it when you manage many bare-metal/VM nodes, need HA, and want consistent, automatable lifecycle operations — versus a small or throwaway cluster where plain kubeadm suffices.
High availability control plane setup
Theory
A single control-plane node is a single point of failure: lose it and you lose the ability to schedule, heal, and manage the cluster (existing Pods keep running, but nothing can change). A highly available (HA) control plane runs multiple control-plane nodes so the cluster keeps operating through a node failure.
Two things must be made redundant:
- The API servers: run 3+ stateless API server replicas behind a load balancer that distributes client traffic and removes failed instances. Because they are stateless, this is straightforward active-active.
- etcd: run a quorum-based etcd cluster (3 or 5 members). There are two topologies — stacked (etcd runs on the same nodes as the control plane) and external (etcd on dedicated machines). Stacked is simpler and cheaper; external isolates etcd failures from control-plane failures and is preferred for larger/critical clusters.
The other control-plane components (scheduler, controller-manager) also run as replicas but use leader election so only one is active at a time.
Example
+------------------+
clients ----> | Load Balancer | (e.g., HAProxy, cloud LB on :6443)
+--------+---------+
+-------------+-------------+
v v v
apiserver-1 apiserver-2 apiserver-3 (active-active, stateless)
| | |
etcd-1 <-------- etcd-2 -------> etcd-3 (Raft quorum = 2 of 3)
scheduler/cm replicas everywhere, but only ONE leader each (leader election)
# kubeadm HA: init the first node with a shared control-plane endpoint (the LB):
sudo kubeadm init --control-plane-endpoint "lb.example.com:6443" --upload-certs
# Then join additional control-plane nodes with the printed "--control-plane" join cmd.
Exercises
- (Beginner) What still works and what stops working if your only control-plane node fails?
- (Beginner) Why can API server replicas run active-active while scheduler/controller-manager replicas use leader election?
- (Intermediate) Compare stacked vs. external etcd topology for an HA control plane.
- (Interview) For an HA control plane you choose 3 etcd members rather than 2 or 4. Justify the number using quorum math and fault tolerance. (Hint: (N/2)+1.)
Answers
- Existing Pods/containers keep running (kubelets operate from cached state), but the cluster cannot make changes: no new scheduling, no self-healing of failed Pods, no API access for
kubectl, and no controller actions — because the control plane is down.- API servers are stateless request handlers backed by shared etcd, so any replica can serve any request simultaneously without conflict. The scheduler and controller-manager take actions (binding Pods, creating/deleting resources); multiple active instances would issue conflicting/duplicate actions, so leader election ensures exactly one acts at a time.
- Stacked: etcd members run on the same machines as the control-plane components — fewer machines, simpler, cheaper, but a node failure removes both an API server and an etcd member, and they compete for resources. External: etcd runs on dedicated hosts — more machines and complexity, but isolates etcd from control-plane failures/resource contention and is recommended for large or critical clusters.
- Quorum is
(N/2)+1. With 3 members, quorum is 2, tolerating 1 failure. With 2 members, quorum is 2, tolerating 0 failures (worse than a single node for availability of writes). With 4 members, quorum is 3, still tolerating only 1 failure — same tolerance as 3 but more cost and more cross-member coordination. So 3 gives the best fault tolerance per node for a small HA cluster; 5 (quorum 3, tolerates 2) is used when more resilience is needed.
3.3 kubectl CLI
kubectl is the primary way humans and scripts interact with a cluster. Mastering its configuration and core commands pays off every single day. This subchapter covers installation, contexts, everyday commands, and extending it with plugins.
Installing and configuring kubectl
Theory
kubectl is the official command-line client for the Kubernetes API. It translates human-friendly commands into the REST calls the API server understands. It is a standalone binary you install on your workstation (or CI runner) — separate from any cluster — and it can talk to any cluster you have credentials for.
A key compatibility rule: kubectl supports a version skew of ±1 minor version relative to the cluster's API server. So a v1.30 kubectl works with v1.29, v1.30, and v1.31 clusters. Using a kubectl far newer or older than the cluster can cause subtle failures, so it is good practice to match (or stay within one minor of) your cluster version. After installing, kubectl version and kubectl cluster-info confirm connectivity.
Example
# Install on Linux (download the matching version):
curl -LO "https://dl.k8s.io/release/$(curl -L -s https://dl.k8s.io/release/stable.txt)/bin/linux/amd64/kubectl"
sudo install -o root -g root -m 0755 kubectl /usr/local/bin/kubectl
kubectl version --client # show kubectl's own version
kubectl cluster-info # verify it can reach the cluster's control plane
# Enable shell autocompletion (huge productivity boost):
source <(kubectl completion bash)
alias k=kubectl # common convenience alias
Exercises
- (Beginner) What is kubectl, and is it part of the cluster or installed separately?
- (Beginner) Which command verifies that kubectl can reach the cluster?
- (Intermediate) Your cluster runs v1.30. Within what range of kubectl minor versions are you officially supported, and why does this matter?
- (Interview) Why does kubectl enforce a version skew policy with the API server, and what kinds of problems can arise from a large skew? (Hint: API fields/resources added or removed between versions.)
Answers
- kubectl is the official CLI client for the Kubernetes API. It is installed separately on your workstation/CI, not part of the cluster, and can connect to any cluster you have credentials for.
kubectl cluster-info(andkubectl version/kubectl get nodes) confirms connectivity to the control plane.- Supported skew is ±1 minor version, so for a v1.30 cluster, kubectl v1.29–v1.31 are supported. It matters because client and server must agree on available API resources and fields; staying within one minor avoids incompatibilities.
- The Kubernetes API evolves between minor versions — resources and fields are added, promoted, deprecated, or removed. A kubectl too new may send fields/resources the server doesn't understand; a kubectl too old may not know about resources or may serialize objects in ways the server handles incorrectly. The ±1 skew policy bounds these differences to changes both sides can tolerate, preventing failed or silently incorrect operations.
kubeconfig and context management
Theory
How does kubectl know which cluster to talk to and as whom? Through a kubeconfig file (default ~/.kube/config). It is a YAML file with three lists that get tied together by contexts:
- clusters: API server URLs and their CA certificates.
- users: credentials (client certs, tokens, exec plugins).
- contexts: a named combination of (cluster + user + default namespace).
The current-context determines what every kubectl command targets by default. This design lets one engineer manage many clusters (dev, staging, prod across multiple clouds) from one config, switching with a single command. You can merge multiple kubeconfig files via the KUBECONFIG environment variable (colon-separated list).
Example
kubectl config get-contexts # list all contexts; * marks the current one
kubectl config current-context # show the active context
kubectl config use-context prod # switch the default target to "prod"
kubectl config set-context --current --namespace=team-a # change default namespace
# Merge two kubeconfig files for this shell session:
export KUBECONFIG=~/.kube/config:~/work/eks-config
kubectl config view --flatten > ~/.kube/merged && mv ~/.kube/merged ~/.kube/config
# Simplified kubeconfig structure:
contexts:
- name: prod
context:
cluster: prod-cluster # which clusters[] entry
user: alice # which users[] entry
namespace: payments # default namespace for this context
current-context: prod
Exercises
- (Beginner) What three things does a kubeconfig "context" tie together?
- (Beginner) Which command switches the active cluster you are operating on?
- (Intermediate) You have separate kubeconfig files for a GKE and an EKS cluster. How do you use both from a single kubectl session?
- (Interview) An engineer accidentally ran a destructive command against prod instead of dev. What kubeconfig/context practices reduce the chance of this happening? (Hint: context naming, namespace defaults, prompt indicators, separate credentials.)
Answers
- A context ties together a cluster (API server + CA), a user (credentials), and a default namespace.
kubectl config use-context <name>switches the current context (the default target cluster/user/namespace).- Point
KUBECONFIGat both files (colon-separated):export KUBECONFIG=~/.kube/gke:~/.kube/eks. kubectl merges them, exposing all contexts; switch between clusters withkubectl config use-context. Optionally flatten/merge them into one file withkubectl config view --flatten.- Use clear, unambiguous context names (e.g.,
prod-paymentsvsdev-payments); set per-context default namespaces; display the current context/namespace in your shell prompt (tools like kube-ps1, kubectx/kubens); keep prod credentials separate and require an explicit switch; consider read-only or restricted RBAC for day-to-day prod access; and add confirmation steps or--contextexplicitly on destructive commands so the target is never implicit.
Basic kubectl commands
Theory
A small set of kubectl verbs covers the vast majority of daily work. They fall into a few mental categories:
- Viewing:
get(list),describe(detailed human-readable view + events),logs(container output),top(resource usage). - Creating/changing:
apply(declarative — apply a manifest),create,edit,scale,set image,delete. - Interacting:
exec(run a command in a container),port-forward(tunnel a local port to a Pod/Service),cp. - Debugging:
describe,logs --previous,get events.
The most important habit is preferring declarative apply over imperative commands for anything you want to keep: kubectl apply -f makes your manifests the source of truth and is idempotent and GitOps-friendly. Imperative commands (create, run, expose) are great for quick experiments. The -o/--output flag (-o yaml, -o json, -o jsonpath=..., -o wide) and -n/--namespace are used constantly.
Example
kubectl get pods -n web -o wide # list Pods with node/IP columns
kubectl describe pod web-abc -n web # detailed status + recent events
kubectl logs -f web-abc -c app # stream logs of container "app"
kubectl logs web-abc --previous # logs from the crashed prior container
kubectl apply -f deployment.yaml # declarative create/update (preferred)
kubectl scale deployment web --replicas=5 # imperative quick scale
kubectl set image deployment/web app=app:2.0 # trigger a rolling update
kubectl exec -it web-abc -- sh # open a shell in the container
kubectl port-forward svc/web 8080:80 # reach the Service at localhost:8080
kubectl get pod web-abc -o jsonpath='{.status.podIP}' # extract a single field
Exercises
- (Beginner) Which command shows a Pod's detailed status and recent events?
- (Beginner) How do you view the logs of a container that has already crashed and restarted?
- (Intermediate) Explain the difference between
kubectl applyandkubectl create, and when you would use each.- (Interview) Why is declarative
kubectl apply -fgenerally preferred over imperative commands in production workflows? (Hint: source of truth, idempotency, reviewability, GitOps.)
Answers
kubectl describe pod <name>— it shows detailed spec/status plus the recent Events that explain scheduling/pull/probe issues.kubectl logs <pod> --previous(optionally-c <container>) shows logs from the previous, crashed container instance.createis imperative and fails if the object already exists; it creates once.applyis declarative: it creates the object if absent and updates it to match the manifest if present (tracking changes via the last-applied-configuration), and it is idempotent. Usecreatefor one-off imperative actions; useapplyto manage resources from version-controlled manifests.apply -ftreats version-controlled manifests as the single source of truth: it is idempotent (safe to re-run), produces reviewable diffs in code review, supports updates the same way as creates, and fits GitOps where Git is authoritative and tooling reconciles the cluster to it. Imperative commands mutate state ad hoc with no recorded source of truth, making changes hard to audit, reproduce, or roll back.
kubectl plugins and krew
Theory
kubectl is extensible: any executable on your PATH named kubectl-<something> becomes a subcommand kubectl <something>. This plugin mechanism lets the community add capabilities without modifying kubectl itself. krew is the official plugin manager for kubectl (itself a kubectl plugin) — it provides a curated index, and handles discovering, installing, and updating plugins across platforms.
Plugins fill gaps in the core CLI. Popular ones include kubectx/kubens (fast context/namespace switching), kubectl-neat (clean up get -o yaml output), stern (multi-Pod log tailing), and kubectl-tree (show ownership hierarchies). The pattern mirrors Kubernetes' broader extensibility philosophy: a small stable core plus a plugin ecosystem.
Example
# Install krew (the plugin manager), then use it to install plugins:
kubectl krew install ctx ns neat tree # install several plugins
kubectl krew list # show installed plugins
kubectl krew update && kubectl krew upgrade
# Now use them as native-feeling subcommands:
kubectl ctx prod # switch context (kubectx)
kubectl ns team-a # switch namespace (kubens)
kubectl tree deployment web # show owned ReplicaSets/Pods
# A plugin is just an executable named kubectl-<name> on PATH:
echo '#!/bin/sh
echo hello from my plugin' > kubectl-hello && chmod +x kubectl-hello
kubectl hello # -> "hello from my plugin"
Exercises
- (Beginner) What naming convention turns an executable into a kubectl plugin?
- (Beginner) What is krew?
- (Intermediate) Write the steps to create a trivial custom kubectl plugin called
kubectl hellothat prints a message.- (Interview) The kubectl plugin model and krew reflect a broader Kubernetes design philosophy. Describe that philosophy and one benefit it brings to the CLI specifically. (Hint: small stable core + extensible ecosystem.)
Answers
- An executable named
kubectl-<name>on yourPATHbecomes the subcommandkubectl <name>(dashes in the filename map to spaces/subcommands).- krew is the official plugin manager for kubectl — a kubectl plugin that provides a curated plugin index and installs, updates, and manages other plugins across platforms.
- Create a file named
kubectl-hellocontaining a script (e.g.,#!/bin/shthenecho "hello"), make it executable (chmod +x kubectl-hello), and place it on yourPATH. Thenkubectl helloruns it.- The philosophy is a small, stable core with a pluggable, extensible ecosystem (the same pattern as CRI/CNI/CSI and CRDs). For the CLI, this means the kubectl maintainers keep the core lean and well-tested, while the community can add specialized capabilities (context switching, log tailing, visualization) independently — users get rich functionality without bloating or destabilizing core kubectl, and plugins evolve on their own schedules.
4. Workload Resources
Pods are the atom of execution in Kubernetes, but you rarely create them directly. Instead you use higher-level workload resources — Deployments, StatefulSets, DaemonSets, Jobs — that manage Pods for you, providing self-healing, scaling, ordering, and lifecycle guarantees. This chapter starts at the Pod and works up the abstraction ladder, showing which controller to reach for given the shape of your workload.
4.1 Pods
A Pod is the smallest deployable unit in Kubernetes — one or more containers that share a network and storage context and are always scheduled together. This subchapter dissects the Pod inside and out.
Pod lifecycle and phases
Theory
A Pod is not simply "on" or "off." It moves through a defined set of phases that summarize where it is in its life. Understanding these phases is the foundation of debugging — most "why isn't my app working?" questions start by reading the phase and the container states beneath it.
The status.phase values are:
| Phase | Meaning |
|---|---|
Pending | Accepted by the cluster but not yet running — being scheduled, or images are pulling. |
Running | Bound to a node; at least one container is running (or starting/restarting). |
Succeeded | All containers terminated successfully (exit 0) and will not restart. |
Failed | All containers terminated, and at least one failed (non-zero exit) or was killed. |
Unknown | The node's state can't be obtained (usually a node/communication failure). |
Beneath the phase, each container has its own state: Waiting (with a reason like ContainerCreating, ImagePullBackOff, CrashLoopBackOff), Running, or Terminated (with an exit code and reason like OOMKilled, Completed, Error). The phase is a high-level summary; the container states tell you the real story.
Example
kubectl get pod web -o jsonpath='{.status.phase}' # -> Running
kubectl get pod web -o jsonpath='{.status.containerStatuses[0].state}'
# {"waiting":{"reason":"CrashLoopBackOff","message":"back-off 40s ..."}}
Typical happy-path progression:
Pending (scheduling) <- waiting for a node
Pending (ContainerCreating) <- pulling image, setting up network/volumes
Running <- container(s) up
Succeeded (for batch Pods) OR stays Running (for services)
Exercises
- (Beginner) List the five Pod phases and give a one-line meaning for each.
- (Beginner) What is the difference between a Pod phase and a container state?
- (Intermediate) A Pod is in
Pendingfor a long time. Where do you look to find out why, and name two possible causes.- (Interview) A Pod shows phase
Runningbut your application is not serving traffic. Explain how that is possible and what you would inspect. (Hint: container restart counts, readiness probes, CrashLoopBackOff under a still-"Running" phase.)
Answers
Pending— accepted but not yet running (scheduling/pulling);Running— on a node with at least one container running;Succeeded— all containers exited 0, no restart;Failed— all containers terminated and at least one failed;Unknown— node state cannot be determined.- The phase is a coarse, cluster-level summary of the whole Pod's lifecycle position. A container state (
Waiting/Running/Terminated, with reasons and exit codes) describes each individual container and carries the detailed diagnostic information.- Look at
kubectl describe pod <name>(the Events section) and the container statuses. Causes include: no node has sufficient resources; unsatisfiable nodeSelector/affinity or taints; image pull failure (ImagePullBackOff); unbound PersistentVolumeClaim; or no CNI/network setup.- The phase
Runningonly means a container process started — not that the app is ready. The container may be crash-looping (restarting repeatedly; checkRESTARTSand--previouslogs), failing its readiness probe (so it is excluded from Service endpoints despite "running"), or listening on the wrong port. Inspect restart counts, readiness/liveness probe status, logs, and whether the Service's EndpointSlices include the Pod.
Single-container vs multi-container pods
Theory
The most common Pod has exactly one container — Kubernetes' "one process per container" default. But a Pod can hold multiple containers that are tightly coupled and must share fate, network, and storage. The key principle: containers go in the same Pod only when they are a single cohesive unit that should always be co-located and co-scheduled; otherwise they belong in separate Pods.
All containers in a Pod share:
- The network namespace — same IP, same port space; they reach each other over
localhost. - Volumes — they can mount the same volume to share files.
- Lifecycle coupling — they are scheduled together onto one node and live/die together as a unit.
Multi-container patterns (named in Google's "Design Patterns for Container-Based Distributed Systems" paper) include sidecar (augment the main app), ambassador (proxy outbound connections), and adapter (transform output). If two containers do not need to share a host, network, or storage, put them in separate Pods so they can scale and schedule independently.
Example
apiVersion: v1
kind: Pod
metadata:
name: web-with-logger
spec:
containers:
- name: app # main container writes logs to a shared volume
image: myapp:1.0
volumeMounts:
- name: logs
mountPath: /var/log/app
- name: log-shipper # sidecar reads the same volume and ships logs
image: fluent-bit:2.0
volumeMounts:
- name: logs
mountPath: /var/log/app # SAME volume -> shares files with "app"
volumes:
- name: logs
emptyDir: {} # shared scratch space, lives with the Pod
Exercises
- (Beginner) What do all containers in the same Pod share?
- (Beginner) How do two containers in the same Pod communicate over the network?
- (Intermediate) Give one example where two containers belong in the same Pod and one where they belong in separate Pods.
- (Interview) What is the guiding principle for deciding whether to add a container to an existing Pod versus creating a new Pod? (Hint: shared fate, co-scheduling, and independent scalability.)
Answers
- They share the network namespace (one IP and port space), can share volumes, and share lifecycle/scheduling (placed on the same node, started and stopped together).
- Over
localhost(and distinct ports), since they share the same network namespace and IP.- Same Pod: an app container plus a log-shipping sidecar that reads the app's logs from a shared volume (tightly coupled, must be co-located). Separate Pods: a frontend and a backend service that communicate over the network and need to scale independently — they should be different Deployments/Pods fronted by Services.
- Co-locate in one Pod only when the containers form a single cohesive unit that must share fate, node, network, and/or storage and should always be scheduled and scaled together. If they can or should scale, schedule, upgrade, or fail independently, put them in separate Pods. Over-stuffing a Pod couples lifecycles that ought to be independent.
Init containers
Theory
Sometimes a Pod's main container should not start until some setup is done: waiting for a dependency to be ready, downloading config, running database migrations, or setting file permissions. Init containers solve this. They are special containers that run to completion, one at a time, in order, before any of the Pod's normal (app) containers start.
Properties that make them distinct:
- They run sequentially; each must exit 0 before the next starts.
- If an init container fails, the kubelet restarts it (per the Pod's
restartPolicy); the app containers do not start until all init containers succeed. This makes the Pod robustly block on prerequisites. - They can use a different image with tools the main app does not need (e.g.,
git,curl), keeping the app image minimal.
Example
apiVersion: v1
kind: Pod
metadata:
name: web
spec:
initContainers:
- name: wait-for-db # block until the database Service resolves/answers
image: busybox:1.36
command: ['sh', '-c', 'until nc -z db 5432; do echo waiting; sleep 2; done']
containers:
- name: app
image: myapp:1.0 # only starts AFTER wait-for-db exits 0
Exercises
- (Beginner) When do init containers run relative to the app containers?
- (Beginner) In what order do multiple init containers execute?
- (Intermediate) Why might you use a different (larger, tool-rich) image for an init container than for the app container?
- (Interview) An init container is stuck failing repeatedly. What happens to the Pod's app containers, and how does this behavior help enforce dependencies? (Hint: app containers don't start; restartPolicy.)
Answers
- Init containers run (and must all complete successfully) before any of the Pod's normal app containers start.
- Sequentially and in the order listed; each init container must exit 0 before the next one begins.
- The init container can carry tools needed only for setup (e.g.,
git,curl, migration utilities) without bloating or adding attack surface to the app image. The app image stays minimal while setup still has what it needs.- The app containers do not start at all until every init container succeeds; a failing init container is restarted per the Pod's
restartPolicy, so the Pod effectively blocks (often inInit:Error/Init:CrashLoopBackOff) until the prerequisite is met. This enforces dependencies declaratively — the app can assume its prerequisites are satisfied because Kubernetes guarantees it never starts otherwise.
Sidecar containers
Theory
A sidecar is a helper container that runs alongside the main application container in the same Pod to augment or support it — handling cross-cutting concerns like logging, monitoring, proxying, or syncing files, so the main app doesn't have to. The classic examples: a log-shipping agent, a service-mesh proxy (Envoy), or a file-sync container pulling content for a web server.
Historically sidecars were just ordinary additional containers in the Pod, which created a problem for Jobs and startup ordering: a long-running sidecar would keep a Job from completing, and there was no guarantee the sidecar started before the app. Modern Kubernetes (stable in v1.29+) introduces native sidecar containers, declared as init containers with restartPolicy: Always. These start before app containers (like init containers), keep running alongside them (unlike normal init containers), and are shut down after the app containers — fixing both the ordering and Job-completion issues.
Example
apiVersion: v1
kind: Pod
metadata:
name: web
spec:
initContainers:
- name: log-shipper # a NATIVE sidecar: init container that keeps running
image: fluent-bit:2.0
restartPolicy: Always # <- this makes it a sidecar, not a normal init container
volumeMounts:
- name: logs
mountPath: /var/log/app
containers:
- name: app
image: myapp:1.0
volumeMounts:
- name: logs
mountPath: /var/log/app
volumes:
- name: logs
emptyDir: {}
Exercises
- (Beginner) What is a sidecar container, in one sentence?
- (Beginner) How are native sidecar containers declared in modern Kubernetes?
- (Intermediate) Why did running a sidecar as an ordinary container cause problems for Jobs?
- (Interview) Compare a native sidecar (init container with
restartPolicy: Always) to a plain additional app container. What ordering and lifecycle guarantees does the native sidecar add? (Hint: starts before, stops after, doesn't block Job completion.)
Answers
- A sidecar is a helper container that runs in the same Pod as the main app to provide supporting/cross-cutting functionality (logging, proxying, syncing, monitoring).
- As an init container with
restartPolicy: Always(stable since v1.29). This signals Kubernetes to start it before the app containers but keep it running alongside them.- An ordinary container runs for the life of the Pod, but a Job is considered complete only when its containers terminate. A long-running sidecar never exits on its own, so the Job's Pod never reached
Succeeded— the sidecar kept it "running" forever.- A native sidecar starts and becomes ready before the app containers (so the app can rely on it, e.g., a proxy being up), keeps running during the app's life, and is terminated after the app containers stop. Crucially, the Pod/Job can complete based on the main containers finishing — the sidecar does not block completion. A plain app container has none of these ordering or completion guarantees.
Pod spec anatomy
Theory
Every Pod (and every workload that creates Pods) is described by a Pod spec — the spec section that declares what to run and how. Reading and writing this fluently is essential. The major regions of a Pod spec are:
- containers / initContainers: the images, commands, ports, env, resources, probes, and volume mounts.
- volumes: storage available to the Pod's containers.
- Scheduling controls:
nodeSelector,affinity,tolerations,topologySpreadConstraints. - Security:
securityContext(Pod- and container-level),serviceAccountName. - Lifecycle/behavior:
restartPolicy,terminationGracePeriodSeconds,dnsPolicy,hostNetwork.
Within a container, the most-used fields are name, image, command/args, env/envFrom, ports, resources (requests/limits), volumeMounts, and the three probes (livenessProbe, readinessProbe, startupProbe).
Example
apiVersion: v1
kind: Pod
metadata:
name: anatomy
labels: { app: anatomy }
spec:
serviceAccountName: app-sa # identity for API access
restartPolicy: Always # Always | OnFailure | Never
terminationGracePeriodSeconds: 30 # time for graceful shutdown
containers:
- name: app
image: myapp:1.0
command: ["/bin/app"] # overrides image ENTRYPOINT
args: ["--port=8080"] # overrides image CMD
ports:
- containerPort: 8080
env:
- name: LOG_LEVEL
value: "info"
resources: # scheduling + limits (see Chapter 8)
requests: { cpu: "100m", memory: "128Mi" }
limits: { cpu: "500m", memory: "256Mi" }
readinessProbe: # gate traffic until ready (see Chapter 10)
httpGet: { path: /healthz, port: 8080 }
volumeMounts:
- name: data
mountPath: /data
volumes:
- name: data
emptyDir: {}
Exercises
- (Beginner) What is the difference between a container's
command/argsand the image's ENTRYPOINT/CMD?- (Beginner) Name the three valid values of
restartPolicy.- (Intermediate) Where in a Pod spec do volumes get declared versus mounted, and why are these two separate steps?
- (Interview) The same Pod spec template appears inside Deployments, StatefulSets, DaemonSets, and Jobs. Why is reusing one Pod template across all controllers a powerful design decision? (Hint: consistency, composability, learn-once.)
Answers
commandoverrides the image's ENTRYPOINT andargsoverrides its CMD. If omitted, the image's built-in ENTRYPOINT/CMD are used. They let you change what runs without rebuilding the image.Always,OnFailure, andNever.- Volumes are declared once under
spec.volumes(defining the storage source, e.g., emptyDir, configMap, PVC) and mounted per-container undercontainers[].volumeMounts(defining where each container sees it). Separating them lets multiple containers mount the same declared volume at different paths and decouples "what storage exists" from "who uses it and where."- Every workload controller embeds the identical Pod template, so once you understand the Pod spec you understand how to configure all of them. It composes cleanly — affinity, probes, resources, security all work the same regardless of controller — reducing cognitive load, enabling consistent tooling/validation, and letting controllers focus only on how many Pods and with what lifecycle, not on redefining what a Pod is.
Ephemeral containers for debugging
Theory
How do you debug a running Pod whose container image is a minimal, distroless image with no shell, no curl, no ps? You cannot kubectl exec a shell that does not exist. Ephemeral containers solve this: they are temporary containers you inject into a running Pod for debugging, sharing the Pod's namespaces so they can inspect the existing containers' processes, network, and (optionally) filesystem.
Key characteristics:
- They are added to a live Pod via
kubectl debugand cannot be specified in the Pod spec at creation time. - They have no resource guarantees, no probes, and no ports — they are strictly for interactive troubleshooting.
- They do not restart, and removing them is automatic when you are done; they don't alter the original containers.
This is the modern, supported way to debug — far better than baking debug tools into production images.
Example
# Attach a debug container (with a full shell + tools) to a running Pod:
kubectl debug -it web-abc --image=busybox:1.36 --target=app
# --target=app shares the process namespace of container "app" so you can see its PIDs
# Inside, you can inspect the app's network and processes:
/ # wget -qO- localhost:8080/healthz
/ # ps aux
# Or create a copy of the Pod with a debug container (non-disruptive variant):
kubectl debug web-abc -it --image=busybox --copy-to=web-debug
Exercises
- (Beginner) What problem do ephemeral containers solve for minimal/distroless images?
- (Beginner) Which kubectl command injects an ephemeral debug container into a running Pod?
- (Intermediate) Why can't ephemeral containers have resource requests/limits, ports, or probes?
- (Interview) Compare baking debugging tools into a production image versus using ephemeral containers. What are the security and operational advantages of the ephemeral approach? (Hint: image size, attack surface, on-demand tooling.)
Answers
- Minimal/distroless images often lack a shell and tools, so
kubectl exechas nothing to run. Ephemeral containers inject a temporary, tool-rich container into the running Pod, sharing its namespaces so you can debug the existing containers without modifying their images.kubectl debug(e.g.,kubectl debug -it <pod> --image=busybox --target=<container>).- They are designed purely for short-lived, interactive debugging and are added to an already-running Pod. Allowing resource requests would affect scheduling/QoS of a Pod that is already placed; probes and ports imply a managed, long-lived service role. Ephemeral containers intentionally have none of these so they never disturb the Pod's guaranteed resources or service behavior.
- Baking tools into production images increases image size, expands the attack surface (every tool is a potential exploit vector), and means you ship debugging capabilities to every running instance permanently. Ephemeral containers keep production images minimal and hardened, and provide debugging tools on demand only when needed, in a temporary container that is removed afterward — better security posture and smaller, faster images, with no loss of debuggability.
4.2 ReplicaSets and Deployments
You almost never create bare Pods in production — they have no self-healing or rollout machinery. Deployments (built on ReplicaSets) give you declarative, versioned, self-healing application management. This subchapter covers both.
ReplicaSet purpose and behavior
Theory
A bare Pod is mortal: if it (or its node) dies, it is gone for good. A ReplicaSet is the controller that fixes this. Its single responsibility is to maintain a stable set of replica Pods running at any given time — if you ask for 3 and only 2 exist, it creates one; if 4 exist, it deletes one. This is reconciliation (Chapter 1) applied to Pod count.
A ReplicaSet identifies the Pods it owns through a label selector plus ownerReferences on the Pods. This is important: a ReplicaSet will adopt any existing Pod matching its selector that lacks a controller, and conversely it only manages Pods carrying the matching labels. In practice you rarely create ReplicaSets directly — you create Deployments, which create and manage ReplicaSets for you (adding rollout and rollback on top). Knowing the ReplicaSet exists underneath is essential for understanding how Deployments work.
Example
apiVersion: apps/v1
kind: ReplicaSet
metadata:
name: web
spec:
replicas: 3 # desired count the controller maintains
selector:
matchLabels:
app: web # which Pods this RS owns
template: # the Pod template used to create missing Pods
metadata:
labels:
app: web # MUST match the selector above
spec:
containers:
- name: web
image: nginx:1.25
kubectl get rs web
# NAME DESIRED CURRENT READY AGE
# web 3 3 3 2m
kubectl delete pod -l app=web --field-selector ... # delete one; RS recreates it
Exercises
- (Beginner) What is the one job of a ReplicaSet?
- (Beginner) How does a ReplicaSet know which Pods belong to it?
- (Intermediate) Why do you rarely create ReplicaSets directly, and what do you create instead?
- (Interview) The Pod template's labels must match the ReplicaSet's selector. What goes wrong if they don't, and why is this constraint enforced? (Hint: orphaned Pods, runaway creation.)
Answers
- To maintain a specified number of identical replica Pods running at all times — creating or deleting Pods to converge actual count to the desired
replicas.- Via its label selector (which Pods match) combined with ownerReferences set on Pods it creates. It manages Pods that match the selector and adopts matching controller-less Pods.
- Because Deployments manage ReplicaSets and add rollout, rollback, and revision history on top. You create a Deployment; it creates/manages ReplicaSets automatically, which is almost always what you want.
- Kubernetes rejects a ReplicaSet whose template labels don't match its selector (
selectordoes not match template labels). If it were allowed, the Pods the RS creates would not match its own selector, so it would never "see" them as satisfying the count and would create Pods endlessly (runaway creation), while the created Pods would be effectively orphaned. The constraint guarantees the controller can recognize the Pods it produces.
Deployment creation and management
Theory
A Deployment is the workhorse controller for stateless applications. It manages ReplicaSets, which manage Pods — a three-level hierarchy. The Deployment adds what a raw ReplicaSet lacks: declarative updates with versioned rollouts and rollbacks. When you change a Deployment's Pod template (e.g., a new image), the Deployment creates a new ReplicaSet and gradually shifts Pods from the old ReplicaSet to the new one, keeping a history of revisions.
You manage a Deployment declaratively with kubectl apply -f, and observe rollouts with kubectl rollout status. The Deployment continuously reconciles: it ensures the right number of replicas of the right version are running, replacing failed Pods and completing in-progress rollouts. For the vast majority of services (web servers, APIs, workers), the Deployment is the correct resource.
Example
apiVersion: apps/v1
kind: Deployment
metadata:
name: web
spec:
replicas: 3
selector:
matchLabels: { app: web }
template:
metadata:
labels: { app: web }
spec:
containers:
- name: web
image: nginx:1.25
kubectl apply -f web-deploy.yaml
kubectl get deploy,rs,pods -l app=web # see the Deploy -> RS -> Pods hierarchy
kubectl rollout status deployment/web # wait for the rollout to finish
kubectl scale deployment web --replicas=5
Exercises
- (Beginner) What is the three-level hierarchy a Deployment manages?
- (Beginner) What does a Deployment add on top of a plain ReplicaSet?
- (Intermediate) When you change a Deployment's image, what does the Deployment create, and what happens to the old ReplicaSet?
- (Interview) Why is the Deployment the recommended resource for stateless apps, but not for databases? (Hint: identity, ordering, and storage — foreshadowing StatefulSets.)
Answers
- Deployment -> ReplicaSet -> Pods. The Deployment manages ReplicaSets, which manage Pods.
- Declarative, versioned rollouts and rollbacks with revision history (plus pause/resume and rollout strategies), which a bare ReplicaSet does not provide.
- It creates a new ReplicaSet for the new template and scales it up while scaling the old ReplicaSet down (a rolling update). The old ReplicaSet is kept (scaled to 0) for rollback history, up to
revisionHistoryLimit.- Deployment Pods are interchangeable and have no stable identity or ordered lifecycle — perfect for stateless apps where any replica is as good as another. Databases need stable network identities, ordered startup/shutdown, and stable per-Pod persistent storage, which Deployments do not provide. Those guarantees come from StatefulSets.
Rolling updates and rollbacks
Theory
The whole point of a Deployment is zero-downtime updates. A rolling update replaces old Pods with new ones incrementally: bring up some new Pods, wait until they are ready, then terminate some old Pods, repeating until the new version fully replaces the old. Because old Pods keep serving until new ones are ready, users see no outage.
Two parameters control the pace, both under strategy.rollingUpdate:
- maxUnavailable: how many Pods (count or %) may be unavailable during the update — controls how aggressively old Pods are removed.
- maxSurge: how many extra Pods (count or %) may be created above the desired count — controls how many new Pods come up at once.
If a new version is broken, you roll back to a previous revision — Kubernetes keeps the old ReplicaSets, so rollback is just scaling the old RS back up and the bad one down. kubectl rollout undo does exactly this.
Example
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # never drop below desired-1 ready Pods
maxSurge: 1 # allow at most 1 extra Pod during the rollout
kubectl set image deployment/web web=nginx:1.26 # trigger a rolling update
kubectl rollout status deployment/web # watch progress
kubectl rollout history deployment/web # list revisions
kubectl rollout undo deployment/web # roll back to previous revision
kubectl rollout undo deployment/web --to-revision=3
Exercises
- (Beginner) Why does a rolling update avoid downtime?
- (Beginner) What do
maxUnavailableandmaxSurgecontrol?- (Intermediate) With
replicas: 4,maxUnavailable: 1,maxSurge: 1, what is the minimum number of ready Pods and the maximum total Pods during a rollout?- (Interview) How is a rollback able to be so fast and safe in Kubernetes? Explain what actually happens at the ReplicaSet level. (Hint: old ReplicaSets are retained, not deleted.)
Answers
- New Pods are brought up and confirmed ready before old Pods are removed, so a working set of Pods is always serving traffic — there is never a moment with zero available replicas (given sane settings and readiness probes).
maxUnavailablecaps how many Pods may be unavailable during the update (how fast old Pods are torn down);maxSurgecaps how many extra Pods above the desired count may be created (how fast new Pods are added).- Minimum ready =
replicas - maxUnavailable= 4 - 1 = 3. Maximum total =replicas + maxSurge= 4 + 1 = 5.- When a Deployment updates, it does not delete the old ReplicaSet — it scales it to 0 and keeps it in history. A rollback simply scales the previous (known-good) ReplicaSet back up and the current (bad) one down — the same rolling mechanism in reverse. Because the old Pods' exact spec is preserved in the retained ReplicaSet, the rollback is fast, deterministic, and requires no rebuild or re-fetch of configuration.
Deployment strategies (Recreate, RollingUpdate)
Theory
A Deployment supports two built-in update strategies, chosen via spec.strategy.type:
- RollingUpdate (the default): incrementally replace old Pods with new ones, as described above. Both versions run simultaneously during the transition. This gives zero downtime but requires your app to tolerate two versions running at once (e.g., compatible database schemas).
- Recreate: terminate all old Pods first, then create the new ones. This causes a brief downtime window (no Pods serving during the switch) but guarantees only one version runs at a time. Use it when two versions cannot coexist — for example, an app that holds an exclusive lock, or a schema migration incompatible with the old code.
More advanced strategies (canary, blue-green) aren't native Deployment types; they're built using multiple Deployments/Services or tools like Argo Rollouts (covered in Chapter 12).
Example
# Recreate: full stop, then full start (accepts downtime, avoids version overlap)
spec:
strategy:
type: Recreate
---
# RollingUpdate (default): gradual, zero-downtime, versions overlap
spec:
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 25%
maxSurge: 25%
Exercises
- (Beginner) What are the two built-in Deployment strategy types?
- (Beginner) Which strategy causes downtime, and which avoids it?
- (Intermediate) Give a concrete scenario where
Recreateis the correct choice despite the downtime.- (Interview) RollingUpdate runs two versions of your app simultaneously. What application-level requirement does this impose, and how does it relate to database schema changes? (Hint: backward/forward compatibility.)
Answers
RollingUpdate(default) andRecreate.Recreatecauses downtime (all old Pods stop before new ones start).RollingUpdateavoids downtime (gradual replacement).- When the old and new versions cannot run at the same time — e.g., the app requires an exclusive resource/lock that only one version may hold, or a non-backward-compatible schema migration means the old code would corrupt or fail against the new schema. Recreate guarantees only one version exists at a time.
- RollingUpdate means old and new Pods serve traffic concurrently, so both versions must be mutually compatible (the new version works with old data/clients and vice versa). For databases this requires backward/forward-compatible, expand-then-contract migrations: first deploy schema changes that both versions tolerate (additive), roll out the new code, then remove old columns/fields in a later step — never a breaking change in a single rollout.
Pausing and resuming deployments
Theory
Sometimes you want to make several changes to a Deployment without triggering a separate rollout for each one. By default, every edit to the Pod template kicks off a new rollout immediately. Pausing a Deployment lets you batch changes: while paused, the Deployment records your spec changes but does not act on them. When you resume, it performs a single rollout that incorporates all the accumulated changes.
This is also useful for canary-style manual control: you can pause mid-rollout to observe the new Pods before allowing the rollout to continue, then resume (or roll back) based on what you see. kubectl rollout pause and kubectl rollout resume control this.
Example
kubectl rollout pause deployment/web # stop reacting to spec changes
# Make several changes — none triggers a rollout yet:
kubectl set image deployment/web web=nginx:1.26
kubectl set resources deployment/web -c=web --limits=cpu=500m,memory=256Mi
kubectl set env deployment/web LOG_LEVEL=debug
kubectl rollout resume deployment/web # ONE rollout applies all changes
kubectl rollout status deployment/web
Exercises
- (Beginner) By default, what happens each time you change a Deployment's Pod template?
- (Beginner) What is the benefit of pausing a Deployment before making several changes?
- (Intermediate) How can pause/resume be used to manually gate a rollout for observation?
- (Interview) When a Deployment is paused, where do your changes go, and what guarantees do you have that they will be applied correctly on resume? (Hint: desired state is still recorded in the spec.)
Answers
- It triggers a new rollout immediately (a new ReplicaSet is created and Pods are rolled).
- You can batch multiple edits (image, env, resources) and have them applied in a single rollout on resume, instead of one rollout per change — fewer disruptions and faster overall.
- Trigger a change so a rollout begins, then
kubectl rollout pausepartway through; observe the new Pods/canary behavior; if healthy,kubectl rollout resumeto complete it, orkubectl rollout undoto abort. The pause halts further progression while the partial new version is observable.- The changes are written to the Deployment's
spec(desired state) and stored in etcd as normal — pausing only stops the Deployment controller from acting on the spec, not from recording it. On resume, the controller reconciles actual state to the now-current desired spec via its normal rolling mechanism, so all accumulated changes are applied together and correctly, with the usual rollout guarantees (readiness gating, surge/unavailable limits).
4.3 StatefulSets
Some applications — databases, message queues, clustered stores — need stable identities and storage that Deployments cannot provide. StatefulSets are built for exactly these stateful workloads. This subchapter explains how and when to use them.
StatefulSet use cases
Theory
Deployments treat Pods as cattle — interchangeable and disposable. Some workloads need pets — each instance has a distinct, persistent identity. A StatefulSet manages Pods that each require one or more of: a stable, unique network identity, stable persistent storage that follows the Pod across restarts/rescheduling, and ordered, graceful deployment and scaling.
Typical use cases: databases (PostgreSQL, MySQL), distributed datastores (Cassandra, MongoDB, Elasticsearch), message brokers (Kafka, RabbitMQ), and any clustered system where members must know each other by stable names and keep their own data. The defining question: does each replica need its own persistent identity and storage? If yes, StatefulSet; if no (replicas are interchangeable), Deployment.
Example
Deployment Pods (interchangeable): StatefulSet Pods (stable identity):
web-7d9c-a1b2 (random suffix) db-0 <- ordinal identity
web-7d9c-c3d4 db-1
web-7d9c-e5f6 db-2
any can be replaced by any other db-0 always "db-0", own storage
Exercises
- (Beginner) What three guarantees can a StatefulSet provide that a Deployment cannot?
- (Beginner) Give two example workloads that are appropriate for a StatefulSet.
- (Intermediate) What single question best determines whether to use a StatefulSet or a Deployment?
- (Interview) Explain the "pets vs. cattle" analogy and how it maps onto Deployments vs. StatefulSets. (Hint: identity and replaceability.)
Answers
- Stable, unique network identity per Pod; stable persistent storage that stays with each Pod across rescheduling; and ordered, graceful deployment, scaling, and deletion.
- Any two: PostgreSQL/MySQL databases, Cassandra/MongoDB/Elasticsearch clusters, Kafka/RabbitMQ/ZooKeeper.
- "Does each replica need its own stable identity and persistent storage?" If yes -> StatefulSet; if replicas are interchangeable and stateless -> Deployment.
- "Cattle" are interchangeable and individually disposable — you don't name them; if one fails you replace it with an identical one. That is a Deployment's Pods. "Pets" have names and individual care — each is unique and not freely replaceable. That is a StatefulSet's Pods, where each has a stable ordinal identity (
db-0,db-1) and its own persistent data, so it must be recreated as itself, not swapped for an arbitrary replica.
Stable network identities
Theory
A StatefulSet gives each Pod a stable, predictable name and DNS record that persists across restarts and rescheduling. Pods are named <statefulset-name>-<ordinal> — db-0, db-1, db-2 — rather than getting random suffixes. Crucially, if db-1 dies and is recreated (even on another node), it comes back as db-1, not a new random name.
Combined with a headless Service (covered below), each Pod also gets a stable DNS name: <pod-name>.<service-name>.<namespace>.svc.cluster.local, e.g., db-0.cassandra.default.svc.cluster.local. This lets cluster members address each other directly and reliably — essential for systems where nodes must form a quorum, replicate to specific peers, or maintain a membership list. Without stable identities, a restarted Pod would be unrecognizable to its peers.
Example
# StatefulSet Pods have ordinal names, created/addressed predictably:
kubectl get pods -l app=db
# NAME READY STATUS AGE
# db-0 1/1 Running 5m
# db-1 1/1 Running 4m
# db-2 1/1 Running 3m
# Each Pod is directly addressable by stable DNS (via a headless Service "db"):
kubectl run t --rm -it --image=busybox -- nslookup db-0.db.default.svc.cluster.local
Exercises
- (Beginner) What is the naming pattern for StatefulSet Pods?
- (Beginner) If
db-1is rescheduled to a different node, what name does it come back with?- (Intermediate) Write the full DNS name for Pod
db-2of a StatefulSet fronted by a headless Service namedcassandrain thedatanamespace.- (Interview) Why are stable per-Pod DNS names essential for clustered databases, but unnecessary for a stateless web Deployment? (Hint: peer membership, replication targets, quorum.)
Answers
<statefulset-name>-<ordinal>, with ordinals starting at 0 (e.g.,db-0,db-1,db-2).- The same name,
db-1— the identity is stable across rescheduling.cassandra-2... actually the Pod isdb-2:db-2.cassandra.data.svc.cluster.local(form:<pod>.<headless-service>.<namespace>.svc.cluster.local).- Clustered databases form a membership where nodes must reach specific peers by a durable address — for replication, leader election, and quorum — and must recognize a restarted member as the same node holding the same data. Stable DNS names provide that durable addressing. A stateless web Deployment's replicas are interchangeable and addressed collectively through a single Service ClusterIP/load balancer; no replica needs to find a specific peer, so per-Pod identity adds no value.
Ordered pod creation and deletion
Theory
StatefulSets manage Pods in order, which matters for systems with initialization dependencies (e.g., a primary must be up before replicas join). By default:
- Creation/scale-up proceeds in order from ordinal 0 upward:
db-0is created and becomes Ready beforedb-1starts, which becomes Ready beforedb-2, and so on. - Deletion/scale-down proceeds in reverse order, from the highest ordinal down:
db-2is terminated beforedb-1, beforedb-0. - Updates (RollingUpdate) also go in reverse ordinal order, one Pod at a time.
This ordering can be relaxed with podManagementPolicy: Parallel (start/stop all at once) when your app does not need it, but the default OrderedReady is what provides the dependency guarantees. Ordered, one-at-a-time operations make scaling slower but safe for stateful systems.
Example
spec:
podManagementPolicy: OrderedReady # default: one at a time, in order
# podManagementPolicy: Parallel # alternative: all at once (no ordering)
updateStrategy:
type: RollingUpdate # updates go highest-ordinal -> lowest
Scale up 0 -> 3: create db-0 (wait Ready) -> db-1 (wait Ready) -> db-2
Scale down 3 -> 1: delete db-2 -> db-1 -> (keep db-0)
Exercises
- (Beginner) In what order are StatefulSet Pods created during scale-up?
- (Beginner) In what order are they deleted during scale-down?
- (Intermediate) What does
podManagementPolicy: Parallelchange, and when would you use it?- (Interview) Why is ordered, one-at-a-time scaling important for a primary/replica database, even though it is slower than parallel? (Hint: a replica needs the primary to exist before it can join.)
Answers
- In ascending ordinal order starting at 0, and each Pod must be Running and Ready before the next is created (
db-0, thendb-1, thendb-2).- In descending order, highest ordinal first (
db-2, thendb-1, thendb-0).Parallelmakes the StatefulSet create and delete all Pods simultaneously instead of waiting for ordering/readiness. Use it when Pods have no startup interdependencies and you want faster scaling, while still wanting stable identities and per-Pod storage.- Many stateful systems have initialization dependencies: a replica must connect to an already-running primary to bootstrap and begin replication, and quorum-based systems benefit from members joining in a controlled sequence. Ordered, ready-gated startup guarantees the prerequisite member (e.g.,
db-0as primary) exists and is ready before dependents (db-1,db-2) try to join, avoiding race conditions and failed bootstraps. The slower rollout is an acceptable cost for correctness.
Persistent storage with StatefulSets
Theory
The feature that truly distinguishes StatefulSets is per-Pod persistent storage via volumeClaimTemplates. Instead of all Pods sharing one volume, the StatefulSet creates a separate PersistentVolumeClaim for each Pod, named deterministically (e.g., data-db-0, data-db-1). Each Pod gets its own volume that stays bound to that ordinal identity for the life of the StatefulSet.
The critical guarantee: if db-1 is deleted and recreated, it re-attaches to the same PVC (data-db-1) and thus the same data. Even on a different node, db-1 keeps its data. Deleting the StatefulSet does not delete the PVCs by default — this protects data from accidental loss, but means you must clean up PVCs manually when you truly want to discard the data. This is exactly the behavior a database needs.
Example
apiVersion: apps/v1
kind: StatefulSet
metadata: { name: db }
spec:
serviceName: db
replicas: 3
selector: { matchLabels: { app: db } }
template:
metadata: { labels: { app: db } }
spec:
containers:
- name: db
image: postgres:16
volumeMounts:
- name: data
mountPath: /var/lib/postgresql/data
volumeClaimTemplates: # one PVC PER Pod, auto-created
- metadata: { name: data }
spec:
accessModes: ["ReadWriteOnce"]
resources: { requests: { storage: 10Gi } }
kubectl get pvc
# NAME STATUS VOLUME CAPACITY ACCESS MODES
# data-db-0 Bound pv-... 10Gi RWO <- one per ordinal
# data-db-1 Bound pv-... 10Gi RWO
# data-db-2 Bound pv-... 10Gi RWO
Exercises
- (Beginner) What field in a StatefulSet creates a separate volume per Pod?
- (Beginner) When
db-2is recreated, does it get a new empty volume or its old data?- (Intermediate) What happens to the PVCs when you delete a StatefulSet, and why is that the default behavior?
- (Interview) Contrast how storage is handled in a Deployment versus a StatefulSet, and explain why the StatefulSet model is required for databases. (Hint: shared/ephemeral vs. per-identity durable.)
Answers
volumeClaimTemplates— the StatefulSet generates one PVC per Pod from the template (named<template>-<statefulset>-<ordinal>).- It re-attaches to its existing PVC (
data-db-2) and keeps its old data. The volume is bound to the ordinal identity, not to a particular Pod instance.- By default the PVCs (and their data) are retained, not deleted, when the StatefulSet is deleted. This protects against accidental data loss for stateful workloads; you must delete the PVCs explicitly to reclaim the storage. (Newer
persistentVolumeClaimRetentionPolicycan change this behavior.)- A Deployment typically uses ephemeral or shared volumes — all replicas are interchangeable and have no individually durable storage; if a Pod is replaced, its local data is gone. A StatefulSet provisions a dedicated, durable PVC per Pod identity that survives Pod restarts and rescheduling. Databases require each member to keep its own data persistently and reattach to it after any restart; only the per-identity durable storage of a StatefulSet provides that.
Headless services for StatefulSets
Theory
A normal Service gives a single ClusterIP that load-balances across all backing Pods — which hides individual Pods. StatefulSets need the opposite: clients must reach specific Pods by name. A headless Service (clusterIP: None) provides this. Instead of a single virtual IP, it creates DNS records for each backing Pod, returning the Pods' individual IPs.
A StatefulSet references a headless Service via spec.serviceName. This is what enables the stable per-Pod DNS names (db-0.db.namespace.svc.cluster.local). The headless Service does no load balancing — it is purely a DNS mechanism for discovering and addressing the individual stable Pods. (You can still create an additional normal Service on top if some clients want load-balanced access to the set, e.g., for read replicas.)
Example
apiVersion: v1
kind: Service
metadata:
name: db
spec:
clusterIP: None # <- headless: no virtual IP, DNS per Pod instead
selector: { app: db }
ports:
- port: 5432
# A headless Service returns each Pod's IP, not a single VIP:
kubectl run t --rm -it --image=busybox -- nslookup db.default.svc.cluster.local
# Address: 10.244.1.5 (db-0)
# Address: 10.244.2.7 (db-1)
# Address: 10.244.3.9 (db-2)
Exercises
- (Beginner) What field makes a Service "headless"?
- (Beginner) Does a headless Service load-balance traffic? What does it provide instead?
- (Intermediate) How does a StatefulSet associate itself with its headless Service?
- (Interview) Why must a StatefulSet use a headless Service rather than a standard ClusterIP Service to achieve stable per-Pod addressing? (Hint: a single VIP hides individual Pods.)
Answers
clusterIP: None.- No, it does not load-balance. It provides DNS records that resolve to the individual Pod IPs (and, with a StatefulSet, stable per-Pod DNS names), enabling direct addressing of specific Pods.
- Through the StatefulSet's
spec.serviceName, which names the headless (governing) Service; this drives the per-Pod DNS subdomain (<pod>.<serviceName>.<ns>.svc.cluster.local).- A standard ClusterIP Service exposes one virtual IP and load-balances to a random backing Pod, deliberately hiding which Pod you reach — the opposite of what stateful members need. A headless Service has no VIP and instead publishes per-Pod DNS records, so clients (and cluster members) can resolve and connect to a specific stable Pod (e.g., the primary, or a particular replica). That direct, identity-based addressing is required for replication, quorum, and membership.
4.4 DaemonSets
When you need exactly one copy of a Pod on every node — a log collector, a node monitor, a CNI agent — a DaemonSet is the tool. This subchapter covers node-level workloads.
DaemonSet use cases
Theory
Most controllers answer "how many replicas do I want?" A DaemonSet answers a different question: "I want this Pod running on every node (or every node matching a selector)." As nodes join the cluster, the DaemonSet automatically adds the Pod to them; as nodes leave, their Pods are removed. The count of Pods equals the count of (matching) nodes, not a fixed number.
This is the right model for node-level infrastructure — agents that must run once per machine to do their job:
- Log collectors (Fluentd, Fluent Bit) that gather every node's container logs.
- Node monitoring agents (node-exporter, Datadog agent).
- Networking components (CNI agents like Calico/Cilium, kube-proxy itself).
- Storage daemons (CSI node plugins).
If your workload is "one per node, tied to the node's resources or local data," it is a DaemonSet.
Example
apiVersion: apps/v1
kind: DaemonSet
metadata: { name: node-exporter }
spec:
selector: { matchLabels: { app: node-exporter } }
template:
metadata: { labels: { app: node-exporter } }
spec:
containers:
- name: node-exporter
image: prom/node-exporter:latest
ports: [ { containerPort: 9100 } ]
kubectl get daemonset node-exporter
# NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR
# node-exporter 3 3 3 3 3 <none>
# DESIRED == number of (matching) nodes, not a fixed replica count
Exercises
- (Beginner) What question does a DaemonSet answer that a Deployment does not?
- (Beginner) Name two real-world workloads that fit the DaemonSet model.
- (Intermediate) When three new nodes join the cluster, what does a DaemonSet do automatically?
- (Interview) Why is a DaemonSet, rather than a Deployment with many replicas, the correct way to run a per-node log collector? (Hint: placement guarantee vs. replica count, and node coverage.)
Answers
- "On which nodes should this Pod run?" (one per node, automatically tracking node membership) — versus a Deployment's "how many interchangeable replicas do I want?".
- Any two: log collectors (Fluentd/Fluent Bit), node metrics agents (node-exporter), CNI/network agents (Calico, Cilium), kube-proxy, CSI node plugins.
- The DaemonSet controller automatically schedules its Pod onto each newly joined (matching) node, so the three new nodes each get a copy without any manual change.
- A log collector must run on every node to capture that node's container logs. A Deployment only guarantees a count of Pods placed wherever the scheduler likes — it might put two replicas on one node and none on another, leaving nodes uncovered. A DaemonSet guarantees exactly one Pod per node and adapts automatically as nodes are added/removed, which is precisely the coverage guarantee a per-node agent requires.
Node-level workloads
Theory
DaemonSet Pods often need deeper access to the node itself than ordinary application Pods, because their job is to observe or manage the host. Such workloads commonly use:
- hostPath volumes to read node directories (e.g.,
/var/log/containersfor logs,/procand/sysfor metrics). - hostNetwork: true to use the node's network namespace directly (common for networking/monitoring agents that must see host interfaces or bind to host ports).
- Elevated privileges / securityContext (capabilities or
privileged: true) when they manipulate kernel or device state (CNI, storage).
Because they run everywhere with elevated access, DaemonSets are powerful and security-sensitive — a compromised DaemonSet runs on every node with host access. They also typically set resource requests carefully and tolerate node taints so they can run even on specialized or control-plane nodes (covered next).
Example
spec:
template:
spec:
hostNetwork: true # use the node's network stack
containers:
- name: agent
image: monitoring-agent:1.0
securityContext:
privileged: true # node-level access (use sparingly!)
volumeMounts:
- name: varlog
mountPath: /var/log
readOnly: true
volumes:
- name: varlog
hostPath:
path: /var/log # read the node's log directory
Exercises
- (Beginner) Why do DaemonSet Pods often use hostPath volumes?
- (Beginner) What does
hostNetwork: truegive a Pod?- (Intermediate) Why are DaemonSets considered more security-sensitive than typical application Deployments?
- (Interview) A monitoring DaemonSet needs to read
/procand bind to a host port. Which Pod spec features enable this, and what is the security trade-off? (Hint: hostPath, hostNetwork/hostPort, privileges, blast radius.)
Answers
- Because their purpose is to observe or manage the node, they mount host directories (logs in
/var/log, metrics from/procand/sys, container runtime sockets) directly via hostPath.- It places the Pod in the node's network namespace, so it shares the host's interfaces and IP and can see host-level traffic and bind host ports directly (instead of getting its own isolated Pod network).
- They run on every node (broad coverage) and frequently require elevated access (hostPath, hostNetwork, privileged/capabilities). A compromise of a DaemonSet therefore potentially yields host-level access across the entire cluster — a much larger blast radius than a single app Pod.
- Reading
/procuses a hostPath volume (orhostPID), and binding a host port uses hostNetwork: true (orhostPort); kernel-level access may needsecurityContextcapabilities orprivileged: true. The trade-off: these break Pod isolation and give host-level visibility/control, so a vulnerability in the agent can lead to node compromise on every node. Mitigate by granting the minimum needed (specific capabilities over full privilege, read-only mounts, restricted hostPaths), strict RBAC, and trusted images.
DaemonSet update strategies
Theory
Updating a DaemonSet means replacing the Pod on every node with a new version. DaemonSets support two updateStrategy types:
- RollingUpdate (the default): replace Pods gradually, node by node, controlled by maxUnavailable (how many node Pods may be down at once) and optionally maxSurge. This limits how much of your per-node coverage is disrupted simultaneously — important when the DaemonSet is critical infrastructure (e.g., the CNI), where taking all of them down at once would break the whole cluster.
- OnDelete: the controller does not automatically replace Pods on update. New Pods are only created when you manually delete the old ones. This gives operators full manual control over exactly when and where each node's Pod is upgraded — useful for highly sensitive node agents.
Example
spec:
updateStrategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1 # update one node's Pod at a time
# --- or ---
spec:
updateStrategy:
type: OnDelete # update a node's Pod only when you delete it manually
kubectl rollout status daemonset/node-exporter # watch a rolling update progress
Exercises
- (Beginner) What are the two DaemonSet update strategies?
- (Beginner) Under
OnDelete, how does a node's Pod get the new version?- (Intermediate) Why does
maxUnavailablematter especially for a CNI DaemonSet?- (Interview) When would you deliberately choose
OnDeleteoverRollingUpdatefor a DaemonSet? (Hint: critical node agents, manual scheduling of disruption, coordination with maintenance windows.)
Answers
RollingUpdate(default) andOnDelete.- Only when you manually delete the existing Pod on that node — the DaemonSet controller then creates a new (updated) Pod to replace it. Without deletion, the old Pod keeps running.
- A CNI DaemonSet provides Pod networking on every node; if too many are unavailable simultaneously, networking breaks across those nodes. A small
maxUnavailable(e.g., 1) ensures only one node's networking is briefly disrupted at a time, preserving cluster-wide connectivity during the upgrade.- Choose
OnDeletefor critical or fragile node agents where you must control exactly when and on which node a disruption occurs — e.g., coordinating with maintenance windows, validating the new version on one node before proceeding, or where automatic rolling could risk cluster stability. It trades automation for precise operator control over the rollout.
Tolerations with DaemonSets
Theory
Many nodes carry taints that repel ordinary Pods — control-plane nodes are tainted so workloads don't land on them, and special nodes (GPU, dedicated) may be tainted too. But node-level DaemonSets often must run even on these nodes (e.g., a monitoring or networking agent should cover the control plane too). To do so, the DaemonSet's Pods carry tolerations that allow them to be scheduled onto tainted nodes.
The DaemonSet controller automatically adds several tolerations to its Pods so they keep running through common node conditions (e.g., node.kubernetes.io/not-ready, unreachable, disk-pressure, memory-pressure) — because you generally want node agents to persist even when a node is unhealthy. To also run on control-plane nodes, you explicitly add a toleration for the control-plane taint. (Taints and tolerations are covered in depth in Chapter 8.)
Example
spec:
template:
spec:
tolerations:
# run on control-plane nodes despite their taint:
- key: node-role.kubernetes.io/control-plane
operator: Exists
effect: NoSchedule
# tolerate node problems so the agent keeps running on unhealthy nodes:
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
Exercises
- (Beginner) What is a taint's purpose, and what lets a Pod ignore one?
- (Beginner) Why might a DaemonSet need a toleration for the control-plane taint?
- (Intermediate) Which tolerations does the DaemonSet controller add automatically, and why?
- (Interview) A monitoring DaemonSet is missing from your control-plane nodes. What is the most likely cause and the fix? (Hint: taints vs. tolerations.)
Answers
- A taint marks a node to repel Pods that don't tolerate it (controlling what may schedule there). A matching toleration on a Pod lets that Pod be scheduled onto (or remain on) the tainted node.
- Control-plane nodes are tainted (e.g.,
node-role.kubernetes.io/control-plane:NoSchedule) to keep regular workloads off them. A DaemonSet that should cover all nodes (e.g., a metrics/log agent) needs a toleration for that taint to be placed on the control-plane nodes too.- It automatically adds tolerations for node-condition taints such as
not-ready,unreachable,disk-pressure,memory-pressure,pid-pressure, andunschedulable. This is so node agents continue running even when a node becomes unhealthy or is cordoned — exactly when monitoring/networking is most needed.- The control-plane nodes carry a
NoScheduletaint and the DaemonSet's Pods lack a matching toleration, so they are never scheduled there. The fix is to add a toleration for the control-plane taint (key: node-role.kubernetes.io/control-plane,operator: Exists,effect: NoSchedule) to the DaemonSet's Pod template.
4.5 Jobs and CronJobs
Not every workload runs forever. Batch tasks — migrations, backups, report generation — need to run to completion and stop. Jobs and CronJobs manage these finite and scheduled workloads. This subchapter covers them.
Job types: non-parallel, parallel, indexed
Theory
A Job runs one or more Pods until a specified number of them successfully complete, then stops. Unlike a Deployment (which keeps Pods running indefinitely), a Job is about finishing work. Two fields shape its behavior:
- completions: how many successful Pod completions are required for the Job to be done.
- parallelism: how many Pods may run at once.
These combine into three patterns:
| Pattern | Settings | Use case |
|---|---|---|
| Non-parallel | completions: 1, parallelism: 1 | A single one-off task (a migration, a backup). |
| Parallel with fixed completions | completions: N, parallelism: M | Process N work items, M at a time (a work queue). |
| Indexed | completionMode: Indexed | Each Pod gets a unique index (0..N-1) so it can process a specific shard. |
Indexed Jobs are powerful for partitioned/batch workloads: each Pod reads its JOB_COMPLETION_INDEX to know which slice of work it owns (e.g., process file partition #3).
Example
apiVersion: batch/v1
kind: Job
metadata: { name: process-shards }
spec:
completions: 5 # need 5 successful completions
parallelism: 2 # run at most 2 Pods at once
completionMode: Indexed # each Pod gets a unique index 0..4
template:
spec:
restartPolicy: OnFailure
containers:
- name: worker
image: shard-processor:1.0
command: ["sh", "-c", "echo processing shard $JOB_COMPLETION_INDEX"]
Exercises
- (Beginner) How does a Job differ from a Deployment in terms of when it stops?
- (Beginner) What do
completionsandparallelismcontrol?- (Intermediate) You must process 100 files, 10 at a time, with each Pod handling a specific file partition. Which Job pattern and settings fit?
- (Interview) How does an Indexed Job let each Pod know which portion of work to do, and why is this better than coordinating via an external queue for partitioned batch jobs? (Hint: JOB_COMPLETION_INDEX, no external coordination.)
Answers
- A Deployment keeps Pods running indefinitely (long-lived services). A Job runs Pods until the required number complete successfully, then stops — it is for finite, completing work.
completions= number of successful Pod completions needed for the Job to be considered done;parallelism= maximum number of Pods running concurrently.- An Indexed parallel Job with
completions: 100,parallelism: 10,completionMode: Indexed. Each Pod uses its index (0–99) to select the file partition it should process.- In Indexed mode, Kubernetes assigns each Pod a unique completion index exposed via the
JOB_COMPLETION_INDEXenv var (and pod hostname annotation). Each Pod deterministically maps its index to a work slice (partition N), so no external queue or locking is needed to divide the work. This is simpler and more robust for statically partitionable workloads: no coordination service to run or fail, no risk of two Pods claiming the same item, and the assignment is reproducible.
Job completion and backoff policies
Theory
Jobs must handle failure deliberately. Two mechanisms govern this:
- restartPolicy (in the Pod template) must be
OnFailureorNeverfor Jobs (notAlways).OnFailurerestarts the container in place;Neverlets the Pod fail and the Job creates a new Pod. - backoffLimit: the number of retries before the Job is marked Failed. Each failed Pod increments a counter; retries use exponential backoff (10s, 20s, 40s, … capped at 6 minutes). When retries exceed
backoffLimit(default 6), the Job gives up and is marked Failed.
Additional controls: activeDeadlineSeconds caps the total wall-clock time the Job may run (regardless of retries), and ttlSecondsAfterFinished auto-deletes the Job (and its Pods) some time after it finishes, preventing accumulation of completed Jobs. Newer podFailurePolicy lets you treat specific exit codes as retryable or fatal.
Example
apiVersion: batch/v1
kind: Job
metadata: { name: migrate }
spec:
backoffLimit: 4 # retry up to 4 times, then mark Failed
activeDeadlineSeconds: 600 # hard cap: fail if running > 10 minutes total
ttlSecondsAfterFinished: 3600 # auto-delete 1h after completion
template:
spec:
restartPolicy: Never # failed Pod -> new Pod (not in-place restart)
containers:
- name: migrate
image: migrator:2.0
Exercises
- (Beginner) Which two
restartPolicyvalues are valid for a Job's Pods?- (Beginner) What does
backoffLimitdo when reached?- (Intermediate) Explain the difference between
backoffLimitandactiveDeadlineSeconds.- (Interview) Why is
ttlSecondsAfterFinishedvaluable in a cluster that runs many Jobs, and what problem does it prevent? (Hint: accumulation of completed Job/Pod objects in etcd.)
Answers
OnFailureandNever(notAlways).- When the number of failed Pod retries exceeds
backoffLimit, the Job stops retrying and is marked Failed.backoffLimitcaps the number of retries (failures) before the Job is failed; retries use exponential backoff.activeDeadlineSecondscaps the total elapsed time the Job may be active regardless of how many retries have occurred — if exceeded, the Job (and its running Pods) is terminated and marked Failed. One bounds attempts; the other bounds wall-clock duration.- Completed/failed Jobs and their Pods persist as API objects until deleted, consuming etcd storage and cluttering
kubectl getoutput. In a cluster running many (e.g., scheduled) Jobs, these accumulate indefinitely without cleanup.ttlSecondsAfterFinishedautomatically deletes a Job and its Pods a set time after it finishes, preventing unbounded object accumulation and the associated etcd bloat and operational noise.
CronJob scheduling syntax
Theory
A CronJob creates Jobs on a repeating schedule — it is Kubernetes' cron. It is the right tool for periodic tasks: nightly backups, hourly report generation, periodic cleanup. A CronJob's spec.schedule uses standard cron syntax: five fields for minute, hour, day-of-month, month, and day-of-week.
┌───────────── minute (0–59)
│ ┌───────────── hour (0–23)
│ │ ┌───────────── day of month (1–31)
│ │ │ ┌───────────── month (1–12)
│ │ │ │ ┌───────────── day of week (0–6, Sun=0)
│ │ │ │ │
* * * * *
Examples: 0 2 * * * = every day at 02:00; */15 * * * * = every 15 minutes; 0 0 * * 0 = midnight every Sunday. The CronJob controller checks the schedule and creates a Job (which creates Pods) at each scheduled time. You can also set a timeZone field (stable in recent versions) so schedules aren't tied to the controller's clock zone.
Example
apiVersion: batch/v1
kind: CronJob
metadata: { name: nightly-backup }
spec:
schedule: "0 2 * * *" # every day at 02:00
timeZone: "Europe/Rome" # interpret schedule in this zone
jobTemplate:
spec:
template:
spec:
restartPolicy: OnFailure
containers:
- name: backup
image: backup-tool:1.0
command: ["/backup.sh"]
Exercises
- (Beginner) What does a CronJob create on each scheduled tick?
- (Beginner) Write the cron expression for "every day at midnight."
- (Intermediate) Translate
*/15 9-17 * * 1-5into plain English.- (Interview) Why does specifying a
timeZoneon a CronJob matter, and what subtle bug can arise without it? (Hint: controller clock vs. business hours, DST.)
Answers
- A Job (which in turn creates the Pod(s) that run the task).
0 0 * * *.- Every 15 minutes, during the hours 09:00–17:00, Monday through Friday (i.e., every 15 minutes within business hours on weekdays).
- Without
timeZone, the schedule is interpreted in the time zone of the CronJob controller (historically UTC), which may differ from the intended business/local zone — so a "2 AM backup" could run at the wrong local time. It also makes daylight-saving transitions confusing. SettingtimeZonepins the schedule to a specific zone so the Job fires at the intended local time consistently, including across DST changes.
ConcurrencyPolicy and history limits
Theory
Two practical concerns arise with recurring Jobs: what if a scheduled run is still going when the next is due, and how many old Jobs should be kept?
concurrencyPolicy governs overlap:
Allow(default): permit concurrent runs — a new Job starts even if the previous is still running.Forbid: skip the new run if the previous Job is still active (no overlap).Replace: cancel the currently running Job and start a new one.
History limits govern retention of finished Jobs:
successfulJobsHistoryLimit(default 3) andfailedJobsHistoryLimit(default 1) cap how many completed/failed Jobs the CronJob keeps for inspection; older ones are garbage-collected.
Also relevant: startingDeadlineSeconds controls how long a missed run may still be started after its scheduled time (e.g., if the controller was down), and suspend: true pauses a CronJob without deleting it.
Example
spec:
schedule: "*/5 * * * *"
concurrencyPolicy: Forbid # don't start a new run if one is still going
startingDeadlineSeconds: 120 # allow a late start within 2 minutes
successfulJobsHistoryLimit: 5 # keep last 5 successful Jobs
failedJobsHistoryLimit: 3 # keep last 3 failed Jobs
suspend: false # set true to pause scheduling
jobTemplate: { ... }
Exercises
- (Beginner) What are the three values of
concurrencyPolicyand what does each do?- (Beginner) What do the two history-limit fields control?
- (Intermediate) A backup Job takes 8 minutes but is scheduled every 5 minutes. Which
concurrencyPolicyprevents overlapping backups, and what is the consequence?- (Interview) Explain
startingDeadlineSecondsand a scenario where setting it too low (or leaving it default) causes scheduled runs to be silently skipped. (Hint: controller downtime and missed schedules.)
Answers
Allow— permit concurrent runs (default);Forbid— skip the new run if the previous is still active;Replace— terminate the running Job and start the new one.successfulJobsHistoryLimitandfailedJobsHistoryLimitcap how many completed successful and failed Jobs (and their Pods) the CronJob retains for inspection before garbage-collecting older ones.Forbidprevents overlap — a new backup won't start while the previous 8-minute backup is still running. The consequence is that scheduled runs which fall during an in-progress backup are skipped, so backups effectively run roughly every ~8–10 minutes instead of every 5, reducing frequency.startingDeadlineSecondsis the window after a scheduled time during which a Job may still be started if it could not start on time (e.g., the CronJob controller was down or busy). If the controller is down longer than this deadline when a run was due, that run is missed and skipped rather than started late. If set too low (or with the default and many missed schedules), runs that the controller couldn't launch in time are silently dropped. Additionally, if more than 100 schedules are missed without a deadline, the controller stops scheduling and logs an error — so a too-restrictive or unset deadline combined with downtime leads to skipped executions.
5. Configuration and Secrets Management
Hard-coding configuration and credentials into container images is inflexible and insecure — you would rebuild the image to change a setting, and secrets would leak into your registry. Kubernetes separates configuration from code with ConfigMaps (non-sensitive config) and Secrets (sensitive data), plus environment variables and the Downward API for injecting values at runtime. This chapter shows how to externalize all configuration cleanly.
5.1 ConfigMaps
A ConfigMap stores non-confidential configuration as key-value pairs, decoupled from your Pods so the same image runs in dev, staging, and prod with different settings. This subchapter covers creating and consuming them.
Creating ConfigMaps from literals, files, and directories
Theory
A ConfigMap is an API object holding non-sensitive configuration data as key-value pairs (and larger text blobs). Its purpose is the separation of configuration from application code: the same container image can be deployed everywhere, with environment-specific values supplied by a ConfigMap. This follows the twelve-factor app principle of storing config in the environment.
You can create ConfigMaps several ways:
- From literals:
--from-literal=key=valuefor individual values. - From files:
--from-file=path— each file becomes a key (filename) with the file contents as the value. Great for whole config files (app.conf,nginx.conf). - From directories:
--from-file=dir/— every file in the directory becomes a key. - Declaratively: a YAML manifest with a
datamap (andbinaryDatafor binary values).
ConfigMaps have a 1 MiB size limit (they are stored in etcd), so they are for configuration, not bulk data.
Example
# From literals:
kubectl create configmap app-config \
--from-literal=LOG_LEVEL=info \
--from-literal=MAX_CONNECTIONS=100
# From a file (key = filename, value = contents):
kubectl create configmap nginx-config --from-file=nginx.conf
# From a whole directory (each file becomes a key):
kubectl create configmap settings --from-file=./config-dir/
# Declarative equivalent:
apiVersion: v1
kind: ConfigMap
metadata:
name: app-config
data:
LOG_LEVEL: "info"
MAX_CONNECTIONS: "100"
app.properties: | # a multi-line value (a whole config file)
timeout=30
retries=3
Exercises
- (Beginner) What is the primary purpose of a ConfigMap?
- (Beginner) When you create a ConfigMap with
--from-file=nginx.conf, what becomes the key and what becomes the value?- (Intermediate) Why are ConfigMaps unsuitable for storing a 5 MB data file, and what is the limit?
- (Interview) How does using ConfigMaps support the goal of building a container image once and deploying it to multiple environments? (Hint: config in the environment, not the image.)
Answers
- To store non-sensitive configuration separately from application code/images, so configuration can change without rebuilding the image and the same image runs across environments.
- The key is the filename (
nginx.conf) and the value is the file's contents.- ConfigMaps are stored in etcd and capped at 1 MiB. They are meant for configuration, not bulk data; a 5 MB file exceeds the limit and would also bloat etcd. Use a volume, object storage, or an init container to fetch large data instead.
- The image contains only code; all environment-specific values (URLs, log levels, feature flags) live in ConfigMaps that are injected at runtime as env vars or mounted files. You build and test one immutable image, then deploy it to dev/staging/prod with different ConfigMaps — eliminating per-environment image variants and "config baked into the image" drift, consistent with twelve-factor config.
Consuming ConfigMaps as environment variables
Theory
The simplest way to use a ConfigMap is to inject its values as environment variables in a container. Two approaches:
- Individual keys via
valueFrom.configMapKeyRef: map one ConfigMap key to one env var (with control over the env var's name). - All keys at once via
envFrom.configMapRef: import every key in the ConfigMap as env vars (the keys become the variable names, optionally with a prefix).
The important caveat: environment variables are read once at container startup. If the ConfigMap changes later, the running container's env vars do not update — the Pod must be restarted to pick up new values. This is a key difference from volume-mounted ConfigMaps (next topic). Env-var injection is ideal for simple, rarely-changing settings.
Example
spec:
containers:
- name: app
image: myapp:1.0
env:
- name: LOG_LEVEL # pick one specific key
valueFrom:
configMapKeyRef:
name: app-config
key: LOG_LEVEL
envFrom: # OR import ALL keys at once
- configMapRef:
name: app-config
# prefix: APP_ # optional prefix for imported var names
Exercises
- (Beginner) What is the difference between
configMapKeyRefandenvFrom.configMapRef?- (Beginner) If you update a ConfigMap, do a running container's environment variables change?
- (Intermediate) You imported a ConfigMap via
envFrombut one key is namedmax-connections(with a dash). Why might it not appear as an env var?- (Interview) Given that env-var injection is read-once at startup, how do teams typically force Pods to pick up changed ConfigMap values? (Hint: rollout restart, config hash annotations.)
Answers
configMapKeyRefinjects a single chosen key as one env var (you control the env var name).envFrom.configMapRefimports all keys in the ConfigMap as env vars at once (key names become variable names, optionally prefixed).- No. Env vars are set at container start and do not update when the ConfigMap changes; the Pod/container must be restarted to see new values.
- Env var names must be valid C-identifiers (letters, digits, underscores; not starting with a digit). A key like
max-connectionscontains a dash, which is not a valid env var name, soenvFromskips it (it is reported as a skipped/invalid key). Use a valid key name or map it explicitly withconfigMapKeyRefto a valid env var name.- Because env vars are read-once, teams trigger a rollout to restart Pods: e.g.,
kubectl rollout restart deployment/<name>, or — to make it automatic — they add a checksum/hash of the ConfigMap as a Pod-template annotation (common in Helm,checksum/config). When the ConfigMap changes, the hash changes, the Pod template changes, and the Deployment performs a rolling update so new Pods read the updated values.
Consuming ConfigMaps as volumes
Theory
The alternative to env vars is mounting a ConfigMap as a volume: each key becomes a file in a directory, with the key's value as the file's contents. This is the natural fit for whole configuration files (e.g., mounting an nginx.conf or application.yaml into the path the app reads).
The major advantage over env vars: volume-mounted ConfigMaps update automatically. The kubelet periodically syncs the mounted files when the ConfigMap changes (with a propagation delay of up to ~1 minute, depending on kubelet sync period and cache TTL). The catch is that the application must re-read the file to use the new value — Kubernetes updates the file on disk, but it cannot make your app reload its config. You can mount specific keys to specific paths with items, and use subPath to mount a single file (though subPath mounts do not receive updates).
Example
spec:
containers:
- name: web
image: nginx:1.25
volumeMounts:
- name: cfg
mountPath: /etc/nginx/conf.d # each ConfigMap key becomes a file here
volumes:
- name: cfg
configMap:
name: nginx-config
items: # optionally map specific keys -> paths
- key: nginx.conf
path: default.conf
Exercises
- (Beginner) When a ConfigMap is mounted as a volume, what does each key become?
- (Beginner) Which consumption method (env var or volume) reflects ConfigMap updates without a Pod restart?
- (Intermediate) You mount a ConfigMap key with
subPath. Why do later updates to the ConfigMap not appear in the container?- (Interview) Volume-mounted ConfigMaps update the file on disk automatically, yet the application may still use stale config. Why, and how is this typically solved? (Hint: the app must reload; file watchers or SIGHUP.)
Answers
- Each key becomes a file in the mounted directory, with the key's value as that file's contents.
- Volume-mounted ConfigMaps (the kubelet syncs updated files into the mount; env vars do not update live).
subPathmounts copy the file into the container at mount time and are not kept in sync by the kubelet — the projection that enables live updates only applies to the whole-volume mount, not individualsubPathentries. So changes to the ConfigMap won't propagate to asubPath-mounted file.- Kubernetes updates the file on disk, but it cannot make a running process re-read its configuration — the app keeps using whatever it loaded at startup. Solutions: have the app watch the file for changes (inotify) and reload, send a reload signal (e.g., SIGHUP) to the process, use a sidecar that detects changes and triggers a reload, or simply roll the Deployment so Pods restart and re-read config.
ConfigMap update propagation
Theory
Bringing the previous two topics together: how do configuration changes actually reach your running workloads? It depends entirely on how the ConfigMap is consumed:
| Consumption method | Live update? | How to apply changes |
|---|---|---|
Env var (configMapKeyRef/envFrom) | No | Restart the Pod (rollout restart). |
| Volume mount (whole volume) | Yes (~up to 1 min delay) | Kubelet syncs files; app must reload them. |
Volume mount with subPath | No | Restart the Pod. |
Two important details: (1) ConfigMaps marked immutable: true cannot be updated at all (you must create a new one) — this improves performance (the kubelet stops watching them) and prevents accidental changes. (2) The common production pattern to make any config change trigger a safe rolling update is to embed a hash of the ConfigMap in the Pod template annotations; when the config changes, the template changes, and the Deployment rolls automatically.
Example
# Immutable ConfigMap: cannot be edited; must be replaced.
apiVersion: v1
kind: ConfigMap
metadata: { name: app-config-v2 }
immutable: true
data:
LOG_LEVEL: "warn"
# Pattern: config hash annotation forces a rollout when config changes
spec:
template:
metadata:
annotations:
checksum/config: "9f86d081..." # hash of the ConfigMap content
kubectl rollout restart deployment/web # force Pods to re-read env-var config
Exercises
- (Beginner) Which consumption method requires a Pod restart to apply a ConfigMap change?
- (Beginner) What does marking a ConfigMap
immutable: trueprevent, and what is one benefit?- (Intermediate) Describe the "config hash annotation" pattern and what problem it solves.
- (Interview) Your team mounts config as a volume expecting "live reload," but changes don't take effect for over a minute and sometimes not at all. Diagnose the likely causes. (Hint: kubelet sync delay, subPath, app not reloading, immutable.)
Answers
- Environment-variable consumption (and
subPathvolume mounts) — both are read-once and require a Pod restart.- It prevents any updates to the ConfigMap's data (you must create a new ConfigMap instead). Benefits: protects against accidental changes and improves performance/scalability because the kubelet/API server no longer needs to watch it for updates.
- You compute a hash/checksum of the ConfigMap's contents and place it as an annotation in the Deployment's Pod template. When the ConfigMap changes, the hash changes, which changes the Pod template, which causes the Deployment to perform a rolling update — ensuring Pods (even those using env vars) pick up the new config safely and automatically. It solves the "env vars don't auto-update" problem deterministically.
- Likely causes: (a) the up-to-~1-minute kubelet sync/cache delay explains the lag for whole-volume mounts; (b) keys mounted via
subPathnever update — they require a restart; (c) the application reads its config only at startup and never re-reads the file, so the file updates but behavior doesn't (needs a reload/watch); (d) the ConfigMap may beimmutableor a different object than the Pod mounts; (e) env-var consumption was mistaken for volume consumption. Verify the mount type, check the file actually changes in the container, and confirm the app reloads.
5.2 Secrets
Secrets hold sensitive data — passwords, tokens, keys — with slightly different handling from ConfigMaps. This subchapter covers their types, creation, consumption, and the critical topic of protecting them at rest.
Secret types and use cases
Theory
A Secret is like a ConfigMap but intended for sensitive data: passwords, API tokens, TLS certificates, SSH keys. Functionally they are similar (key-value, 1 MiB limit, consumed as env vars or volumes), but Secrets get special handling: they are only distributed to nodes that need them, can be encrypted at rest, and Kubernetes avoids writing them to logs.
A common misconception: Secret values are stored base64-encoded, not encrypted, by default. Base64 is encoding, not security — anyone with read access can decode it trivially. Real protection requires enabling encryption at rest and tight RBAC (covered ahead).
Secrets have typed uses, signaled by the type field:
| Type | Purpose |
|---|---|
Opaque | Arbitrary user-defined data (the default). |
kubernetes.io/tls | TLS cert + key (tls.crt, tls.key) — used by Ingress. |
kubernetes.io/dockerconfigjson | Registry credentials for pulling private images. |
kubernetes.io/basic-auth, ssh-auth | Basic-auth and SSH credentials. |
kubernetes.io/service-account-token | ServiceAccount token (legacy/long-lived). |
Example
# Generic (Opaque) secret from literals (kubectl base64-encodes for you):
kubectl create secret generic db-creds \
--from-literal=username=admin \
--from-literal=password='S3cr3t!'
# TLS secret for an Ingress:
kubectl create secret tls web-tls --cert=tls.crt --key=tls.key
# Image pull secret for a private registry:
kubectl create secret docker-registry regcred \
--docker-server=registry.example.com --docker-username=u --docker-password=p
Exercises
- (Beginner) How are Secret values stored by default — encrypted or encoded?
- (Beginner) Which Secret type is used to store a TLS certificate and key for an Ingress?
- (Intermediate) What is the purpose of a
kubernetes.io/dockerconfigjsonSecret?- (Interview) Explain why "Secrets are base64-encoded" is a security pitfall and what must be done to actually protect Secret data. (Hint: encoding vs. encryption, encryption at rest, RBAC.)
Answers
- Encoded (base64) by default, not encrypted — unless you enable encryption at rest.
kubernetes.io/tls(holdingtls.crtandtls.key).- It stores container registry credentials so the kubelet can authenticate and pull images from a private registry; it is referenced by
imagePullSecretsin a Pod or ServiceAccount.- Base64 is reversible encoding with no secrecy — anyone who can read the Secret object (or etcd) can decode the value instantly. So storing secrets as base64 alone provides no protection. To actually secure them you must: enable encryption at rest in the API server (so etcd holds ciphertext, ideally via a KMS), enforce strict RBAC so few identities can read Secrets, restrict etcd access, and consider external secret managers. Encoding is for safe transport of binary data, not confidentiality.
Creating and managing Secrets
Theory
Secrets can be created imperatively (kubectl create secret) or declaratively (YAML). A subtlety with declarative YAML: the data field requires base64-encoded values, while the convenience field stringData lets you write plaintext that Kubernetes encodes for you (and which takes precedence). Never commit plaintext Secret manifests to Git in the clear.
Managing Secrets safely means addressing the GitOps problem: you want everything in Git, but you cannot store raw secrets there. Solutions include:
- Sealed Secrets (Bitnami): encrypt a Secret into a
SealedSecretCRD that is safe to commit; a controller decrypts it in-cluster. - External Secrets Operator / CSI Secrets Store: pull values from an external manager (Vault, AWS/GCP/Azure secret managers) at runtime so the real secret never lives in Git or etcd long-term.
- SOPS with age/KMS: encrypt secret files for Git, decrypt in the pipeline.
Example
apiVersion: v1
kind: Secret
metadata: { name: db-creds }
type: Opaque
stringData: # plaintext here; Kubernetes base64-encodes it
username: admin
password: "S3cr3t!"
# data: # alternative: values must be pre-base64-encoded
# username: YWRtaW4=
# Read a secret back (note: it returns base64; decode to view):
kubectl get secret db-creds -o jsonpath='{.data.password}' | base64 -d
Exercises
- (Beginner) What is the difference between the
dataandstringDatafields in a Secret manifest?- (Beginner) Why should you avoid committing a plain Secret YAML (with base64 values) to a public Git repo?
- (Intermediate) Name two tools/approaches that let you store secrets safely in a GitOps workflow.
- (Interview) Describe how Sealed Secrets makes it safe to store a secret in Git, and where the decryption actually happens. (Hint: asymmetric encryption, in-cluster controller.)
Answers
datarequires values already base64-encoded;stringDataaccepts plaintext that Kubernetes encodes intodataon creation (andstringDatatakes precedence if both set a key).stringDatais a write-only convenience.- Base64 is trivially reversible, so committing such a manifest is equivalent to publishing the plaintext secret. Anyone reading the repo can decode it.
- Any two: Sealed Secrets (encrypt to a SealedSecret CRD that's safe to commit), External Secrets Operator or CSI Secrets Store (sync from Vault/cloud secret managers at runtime), SOPS with age/KMS (encrypt files for Git).
- With Sealed Secrets, you encrypt your Secret using the cluster controller's public key (via
kubeseal), producing aSealedSecretresource that only the controller's private key can decrypt. Because only ciphertext is stored, the SealedSecret is safe to commit to Git. The in-cluster Sealed Secrets controller watches for SealedSecrets, decrypts them with its private key, and creates the corresponding normal Secret — so decryption happens only inside the cluster, never in the repo or in transit.
Consuming Secrets in pods
Theory
Secrets are consumed in Pods almost exactly like ConfigMaps — as environment variables (secretKeyRef / envFrom.secretRef) or as volume mounts (each key becomes a file). The same trade-offs apply: env vars are read-once; volume mounts update (with delay). For Secrets, volume mounts are generally preferred for sensitive data because:
- Secret files mounted from volumes are stored in a tmpfs (in-memory) filesystem on the node, never written to disk.
- Environment variables are more prone to accidental leakage — they can show up in crash dumps, logs, child-process environments, and
kubectl describe/exec output.
Additionally, registry credentials are consumed not by your container but by the kubelet, via imagePullSecrets on the Pod or ServiceAccount. Always restrict who can read Secrets with RBAC, regardless of consumption method.
Example
spec:
imagePullSecrets: # used by the kubelet to pull private images
- name: regcred
containers:
- name: app
image: registry.example.com/myapp:1.0
env:
- name: DB_PASSWORD # as an env var (read-once)
valueFrom:
secretKeyRef:
name: db-creds
key: password
volumeMounts:
- name: creds # as files (preferred for sensitive data)
mountPath: /etc/creds
readOnly: true
volumes:
- name: creds
secret:
secretName: db-creds
Exercises
- (Beginner) What are the two main ways a Pod consumes a Secret?
- (Beginner) Which field tells the kubelet which Secret to use for pulling private images?
- (Intermediate) Why are volume-mounted Secrets often preferred over environment variables for sensitive values?
- (Interview) List three ways an environment-variable secret can accidentally leak that a tmpfs-mounted secret file is less exposed to. (Hint: logs, describe/exec, child processes, crash dumps.)
Answers
- As environment variables (
secretKeyReforenvFrom.secretRef) and as volume-mounted files (secretvolume).imagePullSecrets(on the Pod spec or the ServiceAccount).- Volume-mounted Secrets are stored in node memory (tmpfs), never persisted to disk, can be updated, and are less prone to incidental exposure. Env vars are read-once and tend to leak into logs, process listings, and diagnostic output.
- Examples: (a) logs — apps or frameworks often dump the full environment on startup or error; (b)
kubectl describe pod/ exec / process listing — env vars are visible to anyone who can inspect the Pod or runprintenv/read/proc/<pid>/environ; (c) child processes inherit the parent's environment, spreading the secret to subprocesses; (d) crash/core dumps capture the process environment. A tmpfs-mounted file is only readable at a known path with appropriate permissions and isn't broadcast through the environment, reducing these incidental-exposure vectors.
Encryption at rest for Secrets
Theory
By default, Secrets are stored in etcd as base64 (effectively plaintext). Anyone who can read etcd — through a backup file, a disk, or direct access — can read every Secret. Encryption at rest fixes this by having the API server encrypt Secret data before writing it to etcd, using an EncryptionConfiguration.
You configure one or more providers in priority order; the first listed is used for writing, and all are tried for reading (so you can rotate keys). Provider options include aescbc/aesgcm (encrypt with a local key) and, best practice, kms (a KMS plugin that uses an external Key Management Service like AWS KMS, GCP KMS, or HashiCorp Vault — so the data-encryption keys themselves are protected by a master key you control and can rotate/audit). Never put identity (no encryption) first if you want encryption active.
Example
# EncryptionConfiguration passed to the API server via --encryption-provider-config
apiVersion: apiserver.config.k8s.io/v1
kind: EncryptionConfiguration
resources:
- resources: ["secrets"]
providers:
- kms: # preferred: external KMS protects the keys
name: myKmsPlugin
endpoint: unix:///var/run/kms.sock
- aescbc: # fallback local-key provider
keys:
- name: key1
secret: <base64-32-byte-key>
- identity: {} # plaintext — only for reading old data during migration
# After enabling, re-encrypt existing Secrets by rewriting them:
kubectl get secrets --all-namespaces -o json | kubectl replace -f -
Exercises
- (Beginner) How are Secrets stored in etcd by default, and what risk does this create?
- (Beginner) What configuration object enables encryption at rest?
- (Intermediate) In an
EncryptionConfigurationwith multiple providers, which provider is used to encrypt new writes, and why list more than one?- (Interview) Why is a KMS provider considered best practice over a static local
aescbckey, and what does it protect against that the local key does not? (Hint: key custody, rotation, auditing, etcd backup theft.)
Answers
- Base64-encoded (effectively plaintext). Anyone with access to etcd, its disk, or its backups can read every Secret.
- An
EncryptionConfiguration(provided to the API server via--encryption-provider-config).- The first provider listed is used to encrypt new writes; all listed providers are tried in order when decrypting reads. Listing multiple providers enables key rotation and migration (e.g., read data written by an old key while writing with a new one, or include
identitylast to read still-unencrypted data during rollout).- With a static local
aescbckey, the encryption key sits in the API server's config on disk — if that host or config is compromised, so are all Secrets. A KMS provider keeps the key-encryption key in an external, hardened, auditable service: the data keys are wrapped by a master key you never expose, you can rotate and revoke it centrally, and access is logged. This protects against etcd/backup theft (the data is ciphertext whose key isn't co-located) and gives proper key custody, rotation, and audit — capabilities a static on-disk key lacks.
External secret management (Vault, AWS Secrets Manager)
Theory
Even with encryption at rest, native Secrets live in the cluster and are managed with Kubernetes tooling. Many organizations prefer a dedicated, centralized secret manager — HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager, Azure Key Vault — as the single source of truth for secrets across all systems, with features Kubernetes lacks: dynamic/short-lived credentials, fine-grained access policies, automatic rotation, leasing, and rich audit logs.
Two integration patterns dominate:
- External Secrets Operator (ESO): a controller that reads from the external manager and syncs values into native Kubernetes Secrets, which Pods consume normally. Simple, but the secret is materialized in etcd.
- Secrets Store CSI Driver: mounts secrets directly from the external manager into the Pod as a volume at runtime, so the secret need not be stored as a Kubernetes Secret at all (optionally it can also sync one).
These also enable dynamic secrets (e.g., Vault issuing a database credential valid for 1 hour), drastically reducing the blast radius of a leak.
Example
# External Secrets Operator: declaratively sync a value from AWS Secrets Manager
apiVersion: external-secrets.io/v1beta1
kind: ExternalSecret
metadata: { name: db-creds }
spec:
refreshInterval: 1h
secretStoreRef:
name: aws-secrets-manager
kind: SecretStore
target:
name: db-creds # the K8s Secret ESO will create/keep in sync
data:
- secretKey: password
remoteRef:
key: prod/db
property: password
Exercises
- (Beginner) Name two external secret managers commonly integrated with Kubernetes.
- (Beginner) What does the External Secrets Operator do at a high level?
- (Intermediate) Contrast the External Secrets Operator with the Secrets Store CSI Driver in terms of where the secret ends up.
- (Interview) What is a "dynamic secret," and how does using one (e.g., a Vault-issued 1-hour DB credential) reduce risk compared to a long-lived static Secret? (Hint: short TTL, automatic rotation, blast radius.)
Answers
- Any two: HashiCorp Vault, AWS Secrets Manager, GCP Secret Manager, Azure Key Vault.
- It is a controller that fetches secret values from an external manager and synchronizes them into native Kubernetes Secrets (keeping them refreshed), so Pods consume them through the normal Secret mechanism.
- ESO materializes the secret as a Kubernetes Secret in etcd (Pods read that Secret). The CSI Secrets Store driver mounts the secret value directly into the Pod as a volume at runtime, so it need not be stored as a Kubernetes Secret at all (optionally it can also sync one). CSI keeps the secret out of etcd; ESO is simpler but stores it.
- A dynamic secret is a credential generated on demand with a short lease/TTL and automatically revoked when it expires (e.g., Vault creates a unique DB username/password valid for 1 hour). Because each credential is short-lived, unique per consumer, and auto-rotated, a leaked credential is useless after expiry and can be tied to a specific consumer for auditing/revocation — dramatically shrinking the blast radius compared to a static, long-lived Secret that, if leaked, grants indefinite access and must be manually rotated everywhere.
5.3 Environment Variables and Downward API
Beyond ConfigMaps and Secrets, Pods can receive configuration from static values and — uniquely — from information about themselves via the Downward API. This subchapter ties together how environment is assembled.
Setting static environment variables
Theory
The most basic configuration is a static environment variable defined inline in the container spec with env[].value. These are literal values baked into the Pod manifest — useful for simple, non-sensitive, fixed settings (a feature flag, a fixed port, an app mode).
A useful and sometimes surprising feature is variable expansion: within env values you can reference earlier-defined variables using $(VAR_NAME) syntax, letting you compose values. (To emit a literal $, write $$.) Order matters — only variables defined earlier in the list can be referenced. While handy, prefer ConfigMaps/Secrets for anything that varies by environment; reserve inline value for truly constant settings.
Example
spec:
containers:
- name: app
image: myapp:1.0
env:
- name: APP_MODE
value: "production" # static literal
- name: PORT
value: "8080"
- name: BASE_URL # composed from earlier vars via $(...)
value: "http://localhost:$(PORT)"
- name: PRICE_LABEL
value: "Cost is $$5" # $$ -> literal $ => "Cost is $5"
Exercises
- (Beginner) How do you define a static environment variable directly in a container spec?
- (Beginner) What is the syntax to reference another environment variable's value within an
enventry?- (Intermediate) Why does the order of
enventries matter when using$(VAR)expansion?- (Interview) When should you use an inline static
envvalue versus pulling the value from a ConfigMap or Secret? (Hint: constancy, sensitivity, environment variance.)
Answers
- With an
enventry that has anameand a literalvalue(e.g.,- name: APP_MODE/value: "production").$(VAR_NAME)— and$$produces a literal$.- Expansion can only reference variables that are already defined earlier in the same
envlist; a$(VAR)referencing a variable defined later (or not at all) is left unexpanded. So you must define a variable before composing other values from it.- Use inline static
valueonly for truly constant, non-sensitive settings that don't change between environments (e.g., a fixed mode or port). Use a ConfigMap for non-sensitive values that differ per environment (URLs, log levels), and a Secret for sensitive values (passwords, tokens). This keeps the image/manifest environment-agnostic and keeps secrets out of plain manifests.
Referencing ConfigMap and Secret keys
Theory
This topic consolidates the injection mechanisms covered earlier into one mental model. A container's environment can be assembled from four sources, all under env/envFrom:
- Static literals —
env[].value. - ConfigMap keys —
env[].valueFrom.configMapKeyRef(one key) orenvFrom.configMapRef(all keys). - Secret keys —
env[].valueFrom.secretKeyRef(one key) orenvFrom.secretRef(all keys). - Downward API —
env[].valueFrom.fieldRef/resourceFieldRef(covered next).
Precedence and behavior to remember: envFrom imports bulk keys, then explicit env entries can override individual names; if a referenced ConfigMap/Secret or key is missing, the container fails to start unless you mark the reference optional: true. This unified valueFrom model is how Kubernetes lets one Pod template draw configuration from many typed sources cleanly.
Example
spec:
containers:
- name: app
image: myapp:1.0
envFrom:
- configMapRef: { name: app-config } # bulk import all ConfigMap keys
- secretRef: { name: db-creds } # bulk import all Secret keys
env:
- name: DB_PASSWORD # specific Secret key -> chosen name
valueFrom:
secretKeyRef: { name: db-creds, key: password }
- name: FEATURE_X # specific ConfigMap key, optional
valueFrom:
configMapKeyRef: { name: app-config, key: feature_x, optional: true }
Exercises
- (Beginner) What are the four sources an environment variable's value can come from?
- (Beginner) What happens if a container references a ConfigMap key that does not exist?
- (Intermediate) How do you make a missing key reference non-fatal so the container still starts?
- (Interview) Explain how
envFromand explicitenventries interact when both set a variable of the same name. (Hint: explicit env overrides bulk import.)
Answers
- Static literal (
value), ConfigMap key (configMapKeyRef/envFrom.configMapRef), Secret key (secretKeyRef/envFrom.secretRef), and the Downward API (fieldRef/resourceFieldRef).- The container fails to start (it cannot be created) because the required reference can't be resolved — unless the reference is marked
optional: true.- Set
optional: trueon theconfigMapKeyRef/secretKeyRef(or on theconfigMapRef/secretRefinenvFrom). The missing key/object is then skipped instead of blocking startup.envFromperforms a bulk import of all keys as env vars first; then explicitenventries are applied and override any same-named variable from the bulk import. So an explicitenventry takes precedence over a value brought in byenvFromwith the same name, letting you import many values but selectively override or rename specific ones.
Downward API: exposing pod metadata
Theory
Sometimes an application needs to know things about itself and its environment that are only determined at runtime: its own Pod name, namespace, node, IP, labels, or its resource limits. Hard-coding these is impossible (the Pod name is generated; the node is chosen by the scheduler). The Downward API exposes this Pod/container metadata to the container, either as environment variables or as files in a volume, without the app needing to call the Kubernetes API.
It is called "downward" because information flows down from the platform into the workload. Common uses: tagging logs/metrics with the Pod name and node, constructing a unique instance ID, reporting which namespace the app runs in, or making an app aware of its own memory limit so it can size internal caches/heaps accordingly.
Example
spec:
containers:
- name: app
image: myapp:1.0
env:
- name: POD_NAME
valueFrom:
fieldRef: { fieldPath: metadata.name }
- name: POD_NAMESPACE
valueFrom:
fieldRef: { fieldPath: metadata.namespace }
- name: NODE_NAME
valueFrom:
fieldRef: { fieldPath: spec.nodeName }
volumeMounts:
- name: podinfo
mountPath: /etc/podinfo
volumes:
- name: podinfo
downwardAPI:
items:
- path: "labels" # writes the Pod's labels to /etc/podinfo/labels
fieldRef: { fieldPath: metadata.labels }
Exercises
- (Beginner) What does the Downward API let a container learn about?
- (Beginner) In which two forms can Downward API data be exposed to a container?
- (Intermediate) Why can't you just hard-code the Pod name and node name in the manifest instead of using the Downward API?
- (Interview) Give a concrete scenario where exposing the Pod's memory limit to the app via the Downward API materially improves the application's behavior. (Hint: JVM heap / cache sizing inside a container.)
Answers
- Metadata about the Pod and container itself and its placement — e.g., name, namespace, UID, labels/annotations, node name, Pod IP, service account, and the container's resource requests/limits.
- As environment variables (
fieldRef/resourceFieldRef) and as files in adownwardAPIvolume.- Many of these values are not known until runtime and are assigned by the platform: the Pod name may be generated (e.g., with a random suffix), the node is chosen by the scheduler, and the Pod IP is allocated at creation. They differ per replica and per scheduling, so they cannot be statically written in the manifest — the Downward API supplies them dynamically.
- A JVM (or any runtime with its own heap/cache) running in a container may, by default, size its heap based on the host's total memory rather than the container's cgroup limit, leading to OOMKills. By exposing the container's memory limit via
resourceFieldRef(e.g.,limits.memory) as an env var, the app/startup script can set the max heap or cache size proportionally to its actual limit — avoiding OOM kills and using its allotted memory efficiently. (Modern JVMs are container-aware, but explicit sizing via the Downward API remains a robust, general technique for any runtime.)
fieldRef and resourceFieldRef
Theory
The Downward API has two distinct selectors, and knowing which to use is the practical crux of this subchapter:
fieldRefexposes Pod-level metadata and spec fields — things likemetadata.name,metadata.namespace,metadata.uid,metadata.labels,metadata.annotations,spec.nodeName,spec.serviceAccountName, andstatus.podIP/status.hostIP.resourceFieldRefexposes a container's resource requests and limits —requests.cpu,limits.cpu,requests.memory,limits.memory, and ephemeral-storage equivalents. You can specify adivisorto scale the output (e.g., express memory in Mi).
A few constraints: not every field is available in both env-var and volume form (e.g., status.podIP and resourceFieldRef work as env vars; labels/annotations are typically exposed via volume files because they are maps). For resourceFieldRef, if a limit is not set, it defaults to the node's allocatable capacity.
Example
spec:
containers:
- name: app
image: myapp:1.0
resources:
requests: { cpu: "250m", memory: "256Mi" }
limits: { cpu: "1", memory: "512Mi" }
env:
- name: POD_IP
valueFrom:
fieldRef: { fieldPath: status.podIP } # fieldRef: pod metadata
- name: MEM_LIMIT_MI
valueFrom:
resourceFieldRef: # resourceFieldRef: resources
containerName: app
resource: limits.memory
divisor: "1Mi" # output in MiB -> "512"
- name: CPU_REQUEST
valueFrom:
resourceFieldRef:
containerName: app
resource: requests.cpu # "250m"
Exercises
- (Beginner) What category of information does
fieldRefexpose versusresourceFieldRef?- (Beginner) Which selector would you use to expose a container's CPU limit to itself?
- (Intermediate) What does the
divisorfield do in aresourceFieldRef, and give an example output.- (Interview) If a container does not specify a memory limit but uses
resourceFieldRefonlimits.memory, what value does it receive, and why could that be surprising? (Hint: defaults to node allocatable.)
Answers
fieldRefexposes Pod-level metadata/spec/status fields (name, namespace, uid, labels, annotations, nodeName, podIP, etc.).resourceFieldRefexposes a container's resource requests and limits (cpu/memory/ephemeral-storage).resourceFieldRefwithresource: limits.cpu(and the appropriatecontainerName).divisorscales/normalizes the reported resource value into the desired unit. For example, withresource: limits.memoryanddivisor: "1Mi", a 512Mi limit is reported as512; withresource: limits.cpuanddivisor: "1m", a 1 CPU limit reports1000.- It receives the node's allocatable amount for that resource (the effective upper bound when no limit is set), not zero or an error. This is surprising because the app may believe it has, say, the whole node's memory as its "limit" and size buffers accordingly, risking OOM if the node is shared — which is exactly why setting explicit limits (and reading them via the Downward API) is recommended.
6. Networking
Networking is where Kubernetes feels most like magic — and where it most often breaks. Kubernetes defines a simple, opinionated networking model (every Pod gets an IP; all Pods can reach all Pods) and then delegates the implementation to plugins. On top of that base it layers Services for stable access, Ingress for HTTP routing, NetworkPolicies for segmentation, and DNS for discovery. This chapter builds that stack from the bottom up.
6.1 Kubernetes Networking Model
Before Services and Ingress make sense, you must understand the foundational model: a flat network where every Pod is directly addressable. This subchapter covers those first principles.
Pod-to-pod communication
Theory
In Kubernetes, every Pod gets its own unique IP address, and any Pod can communicate with any other Pod directly using that IP, across nodes, without NAT. This is the core promise of the Kubernetes networking model. From a Pod's perspective, the cluster looks like one big flat network where it can reach any peer as if they were on the same LAN.
This is deliberately different from the Docker default (where containers share the host's IP and you publish ports). It dramatically simplifies application design: services don't need to discover which host a peer is on or deal with port-mapping — they just use Pod IPs (usually via a Service for stability). Containers within a Pod share that single Pod IP and communicate over localhost.
The actual wiring (routes, overlays, encapsulation) is the CNI plugin's job; the model only guarantees the behavior, not the implementation.
Example
Node A Node B
+---------------------+ +---------------------+
| Pod 1: 10.244.1.5 | | Pod 3: 10.244.2.9 |
| Pod 2: 10.244.1.6 | | Pod 4: 10.244.2.10 |
+---------------------+ +---------------------+
\____________ direct, no NAT ____________/
10.244.1.5 can reach 10.244.2.9 by IP across nodes
# Each Pod has its own IP:
kubectl get pods -o wide
# NAME READY STATUS IP NODE
# web-1 1/1 Running 10.244.1.5 node-a
# web-2 1/1 Running 10.244.2.9 node-b
# From web-1 you can curl 10.244.2.9 directly (no port mapping needed).
Exercises
- (Beginner) How many IP addresses does each Pod get, and who shares it?
- (Beginner) Can a Pod on node A reach a Pod on node B directly by IP? Is NAT involved?
- (Intermediate) How does Kubernetes' networking model differ from Docker's default host-port-mapping approach, and why is that simpler for app developers?
- (Interview) The networking model specifies behavior but not implementation. What component provides the implementation, and why is separating the two valuable? (Hint: CNI, pluggability.)
Answers
- Each Pod gets exactly one unique IP address, shared by all containers in that Pod (they communicate among themselves over localhost).
- Yes — directly by Pod IP, across nodes, with no NAT. That is a fundamental guarantee of the model.
- Docker's default gives containers private IPs behind the host and requires publishing/mapping host ports to reach them, so callers must know host+port and avoid port conflicts. Kubernetes gives every Pod a routable IP on a flat network, so apps just talk to Pod/Service IPs without port mapping or host awareness — much simpler service-to-service communication.
- The CNI plugin (Calico, Cilium, Flannel, etc.) implements the actual routing/overlay. Separating the guaranteed behavior from the implementation lets operators choose a networking solution suited to their environment (overlay vs. BGP, eBPF, policy support, performance) without changing applications or Kubernetes itself — the same stable-interface/swappable-implementation philosophy as CRI and CSI.
Flat network model principles
Theory
The Kubernetes network model is governed by a few non-negotiable rules every conformant CNI must satisfy:
- Every Pod has a unique IP across the whole cluster.
- Pods on a node can communicate with all Pods on all nodes without NAT.
- Agents on a node (e.g., the kubelet) can communicate with all Pods on that node.
- (For Pods using the host network, they use the node's IP.)
This is the flat network model: a single, non-overlapping IP space shared by all Pods, with no address translation between them. The principle deliberately pushes complexity (overlays, encapsulation, BGP) below the model so applications can assume a simple, uniform network. The trade-offs different CNIs make are about how to deliver this flat space — VXLAN overlays (encapsulate packets), native routing (advertise Pod CIDRs via BGP), or eBPF — each with different performance and feature characteristics.
Example
Cluster Pod CIDR: 10.244.0.0/16 (one flat space, carved per node)
node-a -> 10.244.1.0/24 (Pods 10.244.1.x)
node-b -> 10.244.2.0/24 (Pods 10.244.2.x)
node-c -> 10.244.3.0/24 (Pods 10.244.3.x)
No overlap; any Pod IP is unique and routable to any other, no NAT.
Exercises
- (Beginner) State the core rule of the flat network model regarding Pod-to-Pod traffic.
- (Beginner) Why must Pod IP ranges per node not overlap?
- (Intermediate) Name two different techniques CNIs use to implement the flat network, and one trade-off between them.
- (Interview) Why does Kubernetes deliberately keep NAT out of Pod-to-Pod communication, and what application-level problems does that avoid? (Hint: source IP preservation, peer addressing, simplicity.)
Answers
- Any Pod can reach any other Pod directly by IP, across nodes, without NAT (a single flat IP space).
- Because every Pod IP must be unique and routable cluster-wide; overlapping per-node ranges would create ambiguous/duplicate addresses, breaking direct routing and the uniqueness guarantee.
- Examples: overlay/encapsulation (e.g., VXLAN, as in Flannel) wraps Pod packets to traverse the underlay — easy to deploy on any network but adds encapsulation overhead; native/BGP routing (e.g., Calico) advertises Pod CIDRs so packets route natively — better performance but needs network/router cooperation. (eBPF-based, e.g., Cilium, is another.) Trade-off: overlay = portability/simplicity vs. overhead; native routing = performance vs. infrastructure requirements.
- NAT rewrites source/destination addresses, which would obscure the real client IP, complicate direct peer-to-peer addressing, and break protocols that embed addresses. By guaranteeing no NAT between Pods, Kubernetes preserves source IPs (important for logging, security policy, and some protocols), lets services address peers by their actual IPs, and keeps the mental model simple — every Pod is just directly reachable, like hosts on a flat LAN.
CNI plugin overview (Calico, Flannel, Cilium)
Theory
A CNI plugin implements the flat network model. The three most common — each making different trade-offs — are:
| Plugin | Approach | NetworkPolicy | Notable strengths |
|---|---|---|---|
| Flannel | Simple overlay (VXLAN by default) | No (by itself) | Easiest to set up; minimal; good for learning/small clusters. |
| Calico | Native routing (BGP) or overlay (IPIP/VXLAN) | Yes (rich) | High performance, mature NetworkPolicy, scalable. |
| Cilium | eBPF-based dataplane | Yes (L3–L7) | eBPF performance/observability, L7 policy, service mesh, Hubble. |
Flannel prioritizes simplicity at the cost of features (notably no policy enforcement). Calico is the go-to when you need solid NetworkPolicy and performance, and can run without an overlay for near-native speed. Cilium is the modern, feature-rich choice built on eBPF, offering identity-aware L3–L7 policy, deep observability (Hubble), and even service-mesh capabilities — at the cost of requiring a recent kernel and more conceptual surface area.
Example
# Installing a CNI is a one-time cluster step (examples):
kubectl apply -f https://.../flannel/kube-flannel.yml # Flannel
kubectl apply -f https://.../calico/calico.yaml # Calico
cilium install # Cilium (CLI)
# CNI components run as a DaemonSet (one agent per node):
kubectl -n kube-system get pods -l k8s-app=calico-node -o wide
Exercises
- (Beginner) Which of the three plugins does not enforce NetworkPolicy on its own?
- (Beginner) What underlying technology distinguishes Cilium's dataplane?
- (Intermediate) You need network segmentation (NetworkPolicy) and high throughput on a cluster where you control the network. Which CNI fits and why?
- (Interview) Cilium uses eBPF instead of iptables for its dataplane. What advantages does this bring as cluster and Service counts grow large? (Hint: iptables rule scaling, observability, L7 awareness.)
Answers
- Flannel (it provides networking only; you'd pair it with something else for policy).
- eBPF (it programs the Linux kernel directly via eBPF programs rather than relying on iptables).
- Calico (or Cilium). Calico provides mature, performant NetworkPolicy and can use native BGP routing without an overlay for near-native throughput when you control the underlying network. Cilium is also a strong fit (eBPF performance + rich policy).
- iptables-based dataplanes degrade as rules grow — large numbers of Services/endpoints create long, linearly-scanned rule chains that slow updates and packet processing. Cilium's eBPF dataplane uses efficient in-kernel maps with effectively O(1) lookups, scaling better to many Services and high churn. It also enables rich, identity-based and L7-aware policy and deep flow observability (Hubble) that iptables cannot provide — improving performance, scalability, and visibility simultaneously.
IP address management (IPAM)
Theory
IPAM (IP Address Management) is the part of networking responsible for allocating and tracking IP addresses — ensuring every Pod gets a unique, valid IP from the cluster's address space and that addresses are reclaimed when Pods die. The CNI plugin includes an IPAM component that does this.
The cluster is configured with a Pod CIDR (e.g., 10.244.0.0/16). Typically this large block is subdivided into per-node CIDRs (e.g., a /24 per node), and each node's IPAM hands out IPs from its slice to Pods scheduled there. This local allocation avoids a central bottleneck. Cloud-native CNIs may instead assign real VPC IPs to Pods (e.g., AWS VPC CNI gives each Pod an ENI/secondary IP from the VPC), trading IP exhaustion concerns for native routability. Key operational concerns: avoiding CIDR overlap with other networks, and not exhausting the address space (which causes Pods to get stuck ContainerCreating with IPAM errors).
Example
Pod CIDR: 10.244.0.0/16 (65,534 usable IPs)
split into /24 per node => 254 Pods/node, 256 nodes max
node-a: 10.244.1.0/24 (allocates 10.244.1.2 ... 10.244.1.254)
node-b: 10.244.2.0/24
IPAM on each node assigns + frees IPs as Pods come and go.
# Symptom of IP exhaustion / IPAM failure:
kubectl describe pod stuck-pod
# Events: ... failed to allocate IP address: no available addresses in range
Exercises
- (Beginner) What is IPAM responsible for?
- (Beginner) What is a "Pod CIDR," and how is it typically divided?
- (Intermediate) What symptom would you expect if a node's IP range is exhausted?
- (Interview) Cloud CNIs like the AWS VPC CNI assign real VPC IPs to Pods instead of overlay IPs. What does this gain, and what new constraint does it introduce? (Hint: native routability vs. VPC IP exhaustion.)
Answers
- Allocating unique IP addresses to Pods from the cluster's address space and tracking/reclaiming them as Pods are created and destroyed.
- The Pod CIDR is the IP range reserved for Pod addresses (e.g.,
10.244.0.0/16). It is typically subdivided into smaller per-node blocks (e.g., a/24per node) so each node allocates IPs locally from its slice.- New Pods scheduled to that node fail to start, staying in
ContainerCreating/Pendingwith IPAM errors like "no available addresses" — because the node cannot allocate an IP for the Pod's network namespace.- Assigning real VPC IPs makes Pods natively routable within the VPC (no overlay/encapsulation overhead, direct integration with cloud security groups and load balancers, preserved source IPs). The constraint is VPC IP exhaustion: Pods consume actual VPC subnet addresses, so subnet sizing and per-node IP/ENI limits cap Pod density and must be planned carefully to avoid running out of addresses.
6.2 Services
Pods are ephemeral and their IPs change; Services provide stable endpoints in front of them. This subchapter covers every Service type and how discovery and affinity work.
ClusterIP service
Theory
A Service is an abstraction that provides a stable virtual IP and DNS name in front of a dynamic set of Pods selected by labels. The ClusterIP type — the default — exposes the Service on an internal, cluster-only virtual IP. It is reachable from inside the cluster but not from outside. This is the workhorse for internal service-to-service communication (e.g., a frontend talking to a backend API).
The Service decouples clients from Pod churn: clients connect to the stable ClusterIP (or DNS name), and kube-proxy load-balances each connection to one of the healthy backing Pods (whose membership is tracked in EndpointSlices). When Pods are added, removed, or rescheduled with new IPs, the Service IP stays constant and the endpoint set updates automatically. The set of Pods behind a Service is defined by its selector matching Pod labels.
Example
apiVersion: v1
kind: Service
metadata: { name: backend }
spec:
type: ClusterIP # default; internal-only virtual IP
selector:
app: backend # selects Pods labeled app=backend
ports:
- port: 80 # the Service port clients use
targetPort: 8080 # the container port traffic is sent to
kubectl get svc backend
# NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
# backend ClusterIP 10.96.12.34 <none> 80/TCP
# Reachable in-cluster as: backend.<namespace>.svc.cluster.local:80
Exercises
- (Beginner) What does a ClusterIP Service provide, and is it reachable from outside the cluster?
- (Beginner) What is the difference between
portandtargetPort?- (Intermediate) How does a Service know which Pods to send traffic to, and what object tracks the current set?
- (Interview) Explain how a Service provides stability to clients despite Pods being created and destroyed with changing IPs. (Hint: stable VIP/DNS, selector, EndpointSlices, kube-proxy.)
Answers
- A stable internal virtual IP (and DNS name) load-balancing to the selected Pods. ClusterIP is internal-only — not reachable from outside the cluster.
portis the port the Service listens on (what clients connect to);targetPortis the port on the backing Pods/containers that traffic is forwarded to. They can differ.- Via its label selector matching Pod labels. The current set of healthy backing Pod IPs is tracked in EndpointSlices (formerly Endpoints), which the control plane keeps updated as Pods come and go.
- Clients use the Service's stable ClusterIP or DNS name, which never changes for the life of the Service. Behind it, the selector continuously identifies matching, ready Pods and the EndpointSlice controller updates the endpoint list as Pods churn. kube-proxy programs each node so traffic to the ClusterIP is load-balanced to a current healthy Pod. Thus clients are insulated entirely from Pod creation/deletion and IP changes.
NodePort service
Theory
A NodePort Service exposes the Service on a static port on every node's IP, making it reachable from outside the cluster at <any-node-IP>:<nodePort>. It builds on ClusterIP: creating a NodePort also creates a ClusterIP, and the node port simply forwards into it. The port is allocated from a configurable range (default 30000–32767).
NodePort is the simplest way to get external traffic into a cluster without a cloud load balancer, but it has limitations: the ports are high-numbered and non-standard, you must track node IPs (which can change), and there's no built-in load balancing across nodes (clients hit a specific node, though kube-proxy then routes to any Pod). In practice NodePort is used for development, on-prem setups behind an external load balancer, or as the underlying mechanism that LoadBalancer Services build upon.
Example
apiVersion: v1
kind: Service
metadata: { name: web }
spec:
type: NodePort
selector: { app: web }
ports:
- port: 80
targetPort: 8080
nodePort: 30080 # optional; otherwise auto-assigned in 30000-32767
kubectl get svc web
# NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
# web NodePort 10.96.5.6 <none> 80:30080/TCP 1m
# Reachable externally at: http://<node-IP>:30080
Exercises
- (Beginner) Where is a NodePort Service reachable from, and on what kind of port?
- (Beginner) What is the default node port range?
- (Intermediate) A NodePort also implicitly creates which other Service type, and why?
- (Interview) What are the practical drawbacks of exposing production traffic directly via NodePort, and what is typically placed in front of it? (Hint: non-standard ports, node IP tracking, no L7 features; external LB/Ingress.)
Answers
- From outside the cluster, at any node's IP on a static, high-numbered port (
<node-IP>:<nodePort>).- 30000–32767.
- It also creates a ClusterIP Service. The node port forwards incoming external traffic to the ClusterIP, which load-balances to the Pods — so NodePort is a layer on top of ClusterIP.
- Drawbacks: non-standard high ports (not 80/443), clients must know/track node IPs (which change as nodes are added/removed), no TLS termination, host-based routing, or other L7 features, and limited/awkward load balancing. In production you typically place an external load balancer in front (which a
LoadBalancerService automates) and/or an Ingress controller for HTTP routing and TLS, using NodePort only as the underlying plumbing.
LoadBalancer service
Theory
A LoadBalancer Service is the standard way to expose a service to the internet in a cloud environment. It builds on NodePort/ClusterIP and additionally asks the cloud provider (via the cloud-controller-manager) to provision an external load balancer (AWS ELB/NLB, GCP LB, Azure LB) that forwards traffic to the Service. The cloud LB gets a stable external IP/hostname, which Kubernetes reports in the Service's EXTERNAL-IP.
This gives you a single, stable, internet-facing endpoint with the cloud's load balancing, health checks, and (often) integration with TLS and DDoS protection. The catch: each LoadBalancer Service typically provisions its own cloud load balancer, which costs money and doesn't scale well if you have dozens of services — which is exactly the problem Ingress solves (one LB fronting many services). On bare metal there's no cloud LB, so tools like MetalLB implement the type.
Example
apiVersion: v1
kind: Service
metadata: { name: frontend }
spec:
type: LoadBalancer
selector: { app: frontend }
ports:
- port: 80
targetPort: 8080
kubectl get svc frontend
# NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S)
# frontend LoadBalancer 10.96.0.40 a1b2.elb.aws.com 80:31234/TCP
# The cloud-controller-manager populated EXTERNAL-IP once the LB was ready.
Exercises
- (Beginner) What does a LoadBalancer Service ask the cloud provider to create?
- (Beginner) Which component populates the Service's
EXTERNAL-IP?- (Intermediate) Why can running many LoadBalancer Services become expensive and unwieldy?
- (Interview) On a bare-metal cluster, a LoadBalancer Service stays
<pending>. Explain why and how to resolve it. (Hint: no cloud CCM; MetalLB.)
Answers
- An external (cloud) load balancer with a stable public IP/hostname that forwards traffic to the Service's nodes/Pods.
- The cloud-controller-manager (its Service controller), after the cloud LB is provisioned.
- Each LoadBalancer Service typically provisions its own cloud load balancer, each of which costs money and consumes cloud quota. With many services this means many LBs to pay for and manage — inefficient compared to a single LB fronting many services, which is what Ingress provides.
- There is no cloud provider/CCM to fulfill the LoadBalancer request, so no external LB is created and
EXTERNAL-IPnever populates. Resolve it by installing a bare-metal load-balancer implementation such as MetalLB (or kube-vip), which assigns IPs from a configured pool and announces them via ARP (L2) or BGP, fulfilling LoadBalancer Services on-prem.
ExternalName service
Theory
An ExternalName Service is unusual: it has no selector, no Pods, no ClusterIP. Instead it maps a Service name to an external DNS name via a CNAME record. When something in the cluster looks up the Service, CoreDNS returns a CNAME pointing to the external host. It is essentially an in-cluster DNS alias for an external endpoint.
Its purpose is to let in-cluster clients reference an external dependency (a managed database, a third-party API, a legacy system) through a stable internal name, decoupling the client from the actual external address. You could later migrate that dependency into the cluster by changing the Service to a normal ClusterIP type — without touching any client. Note: because it's just DNS aliasing, no traffic flows through Kubernetes, no load balancing happens, and it works at the DNS level only (not for things requiring TLS SNI rewriting, etc.).
Example
apiVersion: v1
kind: Service
metadata: { name: prod-db }
spec:
type: ExternalName
externalName: db.prod.example.com # CNAME target; no selector/ClusterIP
# In-cluster, "prod-db" resolves to the external host via CNAME:
kubectl run t --rm -it --image=busybox -- nslookup prod-db.default.svc.cluster.local
# prod-db.default.svc.cluster.local canonical name = db.prod.example.com
Exercises
- (Beginner) What does an ExternalName Service map a Service name to, and via what DNS record type?
- (Beginner) Does an ExternalName Service have a ClusterIP or select Pods?
- (Intermediate) Give a scenario where ExternalName helps you migrate an external dependency into the cluster later without changing clients.
- (Interview) Since ExternalName works purely at the DNS layer, what are its limitations compared to a real Service? (Hint: no load balancing, no port remap, TLS/SNI considerations.)
Answers
- To an external DNS hostname, via a CNAME record.
- No — it has no ClusterIP and no selector/Endpoints; it is just a DNS alias.
- Suppose your app uses an external managed database at
db.prod.example.com. Point clients at the internal nameprod-db(an ExternalName Service). Later, when you run the database inside the cluster, changeprod-dbto a normal ClusterIP Service selecting the in-cluster Pods. Clients keep usingprod-dband need no changes — the indirection makes the migration transparent.- It only returns a DNS alias, so: there's no load balancing or health checking by Kubernetes (that's the external host's concern), no port remapping (it doesn't proxy traffic, so
targetPort-style remap doesn't apply), kube-proxy isn't involved, and it can cause TLS/SNI mismatches (the client connects to the external host directly, so certificates must match the real external name, not the Service name). It's a convenience alias, not a traffic-handling Service.
Headless services
Theory
A headless Service (clusterIP: None) was introduced in Chapter 4 for StatefulSets, but it's a general Service feature. By setting clusterIP: None, you tell Kubernetes not to allocate a virtual IP and not to load-balance. Instead, a DNS lookup of the Service returns the A/AAAA records of all the backing Pods directly (their individual IPs).
This is useful whenever the client — not kube-proxy — should decide which Pod to talk to, or needs to see all of them: stateful systems addressing specific members, client-side load balancing, or service discovery where the app wants the full endpoint list (e.g., gRPC clients that maintain their own connection pool to each backend). With a selector, the headless Service produces per-Pod records; without a selector, you can manage the Endpoints manually.
Example
apiVersion: v1
kind: Service
metadata: { name: cassandra }
spec:
clusterIP: None # headless
selector: { app: cassandra }
ports:
- port: 9042
# Returns ALL Pod IPs, not a single VIP:
kubectl run t --rm -it --image=busybox -- nslookup cassandra.default.svc.cluster.local
# Address: 10.244.1.5
# Address: 10.244.2.7
# Address: 10.244.3.9
Exercises
- (Beginner) How do you make a Service headless?
- (Beginner) What does a DNS query for a headless Service return instead of a single virtual IP?
- (Intermediate) Give two situations where a headless Service is preferable to a normal ClusterIP Service.
- (Interview) Why might a gRPC client specifically benefit from a headless Service rather than a ClusterIP? (Hint: long-lived connections, client-side load balancing, single VIP pinning.)
Answers
- Set
clusterIP: Nonein the Service spec.- The individual IP addresses (A/AAAA records) of all the backing Pods.
- Any two: addressing specific members of a stateful system (StatefulSets), client-side load balancing where the client picks the backend, service discovery needing the full endpoint list, or peer-to-peer/clustered apps that must know all members.
- gRPC uses long-lived HTTP/2 connections; with a ClusterIP, kube-proxy load-balances at connection time, so a client tends to pin a single persistent connection to one backend Pod, defeating load distribution across many requests. A headless Service returns all Pod IPs, letting the gRPC client open and balance requests across connections to each backend itself (client-side load balancing), achieving even distribution and reacting to endpoint changes.
Service discovery via DNS
Theory
How does a Pod find a Service? Through DNS, provided by CoreDNS. Every Service automatically gets a DNS A/AAAA record at <service>.<namespace>.svc.cluster.local resolving to its ClusterIP (or, for headless, to the Pod IPs). This is the primary service-discovery mechanism in Kubernetes — apps connect by name, not by IP.
Crucially, DNS supports short names via search domains. The kubelet configures each Pod's /etc/resolv.conf with search suffixes so that within the same namespace you can use just backend, from another namespace backend.other-ns, and the fully qualified name always works. There's also a legacy mechanism — environment variables (BACKEND_SERVICE_HOST/_PORT) injected for Services that existed when the Pod started — but DNS is preferred because it's dynamic (env vars are stale if the Service is created after the Pod).
Example
DNS name structure:
<service>.<namespace>.svc.cluster.local
e.g. backend.payments.svc.cluster.local -> 10.96.12.34
From a Pod in namespace "payments": curl http://backend (short name works)
From a Pod in another namespace: curl http://backend.payments
Always works (FQDN): curl http://backend.payments.svc.cluster.local
cat /etc/resolv.conf
# search payments.svc.cluster.local svc.cluster.local cluster.local
# nameserver 10.96.0.10
Exercises
- (Beginner) What is the fully qualified DNS name pattern for a Service?
- (Beginner) From a Pod in the same namespace as a Service
api, what short name can you use to reach it?- (Intermediate) Why is DNS-based discovery preferred over the injected environment-variable mechanism?
- (Interview) Explain how the search domains in a Pod's
/etc/resolv.confenable both short and cross-namespace name resolution. (Hint: suffix search list ordering.)
Answers
<service>.<namespace>.svc.cluster.local.- Just
api(the same-namespace search suffix resolves it).- DNS is dynamic: a Pod can resolve any Service at any time, including Services created after the Pod started. Environment-variable discovery only injects variables for Services that existed when the Pod was created, so it misses later Services and goes stale — and it clutters the environment. DNS avoids these problems and supports short/cross-namespace names.
- The resolver appends each search suffix in order to an unqualified name until one resolves. With
search payments.svc.cluster.local svc.cluster.local cluster.local, looking upapifirst triesapi.payments.svc.cluster.local(same namespace) — resolving short names; looking upapi.othertriesapi.other.svc.cluster.local— resolving cross-namespace; and a full FQDN (ending in the cluster domain) resolves directly. The ordered suffix list is what makes both conveniences work.
Session affinity
Theory
By default, a Service load-balances each new connection independently, so successive requests from the same client may hit different Pods. For most stateless apps that's ideal. But some apps keep per-client in-memory state (a session, a cache) and work better if a given client consistently reaches the same Pod. Session affinity provides this "sticky" behavior.
Kubernetes Services support sessionAffinity: ClientIP, which routes requests from the same client source IP to the same backing Pod (within a configurable timeout, default 3 hours). This is L3/L4 affinity based on IP only — it cannot do cookie-based stickiness (that requires an L7 proxy like an Ingress controller). Note that session affinity is generally a workaround; the more robust pattern is to externalize session state (e.g., to Redis) so any Pod can serve any request, keeping the service truly stateless and horizontally scalable.
Example
apiVersion: v1
kind: Service
metadata: { name: web }
spec:
selector: { app: web }
sessionAffinity: ClientIP # stick a client IP to one Pod
sessionAffinityConfig:
clientIP:
timeoutSeconds: 10800 # affinity duration (default 3h)
ports:
- port: 80
targetPort: 8080
Exercises
- (Beginner) What does session affinity do?
- (Beginner) Which value enables IP-based stickiness on a Service, and what is its default timeout?
- (Intermediate) Why can a Service not do cookie-based session stickiness, and what can?
- (Interview) Session affinity is often called a workaround. What is the more scalable architectural alternative, and why is it preferable? (Hint: externalize state, statelessness.)
Answers
- It routes requests from the same client consistently to the same backing Pod ("sticky sessions"), rather than load-balancing each request independently.
sessionAffinity: ClientIP; default timeout is 3 hours (10800 seconds).- A Service operates at L3/L4 and only sees IP addresses, not HTTP cookies, so it can only stick by client IP. Cookie-based affinity requires an L7 (HTTP-aware) proxy such as an Ingress controller (e.g., NGINX with affinity annotations) that can read/set cookies.
- The better approach is to externalize session state (store sessions/caches in a shared store like Redis or a database) so the application is truly stateless and any Pod can serve any request. This avoids the fragility of affinity (uneven load, lost sessions when a Pod dies, broken stickiness behind NAT where many clients share an IP) and enables clean horizontal scaling and rolling updates without disrupting user sessions.
6.3 Ingress
Exposing many HTTP services each via its own LoadBalancer is wasteful. Ingress provides a single entry point with HTTP-aware routing, TLS, and host/path rules. This subchapter covers it.
Ingress resource and controllers
Theory
An Ingress is an API object that defines HTTP/HTTPS routing rules for exposing Services to the outside world — routing by hostname and URL path, terminating TLS, etc. — through a single external entry point. Critically, the Ingress object is just a set of rules; it does nothing on its own. You must run an Ingress controller (a Pod that watches Ingress objects and implements them), such as the NGINX Ingress Controller, Traefik, HAProxy, or a cloud's controller.
This is a deliberate separation: the Ingress resource is the portable, declarative spec; the Ingress controller is the pluggable implementation (the same pattern as CNI/CSI). One Ingress controller (fronted by a single LoadBalancer) can serve many Services across many hostnames and paths — solving the "one LB per service" cost problem of LoadBalancer Services. Note: Ingress is for HTTP(S); for other protocols you use Services or the newer Gateway API.
Example
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: site
spec:
ingressClassName: nginx # which controller implements this
rules:
- host: shop.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service:
name: frontend
port: { number: 80 }
# The controller runs as a workload and is fronted by one LB:
kubectl -n ingress-nginx get pods,svc
# service/ingress-nginx-controller LoadBalancer ... 80:.../443:...
Exercises
- (Beginner) What does an Ingress resource define, and what protocol(s) is it for?
- (Beginner) Why does creating an Ingress object alone not route any traffic?
- (Intermediate) How does Ingress solve the cost problem of using a LoadBalancer Service per application?
- (Interview) Describe the separation between the Ingress resource and the Ingress controller and why this mirrors Kubernetes' broader extensibility philosophy. (Hint: portable spec vs. swappable implementation.)
Answers
- It defines HTTP/HTTPS routing rules (host- and path-based) and TLS settings for exposing Services externally. It is for HTTP(S) traffic (L7).
- The Ingress object is only a declarative set of rules. Without a running Ingress controller to watch those objects and configure an actual proxy/load balancer, nothing implements the rules, so no traffic is routed.
- A single Ingress controller, fronted by one external load balancer, can route to many backend Services based on host/path rules. So instead of one cloud LB per service, you pay for one LB and define many routes — far cheaper and easier to manage at scale.
- The Ingress resource is a standard, portable, declarative API describing desired routing; the Ingress controller is a pluggable implementation (NGINX, Traefik, cloud-native, etc.) that realizes those rules. This is the same stable-interface/swappable-implementation pattern as CRI, CNI, and CSI: Kubernetes defines the contract while letting operators choose the implementation best suited to their needs, without changing the resource or applications.
NGINX Ingress Controller
Theory
The NGINX Ingress Controller is the most widely used Ingress controller. It runs NGINX (a battle-tested reverse proxy) inside the cluster, watches Ingress objects, and dynamically generates NGINX configuration to route traffic accordingly. It handles TLS termination, path/host routing, rewrites, rate limiting, and many features exposed through annotations.
A common point of confusion: there are two different projects — the community ingress-nginx (maintained by the Kubernetes project) and NGINX Inc.'s nginx-ingress (commercial/F5). They use different annotation prefixes and feature sets. The community ingress-nginx is by far the most common in tutorials and default installs. Operationally, it runs as a Deployment or DaemonSet fronted by a LoadBalancer/NodePort Service, and you select it via ingressClassName: nginx.
Example
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api
annotations:
nginx.ingress.kubernetes.io/rewrite-target: / # NGINX-specific behavior
nginx.ingress.kubernetes.io/ssl-redirect: "true"
spec:
ingressClassName: nginx
rules:
- host: api.example.com
http:
paths:
- path: /v1
pathType: Prefix
backend:
service: { name: api-v1, port: { number: 80 } }
helm install ingress-nginx ingress-nginx/ingress-nginx -n ingress-nginx --create-namespace
Exercises
- (Beginner) What software does the NGINX Ingress Controller run under the hood?
- (Beginner) How are NGINX-specific behaviors (like URL rewrites) configured on an Ingress?
- (Intermediate) Why is it important to know whether you're using community
ingress-nginxor NGINX Inc.'s controller?- (Interview) The NGINX Ingress Controller regenerates NGINX config as Ingress/Service/endpoint objects change. Explain the watch-and-reconcile loop that makes this dynamic. (Hint: watches API server, templates config, reloads NGINX.)
Answers
- NGINX (a reverse proxy/load balancer) running inside the cluster.
- Through annotations on the Ingress object (e.g.,
nginx.ingress.kubernetes.io/rewrite-target), which the controller translates into NGINX configuration.- The two projects (community
ingress-nginxvs. F5/NGINX Inc.'snginx-ingress) use different annotation prefixes, configuration, and feature sets, so annotations and docs for one don't apply to the other. Using the wrong annotations silently does nothing or behaves unexpectedly, so you must know which controller you run.- The controller runs a control loop: it watches the API server for Ingress, Service, and EndpointSlice changes; on any change it renders an NGINX configuration from templates reflecting the current desired routing and the current set of healthy backend endpoints; then it applies/reloads that config in the NGINX process. Because it reacts to watch events (level-triggered reconciliation), routing stays continuously in sync with the cluster state as backends scale, fail, or are reconfigured.
Path-based and host-based routing
Theory
The core power of Ingress is routing a single external IP to many backends based on the request. Two dimensions:
- Host-based routing: route by the HTTP
Hostheader (the domain).shop.example.comgoes to the shop Service;api.example.comgoes to the API Service — all on the same IP/controller. This is name-based virtual hosting. - Path-based routing: route by URL path within a host.
/webto the frontend,/apito the backend,/grafanato monitoring.
Paths have a pathType: Prefix (match by URL path prefix, the common choice), Exact (exact match), or ImplementationSpecific. Order and specificity matter, and some controllers need rewrite annotations when the backend doesn't expect the path prefix. Combining host and path rules lets one Ingress controller serve an entire portfolio of applications.
Example
spec:
ingressClassName: nginx
rules:
- host: shop.example.com # host-based: by domain
http:
paths:
- path: / # path-based: by URL path
pathType: Prefix
backend: { service: { name: frontend, port: { number: 80 } } }
- path: /api
pathType: Prefix
backend: { service: { name: api, port: { number: 80 } } }
- host: admin.example.com # different host -> different backend
http:
paths:
- path: /
pathType: Prefix
backend: { service: { name: admin, port: { number: 80 } } }
Exercises
- (Beginner) What request attribute does host-based routing use, and what does path-based routing use?
- (Beginner) What is the most commonly used
pathType?- (Intermediate) Write (in words) the routing for:
app.example.com/toweb,app.example.com/apitoapi.- (Interview) When routing
/apito a backend that serves its routes at/, what problem arises and how is it typically solved? (Hint: the backend receives/api/...; rewrite-target.)
Answers
- Host-based routing uses the HTTP
Hostheader (the domain); path-based routing uses the URL path.Prefix.- One Ingress rule with
host: app.example.comand two paths:path: /(pathType: Prefix) -> Serviceweb;path: /api(pathType: Prefix) -> Serviceapi. Requests to the root go toweb; requests under/apigo toapi.- The backend receives the full path including the prefix (e.g.,
/api/usersinstead of/users), so its routes don't match and it returns 404s. This is solved with a URL rewrite (e.g.,nginx.ingress.kubernetes.io/rewrite-target: /combined with a capture-group path), which strips/rewrites the prefix before forwarding so the backend sees the path it expects.
TLS termination at Ingress
Theory
TLS termination means the Ingress controller decrypts incoming HTTPS traffic at the edge, so your backend Services can receive plain HTTP internally. This centralizes certificate management at one place (the Ingress) instead of configuring TLS in every application. You provide the certificate and key in a kubernetes.io/tls Secret and reference it in the Ingress tls section, mapping it to the host(s) it covers.
In practice, certificate provisioning is automated with cert-manager, an add-on that obtains and renews certificates (e.g., free Let's Encrypt certs via ACME) and stores them in TLS Secrets that the Ingress references. The controller then serves HTTPS for the configured hosts, terminating TLS and forwarding plaintext to backends. (If you need encryption all the way to the Pod — "end-to-end" TLS or mTLS — that's a service-mesh concern, Chapter 13.)
Example
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: secure-site
annotations:
cert-manager.io/cluster-issuer: letsencrypt-prod # cert-manager auto-issues
spec:
ingressClassName: nginx
tls:
- hosts: [ shop.example.com ]
secretName: shop-tls # kubernetes.io/tls Secret with cert + key
rules:
- host: shop.example.com
http:
paths:
- path: /
pathType: Prefix
backend: { service: { name: frontend, port: { number: 80 } } }
Exercises
- (Beginner) What does "TLS termination at the Ingress" mean for backend traffic?
- (Beginner) What type of Secret holds the certificate and key for an Ingress?
- (Intermediate) What does cert-manager automate, and why is that valuable?
- (Interview) TLS terminates at the Ingress, leaving traffic plaintext between the Ingress and the Pods. When is that acceptable, and what would you use if you need encryption end-to-end? (Hint: trusted cluster network vs. mTLS/service mesh.)
Answers
- The Ingress controller decrypts HTTPS at the edge, so backends receive plain HTTP; TLS is handled centrally at the Ingress rather than in each app.
- A
kubernetes.io/tlsSecret (containingtls.crtandtls.key), referenced viasecretNamein the Ingresstlssection.- cert-manager automatically obtains and renews TLS certificates (e.g., from Let's Encrypt via ACME) and stores them in TLS Secrets the Ingress uses. This eliminates manual certificate issuance, installation, and renewal — preventing outages from expired certs and enabling free, automated HTTPS.
- Terminating at the Ingress is acceptable when the in-cluster network between the Ingress and Pods is considered trusted and not subject to your threat model (common in many setups). If you require encryption all the way to the Pod (e.g., zero-trust, compliance, untrusted network), use a service mesh providing automatic mTLS between workloads (Chapter 13), or configure the backend and Ingress for end-to-end/re-encrypted TLS so traffic is never plaintext on the wire.
Ingress class and annotations
Theory
A cluster can run multiple Ingress controllers (e.g., an internal one and an external one, or NGINX plus a cloud controller). The IngressClass mechanism tells each Ingress object which controller should handle it. You reference it via spec.ingressClassName (the modern field; the older kubernetes.io/ingress.class annotation is deprecated). An IngressClass can be marked default so Ingresses without an explicit class use it.
Annotations are how you configure controller-specific behavior that the standard Ingress spec doesn't cover — rewrites, rate limits, timeouts, body size limits, auth, canary weights, and more. Because annotations are controller-specific, they are not portable between controllers. The portable, structured successor to all of this is the Gateway API, which moves many annotation-driven behaviors into typed resources (GatewayClass, Gateway, HTTPRoute) — worth knowing as the direction the ecosystem is heading.
Example
apiVersion: networking.k8s.io/v1
kind: IngressClass
metadata:
name: nginx
annotations:
ingressclass.kubernetes.io/is-default-class: "true" # default for unspecified
spec:
controller: k8s.io/ingress-nginx
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: api
annotations:
nginx.ingress.kubernetes.io/proxy-body-size: "50m" # controller-specific
nginx.ingress.kubernetes.io/rate-limit-rps: "100"
spec:
ingressClassName: nginx
rules: [ ... ]
Exercises
- (Beginner) What does
ingressClassNameon an Ingress determine?- (Beginner) How do you make an IngressClass the default for Ingresses that don't specify one?
- (Intermediate) Why are Ingress annotations not portable across different controllers?
- (Interview) The Gateway API is positioned as the successor to Ingress + annotations. What structural problems with the annotation approach does it aim to fix? (Hint: typed/role-oriented resources vs. unstructured controller-specific annotations.)
Answers
- Which Ingress controller (via its IngressClass) should implement that Ingress's rules — important when multiple controllers run in the cluster.
- Mark the IngressClass with the annotation
ingressclass.kubernetes.io/is-default-class: "true"; Ingresses without an explicitingressClassNamethen use it.- Annotations encode controller-specific features and use controller-specific keys/semantics (e.g.,
nginx.ingress.kubernetes.io/...). A different controller doesn't understand those keys, so the configuration silently doesn't apply — making annotation-based config non-portable.- Annotations are unstructured strings with no schema/validation, are controller-specific (non-portable), and overload a single resource with concerns belonging to different roles. The Gateway API replaces them with typed, validated, role-oriented resources (GatewayClass for infra providers, Gateway for cluster operators, HTTPRoute/etc. for app developers), enabling schema validation, clearer separation of responsibilities, portability across implementations, and richer routing expressed as first-class fields instead of opaque annotations.
6.4 Network Policies
By default, all Pods can talk to all Pods — which is convenient but insecure. NetworkPolicies let you segment the network and enforce least-privilege connectivity. This subchapter covers them.
Default allow vs deny behavior
Theory
A critical security default to internalize: by default, all Pods can communicate with all other Pods in the cluster, with no restrictions. The flat network model is "default allow." This is convenient but means a single compromised Pod can reach every other Pod (lateral movement).
A NetworkPolicy changes this for the Pods it selects. The key rule: as soon as any NetworkPolicy selects a Pod, that Pod becomes "default deny" for the direction(s) the policy covers — only traffic explicitly allowed by a policy is permitted; everything else is dropped. Pods not selected by any policy remain default-allow. A common pattern is to apply an explicit default-deny policy to a namespace (selecting all Pods with empty rules), then add specific allow policies on top — flipping the namespace to least-privilege.
Example
# Default-deny ALL ingress in a namespace: selects every Pod, allows nothing.
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: default-deny-ingress
namespace: secure
spec:
podSelector: {} # {} = all Pods in the namespace
policyTypes:
- Ingress # no ingress rules listed => deny all inbound
Without any policy: anyone -> Pod (default allow)
With this policy: nothing -> selected Pods, unless another policy allows it
Exercises
- (Beginner) What is the default Pod-to-Pod communication behavior in a fresh cluster?
- (Beginner) What happens to a Pod's allowed traffic the moment a NetworkPolicy selects it?
- (Intermediate) Write (in words) what a policy with
podSelector: {}andpolicyTypes: [Ingress]and no ingress rules achieves.- (Interview) Why is "default allow" a security concern, and how does a default-deny-plus-explicit-allow strategy improve posture? (Hint: lateral movement, least privilege.)
Answers
- Default allow — every Pod can reach every other Pod with no restrictions.
- It switches from default-allow to default-deny for the covered direction(s): only traffic explicitly permitted by a NetworkPolicy is allowed; all other traffic in that direction is dropped.
- It selects all Pods in the namespace and, because it lists
Ingressas a policy type but specifies no ingress rules, it denies all inbound traffic to every Pod in that namespace (a namespace-wide default-deny for ingress).- With default allow, a single compromised Pod can freely reach every other Pod, enabling lateral movement and broad blast radius. Applying a namespace default-deny and then adding only the specific allow rules each workload needs enforces least privilege: Pods can talk only to their legitimate dependencies, so a compromise is contained to the explicitly permitted paths, dramatically limiting lateral movement.
Ingress and egress rules
Theory
NetworkPolicies control traffic in two directions, declared in policyTypes:
- Ingress rules: what is allowed to connect to the selected Pods (inbound).
- Egress rules: what the selected Pods are allowed to connect to (outbound).
Each rule specifies allowed peers (from for ingress, to for egress) and optionally ports. Peers can be Pods (by podSelector), namespaces (by namespaceSelector), or IP blocks (ipBlock, for external addresses). Rules are additive (a union — anything matching any rule is allowed) and whitelist-only (you can only allow, never explicitly deny; denial is implicit for anything not allowed once a policy applies).
A frequent gotcha: locking down egress can break DNS — Pods need to reach CoreDNS (UDP/TCP 53) to resolve names, so an egress policy must explicitly allow DNS or name resolution fails.
Example
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata: { name: api-policy, namespace: app }
spec:
podSelector: { matchLabels: { app: api } }
policyTypes: [ Ingress, Egress ]
ingress:
- from:
- podSelector: { matchLabels: { app: frontend } } # only frontend may call api
ports:
- { protocol: TCP, port: 8080 }
egress:
- to:
- podSelector: { matchLabels: { app: db } } # api may reach db
ports:
- { protocol: TCP, port: 5432 }
- to: # allow DNS (critical!)
- namespaceSelector: {}
ports:
- { protocol: UDP, port: 53 }
- { protocol: TCP, port: 53 }
Exercises
- (Beginner) What is the difference between an ingress rule and an egress rule?
- (Beginner) Can a NetworkPolicy explicitly deny a specific source? Why or why not?
- (Intermediate) Why might adding an egress policy suddenly break your application's ability to resolve Service names?
- (Interview) NetworkPolicy rules are additive whitelists. Explain how multiple policies selecting the same Pod combine, and the implication for designing a policy set. (Hint: union of allows, no explicit deny.)
Answers
- Ingress rules govern inbound connections to the selected Pods (who may connect to them); egress rules govern outbound connections from the selected Pods (what they may connect to).
- No. NetworkPolicies are whitelist-only — you can only specify allowed traffic. Denial is implicit: once any policy selects a Pod for a direction, anything not explicitly allowed is denied. There is no "deny this specific peer" rule.
- An egress policy makes the Pod default-deny for outbound traffic, so unless you explicitly allow egress to CoreDNS on port 53 (UDP/TCP), DNS queries are dropped and name resolution fails — breaking Service discovery even though the app and Services are fine.
- When several policies select the same Pod, their allowed peers/ports are unioned — traffic is permitted if any applicable policy allows it. Because there's no explicit deny, you cannot subtract permissions with another policy; you can only widen what's allowed. Implication: design policies as the complete set of things each workload should be allowed to do, and rely on a default-deny baseline to forbid everything else — adding a permissive policy can only loosen, never tighten.
Pod selector and namespace selector
Theory
NetworkPolicy peers are chosen with selectors, and understanding how they combine is the crux of writing correct policies:
podSelector(insidefrom/to) matches Pods by label within the policy's own namespace (unless combined with a namespaceSelector).namespaceSelectormatches entire namespaces by their labels — allowing traffic from/to all Pods in those namespaces.- When
podSelectorandnamespaceSelectorappear in the samefrom/toelement (same list item), they are ANDed: "Pods matching X in namespaces matching Y." When they're in separate list items, they are ORed.
This AND/OR distinction is the single most common source of NetworkPolicy bugs. Note also that namespace selection relies on namespace labels (e.g., the built-in kubernetes.io/metadata.name label), so you select namespaces by labeling them, not by name directly in older setups.
Example
ingress:
# AND: Pods labeled app=frontend, only from namespaces labeled team=web
- from:
- namespaceSelector: { matchLabels: { team: web } }
podSelector: { matchLabels: { app: frontend } }
# vs. OR: anything in team=web namespaces, OR any Pod labeled app=frontend anywhere selected
- from:
- namespaceSelector: { matchLabels: { team: web } }
- podSelector: { matchLabels: { app: frontend } }
Exercises
- (Beginner) By default,
podSelectorinside afromblock matches Pods in which namespace?- (Beginner) What does
namespaceSelectormatch on?- (Intermediate) Explain the difference in meaning between combining podSelector and namespaceSelector in the same list item versus separate list items.
- (Interview) A policy intended to allow "frontend Pods in the web namespace" is accidentally allowing all Pods in the web namespace AND all frontend Pods everywhere. What is the likely YAML mistake? (Hint: AND vs. OR via list structure.)
Answers
- The policy's own namespace (the namespace where the NetworkPolicy is defined), unless a namespaceSelector is also specified.
- Namespace labels — it selects whole namespaces whose labels match (e.g.,
kubernetes.io/metadata.nameor custom labels), allowing traffic from/to all Pods in those namespaces.- In the same
from/tolist element, podSelector and namespaceSelector are ANDed ("Pods matching the podSelector that are in namespaces matching the namespaceSelector"). In separate list elements they are ORed ("any Pod in matching namespaces" OR "Pods matching the podSelector in the policy's namespace").- The two selectors were placed as separate list items (two
-entries) instead of as two keys under a single list item. That makes them ORed — allowing all Pods inwebnamespaces and, separately, allfrontendPods. The fix is to putnamespaceSelectorandpodSelectorunder the same list item so they're ANDed.
Network policy enforcement with Calico/Cilium
Theory
A subtle but vital point: the NetworkPolicy resource is part of the Kubernetes API, but enforcement is done by the CNI plugin — and not all CNIs enforce NetworkPolicy. If your CNI doesn't (e.g., plain Flannel), you can create NetworkPolicy objects and they will be silently ignored, giving a false sense of security. You must run a policy-capable CNI such as Calico or Cilium (or others like Antrea, Weave Net).
Beyond the standard API, these CNIs offer extended policy capabilities via their own CRDs:
- Calico:
GlobalNetworkPolicy/NetworkPolicywith ordering, explicit deny/allow, richer selectors, and cluster-wide scope. - Cilium:
CiliumNetworkPolicy/CiliumClusterwideNetworkPolicywith identity-based and L7-aware rules (e.g., allow onlyGET /apiHTTP calls), powered by eBPF.
The takeaway: choose a CNI that enforces policy, and reach for vendor CRDs when standard NetworkPolicy (L3/L4 only) is insufficient.
Example
# Cilium L7 policy: allow frontend to call api ONLY on GET /api/* (HTTP-aware)
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata: { name: l7-api }
spec:
endpointSelector: { matchLabels: { app: api } }
ingress:
- fromEndpoints:
- matchLabels: { app: frontend }
toPorts:
- ports: [ { port: "8080", protocol: TCP } ]
rules:
http:
- { method: "GET", path: "/api/.*" } # L7 rule, beyond standard NetworkPolicy
Exercises
- (Beginner) Which component actually enforces NetworkPolicy?
- (Beginner) What happens if you create NetworkPolicy objects on a CNI that doesn't support them?
- (Intermediate) Name a capability that Cilium's CiliumNetworkPolicy offers that the standard NetworkPolicy API does not.
- (Interview) A security audit finds NetworkPolicies defined but traffic flowing freely against them. What is the most likely root cause, and how would you confirm and fix it? (Hint: non-enforcing CNI.)
Answers
- The CNI plugin (the network dataplane), not the Kubernetes control plane — the API only stores the policy objects.
- They are accepted and stored by the API server but silently ignored (not enforced), so all traffic flows as if no policy existed — a dangerous false sense of security.
- L7/HTTP-aware rules (e.g., allow only specific HTTP methods/paths), identity-based policy, DNS-aware egress, and clusterwide policies — capabilities beyond the standard API's L3/L4 (IP/port) scope.
- The most likely cause is a CNI that does not enforce NetworkPolicy (e.g., plain Flannel). Confirm by checking which CNI is installed and whether it advertises policy support (and test connectivity that a policy should block). Fix by switching to or adding a policy-capable CNI such as Calico or Cilium (or a policy add-on), then verify the previously-defined policies now actually block traffic.
6.5 DNS in Kubernetes
DNS underpins service discovery; understanding CoreDNS configuration and the records it serves helps you debug and customize name resolution. This subchapter covers DNS specifics.
CoreDNS configuration
Theory
CoreDNS is the cluster DNS server (introduced in Chapter 2). Its behavior is defined by the Corefile, stored in the coredns ConfigMap in kube-system. The Corefile is a chain of plugins processed per query: the kubernetes plugin answers cluster records (Services/Pods) by watching the API server; forward sends non-cluster queries to upstream resolvers; cache caches responses; and others (errors, health, ready, loop, reload) provide operational features.
Customizing CoreDNS is done by editing this ConfigMap — for example, to add stub domains (forward a specific domain to a specific DNS server), rewrite queries, or tune caching. CoreDNS picks up changes automatically (the reload plugin). Because all in-cluster name resolution depends on it, CoreDNS configuration and health are central to cluster operability.
Example
# coredns ConfigMap (Corefile):
.:53 {
errors
health
ready
kubernetes cluster.local in-addr.arpa ip6.arpa { # cluster records
pods insecure
fallthrough in-addr.arpa ip6.arpa
}
forward . /etc/resolv.conf # upstream for external names
cache 30
loop
reload # auto-reload on ConfigMap change
}
# Custom stub domain example: forward acme.internal to a specific server
acme.internal:53 {
forward . 10.10.0.53
}
Exercises
- (Beginner) Where is CoreDNS's configuration stored?
- (Beginner) What does the
forwardplugin do?- (Intermediate) How would you make CoreDNS resolve a custom internal domain via a specific external DNS server?
- (Interview) The Corefile is a plugin chain processed per query. Explain how the
kubernetes,forward, andcacheplugins cooperate to resolve both cluster and external names efficiently. (Hint: cluster records vs. upstream, caching layer.)
Answers
- In the
corednsConfigMap (the Corefile) in thekube-systemnamespace.- It forwards DNS queries that CoreDNS isn't authoritative for (e.g., external/internet names) to upstream resolvers (commonly the node's
/etc/resolv.conf).- Add a server block / stub domain to the Corefile (via the coredns ConfigMap), e.g.
acme.internal:53 { forward . <dns-server-ip> }, so queries for that domain are sent to the specified server. CoreDNS reloads the change automatically.- For each query, the chain is evaluated: the
cacheplugin first serves a cached answer if present (fast path). On a miss, thekubernetesplugin answers names in the cluster domain (*.svc.cluster.local, Pod records) from its API-server-watched data; anything outside the cluster domain falls through toforward, which queries upstream resolvers. The response is then cached for subsequent queries. This split lets cluster names resolve locally and instantly from live data while external names are delegated upstream, with caching reducing load and latency for both.
Service DNS records
Theory
CoreDNS publishes predictable DNS records for Services, which is the backbone of service discovery:
- Normal (ClusterIP) Service: an A/AAAA record at
<service>.<namespace>.svc.cluster.localresolving to the Service's ClusterIP. - Headless Service: A/AAAA records returning the individual Pod IPs (one per ready Pod), plus per-Pod records when backed by a StatefulSet.
- SRV records: for named ports,
_<port-name>._<protocol>.<service>.<namespace>.svc.cluster.locallets clients discover port numbers, not just addresses. - ExternalName Service: a CNAME to the external host.
These records update as Services and endpoints change. The consistent naming scheme means applications and operators can predict any Service's address from its name and namespace — no service registry to query manually.
Example
ClusterIP svc "api" in ns "shop":
api.shop.svc.cluster.local A 10.96.10.20
Headless svc "db" in ns "data":
db.data.svc.cluster.local A 10.244.1.5
A 10.244.2.7
Named port "grpc" (SRV):
_grpc._tcp.api.shop.svc.cluster.local SRV 0 100 50051 api.shop.svc.cluster.local
Exercises
- (Beginner) What does the A record of a normal ClusterIP Service resolve to?
- (Beginner) How does the DNS result for a headless Service differ from a ClusterIP Service?
- (Intermediate) What are SRV records used for in Kubernetes DNS?
- (Interview) Explain how the predictable Service DNS naming scheme eliminates the need for a separate service-registry lookup in application code. (Hint: name+namespace deterministically maps to an address.)
Answers
- The Service's stable ClusterIP (virtual IP).
- A ClusterIP Service's A record returns a single virtual IP; a headless Service's A records return the individual Pod IPs of all ready backends (no VIP), enabling direct per-Pod addressing.
- SRV records expose named ports (and the associated host) so clients can discover the port number/protocol for a service, e.g.,
_grpc._tcp.<svc>.<ns>.svc.cluster.local, rather than hard-coding ports.- Every Service has a deterministic name
<service>.<namespace>.svc.cluster.localthat CoreDNS resolves to its current address, kept up to date as endpoints change. Applications simply use that name; the platform's DNS is the registry. There's no need to call a discovery API or maintain client-side registries — knowing a service's name and namespace is enough to reach it, and resolution always reflects the live set of healthy backends.
Pod DNS records
Theory
Beyond Services, individual Pods can also have DNS records. By default, a Pod gets an A record of the form <pod-ip-with-dashes>.<namespace>.pod.cluster.local (e.g., 10-244-1-5.default.pod.cluster.local) — rarely used directly. More useful are the StatefulSet per-Pod records via a headless Service: <pod-name>.<service>.<namespace>.svc.cluster.local (e.g., db-0.db.data.svc.cluster.local), which give stable, predictable names to individual stateful Pods.
A Pod's own DNS identity can be customized with hostname and subdomain fields: setting subdomain to a headless Service's name yields a stable record <hostname>.<subdomain>.<namespace>.svc.cluster.local. This is mostly relevant for clustered/stateful applications where members must address each other by stable names. For ordinary stateless Pods, you use the fronting Service's DNS, not per-Pod records.
Example
apiVersion: v1
kind: Pod
metadata:
name: worker
labels: { app: worker }
spec:
hostname: worker-1 # sets the Pod's hostname
subdomain: workers # must match a headless Service named "workers"
containers:
- name: app
image: myapp:1.0
---
apiVersion: v1
kind: Service
metadata: { name: workers }
spec:
clusterIP: None # headless, enables per-Pod records
selector: { app: worker }
ports: [ { port: 80 } ]
# Resulting record: worker-1.workers.<namespace>.svc.cluster.local
Exercises
- (Beginner) What is the default DNS record format for an individual Pod?
- (Beginner) Which workload type makes per-Pod DNS names genuinely useful, and what is the name format?
- (Intermediate) How do the
hostnameandsubdomainfields create a stable per-Pod DNS name?- (Interview) Why do stateless application Pods rarely need per-Pod DNS records, while clustered databases rely on them? (Hint: interchangeable replicas vs. members addressing specific peers.)
Answers
<pod-ip-with-dashes>.<namespace>.pod.cluster.local(e.g.,10-244-1-5.default.pod.cluster.local).- StatefulSets (with a headless governing Service), giving stable names of the form
<pod-name>.<service>.<namespace>.svc.cluster.local(e.g.,db-0.db.data.svc.cluster.local).- Setting a Pod's
subdomainto the name of a headless Service (andhostnameto a desired name) makes CoreDNS publish<hostname>.<subdomain>.<namespace>.svc.cluster.localfor that Pod, giving it a stable, resolvable per-Pod DNS name tied to the headless Service.- Stateless Pods are interchangeable; clients reach them collectively through a single Service name/VIP, so individual Pod identity is irrelevant and per-Pod DNS adds nothing. Clustered databases have members that must contact specific peers (for replication, quorum, membership) and must recognize a restarted member as the same node — which requires stable, individually addressable DNS names, exactly what per-Pod records (via StatefulSet + headless Service) provide.
Custom DNS policies
Theory
Each Pod has a dnsPolicy that controls how its DNS resolution is configured, plus an optional dnsConfig for fine-grained customization. The policies:
| dnsPolicy | Behavior |
|---|---|
ClusterFirst (default) | Cluster DNS (CoreDNS) first; non-cluster names forwarded upstream. |
Default | Inherit the node's /etc/resolv.conf (not cluster DNS). |
ClusterFirstWithHostNet | Like ClusterFirst, but required when the Pod uses hostNetwork: true. |
None | Ignore defaults entirely; use only what you provide in dnsConfig. |
A common gotcha: a Pod with hostNetwork: true and the default ClusterFirst will not use cluster DNS — you must set ClusterFirstWithHostNet for it to resolve Service names. dnsConfig lets you add custom nameservers, searches, and options (e.g., tuning ndots, which affects how short names are expanded and can have performance implications).
Example
spec:
dnsPolicy: "None" # use ONLY the dnsConfig below
dnsConfig:
nameservers:
- 10.96.0.10 # custom resolver(s)
searches:
- svc.cluster.local
- example.com
options:
- { name: ndots, value: "2" } # tune short-name expansion
---
spec:
hostNetwork: true
dnsPolicy: ClusterFirstWithHostNet # needed so hostNetwork Pods use cluster DNS
Exercises
- (Beginner) What is the default
dnsPolicyfor a Pod?- (Beginner) Which
dnsPolicymakes a Pod use the node's resolv.conf instead of cluster DNS?- (Intermediate) A
hostNetwork: truePod can't resolve Service names. What is the fix?- (Interview) What does the
ndotsoption control, and why can a highndotsvalue cause DNS performance problems for short external names? (Hint: search-domain expansion attempts.)
Answers
ClusterFirst.Default(it inherits the node's/etc/resolv.conf).- Set
dnsPolicy: ClusterFirstWithHostNet. WithhostNetwork: true, the defaultClusterFirstdoesn't use cluster DNS, so Service names fail to resolve; the...WithHostNetpolicy restores cluster DNS resolution for host-network Pods.ndotsis the threshold of dots in a name below which the resolver tries the search domains before treating the name as absolute. With the typicalndots: 5, a name with fewer than 5 dots (e.g., an externalapi.example.com, 2 dots) is first tried against each cluster search suffix (...svc.cluster.local, etc.), generating several failing lookups before the absolute query succeeds. A highndotsthus multiplies DNS queries for short external names, adding latency and load; loweringndots(or using FQDNs with a trailing dot) reduces these wasteful lookups.
7. Storage
Containers are ephemeral — when a container restarts, anything written to its writable layer is lost. Real applications need durable storage, shared scratch space, and a way to consume cloud disks without hard-coding vendor details. Kubernetes addresses this with a layered storage model: Volumes attach storage to Pods, PersistentVolumes/Claims decouple storage provisioning from consumption, StorageClasses automate provisioning, and the CSI standard makes any storage system pluggable. This chapter builds that model.
7.1 Volumes
A Volume is the most basic storage abstraction — a directory accessible to a Pod's containers, backed by some medium. This subchapter covers the common volume types and their lifecycles.
Volume types: emptyDir, hostPath, configMap, secret
Theory
A Volume is a directory, possibly with data, that is accessible to the containers in a Pod. Unlike a container's own ephemeral filesystem (lost on restart), a volume can outlive container restarts within the Pod and can be shared between containers in the Pod. The volume's backing medium is what its type defines:
- emptyDir: an empty directory created when the Pod is assigned to a node, used for scratch space or sharing files between containers in the Pod. It is deleted when the Pod is removed. Can be backed by disk or RAM (
medium: Memory). - hostPath: mounts a file or directory from the node's filesystem into the Pod. Powerful but dangerous (ties the Pod to a node, can access host files) — mostly for node-level system Pods (DaemonSets).
- configMap / secret: project a ConfigMap's or Secret's keys as files (covered in Chapter 5).
- (Plus persistentVolumeClaim, projected, downwardAPI, and CSI volumes covered elsewhere.)
Example
spec:
containers:
- name: app
image: myapp:1.0
volumeMounts:
- { name: scratch, mountPath: /cache } # shared scratch
- { name: cfg, mountPath: /etc/cfg } # config files
volumes:
- name: scratch
emptyDir: {} # ephemeral, lives with the Pod
- name: cfg
configMap: { name: app-config } # keys as files
- name: hostlogs
hostPath: # node filesystem (use sparingly)
path: /var/log
type: Directory
Exercises
- (Beginner) What is the lifespan of an
emptyDirvolume?- (Beginner) What does a
hostPathvolume mount into the Pod, and why is it risky?- (Intermediate) Two containers in one Pod need to share files. Which volume type fits, and why does it work across container restarts?
- (Interview) Why is
hostPathdiscouraged for general application workloads but acceptable for certain DaemonSets? (Hint: node coupling, security, portability vs. node-level agents.)
Answers
- It is created when the Pod is assigned to a node and exists as long as that Pod runs on the node; it is deleted permanently when the Pod is removed (it does survive container restarts within the Pod).
- It mounts a file/directory from the node's own filesystem into the Pod. It's risky because it couples the Pod to that specific node, can expose or modify sensitive host files, and breaks portability and isolation.
- An
emptyDirvolume mounted into both containers. It is tied to the Pod (not to a single container), so it persists across individual container restarts and both containers see the same directory, enabling file sharing.- For general apps, hostPath ties the Pod to a particular node's filesystem (breaking rescheduling/portability) and is a security risk (host access), so it's discouraged. It's acceptable for node-level DaemonSets (log collectors, monitoring, CNI/CSI agents) whose entire purpose is to interact with the node's filesystem (
/var/log,/proc, runtime sockets) and which are intentionally pinned one-per-node with elevated access.
Volume lifecycle and pod coupling
Theory
The defining characteristic of a (non-persistent) Volume is that its lifecycle is coupled to the Pod, not to individual containers and not durable beyond the Pod. When the Pod is created, its volumes are set up; when containers restart, the volumes persist (so data survives a crash-restart); but when the Pod is deleted, ephemeral volumes (like emptyDir) and their data are gone.
This is the key distinction that motivates PersistentVolumes (next subchapter): if you need data to survive Pod deletion, rescheduling to another node, or to be independent of any Pod, a plain Volume is insufficient — you need a PersistentVolume whose lifecycle is independent of any Pod. Understanding "what survives what" — container restart vs. Pod deletion vs. node failure — is essential to choosing the right storage.
Example
Event emptyDir data? PV (PVC) data?
---------------------------------------------------------------
Container crashes & restarts Survives Survives
Pod deleted / rescheduled Lost Survives (re-attaches)
Node fails Lost Survives (on network storage)
# Demonstrate emptyDir survives container restart but not Pod deletion:
kubectl exec mypod -c app -- sh -c 'echo hi > /cache/x'
kubectl exec mypod -c app -- kill 1 # container restarts; /cache/x still there
kubectl delete pod mypod # recreate -> /cache is empty again
Exercises
- (Beginner) Does an emptyDir volume survive a container restart? Does it survive Pod deletion?
- (Beginner) To which object's lifecycle is a plain Volume coupled?
- (Intermediate) Construct a table of what data survives: container restart, Pod reschedule, node failure — for emptyDir vs. a network-backed PV.
- (Interview) Why does the Pod-coupled lifecycle of ordinary Volumes motivate the existence of PersistentVolumes? (Hint: independence from any single Pod.)
Answers
- It survives a container restart (it's tied to the Pod, not the container) but is lost on Pod deletion.
- The Pod's lifecycle (it is created with the Pod and destroyed when the Pod is removed).
- emptyDir: container restart = survives, Pod reschedule/deletion = lost, node failure = lost. Network-backed PV: container restart = survives, Pod reschedule = survives (re-attaches to the new Pod), node failure = survives (data lives on external storage, can be re-mounted elsewhere).
- Because ordinary Volumes vanish when their Pod does, they can't provide durable or relocatable storage. Stateful apps need data that persists across Pod deletion, rescheduling, and node failure, and that exists as a first-class object independent of any Pod. PersistentVolumes provide exactly that decoupling — storage with its own lifecycle, claimed by Pods via PVCs but outliving them.
Projected volumes
Theory
A projected volume combines several volume sources into a single directory. Instead of mounting a ConfigMap, a Secret, and Downward API info at three different paths, a projected volume lets you assemble keys from all of them into one unified mount, each at a path you choose. The supported sources are configMap, secret, downwardAPI, and serviceAccountToken.
The most important modern use is the serviceAccountToken projected source: it requests a short-lived, audience-scoped, auto-rotated ServiceAccount token (a "bound" token) and projects it as a file. This is the secure replacement for the old long-lived ServiceAccount token Secrets — the token has an expiry, is automatically refreshed by the kubelet, and is bound to a specific audience and the Pod's lifetime. Projected volumes thus serve both convenience (unify sources) and security (modern token delivery).
Example
spec:
containers:
- name: app
image: myapp:1.0
volumeMounts:
- { name: all-in-one, mountPath: /etc/config }
volumes:
- name: all-in-one
projected:
sources:
- configMap: { name: app-config }
- secret: { name: db-creds }
- downwardAPI:
items:
- { path: "labels", fieldRef: { fieldPath: metadata.labels } }
- serviceAccountToken: # short-lived, audience-bound token
path: token
audience: vault
expirationSeconds: 3600
Exercises
- (Beginner) What does a projected volume let you do with multiple sources?
- (Beginner) Name the four sources a projected volume supports.
- (Intermediate) Why is a
serviceAccountTokenprojected volume more secure than mounting a legacy ServiceAccount token Secret?- (Interview) Describe two distinct benefits (one convenience, one security) that projected volumes provide. (Hint: unified mount; bound, rotated tokens.)
Answers
- Combine keys/data from several volume sources into a single mounted directory, each item at a chosen path.
configMap,secret,downwardAPI, andserviceAccountToken.- The projected token is short-lived (has an
expirationSeconds), automatically rotated by the kubelet before expiry, and bound to a specificaudience(and the Pod's lifetime). A legacy token Secret is long-lived, non-expiring, not audience-scoped, and stored in etcd — so if leaked it grants indefinite, broad access. The bound token drastically limits exposure.- Convenience: it unifies multiple configuration/secret/metadata sources into one directory instead of several separate mounts, simplifying the container's view of its config. Security: the
serviceAccountTokensource delivers short-lived, audience-bound, auto-rotated tokens (instead of permanent token Secrets), shrinking the blast radius and lifetime of credentials.
CSI volume plugins
Theory
Originally, support for each storage system (AWS EBS, GCE PD, NFS, Ceph) was compiled into Kubernetes ("in-tree" volume plugins), which bloated the codebase and tied storage features to Kubernetes releases. The Container Storage Interface (CSI) is the standard (like CRI for runtimes and CNI for networking) that lets storage vendors write out-of-tree drivers implementing a common gRPC interface. Kubernetes then talks to any CSI driver generically.
A CSI volume is consumed in a Pod either indirectly (via a PVC bound to a StorageClass that uses a CSI provisioner — the common path, covered later) or directly via the csi volume source for special cases. The migration to CSI is essentially complete: in-tree plugins have been deprecated/removed in favor of CSI drivers. The practical takeaway: nearly all real storage in modern clusters flows through CSI drivers, even when you only ever interact with PVCs and StorageClasses.
Example
# A StorageClass referencing a CSI driver (the usual indirect path):
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata: { name: fast }
provisioner: ebs.csi.aws.com # a CSI driver, not an in-tree plugin
parameters: { type: gp3 }
# CSI drivers register themselves; list installed ones:
kubectl get csidrivers
# NAME ATTACHREQUIRED PODINFOONMOUNT AGE
# ebs.csi.aws.com true false 30d
Exercises
- (Beginner) What problem did "in-tree" volume plugins create, and what replaced them?
- (Beginner) How do you list the CSI drivers installed in a cluster?
- (Intermediate) In the common case, how does a Pod end up using a CSI driver without referencing it directly?
- (Interview) CSI follows the same philosophy as CRI and CNI. State that philosophy and explain why moving storage drivers out-of-tree benefits both Kubernetes and storage vendors. (Hint: independent release cadence, leaner core.)
Answers
- In-tree plugins compiled every storage integration into Kubernetes itself, bloating the codebase and coupling storage features/bug-fixes to Kubernetes release cycles. The CSI standard (out-of-tree drivers) replaced them.
kubectl get csidrivers.- The Pod uses a PVC; the PVC references a StorageClass whose
provisioneris a CSI driver. When the PVC is created, the CSI driver dynamically provisions the volume and Kubernetes attaches/mounts it — so the Pod consumes CSI-backed storage by referencing only the PVC.- The philosophy is a standard interface with swappable, out-of-tree implementations. Vendors implement the CSI gRPC contract and ship/version/patch their drivers independently of Kubernetes releases, adding features on their own schedule; Kubernetes keeps a lean, vendor-neutral core that talks to any compliant driver. Both win: vendors get release independence and faster iteration, and Kubernetes avoids carrying and maintaining storage-specific code.
7.2 Persistent Volumes
PersistentVolumes and PersistentVolumeClaims decouple how storage is provided from how it is consumed — the central abstraction for durable storage in Kubernetes. This subchapter covers the model in depth.
PersistentVolume (PV) resource
Theory
A PersistentVolume (PV) is a piece of storage in the cluster, provisioned either by an administrator (statically) or dynamically by a StorageClass. Crucially, a PV is a cluster-scoped resource with its own lifecycle, independent of any Pod. It represents the actual storage — a cloud disk, an NFS export, a Ceph volume — abstracted as a Kubernetes object with a capacity, access modes, and a reclaim policy.
The PV is the "supply" side of Kubernetes storage. It exists whether or not any Pod is using it, surviving Pod deletion and rescheduling. This independence is exactly what plain Volumes lack and what stateful workloads require. Administrators (or dynamic provisioners) create PVs; users don't reference PVs directly — they request storage via a PersistentVolumeClaim (next), which binds to a suitable PV. This separation is the producer/consumer split at the heart of the storage subsystem.
Example
apiVersion: v1
kind: PersistentVolume
metadata: { name: pv-nfs-100g }
spec:
capacity: { storage: 100Gi }
accessModes: [ ReadWriteMany ]
persistentVolumeReclaimPolicy: Retain
storageClassName: nfs
nfs: # the actual backing storage
server: 10.0.0.5
path: /exports/data
kubectl get pv
# NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM
# pv-nfs-100g 100Gi RWX Retain Available <none>
Exercises
- (Beginner) Is a PersistentVolume scoped to a namespace or to the cluster?
- (Beginner) Whose lifecycle is a PV independent of?
- (Intermediate) Who creates PVs, and how do end users typically obtain storage rather than referencing a PV directly?
- (Interview) Explain how the PV represents the "supply" side of a producer/consumer model for storage and why that separation is useful. (Hint: admins/provisioners supply, users claim.)
Answers
- Cluster-scoped (PVs are not namespaced).
- Independent of any Pod's lifecycle — it persists across Pod deletion and rescheduling.
- Administrators create PVs statically, or a StorageClass's provisioner creates them dynamically. End users don't reference PVs directly; they create a PersistentVolumeClaim requesting size/access mode/class, which binds to a matching PV.
- The PV is the supply: administrators or dynamic provisioners create PVs representing real storage, decoupled from any consumer. Users express demand via PVCs (the consumer side), and Kubernetes binds claims to suitable PVs. This separation lets storage be provisioned, sized, and managed independently of workloads; users request abstract storage without knowing backend details, and admins control the available storage pool — clean division of responsibility and portability across environments.
PersistentVolumeClaim (PVC) resource
Theory
A PersistentVolumeClaim (PVC) is a request for storage by a user — the "demand" side. It is a namespaced object specifying how much storage is needed, the required access mode(s), and optionally a StorageClass. Kubernetes matches the PVC to a suitable PV (static binding) or triggers dynamic provisioning of a new PV (via the StorageClass). Once bound, the PVC is referenced by a Pod's volume, and the Pod gets the underlying storage.
The PVC is the abstraction that lets application authors request durable storage without knowing the backend: they ask for "20Gi, ReadWriteOnce, fast class," and the platform fulfills it with whatever storage backs that class (EBS, PD, NFS...). This is the user-facing storage primitive — you almost always create PVCs (often via a StatefulSet's volumeClaimTemplates), and the PV is provisioned for you behind the scenes.
Example
apiVersion: v1
kind: PersistentVolumeClaim
metadata: { name: data, namespace: app }
spec:
accessModes: [ ReadWriteOnce ]
storageClassName: fast # triggers dynamic provisioning
resources:
requests: { storage: 20Gi }
---
# A Pod consumes the PVC by name:
spec:
containers:
- name: app
image: myapp:1.0
volumeMounts: [ { name: vol, mountPath: /data } ]
volumes:
- name: vol
persistentVolumeClaim: { claimName: data }
kubectl get pvc -n app
# NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS
# data Bound pvc-9f.. 20Gi RWO fast
Exercises
- (Beginner) What is a PVC, and is it namespaced?
- (Beginner) How does a Pod use a PVC?
- (Intermediate) Describe what happens when you create a PVC referencing a StorageClass and no matching PV exists yet.
- (Interview) How does the PVC abstraction let application developers request storage without knowing the underlying storage backend? (Hint: declarative request fulfilled by the platform/StorageClass.)
Answers
- A PersistentVolumeClaim is a namespaced request for storage (size, access modes, optional StorageClass) made by a user/workload.
- The Pod references the PVC by name in a
persistentVolumeClaimvolume and mounts it into a container viavolumeMounts.- The StorageClass's provisioner dynamically creates a new PV that satisfies the PVC's requirements, then binds the PVC to it — the PVC goes from
PendingtoBoundand the storage becomes usable, with no manual PV creation.- The developer declares an abstract need — capacity, access mode, and a StorageClass name (a quality/tier) — without specifying any backend details. The platform (via the StorageClass's provisioner/CSI driver) fulfills that request with whatever storage backs the class (EBS, PD, NFS, Ceph). The same PVC manifest is therefore portable across clusters/clouds: only the StorageClass implementation differs, not the application's request.
Access modes: ReadWriteOnce, ReadOnlyMany, ReadWriteMany
Theory
Access modes describe how a volume may be mounted, and they are a frequent source of confusion and failure. The modes:
| Mode | Abbrev | Meaning |
|---|---|---|
| ReadWriteOnce | RWO | Mounted read-write by a single node (multiple Pods on that node can share it). |
| ReadOnlyMany | ROX | Mounted read-only by many nodes. |
| ReadWriteMany | RWX | Mounted read-write by many nodes simultaneously. |
| ReadWriteOncePod | RWOP | Mounted read-write by a single Pod (strictest; newer). |
The critical nuance: RWO is per-node, not per-Pod — historically misread. Most block storage (cloud disks like EBS, GCE PD) supports only RWO because a block device can attach to one node at a time. RWX requires a shared filesystem (NFS, CephFS, EFS, Azure Files) — you cannot get RWX from a plain block disk. Choosing an access mode the backend doesn't support leaves the PVC unbound or the Pod unschedulable.
Example
spec:
accessModes: [ ReadWriteOnce ] # one node R/W; typical for EBS/PD block storage
# accessModes: [ ReadWriteMany ] # needs NFS/CephFS/EFS — shared filesystem
resources: { requests: { storage: 10Gi } }
Backend support (typical):
AWS EBS / GCE PD / Azure Disk -> RWO only (block device, single node)
NFS / CephFS / AWS EFS / Azure Files -> RWX (shared filesystem)
Exercises
- (Beginner) What do the abbreviations RWO, ROX, and RWX stand for?
- (Beginner) RWO restricts read-write access to a single what — Pod or node?
- (Intermediate) You need three Pods on three different nodes to write to the same volume. Which access mode and what kind of backend do you need?
- (Interview) Why can a typical cloud block disk (EBS/PD) not provide ReadWriteMany, and what storage type is required instead? (Hint: block device single-attach vs. shared filesystem.)
Answers
- ReadWriteOnce, ReadOnlyMany, ReadWriteMany (and the newer ReadWriteOncePod).
- A single node (multiple Pods on that same node can share it; the newer ReadWriteOncePod restricts to a single Pod).
- ReadWriteMany (RWX), backed by a shared filesystem such as NFS, CephFS, AWS EFS, or Azure Files — not a plain block disk.
- A cloud block disk is a block device that can be attached to only one node at a time (single-attach), so it inherently supports only single-node access (RWO). ReadWriteMany requires concurrent multi-node read-write, which needs a shared/networked filesystem (NFS, CephFS, EFS, Azure Files) that coordinates concurrent access — capabilities a raw block volume doesn't provide.
Reclaim policies: Retain, Delete, Recycle
Theory
The reclaim policy (persistentVolumeReclaimPolicy) decides what happens to a PV — and its underlying storage — when its PVC is deleted. This governs whether your data is preserved or destroyed, so it's a high-stakes setting:
- Retain: keep the PV and its data after the PVC is deleted. The PV moves to
Released(not reusable until an admin manually cleans it up). Safest for important data — nothing is auto-deleted. - Delete: delete the PV and the underlying storage asset (the actual cloud disk) when the PVC is deleted. Convenient for dynamically provisioned, disposable volumes — but data is gone. This is the default for dynamically provisioned volumes.
- Recycle: deprecated — used to scrub data (
rm -rf) and make the PV available again. Replaced by dynamic provisioning.
The crucial operational lesson: dynamically provisioned PVCs default to Delete, so deleting a PVC (or a namespace) can silently destroy production data unless you set Retain for anything important.
Example
# Set Retain so data survives PVC deletion (recommended for stateful data):
apiVersion: v1
kind: PersistentVolume
metadata: { name: pv-db }
spec:
capacity: { storage: 50Gi }
accessModes: [ ReadWriteOnce ]
persistentVolumeReclaimPolicy: Retain # Delete | Retain
storageClassName: fast
csi: { driver: ebs.csi.aws.com, volumeHandle: vol-0abc }
# After deleting the bound PVC, a Retain PV becomes Released (data kept):
kubectl get pv pv-db
# NAME STATUS RECLAIM POLICY CLAIM
# pv-db Released Retain app/data
Exercises
- (Beginner) When does the reclaim policy take effect?
- (Beginner) What is the default reclaim policy for dynamically provisioned volumes?
- (Intermediate) Explain the difference in outcome between
RetainandDeletewhen a PVC is deleted.- (Interview) A team deleted a namespace and lost their database's data. How does the default reclaim policy explain this, and how should they have prevented it? (Hint: Delete default; set Retain; backups.)
Answers
- When the bound PVC is deleted — it determines the fate of the PV and its underlying storage at that point.
Delete.- With
Retain, deleting the PVC leaves the PV (statusReleased) and the underlying storage intact, preserving the data until an admin manually handles it. WithDelete, deleting the PVC also deletes the PV and the actual backend storage asset, destroying the data.- Dynamically provisioned PVs default to
Delete. Deleting the namespace deleted the PVCs within it, which triggered deletion of the bound PVs and the underlying cloud disks — permanently destroying the database data. Prevention: set the reclaim policy (or StorageClassreclaimPolicy) toRetainfor important volumes, protect critical namespaces/PVCs, and maintain independent backups/snapshots (e.g., Velero, volume snapshots) so data loss isn't tied to object deletion.
Volume binding modes
Theory
The volume binding mode (a StorageClass field, volumeBindingMode) controls when a PVC is bound to a PV and the backing volume is provisioned:
- Immediate (default): the PV is provisioned and bound as soon as the PVC is created, before any Pod uses it. Problem: in multi-zone clusters, the volume might be created in a zone where the Pod can't be scheduled (a zonal disk in zone A, but the Pod lands in zone B) — causing the Pod to be unschedulable.
- WaitForFirstConsumer: delay binding/provisioning until a Pod that uses the PVC is scheduled. The scheduler then considers the Pod's constraints (node affinity, zone, resources) and the volume is provisioned in the right zone for the Pod. This is the recommended mode for zonal block storage.
This mode is the fix for the classic "volume in the wrong zone" problem and is essential to understand for multi-zone clusters.
Example
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata: { name: fast }
provisioner: ebs.csi.aws.com
volumeBindingMode: WaitForFirstConsumer # provision in the Pod's zone
parameters: { type: gp3 }
Immediate: PVC created -> volume provisioned in zone A immediately
Pod later scheduled in zone B -> cannot attach -> Pending
WaitForFirstConsumer: PVC created -> waits
Pod scheduled (zone B chosen) -> volume provisioned in zone B
Exercises
- (Beginner) What does the volume binding mode control?
- (Beginner) What is the default binding mode?
- (Intermediate) Describe the multi-zone problem that
Immediatebinding can cause.- (Interview) Explain how
WaitForFirstConsumersolves the zone-mismatch problem and why it requires deferring provisioning until scheduling. (Hint: scheduler knows the Pod's node/zone.)
Answers
- When a PVC is bound to a PV and the underlying volume is provisioned (immediately vs. when first consumed by a scheduled Pod).
Immediate.- With
Immediate, the volume is provisioned (e.g., in some zone) before the Pod is scheduled. If the volume is zonal (like a cloud block disk) and the Pod is later scheduled to a node in a different zone, the volume can't attach there, leaving the Pod stuckPending/unschedulable.WaitForFirstConsumerdelays binding/provisioning until a Pod using the PVC is being scheduled. At that point the scheduler has chosen (or constrained) the Pod's node and zone, so the provisioner creates the volume in the same zone as the Pod, guaranteeing it can attach. It must defer until scheduling because only then are the Pod's placement constraints (node affinity, zone, resources) known — provisioning earlier would be a blind guess about location.
7.3 Storage Classes
StorageClasses turn manual PV creation into automated, on-demand provisioning, and let you offer differentiated storage tiers. This subchapter covers dynamic provisioning.
Dynamic provisioning with StorageClasses
Theory
Without automation, an admin must pre-create a PV for every storage need — tedious and slow. A StorageClass enables dynamic provisioning: when a PVC requests a StorageClass, Kubernetes automatically creates a matching PV (and the underlying storage) on demand, then binds it. No admin pre-provisioning required.
A StorageClass defines how to provision: a provisioner (which CSI driver to use), parameters (backend-specific options like disk type), a reclaimPolicy, and a volumeBindingMode. It effectively describes a "class" or tier of storage. The workflow: user creates a PVC referencing a StorageClass → the provisioner creates a new PV with the class's settings → PVC binds → Pod uses it. This is how storage works in virtually all real clusters today: users just request PVCs, and StorageClasses fulfill them automatically.
Example
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata: { name: fast }
provisioner: ebs.csi.aws.com # which CSI driver provisions the volume
parameters: # backend-specific
type: gp3
iops: "3000"
reclaimPolicy: Delete
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
# User just creates a PVC referencing the class; a PV appears automatically:
kubectl get pvc,pv
# pvc/data Bound pvc-7a.. 20Gi RWO fast
# pv/pvc-7a.. 20Gi RWO Delete Bound app/data fast <- auto-created
Exercises
- (Beginner) What does dynamic provisioning eliminate the need for?
- (Beginner) What does the
provisionerfield of a StorageClass specify?- (Intermediate) Walk through the sequence of events from creating a PVC with a StorageClass to a Pod using the storage.
- (Interview) How do StorageClasses let a platform team offer differentiated storage tiers (e.g., "fast" vs. "cheap") to application teams? (Hint: classes as named tiers with distinct parameters.)
Answers
- Manually pre-creating PersistentVolumes — the StorageClass provisions them on demand automatically.
- Which provisioner/CSI driver to use to create the underlying storage (e.g.,
ebs.csi.aws.com).- The user creates a PVC referencing a StorageClass → the StorageClass's provisioner dynamically creates a new PV (the actual disk) using the class's parameters/reclaim policy → the PVC binds to that PV → a Pod referencing the PVC is scheduled and the volume is attached/mounted into the container.
- The platform team defines multiple StorageClasses, each a named tier with its own provisioner/parameters — e.g.,
fast(SSD/high-IOPS, possibly Retain) andcheap(HDD/standard, Delete). App teams simply pick a class name in their PVCs to get the desired performance/cost tier without knowing backend details. The class name becomes the contract for a storage quality level, and the platform controls what each tier actually maps to.
Provisioner types
Theory
The provisioner is the engine that actually creates storage for a StorageClass. Provisioners are essentially CSI drivers (the in-tree kubernetes.io/* provisioners are deprecated/migrated). Each cloud and storage system provides one:
| Provisioner | Backend |
|---|---|
ebs.csi.aws.com | AWS EBS block volumes |
pd.csi.storage.gke.io | Google Persistent Disk |
disk.csi.azure.com / file.csi.azure.com | Azure Disk / Azure Files |
efs.csi.aws.com | AWS EFS (RWX shared) |
rook-ceph.rbd.csi.ceph.com | Ceph (on-prem) |
The parameters map passes provisioner-specific options (disk type, IOPS, throughput, filesystem, encryption). There's also a special provisioner, kubernetes.io/no-provisioner, used with local PVs (node-attached disks) where there's nothing to dynamically provision — binding is static but the StorageClass coordinates WaitForFirstConsumer scheduling. Choosing the right provisioner and parameters is how you match storage capability (performance, access mode, durability) to workload needs.
Example
# AWS EBS gp3 with custom IOPS/throughput and encryption:
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "5000"
throughput: "250"
encrypted: "true"
---
# Local PV class: nothing to provision dynamically, but coordinates scheduling
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
Exercises
- (Beginner) What are modern provisioners essentially implemented as?
- (Beginner) What does the
parametersfield of a StorageClass carry?- (Intermediate) When and why would you use the
kubernetes.io/no-provisionerprovisioner?- (Interview) Given a workload needing shared read-write access across nodes, which AWS provisioner would you choose over EBS, and why? (Hint: RWX needs a shared filesystem like EFS.)
Answers
- CSI drivers (the in-tree provisioners have been deprecated and migrated to CSI).
- Provisioner/backend-specific options — e.g., disk type, IOPS, throughput, filesystem, encryption — passed to the driver when creating the volume.
- For local PersistentVolumes (storage physically attached to specific nodes). There's nothing to provision on demand, so
no-provisioneris used; the StorageClass still coordinatesWaitForFirstConsumerso Pods are scheduled to the node holding the local volume. Used for high-performance local disks where data locality matters and you accept node-bound storage.- Use
efs.csi.aws.com(AWS EFS) instead ofebs.csi.aws.com. EBS is a single-attach block device (RWO), so it can't provide cross-node read-write. EFS is a shared NFS-based filesystem supporting ReadWriteMany, allowing multiple Pods on different nodes to read and write the same volume concurrently.
Default StorageClass
Theory
A cluster can designate one StorageClass as the default, marked with the annotation storageclass.kubernetes.io/is-default-class: "true". When a PVC is created without an explicit storageClassName, it uses the default StorageClass — so users get sensible storage automatically without specifying a class.
Two important nuances: (1) If no default is set and a PVC omits the class, the PVC stays Pending (nothing provisions it) — a common "why is my PVC stuck?" cause. (2) Specifying storageClassName: "" (empty string) explicitly disables dynamic provisioning for that PVC, forcing it to bind only to a manually created PV — different from omitting the field entirely. Managed clusters usually ship with a default class (e.g., gp2/gp3 on EKS, standard on GKE). Having exactly one default avoids ambiguity.
Example
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: standard
annotations:
storageclass.kubernetes.io/is-default-class: "true" # the default class
provisioner: pd.csi.storage.gke.io
kubectl get storageclass
# NAME PROVISIONER ... AGE
# standard (default) pd.csi.storage.gke.io 40d
# fast pd.csi.storage.gke.io 40d
# A PVC with no storageClassName uses "standard".
Exercises
- (Beginner) How is a StorageClass marked as the default?
- (Beginner) What StorageClass does a PVC use if it omits
storageClassName?- (Intermediate) A PVC without a storageClassName is stuck
Pending. What cluster condition would explain this?- (Interview) Explain the difference between omitting
storageClassNameand setting it to""(empty string). (Hint: default class vs. disabling dynamic provisioning.)
Answers
- With the annotation
storageclass.kubernetes.io/is-default-class: "true"on the StorageClass.- The cluster's default StorageClass (if one is configured).
- There is no default StorageClass configured in the cluster, so a PVC that omits the class has nothing to provision it and remains Pending. (Setting a default class, or adding an explicit class to the PVC, resolves it.)
- Omitting
storageClassNamemakes the PVC use the cluster's default StorageClass (dynamic provisioning via that class). SettingstorageClassName: ""explicitly disables dynamic provisioning for the PVC — it will only bind to a pre-existing, manually created PV with no class, and will not trigger any provisioner. They are semantically different: "use the default" vs. "use no class / static binding only."
Volume expansion
Theory
Storage needs grow. Volume expansion lets you increase a PVC's size after creation, without recreating the volume or losing data — provided the StorageClass has allowVolumeExpansion: true and the underlying CSI driver supports it. You expand by editing the PVC's resources.requests.storage to a larger value; the provisioner grows the backend disk, and (for filesystem volumes) the filesystem is resized to use the new space.
Important constraints: you can only grow, never shrink, a volume. Some drivers can expand online (while the Pod is running); others require the Pod to be restarted to complete the filesystem resize. And only dynamically provisioned volumes with an expansion-capable class/driver qualify. This avoids the painful old workflow of provisioning a new larger volume and copying data, making capacity growth a simple declarative edit.
Example
# 1) StorageClass must permit expansion:
allowVolumeExpansion: true
# 2) Edit the PVC to request more storage (grow only):
kubectl patch pvc data -p '{"spec":{"resources":{"requests":{"storage":"50Gi"}}}}'
kubectl get pvc data
# NAME STATUS CAPACITY ... CONDITIONS
# data Bound 20Gi FileSystemResizePending <- may need Pod restart
# (after resize) data Bound 50Gi
Exercises
- (Beginner) What StorageClass setting must be enabled to allow expanding a PVC?
- (Beginner) Can you shrink a volume with volume expansion?
- (Intermediate) How do you trigger an expansion on an existing PVC, and what might be required to complete the filesystem resize?
- (Interview) Why is declarative volume expansion a significant improvement over the pre-expansion workflow of provisioning a new volume and copying data? (Hint: downtime, risk, simplicity, no data migration.)
Answers
allowVolumeExpansion: true(and the CSI driver must support expansion).- No — expansion only grows a volume; shrinking is not supported.
- Edit/patch the PVC's
spec.resources.requests.storageto a larger value. The provisioner grows the backend disk; completing the filesystem resize may require the Pod to be restarted (offline resize) unless the driver supports online expansion (the PVC may show aFileSystemResizePendingcondition until done).- The old approach meant creating a new larger volume, attaching it, copying all data over, switching the workload, and decommissioning the old volume — slow, error-prone, and usually involving downtime and risk of data loss. Declarative expansion is a single edit that grows the existing volume in place (often online), preserving data and identity with minimal or no disruption — far simpler, safer, and faster.
7.4 Container Storage Interface (CSI)
CSI is the standard that makes all of the above storage pluggable. This subchapter goes deeper into the driver architecture and advanced capabilities like snapshots.
CSI driver architecture
Theory
A CSI driver is the out-of-tree component implementing the CSI gRPC specification so Kubernetes can manage a specific storage backend. Architecturally, a CSI driver typically has two parts deployed in the cluster:
- Controller plugin (a Deployment/StatefulSet, usually one): handles cluster-wide operations — provisioning/deleting volumes, attaching/detaching them to nodes, creating snapshots. It runs alongside Kubernetes sidecar containers (external-provisioner, external-attacher, external-snapshotter, external-resizer) that watch Kubernetes objects (PVCs, VolumeAttachments) and translate them into CSI calls.
- Node plugin (a DaemonSet, one per node): handles node-local operations — mounting/unmounting the volume into the Pod's filesystem and formatting if needed. It runs the
node-driver-registrarsidecar to register the driver with the kubelet.
This split mirrors the CSI spec's Controller and Node services. Understanding it helps debug storage issues: provisioning failures point at the controller plugin; mount failures point at the node plugin on the relevant node.
Example
Controller plugin (Deployment, cluster-wide):
[csi-provisioner][csi-attacher][csi-snapshotter][csi-resizer][CSI driver]
watch PVCs/VolumeAttachments -> CreateVolume/ControllerPublishVolume
Node plugin (DaemonSet, per node):
[node-driver-registrar][CSI driver] -> NodeStageVolume/NodePublishVolume (mount)
kubectl get pods -n kube-system -l app=ebs-csi-controller # controller plugin
kubectl get pods -n kube-system -l app=ebs-csi-node -o wide # node plugin (per node)
Exercises
- (Beginner) What are the two main components of a CSI driver deployment?
- (Beginner) Which component runs as a DaemonSet, and what does it do?
- (Intermediate) A volume provisions fine but fails to mount on one specific node. Which CSI component would you investigate, and where?
- (Interview) Why does CSI split functionality into Controller and Node services, and how do the Kubernetes sidecar containers bridge Kubernetes objects to CSI calls? (Hint: cluster-wide vs. node-local ops; watchers translating to gRPC.)
Answers
- The Controller plugin (cluster-wide provisioning/attach/snapshot) and the Node plugin (node-local mount/unmount).
- The Node plugin runs as a DaemonSet (one per node); it handles mounting/unmounting (and formatting) the volume into Pods on that node and registers the driver with the kubelet.
- The Node plugin (DaemonSet Pod) on that specific node — check its logs/events, since mounting is a node-local operation handled by the node plugin and kubelet. Provisioning working but mounting failing isolates the problem to node-side staging/publishing.
- CSI separates cluster-wide operations (create/delete/attach/snapshot a volume — done once, centrally) from node-local operations (stage/mount into a particular node's filesystem — done where the Pod runs), matching how storage actually works. Kubernetes provides sidecar containers (external-provisioner, -attacher, -snapshotter, -resizer, node-driver-registrar) that watch Kubernetes API objects (PVCs, VolumeAttachments, VolumeSnapshots) and translate those into the corresponding CSI gRPC calls on the driver. This keeps the driver focused on talking to its backend while the sidecars handle Kubernetes integration.
Popular CSI drivers (AWS EBS, GCE PD, NFS)
Theory
In practice you'll work with a handful of common CSI drivers, each with characteristic capabilities:
- AWS EBS CSI (
ebs.csi.aws.com): provisions EBS block volumes — RWO, zonal, high-performance; the default block storage on EKS. - GCE PD CSI (
pd.csi.storage.gke.io): Google Persistent Disks — RWO (regional PD variants offer cross-zone replication); default on GKE. - Azure Disk / Azure Files CSI: block (RWO) and SMB file shares (RWX) respectively.
- NFS CSI (
nfs.csi.k8s.io): exposes an existing NFS server as RWX storage — simple shared filesystem, common on-prem. - Ceph/Rook, Portworx, OpenEBS: software-defined storage for on-prem/hybrid, offering replication, snapshots, and more.
The selection criteria are access mode (block RWO vs. shared RWX), performance, durability/replication, cost, and environment (cloud vs. on-prem). Knowing each driver's access-mode and topology characteristics prevents the common mistake of expecting RWX from a block driver.
Example
| Driver | Access modes | Topology | Typical use |
|---|---|---|---|
ebs.csi.aws.com | RWO | Zonal | Databases, single-writer apps on EKS |
efs.csi.aws.com | RWX | Regional | Shared files across Pods/nodes on EKS |
pd.csi.storage.gke.io | RWO (regional opt.) | Zonal/Regional | General block storage on GKE |
nfs.csi.k8s.io | RWX | Network | On-prem shared storage |
Exercises
- (Beginner) Which AWS CSI driver provides shared ReadWriteMany storage?
- (Beginner) What access mode do block-storage drivers like EBS and GCE PD typically provide?
- (Intermediate) You're on-prem and need RWX shared storage for several Pods. Name a suitable CSI driver.
- (Interview) When selecting a CSI driver for a stateful workload, what characteristics must you match to the workload's needs? (Hint: access mode, topology/zone, performance, durability, environment.)
Answers
efs.csi.aws.com(AWS EFS) — it provides RWX via a shared NFS-based filesystem.- ReadWriteOnce (RWO) — single-node block attachment.
- The NFS CSI driver (
nfs.csi.k8s.io) backed by an NFS server, or a software-defined storage system like Ceph/Rook, Portworx, or OpenEBS that offers RWX.- Match: access mode (does it need single-node RWO or multi-node RWX?), topology (zonal vs. regional/replicated, and zone-aware scheduling), performance (IOPS/throughput tier), durability/replication (snapshots, cross-zone replication, RPO needs), cost, and environment (cloud-native vs. on-prem software-defined). Choosing a driver whose capabilities don't fit (e.g., expecting RWX from EBS) leads to unbound PVCs or unschedulable Pods.
Volume snapshots and cloning
Theory
CSI standardizes volume snapshots — point-in-time copies of a volume — through dedicated API objects (provided by CRDs and the external-snapshotter):
- VolumeSnapshotClass: like a StorageClass but for snapshots (which driver, parameters).
- VolumeSnapshot: a request to snapshot a specific PVC.
- VolumeSnapshotContent: the actual snapshot resource (analogous to PV vs. PVC).
You can then restore a snapshot by creating a new PVC with a dataSource referencing the VolumeSnapshot — producing a new volume with the snapshot's data. Cloning is similar but copies directly from an existing PVC (dataSource of kind PersistentVolumeClaim) without an intermediate snapshot. These enable backups, test/dev data copies, and fast environment duplication — provided the CSI driver supports snapshots/cloning. (Snapshots are crash-consistent at the block level; application-consistent backups may need app coordination, e.g., quiescing a database.)
Example
# Snapshot an existing PVC:
apiVersion: snapshot.storage.k8s.io/v1
kind: VolumeSnapshot
metadata: { name: db-snap }
spec:
volumeSnapshotClassName: csi-snap
source: { persistentVolumeClaimName: data }
---
# Restore it into a new PVC:
apiVersion: v1
kind: PersistentVolumeClaim
metadata: { name: data-restored }
spec:
storageClassName: fast
dataSource: # restore from snapshot
name: db-snap
kind: VolumeSnapshot
apiGroup: snapshot.storage.k8s.io
accessModes: [ ReadWriteOnce ]
resources: { requests: { storage: 20Gi } }
Exercises
- (Beginner) What are the three main snapshot-related API objects, and which is analogous to a PV?
- (Beginner) How do you restore a snapshot into a usable volume?
- (Intermediate) What is the difference between cloning a PVC and snapshotting then restoring it?
- (Interview) A CSI snapshot is "crash-consistent." Why might that be insufficient for a database backup, and what extra step ensures application consistency? (Hint: in-flight writes/buffers; quiesce/flush.)
Answers
- VolumeSnapshotClass, VolumeSnapshot, and VolumeSnapshotContent. VolumeSnapshotContent is analogous to a PV (the actual backing snapshot), while VolumeSnapshot is analogous to a PVC (the request/claim).
- Create a new PVC with a
dataSourcereferencing the VolumeSnapshot (kindVolumeSnapshot); the provisioner creates a new volume populated with the snapshot's data.- Cloning copies data directly from an existing PVC (
dataSourcekind PersistentVolumeClaim) into a new PVC, with no separate snapshot object. Snapshot+restore first creates a point-in-time VolumeSnapshot, which can be retained/backed up and later restored into one or more new PVCs. Cloning is a one-step copy; snapshots provide a reusable, retainable restore point.- Crash-consistent means the snapshot captures the disk exactly as it was at an instant — like the state after a power loss — which may include in-flight transactions and unflushed buffers, so a database might need recovery on restore or have inconsistent state. For application consistency you coordinate with the app: quiesce/flush it (e.g., put the DB in backup/freeze mode or flush and lock tables) so on-disk data is consistent at snapshot time, then take the snapshot, then resume. Tools/operators often automate this freeze-snapshot-thaw sequence.
CSI ephemeral volumes
Theory
Usually CSI volumes are durable and managed via PVCs. But sometimes you want a CSI-backed volume whose lifecycle is tied to the Pod — created when the Pod starts and destroyed when it stops — without the PV/PVC machinery. CSI supports two ephemeral patterns:
- Generic ephemeral volumes: defined inline in the Pod spec but using the full StorageClass/PVC machinery under the hood (a PVC is auto-created and deleted with the Pod). Good for scratch space that needs real provisioned storage (size, type) but no persistence.
- CSI inline ephemeral volumes: the CSI driver itself provides a volume inline in the Pod spec (the driver must support this mode). Common for drivers that inject data rather than persistent storage — e.g., the Secrets Store CSI driver mounting secrets, or a driver providing certificates.
The key idea: ephemeral CSI volumes give you driver-backed storage/data with Pod-coupled lifecycle, bridging the gap between trivial emptyDir and full persistent PVCs.
Example
# Generic ephemeral volume: real provisioned storage, deleted with the Pod
spec:
containers:
- name: app
image: myapp:1.0
volumeMounts: [ { name: scratch, mountPath: /scratch } ]
volumes:
- name: scratch
ephemeral:
volumeClaimTemplate:
spec:
accessModes: [ ReadWriteOnce ]
storageClassName: fast
resources: { requests: { storage: 5Gi } }
---
# CSI inline ephemeral (e.g., Secrets Store CSI driver injecting secrets):
volumes:
- name: secrets
csi:
driver: secrets-store.csi.k8s.io
readOnly: true
volumeAttributes: { secretProviderClass: "vault-db" }
Exercises
- (Beginner) What lifecycle are CSI ephemeral volumes tied to?
- (Beginner) Name the two kinds of CSI ephemeral volumes.
- (Intermediate) When would you use a generic ephemeral volume instead of an emptyDir?
- (Interview) The Secrets Store CSI driver uses inline ephemeral volumes. Why is an ephemeral, Pod-coupled, driver-provided volume the right model for injecting secrets at runtime? (Hint: no persistence in etcd, lifecycle matches Pod, fetched on demand.)
Answers
- The Pod's lifecycle — created when the Pod starts and deleted when the Pod is removed.
- Generic ephemeral volumes (inline, but using StorageClass/PVC machinery auto-managed with the Pod) and CSI inline ephemeral volumes (provided directly by a CSI driver in the Pod spec).
- When you need real provisioned storage with specific properties (size, type, performance from a StorageClass) for scratch/temporary data, but don't need it to persist beyond the Pod. emptyDir is limited to node-local disk/RAM with no provisioning controls; a generic ephemeral volume gives you proper CSI-backed storage that's still automatically cleaned up with the Pod.
- Secrets should be fetched on demand and never linger. An inline ephemeral CSI volume is created when the Pod starts (the driver fetches secrets from the external store at mount time), exists only for the Pod's life, and is torn down with it — so the secret isn't stored as a Kubernetes Secret in etcd, isn't persisted beyond use, and its lifecycle exactly matches the consuming Pod. This minimizes exposure and keeps secret material out of the cluster datastore.
8. Scheduling and Resource Management
How does Kubernetes decide where each Pod runs, and how does it prevent one workload from starving another? This chapter covers the economics and placement of workloads: declaring resource needs, the scheduler's decision process, steering Pods toward or away from nodes, enforcing fairness with quotas, and scaling automatically. Getting this right is the difference between a stable, cost-efficient cluster and one that thrashes or falls over.
8.1 Resource Requests and Limits
Requests and limits tell Kubernetes how much CPU and memory a container needs and may use. They drive scheduling, fairness, and stability. This subchapter covers them and the QoS classes they produce.
CPU and memory requests
Theory
A request is the amount of CPU or memory a container is guaranteed. It is the primary input to scheduling: the scheduler sums the requests of all Pods on a node and only places a new Pod where the node's allocatable capacity can satisfy the new Pod's requests. Requests are a reservation — that capacity is set aside for the container whether or not it uses it all.
Units: CPU is measured in cores, often as millicores (m): 500m = half a core, 1 = one full core. Memory is in bytes, usually with suffixes: Mi (mebibytes), Gi (gibibytes). Setting requests well is crucial: too low and Pods get over-packed onto nodes and contend; too high and you waste capacity (low utilization, higher cost). Requests are about scheduling and guarantees, distinct from limits (which cap usage).
Example
spec:
containers:
- name: app
image: myapp:1.0
resources:
requests:
cpu: "250m" # guaranteed 0.25 core; used for scheduling
memory: "256Mi" # guaranteed 256 MiB
Node allocatable: 4 CPU, 8Gi
Pods' requests: A=1CPU/2Gi B=1CPU/2Gi C=1.5CPU/3Gi (sum 3.5CPU/7Gi)
A new Pod requesting 1CPU fits (0.5 CPU left? no -> only 0.5 left, won't fit)
Exercises
- (Beginner) What does a resource request guarantee, and what scheduling decision does it drive?
- (Beginner) What does
500mCPU mean?- (Intermediate) A node has 4 allocatable CPUs. Pods with requests totaling 3.5 CPU are running. Can a Pod requesting 1 CPU be scheduled there? Why?
- (Interview) What are the consequences of setting requests too low versus too high across a cluster? (Hint: contention/overpacking vs. wasted capacity/cost.)
Answers
- It guarantees (reserves) that amount of CPU/memory for the container. It drives scheduling: the scheduler only places a Pod on a node whose remaining allocatable capacity can satisfy the Pod's requests.
- Half of one CPU core (500 millicores = 0.5 core).
- No. Only 0.5 CPU of requests remains (4 − 3.5), which is less than the 1 CPU requested, so the scheduler won't place it there (it would go Pending or to another node). Scheduling is based on summed requests vs. allocatable, regardless of actual current usage.
- Too low: the scheduler over-packs nodes (it thinks Pods need little), leading to CPU contention/throttling and memory pressure/OOM as actual usage exceeds reservations. Too high: capacity is reserved but unused, so nodes appear full while idle — low utilization, fewer Pods per node, more nodes needed, and higher cost. Right-sizing requests balances density against stability.
CPU and memory limits
Theory
A limit is the maximum amount of CPU or memory a container may use. Limits cap consumption to protect other workloads and the node. Critically, CPU and memory limits behave very differently when exceeded:
- CPU is compressible: exceeding the CPU limit causes the container to be throttled (slowed down) — it's denied extra CPU cycles but keeps running. No crash.
- Memory is incompressible: exceeding the memory limit causes the container to be OOMKilled (terminated by the kernel's OOM killer) and restarted. You cannot "throttle" memory.
This asymmetry is one of the most important operational facts in Kubernetes. It means an under-provisioned memory limit leads to crash loops (OOMKilled), while an under-provisioned CPU limit leads to sluggishness. Many practitioners recommend setting memory requests = limits (to avoid surprise OOMs) and being cautious or even omitting CPU limits (to avoid unnecessary throttling), depending on workload.
Example
spec:
containers:
- name: app
image: myapp:1.0
resources:
requests: { cpu: "250m", memory: "256Mi" }
limits: { cpu: "500m", memory: "512Mi" } # caps; memory overflow = OOMKill
kubectl get pod app -o jsonpath='{.status.containerStatuses[0].lastState}'
# {"terminated":{"reason":"OOMKilled","exitCode":137}} # exceeded memory limit
Exercises
- (Beginner) What does a resource limit define?
- (Beginner) What happens when a container exceeds its CPU limit versus its memory limit?
- (Intermediate) A container repeatedly shows
OOMKilledwith exit code 137. What does this indicate and what's the likely fix?- (Interview) Explain the "compressible vs. incompressible" distinction between CPU and memory and how it should influence how you set limits. (Hint: throttle vs. kill; memory requests=limits.)
Answers
- The maximum CPU/memory a container is allowed to consume (an upper cap).
- Exceeding the CPU limit throttles the container (it's slowed but keeps running); exceeding the memory limit causes it to be OOMKilled (terminated, then restarted).
- It indicates the container exceeded its memory limit and was killed by the OOM killer (137 = 128 + SIGKILL/9). The fix is to raise the memory limit/request to fit the app's actual usage, and/or reduce the app's memory consumption (fix leaks, tune heap/cache sizes, e.g., via the Downward API).
- CPU is compressible: the kernel can simply give a container fewer cycles, so going over the limit throttles it without killing it. Memory is incompressible: you can't take memory back from a running process gracefully, so exceeding the limit triggers an OOM kill. This means memory limits are safety-critical — under-setting them causes crashes — so a common practice is to set memory requests equal to limits for predictable behavior, while treating CPU limits more loosely (or omitting them) to avoid needless throttling, since overshoot only slows the app.
Quality of Service (QoS) classes
Theory
Based on how a Pod sets requests and limits, Kubernetes assigns it a Quality of Service (QoS) class, which determines its priority when the node runs out of memory and must evict Pods. The three classes:
| QoS Class | Condition | Eviction priority |
|---|---|---|
| Guaranteed | Every container has requests == limits for both CPU and memory | Evicted last (highest protection) |
| Burstable | At least one container has a request, but not Guaranteed | Evicted after BestEffort |
| BestEffort | No requests or limits set on any container | Evicted first (least protection) |
When a node is under memory pressure, the kubelet evicts BestEffort Pods first, then Burstable Pods exceeding their requests, and protects Guaranteed Pods the longest. Thus setting requests/limits isn't just about scheduling — it directly affects which workloads survive resource pressure. Critical workloads should be Guaranteed; sacrificial/batch workloads can be BestEffort.
Example
# Guaranteed: requests == limits for cpu AND memory on every container
resources:
requests: { cpu: "500m", memory: "512Mi" }
limits: { cpu: "500m", memory: "512Mi" }
---
# Burstable: has requests, but limits differ / partial
resources:
requests: { cpu: "250m", memory: "256Mi" }
limits: { cpu: "1", memory: "512Mi" }
---
# BestEffort: no requests or limits at all -> evicted first
resources: {}
kubectl get pod app -o jsonpath='{.status.qosClass}' # Guaranteed | Burstable | BestEffort
Exercises
- (Beginner) What are the three QoS classes?
- (Beginner) What condition makes a Pod "Guaranteed"?
- (Intermediate) Under node memory pressure, in what order are Pods evicted by QoS class?
- (Interview) You have a critical payment service and a best-effort batch job on the same nodes. How should you set resources so the payment service survives memory pressure? (Hint: Guaranteed vs. BestEffort.)
Answers
- Guaranteed, Burstable, and BestEffort.
- Every container in the Pod sets requests equal to limits for both CPU and memory (and both are specified).
- BestEffort first, then Burstable (especially those exceeding their requests), and Guaranteed Pods last (most protected).
- Make the payment service Guaranteed by setting requests == limits for CPU and memory on all its containers, so it's evicted last and has reserved capacity. Leave the batch job BestEffort (no requests/limits) or low-priority Burstable, so it's sacrificed first under pressure. This ensures the node reclaims memory from the expendable batch job before touching the critical service.
LimitRange for namespace defaults
Theory
If developers forget to set requests/limits, Pods become BestEffort and risk over-packing and eviction. A LimitRange is a namespace-scoped policy that automatically applies default requests/limits to containers that don't specify them, and can enforce min/max bounds per container/Pod and default ratios. It runs as an admission controller on Pod creation.
This gives platform teams guardrails: every Pod in a namespace gets sane resource values even if the manifest omits them, and no one can request absurdly large or tiny amounts. Capabilities include default (limits applied if omitted), defaultRequest (requests applied if omitted), max/min (allowed range), and maxLimitRequestRatio (cap how much limits can exceed requests). LimitRange operates per-object; ResourceQuota (next) operates on namespace aggregates.
Example
apiVersion: v1
kind: LimitRange
metadata: { name: defaults, namespace: dev }
spec:
limits:
- type: Container
default: # applied as limits if not set
cpu: "500m"
memory: "512Mi"
defaultRequest: # applied as requests if not set
cpu: "100m"
memory: "128Mi"
max: { cpu: "2", memory: "2Gi" } # nobody may exceed this
min: { cpu: "50m", memory: "64Mi" } # nobody may go below this
Exercises
- (Beginner) What does a LimitRange do for containers that omit requests/limits?
- (Beginner) Is a LimitRange scoped to a namespace or the whole cluster?
- (Intermediate) How can a LimitRange prevent a single Pod from requesting an unreasonably large amount of memory?
- (Interview) How do LimitRange and ResourceQuota differ in scope, and why might you use both together? (Hint: per-object defaults/bounds vs. namespace aggregate totals.)
Answers
- It injects default requests and limits (from
defaultRequest/default) so those containers aren't left without resource settings (avoiding BestEffort and over-packing).- Namespace-scoped.
- By setting a
maxfor the relevant resource (e.g.,max.memory: 2Gi); the admission controller rejects any container/Pod whose request or limit exceeds the maximum.- LimitRange governs per-object values: defaults and min/max bounds and ratios for individual containers/Pods. ResourceQuota governs namespace aggregates: the total requests/limits and object counts summed across the namespace. Used together, LimitRange ensures every Pod has sensible, bounded per-Pod resources (and defaults so quota math works), while ResourceQuota caps the namespace's total consumption — per-object hygiene plus overall fairness/capacity control.
8.2 Scheduling Concepts
The scheduler decides which node runs each Pod. This subchapter opens up its workflow and the affinity/anti-affinity/spread mechanisms you use to influence placement.
Scheduler workflow and phases
Theory
The kube-scheduler turns a Pod with no assigned node into a Pod bound to a specific node. (Introduced in Chapter 2; here we go deeper.) Its decision runs through a scheduling framework of phases, conceptually two stages:
- Filtering (predicates): eliminate nodes that cannot run the Pod — insufficient allocatable resources, unsatisfied nodeSelector/affinity, untolerated taints, volume/zone conflicts, port conflicts. The output is the set of feasible nodes.
- Scoring (priorities): rank the feasible nodes by a weighted set of plugins — resource balance (least/most allocated), affinity preferences, topology spread, image locality, inter-Pod affinity. The highest-scoring node wins (ties broken at random).
Then binding writes the chosen node to the Pod. The framework exposes extension points (PreFilter, Filter, Score, Reserve, Permit, Bind, etc.) where plugins — built-in or custom — hook in. If no node is feasible, the Pod stays Pending with an explanatory event.
Example
Pending Pod
|
v FILTER: remove infeasible nodes
[node1 ok][node2 no:insufficient mem][node3 ok][node4 no:taint]
|
v SCORE: rank feasible nodes
[node1: 72][node3: 88] -> pick node3 (highest)
|
v BIND: set spec.nodeName = node3
kubectl describe pod pending-pod | grep -A3 Events
# Warning FailedScheduling 0/4 nodes are available: 2 Insufficient memory,
# 1 node(s) had untolerated taint, 1 node(s) didn't match nodeSelector.
Exercises
- (Beginner) What are the two main stages of the scheduler's decision?
- (Beginner) What does the scheduler write once it picks a node?
- (Intermediate) A Pod is
Pendingwith "0/5 nodes available". How do you find the specific per-node reasons?- (Interview) The scheduling framework exposes extension points for plugins. Why is a plugin-based scheduler architecture valuable, and what does it enable beyond the default behavior? (Hint: custom scheduling logic without forking.)
Answers
- Filtering (find feasible nodes) and Scoring (rank them); followed by binding.
- It writes the chosen node into the Pod's
spec.nodeName(via a Binding), after which the kubelet on that node runs the Pod.- Run
kubectl describe pod <pod>and read theFailedSchedulingevent, which enumerates per-node reasons (e.g., insufficient memory, untolerated taint, selector mismatch, volume zone conflict).- A plugin framework lets you customize or extend scheduling at well-defined extension points (Filter, Score, Reserve, Bind, etc.) without forking the scheduler. It enables custom placement logic — e.g., specialized bin-packing, hardware-aware or topology-aware scheduling, gang/batch scheduling, or third-party schedulers — while reusing the core machinery. Organizations can add their own constraints/priorities or run multiple schedulers, adapting placement to their needs cleanly.
NodeSelector
Theory
The simplest way to constrain where a Pod runs is nodeSelector: a map of key-value labels a node must have for the Pod to be scheduled there. The scheduler filters out any node whose labels don't include all the specified pairs. It's a hard requirement (the Pod won't schedule if no node matches) but expressively limited — only exact equality, ANDed together.
This is used for basic targeting: run on SSD nodes (disktype: ssd), on a particular instance type, on GPU nodes, or in a particular pool. Nodes carry both built-in labels (e.g., kubernetes.io/hostname, topology.kubernetes.io/zone, kubernetes.io/arch) and custom labels you add. For anything more nuanced (preferences, OR logic, "in"/"not in" sets), you graduate to node affinity (next).
Example
# Label a node so Pods can target it:
kubectl label node node-3 disktype=ssd
spec:
nodeSelector:
disktype: ssd # Pod only schedules on nodes labeled disktype=ssd
containers:
- name: app
image: myapp:1.0
Exercises
- (Beginner) What does nodeSelector match against on a node?
- (Beginner) Is nodeSelector a hard requirement or a preference?
- (Intermediate) How do multiple key-value pairs in a nodeSelector combine — AND or OR?
- (Interview) What are the expressiveness limitations of nodeSelector that motivate node affinity? (Hint: only equality/AND, no preferences or set operators.)
Answers
- Node labels — the node must have all the specified key-value label pairs.
- A hard requirement: if no node matches, the Pod stays Pending (it won't schedule).
- AND — all specified label pairs must be present on the node.
- nodeSelector supports only exact-equality matches ANDed together. It can't express preferences (soft constraints), set operators (In/NotIn/Exists/DoesNotExist), or OR logic across values. Node affinity adds required vs. preferred rules, weighted preferences, and richer match expressions, making it far more flexible for real placement needs.
Node affinity and anti-affinity
Theory
Node affinity is the more powerful successor to nodeSelector for attracting Pods to nodes based on node labels. It supports two flavors:
- requiredDuringSchedulingIgnoredDuringExecution: a hard rule (like nodeSelector but expressive) — the Pod only schedules on matching nodes.
- preferredDuringSchedulingIgnoredDuringExecution: a soft rule with a weight — the scheduler prefers matching nodes but will place the Pod elsewhere if needed.
The long names encode semantics: the rule applies during scheduling but is ignored during execution (a Pod already running isn't evicted if node labels later change). Node affinity uses matchExpressions with operators In, NotIn, Exists, DoesNotExist, Gt, Lt — enabling set membership and node anti-affinity (via NotIn/DoesNotExist, i.e., avoid certain nodes). This is how you express "must run on amd64 GPU nodes, preferably in zone us-east-1a."
Example
spec:
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution: # hard
nodeSelectorTerms:
- matchExpressions:
- { key: kubernetes.io/arch, operator: In, values: [ amd64 ] }
- { key: gpu, operator: Exists }
preferredDuringSchedulingIgnoredDuringExecution: # soft
- weight: 80
preference:
matchExpressions:
- { key: topology.kubernetes.io/zone, operator: In, values: [ us-east-1a ] }
Exercises
- (Beginner) What is the difference between
required...andpreferred...node affinity?- (Beginner) Name three operators node affinity supports that nodeSelector does not.
- (Intermediate) What does "IgnoredDuringExecution" mean for a Pod already running when node labels change?
- (Interview) How would you express "the Pod must run on GPU nodes and should preferably avoid spot instances"? Sketch the affinity structure. (Hint: required Exists on gpu; preferred NotIn on a spot label.)
Answers
required...is a hard constraint — the Pod only schedules on nodes matching it (or stays Pending).preferred...is a soft, weighted preference — the scheduler favors matching nodes but will still schedule elsewhere if none match.- Any three of:
In,NotIn,Exists,DoesNotExist,Gt,Lt.- The rule is only evaluated at scheduling time. If a running Pod's node later stops matching (labels change), the Pod is not evicted — it keeps running. Affinity influences placement, not ongoing residency.
- Use a required nodeAffinity with
matchExpressions: [{ key: gpu, operator: Exists }](must be a GPU node), plus a preferred rule with a weight whosepreference.matchExpressionsis[{ key: <spot-label, e.g. node.kubernetes.io/lifecycle>, operator: NotIn, values: [ spot ] }](prefer non-spot). The required term enforces GPU; the weighted preferred term steers away from spot nodes without making it mandatory.
Pod affinity and anti-affinity
Theory
While node affinity relates Pods to node labels, inter-Pod affinity/anti-affinity relates Pods to other Pods based on their labels, within a topology domain (defined by topologyKey, e.g., hostname, zone). This lets you co-locate or separate Pods:
- podAffinity: schedule this Pod near Pods matching a label (same topology domain) — e.g., place a web server in the same zone as its cache to reduce latency.
- podAntiAffinity: schedule this Pod away from Pods matching a label — e.g., spread replicas of a service across different nodes/zones so a single failure doesn't take them all out.
Both have required (hard) and preferred (soft) variants. The topologyKey is essential: it defines what "near"/"away" means — kubernetes.io/hostname means per-node, topology.kubernetes.io/zone means per-zone. A classic use is anti-affinity on hostname to ensure no two replicas of a critical service land on the same node. (Note: pod affinity can be computationally expensive at large scale.)
Example
spec:
affinity:
podAntiAffinity: # spread replicas across nodes
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchLabels: { app: web }
topologyKey: kubernetes.io/hostname # "away" = different node
podAffinity: # co-locate with cache in same zone
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 50
podAffinityTerm:
labelSelector: { matchLabels: { app: cache } }
topologyKey: topology.kubernetes.io/zone
Exercises
- (Beginner) What does pod affinity relate Pods to, versus node affinity?
- (Beginner) What is the role of
topologyKey?- (Intermediate) Write (in words) an anti-affinity rule ensuring no two
webPods share a node.- (Interview) Why is podAntiAffinity on
kubernetes.io/hostnamea common high-availability pattern, and what is a performance caveat of heavy pod affinity use? (Hint: spread for fault tolerance; scheduling cost at scale.)
Answers
- Pod affinity relates a Pod to other Pods (by their labels, within a topology domain); node affinity relates a Pod to node labels.
- It defines the topology domain over which "near"/"away" is measured — e.g.,
kubernetes.io/hostname(per node),topology.kubernetes.io/zone(per zone). Affinity/anti-affinity is evaluated within these domains.- A required podAntiAffinity with a
labelSelectormatchingapp: webandtopologyKey: kubernetes.io/hostname— meaning awebPod cannot be scheduled onto a node that already runs awebPod, forcing one per node.- Spreading replicas across nodes (hostname anti-affinity) ensures a single node failure can't take down all replicas of a service, improving availability. The caveat: inter-Pod affinity/anti-affinity requires the scheduler to evaluate relationships against many existing Pods across the cluster, which is computationally expensive and can significantly slow scheduling in large clusters — so it should be used judiciously (often
preferredor topology spread constraints are lighter alternatives).
Topology spread constraints
Theory
Topology spread constraints are a more modern, declarative way to control how Pods are evenly distributed across topology domains (zones, nodes), addressing the verbosity and performance issues of podAntiAffinity for spreading. You declare how unevenly Pods may be distributed via maxSkew over a topologyKey, and what to do if the constraint can't be met (whenUnsatisfiable: DoNotSchedule (hard) or ScheduleAnyway (soft)).
maxSkew is the maximum allowed difference between the most and least populated domains. For example, maxSkew: 1 over zones keeps replicas balanced across zones to within one. This is the recommended tool for "spread my Deployment's replicas evenly across zones/nodes for resilience and balance" — clearer and more efficient than equivalent anti-affinity rules. labelSelector identifies the group of Pods being balanced.
Example
spec:
topologySpreadConstraints:
- maxSkew: 1 # at most 1 difference between zones
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule # hard constraint
labelSelector:
matchLabels: { app: web }
containers:
- name: web
image: nginx
3 zones, 6 replicas, maxSkew 1 -> 2 + 2 + 2 (balanced)
If zone-c is full: -> 3 + 3 + 0 would skew by 3 > 1 -> blocked/Pending
Exercises
- (Beginner) What do topology spread constraints control?
- (Beginner) What does
maxSkew: 1mean?- (Intermediate) What is the difference between
whenUnsatisfiable: DoNotScheduleandScheduleAnyway?- (Interview) Why are topology spread constraints often preferred over podAntiAffinity for evenly distributing replicas across zones? (Hint: declarative evenness, performance, finer control.)
Answers
- How evenly Pods (matching a label selector) are distributed across topology domains (e.g., zones or nodes).
- The most populated and least populated topology domains may differ by at most one Pod — keeping the distribution balanced to within one.
DoNotSchedulemakes the constraint hard — if placing the Pod would violatemaxSkew, it stays Pending.ScheduleAnywaymakes it a soft preference — the scheduler tries to satisfy it but will still place the Pod even if it skews the distribution.- Topology spread constraints express "even distribution" directly via
maxSkew, are easier to reason about, and are more scheduler-efficient than emulating spreading with many podAntiAffinity rules (which are computationally expensive at scale). They also offer finer control (soft/hard, per-domain skew, minDomains) and natively target balanced placement across zones/nodes for resilience, rather than the coarse "never co-locate" semantics of anti-affinity.
8.3 Taints and Tolerations
Affinity attracts Pods to nodes; taints do the opposite — they repel Pods from nodes unless the Pod explicitly tolerates them. This subchapter covers this complementary mechanism.
Tainting nodes
Theory
A taint is applied to a node and repels Pods that do not tolerate it. Taints and tolerations work together as the inverse of affinity: affinity is a Pod saying "I want this kind of node," while a taint is a node saying "keep Pods off me unless they explicitly accept this condition." This lets you reserve nodes for specific purposes.
A taint has a key, optional value, and an effect. You apply it with kubectl taint. Common uses: control-plane nodes are tainted so general workloads stay off them; GPU/specialized nodes are tainted so only GPU workloads (which tolerate the taint) land there; nodes under maintenance get tainted to drain workloads. Kubernetes also applies taints automatically for node conditions (not-ready, unreachable, memory-pressure, disk-pressure). A Pod is only scheduled to a tainted node if it has a matching toleration.
Example
# Reserve nodes for GPU workloads by tainting them:
kubectl taint nodes gpu-node-1 gpu=true:NoSchedule
# Remove a taint (note trailing minus):
kubectl taint nodes gpu-node-1 gpu=true:NoSchedule-
kubectl describe node gpu-node-1 | grep Taints
# Taints: gpu=true:NoSchedule # only Pods tolerating gpu=true may schedule here
Exercises
- (Beginner) Is a taint applied to a node or a Pod? What does it do?
- (Beginner) What three parts make up a taint?
- (Intermediate) Give two real-world reasons to taint a node.
- (Interview) Explain how taints (on nodes) and tolerations (on Pods) together enable "dedicated nodes," and contrast this mechanism with node affinity. (Hint: repel-by-default vs. attract; need both taint and toleration to reserve.)
Answers
- A taint is applied to a node; it repels Pods that don't have a matching toleration (keeping them from being scheduled there).
- A
key, an optionalvalue, and aneffect(e.g., NoSchedule).- Any two: reserve control-plane nodes (keep workloads off), dedicate GPU/specialized nodes to specific workloads, isolate nodes for a team/tenant, or mark nodes for maintenance/draining. Kubernetes also auto-taints nodes with conditions (not-ready, unreachable, pressure).
- Tainting a node repels all Pods by default; only Pods carrying a matching toleration may schedule there. To truly dedicate nodes you usually combine a taint (so unrelated Pods are kept off) with node affinity/nodeSelector on the intended Pods (so they're attracted to those nodes) — the toleration alone only permits scheduling, it doesn't force it. Contrast: node affinity is Pod-driven attraction ("I prefer/require these nodes"), while taints are node-driven repulsion ("stay away unless you tolerate me"). Affinity alone can't keep other Pods off a node; a taint can.
Toleration syntax and effects
Theory
A toleration on a Pod allows (but does not require) it to schedule onto nodes with a matching taint. A toleration specifies a key, an operator (Equal to match a specific value, or Exists to match any value of that key), an effect, and — for the NoExecute effect — an optional tolerationSeconds.
Matching rules: a toleration tolerates a taint if the keys, effects, and (for Equal) values match. operator: Exists with no key matches all taints (used to tolerate everything, e.g., for critical DaemonSets). Importantly, a toleration only permits scheduling onto a tainted node — it does not attract the Pod there; without a taint, tolerations have no effect. So to pin Pods to dedicated nodes you pair a toleration (to be allowed on the tainted node) with node affinity/selector (to be drawn to it).
Example
spec:
tolerations:
- key: "gpu"
operator: "Equal"
value: "true"
effect: "NoSchedule" # matches taint gpu=true:NoSchedule
# tolerate ANY taint (e.g., for a cluster-critical agent):
- operator: "Exists"
# NoExecute with grace period: stay 300s after node is tainted, then evict
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute"
tolerationSeconds: 300
Exercises
- (Beginner) Does a toleration force a Pod onto a tainted node?
- (Beginner) What is the difference between the
EqualandExistsoperators in a toleration?- (Intermediate) What does a toleration with only
operator: Exists(no key) tolerate?- (Interview) To dedicate GPU nodes exclusively to GPU workloads (and ensure those workloads actually land there), what combination of mechanisms do you need and why? (Hint: taint + toleration + affinity/selector.)
Answers
- No — it only permits scheduling onto a node with a matching taint; it does not attract or force the Pod there.
Equalmatches a taint with the same key and value;Existsmatches a taint with the given key regardless of its value (and, with no key, matches any taint).- All taints — a keyless
operator: Existstoleration tolerates every taint on every node (commonly used for cluster-critical DaemonSets that must run everywhere).- Three things: (1) taint the GPU nodes (e.g.,
gpu=true:NoSchedule) so non-GPU Pods are kept off; (2) add a matching toleration to GPU workloads so they're allowed on those nodes; and (3) add node affinity/nodeSelector (e.g., on agpulabel) to the GPU workloads so they're actually drawn to the GPU nodes rather than scheduled elsewhere. The taint keeps others out, the toleration lets the right Pods in, and the affinity ensures they go there — all three are needed for true dedication.
NoSchedule, PreferNoSchedule, NoExecute
Theory
A taint's effect determines what happens to Pods that don't tolerate it:
| Effect | Effect on non-tolerating Pods |
|---|---|
| NoSchedule | New Pods won't be scheduled onto the node; existing Pods stay. |
| PreferNoSchedule | The scheduler tries to avoid placing Pods here, but may if necessary (soft). |
| NoExecute | New Pods won't schedule and existing non-tolerating Pods are evicted. |
The crucial distinction is NoExecute: unlike the others, it affects already-running Pods, evicting those that don't tolerate the taint. This is how Kubernetes handles node problems — when a node becomes unreachable, the node controller adds a NoExecute taint, and Pods (subject to their tolerationSeconds) are evicted and rescheduled elsewhere. With NoExecute, a Pod's tolerationSeconds sets how long it may remain after the taint appears before eviction (the default for the auto-added not-ready/unreachable taints is 300s).
Example
kubectl taint nodes node-2 maintenance=true:NoExecute # evicts non-tolerating Pods now
kubectl taint nodes node-3 spot=true:PreferNoSchedule # soft avoidance
NoSchedule : keeps NEW pods off; running pods unaffected
PreferNoSchedule : best-effort avoidance for new pods
NoExecute : keeps new pods off AND evicts running non-tolerating pods
Exercises
- (Beginner) Which taint effect evicts already-running Pods that don't tolerate it?
- (Beginner) How does PreferNoSchedule differ from NoSchedule?
- (Intermediate) When a node becomes unreachable, which effect does Kubernetes apply automatically, and what role does
tolerationSecondsplay?- (Interview) You want to drain a node of workloads immediately for emergency maintenance using taints. Which effect do you use and what happens to the Pods? (Hint: NoExecute eviction and rescheduling.)
Answers
NoExecute.NoScheduleis a hard rule — non-tolerating new Pods are never placed on the node.PreferNoScheduleis soft — the scheduler tries to avoid the node but will place Pods there if no better option exists.- It applies a
NoExecutetaint (e.g.,node.kubernetes.io/unreachable).tolerationSecondson the Pods' (auto-added) tolerations determines how long they may remain before eviction — default 300 seconds — giving a grace period in case the node recovers quickly before Pods are rescheduled.- Apply a
NoExecutetaint to the node (e.g.,kubectl taint nodes <node> maintenance=true:NoExecute). Non-tolerating running Pods are evicted from the node (per anytolerationSeconds), and their controllers (Deployment/ReplicaSet/StatefulSet) reschedule replacements onto other nodes. (In practicekubectl drainis the standard tool, which cordons and evicts gracefully; the NoExecute taint is the underlying eviction mechanism.)
Use cases: dedicated nodes, GPU nodes
Theory
Bringing taints, tolerations, and affinity together, here are the canonical patterns:
- Dedicated nodes (for a team/tenant/workload class): taint the nodes (
dedicated=team-a:NoSchedule), add the matching toleration to that team's Pods, and use node affinity/labels so those Pods target the dedicated nodes. Result: only team-a's Pods run there, and they reliably do. - GPU/specialized hardware nodes: GPU nodes are expensive; you don't want ordinary Pods consuming them. Taint them (
nvidia.com/gpu=present:NoSchedule), and only GPU workloads (which tolerate it and request the GPU resource) land there. Often the device plugin and node labels are combined with the taint. - Spot/preemptible nodes: taint with
PreferNoScheduleor a dedicated taint so only fault-tolerant workloads (with tolerations) use the cheaper, interruptible capacity.
The recurring lesson: taint to repel, tolerate to permit, affinity to attract — use them in combination to achieve precise, reliable placement.
Example
# GPU workload: tolerate the GPU taint, request the GPU, and target GPU nodes
spec:
tolerations:
- { key: "nvidia.com/gpu", operator: "Exists", effect: "NoSchedule" }
nodeSelector:
accelerator: nvidia-tesla # attract to GPU nodes
containers:
- name: trainer
image: ml-trainer:1.0
resources:
limits: { nvidia.com/gpu: 1 } # request 1 GPU (via device plugin)
Exercises
- (Beginner) Why are GPU nodes commonly tainted?
- (Beginner) What three mechanisms combine to reliably place a workload on dedicated nodes?
- (Intermediate) For spot/preemptible nodes, why might you only allow fault-tolerant workloads, and how do taints help?
- (Interview) Summarize the roles "taint, toleration, affinity" each play in achieving exclusive, reliable placement, and what breaks if you omit any one. (Hint: repel/permit/attract.)
Answers
- GPUs are scarce and expensive; tainting GPU nodes keeps ordinary Pods off them so the costly hardware is reserved for workloads that actually need (and tolerate + request) GPUs.
- A taint on the nodes (repel others), a matching toleration on the workload (permit it there), and node affinity/nodeSelector on the workload (attract it to those nodes).
- Spot/preemptible nodes can be reclaimed at any time, so only workloads that tolerate sudden termination (stateless, restartable, or checkpointed) should run there. Taints (a dedicated taint or PreferNoSchedule) keep non-tolerant workloads off, while fault-tolerant workloads carry the toleration to opt in — protecting critical services from running on interruptible capacity.
- Taint repels all non-tolerating Pods from the node (exclusivity). Toleration permits a specific Pod onto the tainted node (access). Affinity/selector attracts the Pod to those nodes (reliable placement). Omit the taint → other Pods can also use the nodes (no exclusivity). Omit the toleration → your intended Pod can't schedule there at all. Omit the affinity → your Pod is allowed on the nodes but may be scheduled elsewhere instead, so it isn't reliably placed. All three are needed for exclusive, dependable placement.
8.4 Resource Quotas
In multi-tenant clusters, you must stop one namespace from consuming all resources. ResourceQuotas and PriorityClasses provide fairness and prioritization. This subchapter covers them.
ResourceQuota for namespaces
Theory
A ResourceQuota caps the aggregate resource consumption within a namespace — total CPU/memory requests and limits, and counts of objects. It's the primary tool for multi-tenancy fairness: give each team a namespace with a quota so no single team can exhaust the cluster. It's enforced at admission time — if a new Pod would push the namespace over its quota, the request is rejected.
A key interaction: once a ResourceQuota constrains CPU/memory requests or limits in a namespace, every Pod created there must specify those requests/limits (otherwise admission can't account for them and rejects the Pod). This is why ResourceQuota is usually paired with a LimitRange that supplies defaults — so Pods without explicit resources still get values that satisfy the quota accounting. ResourceQuota operates on namespace totals; LimitRange operates per object.
Example
apiVersion: v1
kind: ResourceQuota
metadata: { name: team-quota, namespace: team-a }
spec:
hard:
requests.cpu: "10" # total CPU requests across the namespace
requests.memory: 20Gi
limits.cpu: "20"
limits.memory: 40Gi
pods: "50" # max number of Pods
kubectl get resourcequota team-quota -n team-a
# NAME AGE REQUEST LIMIT
# team-quota 1d requests.cpu: 6/10, requests.memory: 12Gi/20Gi ...
Exercises
- (Beginner) What does a ResourceQuota limit, and at what scope?
- (Beginner) When is a ResourceQuota enforced?
- (Intermediate) Why must Pods specify requests/limits once a ResourceQuota on those resources exists, and what helps with that?
- (Interview) How do ResourceQuota and LimitRange complement each other in a multi-tenant cluster? (Hint: aggregate caps vs. per-object defaults/bounds.)
Answers
- The aggregate resource consumption of a namespace — total CPU/memory requests and limits and counts of objects (Pods, Services, PVCs, etc.).
- At admission time — when a new object is created; if it would exceed the namespace's hard quota, the request is rejected.
- Because the quota tracks total requests/limits, the admission controller must know each Pod's requests/limits to account for them; a Pod that omits them can't be counted against a CPU/memory quota and is rejected. A LimitRange with defaults solves this by injecting requests/limits into Pods that don't specify them, so they satisfy quota accounting.
- ResourceQuota enforces namespace-wide aggregate caps (fairness/capacity across tenants), while LimitRange enforces per-object defaults and min/max bounds. Together: LimitRange ensures every Pod has sensible bounded resources (and defaults so quota math works), and ResourceQuota ensures the namespace's total stays within its allotment — preventing any one tenant from monopolizing the cluster while keeping individual Pods well-formed.
Compute and object count quotas
Theory
ResourceQuota covers two broad categories:
- Compute resource quotas: limit
requests.cpu,requests.memory,limits.cpu,limits.memory, and extended/hugepagesand evenrequests.nvidia.com/gpu. These bound how much compute the namespace can reserve/use. - Object count quotas: limit the number of objects of a kind —
pods,services,configmaps,secrets,persistentvolumeclaims,services.loadbalancers(cap costly cloud LBs!),services.nodeports, and even counts of arbitrary resources via thecount/<resource>.<group>syntax.
Object counts matter for reasons beyond compute: too many Secrets/ConfigMaps bloat etcd; uncapped LoadBalancer Services rack up cloud costs; runaway object creation can degrade the control plane. Storage quotas also exist (requests.storage, and per-StorageClass via <class>.storageclass.storage.k8s.io/requests.storage). Together these let platform teams bound both the compute footprint and the object/API footprint of each tenant.
Example
spec:
hard:
# compute
requests.cpu: "10"
limits.memory: 40Gi
# object counts
pods: "50"
services.loadbalancers: "2" # cap expensive cloud LBs
persistentvolumeclaims: "20"
count/deployments.apps: "30"
# storage (per class)
gold.storageclass.storage.k8s.io/requests.storage: 100Gi
Exercises
- (Beginner) Name the two broad categories of quotas a ResourceQuota can enforce.
- (Beginner) Give one object-count quota that directly limits cloud cost.
- (Intermediate) Why might you cap the number of Secrets or ConfigMaps in a namespace?
- (Interview) Beyond compute, why are object-count and storage quotas important for cluster stability and cost? (Hint: etcd bloat, control-plane load, cloud LB/disk spend.)
Answers
- Compute resource quotas (CPU/memory/GPU requests and limits) and object count quotas (numbers of objects like Pods, Services, PVCs).
services.loadbalancers(each LoadBalancer Service provisions a paid cloud load balancer).- Every Secret/ConfigMap is stored in etcd and watched; large numbers bloat etcd, increase memory/IO and watch load, and can degrade API performance. Capping them prevents a tenant from inadvertently (or maliciously) overwhelming the datastore.
- Object counts and storage consume shared, finite resources beyond node CPU/memory: etcd storage and watch/API load (many objects slow the control plane), and real money (each LoadBalancer Service = a cloud LB; each PVC = a provisioned disk). Quotas on these protect control-plane stability (bounded etcd/API footprint) and budget (bounded LB/disk spend), complementing compute quotas to keep the whole platform healthy and cost-controlled.
Quota scopes
Theory
Sometimes you want a quota to apply only to a subset of objects in a namespace — for example, "BestEffort Pods may number at most 10" or "terminating (batch) Pods get a separate CPU budget." Quota scopes enable this by restricting which objects a ResourceQuota counts.
Built-in scopes include Terminating / NotTerminating (Pods with an active deadline vs. without), BestEffort / NotBestEffort (by QoS class), and PriorityClass (via scopeSelector, matching Pods of certain priority classes). The modern, expressive form is scopeSelector with operators, letting you, e.g., apply a quota only to Pods with priorityClassName in (high). Scopes let one namespace have differentiated quotas for different workload classes — separating, say, long-running services from short batch jobs, or capping low-QoS workloads separately.
Example
# Quota that ONLY applies to high-priority Pods in the namespace:
apiVersion: v1
kind: ResourceQuota
metadata: { name: high-prio-quota, namespace: team-a }
spec:
hard:
requests.cpu: "4"
requests.memory: 8Gi
scopeSelector:
matchExpressions:
- { scopeName: PriorityClass, operator: In, values: [ "high" ] }
Exercises
- (Beginner) What do quota scopes let you restrict?
- (Beginner) Name two built-in quota scopes.
- (Intermediate) How would you create a quota that only limits BestEffort Pods?
- (Interview) Why are quota scopes (e.g., by PriorityClass) useful in a namespace that runs both critical services and batch jobs? (Hint: differentiated budgets per workload class.)
Answers
- Which subset of objects a ResourceQuota applies to/counts (e.g., only Pods of a certain QoS, priority, or termination behavior).
- Any two:
Terminating,NotTerminating,BestEffort,NotBestEffort,PriorityClass(via scopeSelector).- Create a ResourceQuota with the
BestEffortscope (or ascopeSelectormatching it); it will then count/limit only Pods in the BestEffort QoS class, e.g., capping their count or resources.- Scopes let one namespace enforce different budgets for different workload classes. For instance, you can give high-priority/critical services a guaranteed CPU/memory budget while separately capping batch or BestEffort jobs, so batch work can't crowd out critical services and each class is bounded appropriately — finer-grained fairness than a single namespace-wide quota.
Priority classes and PriorityLevelConfiguration
Theory
Two distinct "priority" mechanisms exist; don't conflate them:
- PriorityClass (scheduling priority): a cluster-scoped object mapping a name to an integer priority. A Pod's
priorityClassNamesets how important it is for scheduling and preemption. Higher-priority Pending Pods are scheduled before lower-priority ones, and — ifpreemptionPolicyallows — a high-priority Pod can preempt (evict) lower-priority Pods to make room when the cluster is full. System-critical Pods use built-in classes likesystem-cluster-critical. - PriorityLevelConfiguration (API Priority and Fairness, APF): part of the API server's flow control, it classifies and rate-limits API requests so that a flood of low-importance requests can't starve critical ones (e.g., protecting leader election and node heartbeats). Paired with
FlowSchema, it ensures fair, bounded concurrency for API traffic.
In short: PriorityClass prioritizes Pods for the scheduler; PriorityLevelConfiguration prioritizes API requests for the API server.
Example
# Scheduling priority (PriorityClass):
apiVersion: scheduling.k8s.io/v1
kind: PriorityClass
metadata: { name: high }
value: 1000000
globalDefault: false
description: "Critical user-facing services"
preemptionPolicy: PreemptLowerPriority
---
# Pod opting into it:
spec:
priorityClassName: high
Exercises
- (Beginner) What does a PriorityClass affect for a Pod?
- (Beginner) What is preemption in the context of PriorityClass?
- (Intermediate) Distinguish PriorityClass from PriorityLevelConfiguration in one sentence each.
- (Interview) Why is API Priority and Fairness (PriorityLevelConfiguration + FlowSchema) important for control-plane stability under load? (Hint: prevent low-priority request floods from starving critical API traffic.)
Answers
- Its scheduling importance — the order in which Pending Pods are scheduled and whether it can preempt lower-priority Pods to get resources.
- When the cluster lacks room for a high-priority Pending Pod, the scheduler can evict (preempt) lower-priority running Pods to free resources for it (subject to
preemptionPolicy), then schedule the high-priority Pod.- PriorityClass sets a Pod's scheduling/preemption priority for the kube-scheduler. PriorityLevelConfiguration (with FlowSchema) sets the priority/concurrency of API server requests for API Priority and Fairness flow control.
- The API server has finite concurrency; without flow control, a burst of low-value requests (e.g., a misbehaving controller listing everything repeatedly) could saturate it and starve critical traffic like leader election, node heartbeats, and scheduler/controller calls — risking cascading control-plane failure. APF classifies requests into priority levels with bounded, fair concurrency and queuing, so essential requests are protected and the control plane stays responsive and stable under heavy or abusive load.
8.5 Autoscaling
Static replica counts and node pools waste money and risk outages. Kubernetes autoscales at three levels — pods (count), pods (size), and nodes — plus event-driven scaling. This subchapter covers them.
Horizontal Pod Autoscaler (HPA)
Theory
The Horizontal Pod Autoscaler (HPA) automatically adjusts the number of replicas of a workload (Deployment/StatefulSet) based on observed metrics — classically CPU utilization, but also memory, custom, and external metrics. It scales out (more Pods) under load and in (fewer Pods) when idle, keeping the target metric near a desired value.
The HPA controller periodically (default every 15s) reads metrics (via the Metrics API / Metrics Server for resource metrics) and computes desired replicas with roughly: desired = ceil(current * (currentMetric / targetMetric)). For example, with a 50% CPU target, if current utilization is 100% across 2 Pods, it scales to ~4. It respects minReplicas/maxReplicas and uses stabilization windows to avoid flapping. The HPA requires resource requests to be set (to compute utilization percentages) and the Metrics Server installed. It is the most common autoscaler and the standard way to handle variable traffic.
Example
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: web }
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: web
minReplicas: 2
maxReplicas: 10
metrics:
- type: Resource
resource:
name: cpu
target: { type: Utilization, averageUtilization: 50 } # keep CPU ~50%
kubectl get hpa web
# NAME REFERENCE TARGETS MINPODS MAXPODS REPLICAS
# web Deployment/web 72%/50% 2 10 4 # scaling up
Exercises
- (Beginner) What does the HPA adjust automatically?
- (Beginner) What two prerequisites must be in place for CPU-based HPA to work?
- (Intermediate) With a 50% CPU target, 3 Pods at 90% average utilization, roughly how many replicas will the HPA target?
- (Interview) Why does the HPA require resource requests to be set, and what role do stabilization windows play? (Hint: utilization is relative to requests; avoid flapping.)
Answers
- The number of replicas (Pods) of the target workload, scaling out/in based on metrics.
- Resource requests must be set on the Pods (so utilization percentage can be computed), and the Metrics Server (Metrics API) must be installed to provide CPU/memory metrics.
- desired ≈ ceil(3 × (90/50)) = ceil(5.4) = 6 replicas (bounded by maxReplicas).
- CPU/memory "utilization" is expressed as a percentage of the Pod's request, so without a request there's no denominator to compute the percentage — the HPA can't determine utilization. Stabilization windows (and scaling policies) damp rapid oscillations: they make the controller wait/consider recent recommendations before scaling down (and optionally up), preventing "flapping" where transient metric spikes/dips cause constant scale up/down churn that disrupts the workload.
Vertical Pod Autoscaler (VPA)
Theory
Where the HPA changes the number of Pods, the Vertical Pod Autoscaler (VPA) changes the size of Pods — automatically recommending or setting CPU/memory requests and limits based on actual historical usage. It's for workloads that don't scale horizontally well (e.g., a single-writer database, or a JVM app) or to right-size requests you'd otherwise have to tune by hand.
The VPA has three modes: Off (only produces recommendations you can read), Initial (sets resources only at Pod creation), and Auto/Recreate (updates running Pods by evicting and recreating them with new resources — disruptive, since changing a running container's requests historically required a restart). A major caveat: VPA and HPA should not both act on the same resource metric (e.g., both on CPU), as they fight each other. VPA is best for vertical right-sizing; HPA for horizontal scaling on load. (In-place Pod resize is reducing VPA's disruptiveness in newer Kubernetes.)
Example
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata: { name: db-vpa }
spec:
targetRef:
apiVersion: apps/v1
kind: StatefulSet
name: db
updatePolicy:
updateMode: "Auto" # Off | Initial | Auto (evicts to apply new sizes)
kubectl describe vpa db-vpa | grep -A6 "Recommendation"
# Target: cpu: 850m memory: 1200Mi (suggested requests based on usage)
Exercises
- (Beginner) What does the VPA adjust, in contrast to the HPA?
- (Beginner) What are the VPA's update modes?
- (Intermediate) Why has applying VPA recommendations to running Pods traditionally been disruptive?
- (Interview) Why should you avoid running HPA and VPA on the same metric for the same workload, and how can they be combined safely? (Hint: conflicting actions; HPA on custom metric while VPA on CPU/mem, or VPA Off mode.)
Answers
- The VPA adjusts the size of Pods — their CPU/memory requests (and limits) — whereas the HPA adjusts the number of Pod replicas.
Off(recommendations only),Initial(set resources only at Pod creation), andAuto/Recreate(update running Pods by evicting and recreating them with new resources).- Changing a running container's resource requests historically required recreating the Pod, so the VPA in Auto mode evicts and recreates Pods to apply new sizes — causing restarts/disruption. (In-place Pod vertical scaling in newer Kubernetes is making this less disruptive.)
- If both act on the same metric (e.g., CPU), they conflict: the HPA adds replicas to lower per-Pod CPU while the VPA raises per-Pod CPU requests based on usage, and they oscillate against each other. Combine them safely by having the HPA scale on a different signal (a custom/external metric like requests-per-second or queue depth) while the VPA right-sizes CPU/memory, or run the VPA in
Offmode for recommendations only — so the two never contend over the same dimension.
Cluster Autoscaler
Theory
The HPA and VPA change Pods, but if there's no room on existing nodes, new Pods stay Pending. The Cluster Autoscaler (CA) adjusts the number of nodes: it watches for Pods that can't be scheduled due to insufficient resources and adds nodes (by scaling the cloud provider's node groups / autoscaling groups), and it removes underutilized nodes whose Pods can fit elsewhere, to save cost.
Key behaviors: CA scales up when there are unschedulable Pods that would fit on a new node of an existing node group; it scales down when a node has been underutilized for a period and its Pods can be safely rescheduled (respecting PodDisruptionBudgets and certain "do-not-evict" annotations). It works per node group and relies on the cloud provider integration. CA complements the HPA: HPA adds Pods, and when those Pods don't fit, CA adds nodes — together they scale the whole stack. (Newer alternatives like Karpenter provision right-sized nodes more flexibly.)
Example
Traffic up -> HPA adds Pods -> some Pods Pending (no node capacity)
|
v
Cluster Autoscaler sees unschedulable Pods
|
v
adds a node to the node group -> Pods schedule
Traffic down -> Pods removed -> node underutilized for N min
-> CA drains and removes the node -> cost saved
Exercises
- (Beginner) What does the Cluster Autoscaler adjust?
- (Beginner) What triggers the Cluster Autoscaler to add a node?
- (Intermediate) Under what conditions will the Cluster Autoscaler remove a node?
- (Interview) Explain how the HPA and Cluster Autoscaler work together to handle a traffic surge, and what each is responsible for. (Hint: HPA scales Pods, CA scales nodes; Pending Pods bridge them.)
Answers
- The number of nodes in the cluster (by scaling cloud node groups up or down).
- The presence of unschedulable (Pending) Pods that don't fit on current nodes but would fit if a new node of an existing node group were added.
- When a node is underutilized for a sustained period and all its Pods can be safely rescheduled onto other nodes — respecting PodDisruptionBudgets, Pods that can't be moved (e.g., certain local-storage or non-replicated Pods, or those with do-not-evict annotations). It then drains and removes the node to save cost.
- On a surge, the HPA detects higher metrics and increases the Deployment's replica count. If existing nodes lack capacity, the new Pods become Pending. The Cluster Autoscaler observes these unschedulable Pods and adds node(s) to a node group; once the node joins, the Pending Pods are scheduled. The HPA owns how many Pods; the CA owns how many nodes — and Pending Pods are the signal that bridges the two so the whole stack scales to meet demand.
KEDA: event-driven autoscaling
Theory
The HPA scales on CPU/memory (and custom metrics), but many workloads should scale on event sources the HPA doesn't natively understand: queue length (Kafka, RabbitMQ, SQS), Prometheus query results, cron schedules, database row counts, cloud pub/sub backlog. KEDA (Kubernetes Event-Driven Autoscaling) fills this gap. It's an add-on that provides scalers for dozens of event sources and drives autoscaling — including the ability to scale to zero when there are no events (which the plain HPA cannot do).
Architecturally, KEDA acts as a metrics adapter and operator: you define a ScaledObject referencing your Deployment and a trigger (e.g., "scale based on Kafka topic lag"). KEDA translates the external event metric into something the HPA mechanism consumes (it actually creates/manages an HPA under the hood for the scaling math), and it independently handles activation from/to zero. This makes KEDA the standard tool for queue-driven workers and bursty event processing.
Example
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: consumer }
spec:
scaleTargetRef: { name: consumer } # the Deployment to scale
minReplicaCount: 0 # scale to zero when idle
maxReplicaCount: 30
triggers:
- type: kafka
metadata:
bootstrapServers: kafka:9092
consumerGroup: orders
topic: orders
lagThreshold: "100" # add replicas per 100 messages of lag
Exercises
- (Beginner) What kind of signals does KEDA let you scale on that the plain HPA cannot?
- (Beginner) What HPA limitation does KEDA notably overcome regarding idle workloads?
- (Intermediate) In KEDA, what object do you define to configure event-driven scaling, and what does it reference?
- (Interview) KEDA creates an HPA under the hood yet adds capabilities. Explain its architecture and why scale-to-zero requires logic beyond the standard HPA. (Hint: external scalers/metrics adapter; HPA min is 1; activation phase.)
Answers
- External event-source metrics — e.g., message-queue lag/length (Kafka, RabbitMQ, SQS), Prometheus query results, cloud pub/sub backlog, cron schedules, database queries — dozens of scalers beyond CPU/memory.
- Scale to zero: KEDA can scale a workload down to 0 replicas when there are no events (and back up when events arrive), which the standard HPA cannot do (its minimum is 1).
- A ScaledObject (or ScaledJob), which references the target workload (
scaleTargetRef) and defines one or more triggers (event sources) with thresholds.- KEDA runs an operator plus a metrics adapter exposing external-event metrics through the Kubernetes external/custom metrics API. For the 1..N scaling range it creates and manages a standard HPA that consumes those metrics (reusing the HPA's scaling math). But the HPA's minimum is 1 replica, so KEDA itself handles the activation phase: it watches the event source and, when activity crosses a threshold, scales the workload from 0 to 1 (and back to 0 when idle) outside the HPA's purview. Thus scale-to-zero needs KEDA's extra activation logic because the HPA alone can neither read arbitrary event sources nor go to zero.
Custom and external metrics
Theory
The HPA's power comes from the metrics APIs it can consume, of which there are three:
- Resource Metrics API (
metrics.k8s.io, via Metrics Server): CPU and memory of Pods — the classic HPA input. - Custom Metrics API (
custom.metrics.k8s.io): application/Kubernetes-object metrics exposed by an adapter (e.g., the Prometheus Adapter) — scale on things like requests per second, queue depth as a Pod metric, or error rate. These are metrics associated with Kubernetes objects. - External Metrics API (
external.metrics.k8s.io): metrics from outside the cluster not tied to a Kubernetes object — e.g., a cloud queue length, a SaaS metric, a global business KPI.
Scaling on the right metric matters: CPU is a poor proxy for many workloads (a queue worker may be I/O-bound; a web app's true pressure is RPS or p99 latency). The Custom/External metrics APIs let you autoscale on the signal that actually reflects load. An adapter (Prometheus Adapter, KEDA, cloud adapters) implements these APIs to feed the HPA.
Example
# HPA scaling on a custom per-Pod metric (requests-per-second) via an adapter:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: web }
spec:
scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: web }
minReplicas: 2
maxReplicas: 20
metrics:
- type: Pods
pods:
metric: { name: http_requests_per_second }
target: { type: AverageValue, averageValue: "100" } # 100 rps per Pod
- type: External # external (e.g., queue)
external:
metric: { name: queue_messages }
target: { type: AverageValue, averageValue: "30" }
Exercises
- (Beginner) Name the three metrics APIs the HPA can use.
- (Beginner) What component typically implements the Custom Metrics API from Prometheus data?
- (Intermediate) Give an example where CPU is a poor scaling metric and a custom/external metric is better.
- (Interview) What is the distinction between a "custom" metric and an "external" metric for HPA purposes, and why does it matter which API a metric comes through? (Hint: tied to a K8s object vs. outside the cluster.)
Answers
- The Resource Metrics API (
metrics.k8s.io), the Custom Metrics API (custom.metrics.k8s.io), and the External Metrics API (external.metrics.k8s.io).- The Prometheus Adapter (it exposes Prometheus query results through the custom/external metrics APIs for the HPA to consume).
- A queue-worker or I/O-bound service: it may sit at low CPU while a huge backlog accumulates, so CPU won't trigger scaling when it's actually overwhelmed. Scaling on queue length (external metric) or requests/second/latency (custom metric) reflects real load and scales correctly. (Likewise a latency-sensitive web app scaling on p99 latency or RPS rather than CPU.)
- A custom metric is associated with a Kubernetes object (e.g., a per-Pod or per-Service metric like RPS) and comes through
custom.metrics.k8s.io. An external metric is not tied to any cluster object — it originates outside the cluster (e.g., a cloud queue's length, a third-party KPI) and comes throughexternal.metrics.k8s.io. It matters because the API determines how the value is identified and scoped (object-associated vs. free-standing), which adapter serves it, and how the HPA references it (type: Pods/Objectvs.type: External). Choosing the correct API ensures the HPA can locate and interpret the metric properly.
9. Security
Kubernetes security is layered, and a breach at any layer can compromise the cluster. This chapter walks the request as it enters the API server — authentication (who are you?), authorization (what may you do?), admission control (does this comply with policy?) — then covers what a Pod is allowed to do at runtime (Pod security) and the supply chain that produced its image. The throughline is least privilege and defense in depth.
9.1 Authentication
Authentication is the first gate: the API server must establish who (or what) is making each request. This subchapter covers the mechanisms Kubernetes supports.
X.509 client certificates
Theory
The most fundamental authentication method is X.509 client certificates. The API server is configured to trust a Certificate Authority (CA); any client presenting a certificate signed by that CA is authenticated. The certificate's Common Name (CN) becomes the username, and its Organization (O) fields become the user's groups. This is how the cluster-admin kubeconfig from kubeadm works — it contains an admin client certificate.
This mechanism is robust and simple but has a significant operational drawback: certificates cannot be easily revoked (Kubernetes doesn't check CRLs/OCSP), so a leaked cert is valid until it expires. This makes long-lived client certs risky for human users; they're better suited to components and short-lived issuance. Certificates can be requested through the CertificateSigningRequest (CSR) API, where the cluster's CA signs an approved request. For human users at scale, OIDC (covered later) is usually preferred over per-user certs.
Example
# A client cert's subject determines identity:
# CN=alice, O=developers -> user "alice" in group "developers"
openssl x509 -in alice.crt -noout -subject
# subject=CN = alice, O = developers
# Request a cert via the CSR API (approver signs it):
kubectl get csr
kubectl certificate approve alice-csr
Exercises
- (Beginner) In an X.509 client certificate, which field becomes the username and which becomes the groups?
- (Beginner) What must the API server trust for certificate authentication to work?
- (Intermediate) Why are leaked client certificates particularly dangerous in Kubernetes?
- (Interview) Why are long-lived X.509 certs discouraged for human users, and what is generally preferred instead? (Hint: no revocation; OIDC/central identity.)
Answers
- The Common Name (CN) becomes the username; the Organization (O) field(s) become the user's groups.
- A Certificate Authority (CA) whose signature on client certs it will accept (configured as the client CA bundle).
- Kubernetes does not check certificate revocation (no CRL/OCSP), so a leaked certificate remains valid until it expires — there's no easy way to revoke it short of rotating the CA. A long-lived leaked cert thus grants persistent access.
- Because they can't be revoked and are hard to rotate per user, long-lived certs are a standing risk for humans. OIDC integration with a central identity provider is preferred: it issues short-lived tokens, supports central revocation/MFA/group management, and ties cluster access to existing corporate identity — far more manageable and secure for human users at scale.
Bearer tokens and ServiceAccount tokens
Theory
Another authentication method is bearer tokens: the client sends Authorization: Bearer <token> and the API server validates it. The most important kind is the ServiceAccount (SA) token — the identity mechanism for in-cluster workloads (as opposed to human users). Every Pod is associated with a ServiceAccount (the default one if unspecified), and a token for that SA is mounted into the Pod so the application can authenticate to the API server.
Modern Kubernetes uses bound, projected SA tokens: short-lived, audience-scoped JWTs auto-mounted via a projected volume and auto-rotated by the kubelet — a major security improvement over the old long-lived token Secrets (which never expired). A SA's identity is system:serviceaccount:<namespace>:<name> and it belongs to groups like system:serviceaccounts. ServiceAccounts are the in-cluster counterpart to user accounts and are what you grant RBAC permissions to for controllers, operators, and apps that call the API.
Example
apiVersion: v1
kind: ServiceAccount
metadata: { name: app-sa, namespace: prod }
---
spec:
serviceAccountName: app-sa # Pod runs as this identity
containers: [ ... ]
# The projected token is mounted here inside the Pod:
cat /var/run/secrets/kubernetes.io/serviceaccount/token # a short-lived JWT
# Identity becomes: system:serviceaccount:prod:app-sa
# Mint a short-lived token for an SA on demand:
kubectl create token app-sa -n prod --duration=1h
Exercises
- (Beginner) What identity do in-cluster workloads use to authenticate to the API server?
- (Beginner) What is the format of a ServiceAccount's username?
- (Intermediate) How do modern bound/projected SA tokens improve on the legacy SA token Secrets?
- (Interview) Why should you avoid using the
defaultServiceAccount with broad permissions for application Pods? (Hint: least privilege, per-workload identity, blast radius.)
Answers
- A ServiceAccount (its bearer token), mounted into the Pod.
system:serviceaccount:<namespace>:<name>.- Bound/projected tokens are short-lived (have an expiry), audience-scoped, and automatically rotated by the kubelet, and they're bound to the Pod's lifetime. Legacy token Secrets were long-lived, non-expiring, stored in etcd, and not audience-bound — so a leak granted indefinite, broad access. The bound model drastically reduces credential lifetime and blast radius.
- Every Pod gets an identity; if you use the
defaultSA and grant it permissions, all Pods in the namespace inherit them, violating least privilege and widening the blast radius if any Pod is compromised. Instead, create a dedicated ServiceAccount per workload with only the permissions it needs (and disable token automounting where the API isn't used), so each workload's access is minimal and isolated.
OIDC integration
Theory
For human users, the recommended approach is OIDC (OpenID Connect) integration. The API server is configured to trust an external identity provider (IdP) — Google, Azure AD/Entra, Okta, Keycloak, Dex. Users authenticate with the IdP (with SSO, MFA, etc.) and obtain an ID token (JWT); they present this token to the API server, which validates its signature and expiry against the IdP and extracts the username and groups from configured claims.
This is superior to per-user certificates because: tokens are short-lived (limiting leak impact), identity and group membership are managed centrally in the IdP (add/remove users in one place), and you get MFA and corporate SSO for free. RBAC rules then reference the usernames/groups from the IdP claims. OIDC is the standard for enterprise human access; it requires no in-cluster user objects — the IdP is the source of truth for identity, while Kubernetes handles authorization.
Example
# API server flags (conceptual):
--oidc-issuer-url=https://accounts.example.com
--oidc-client-id=kubernetes
--oidc-username-claim=email
--oidc-groups-claim=groups
# kubectl uses an OIDC exec/auth plugin to obtain & refresh the ID token,
# then sends it as a bearer token. RBAC then matches on the claims:
# user: alice@example.com groups: [ platform-admins ]
Exercises
- (Beginner) Who is OIDC integration primarily for — human users or in-cluster workloads?
- (Beginner) What does the API server extract from a validated OIDC token to make authorization decisions?
- (Intermediate) Name two advantages of OIDC over per-user X.509 certificates.
- (Interview) In an OIDC setup, where is the source of truth for user identity and group membership, and how does that simplify access management at scale? (Hint: central IdP, no per-user K8s objects.)
Answers
- Human users.
- The username and groups (from the configured token claims, e.g.,
groups), which RBAC then uses.- Any two: tokens are short-lived (limiting leak impact, unlike non-revocable certs); identity/groups are managed centrally in the IdP (add/remove/disable users in one place); built-in SSO and MFA; easier auditing — all without minting and distributing per-user certs.
- The external identity provider (IdP) is the source of truth — users and groups live there, not as Kubernetes objects. To grant or revoke access at scale you manage users/groups in the IdP and write RBAC rules against group claims (e.g., bind a Role to group
platform-admins). Onboarding/offboarding and MFA are handled centrally, so the cluster needs no per-user provisioning and access stays consistent with corporate identity.
Webhook authentication
Theory
When the built-in methods don't fit, webhook token authentication lets you delegate authentication to an external service. The API server is configured with a webhook endpoint; when it receives a bearer token it doesn't otherwise recognize, it sends a TokenReview request to the webhook, which validates the token however it likes and returns whether it's authenticated plus the username/groups.
This provides maximum flexibility — you can integrate a custom or proprietary identity system, a cloud provider's IAM (this is how managed clusters like EKS map AWS IAM identities to Kubernetes users), or any token format. The trade-off is that the webhook is on the critical path for authentication: if it's slow or down, requests using that method fail, so it must be highly available and fast. Webhook auth is a power-user/integration feature; most clusters use certs, SA tokens, and OIDC.
Example
Client --(Bearer <opaque-token>)--> API server
API server --(TokenReview)--> external webhook (validates token)
webhook --> { authenticated: true, user: "bob", groups: ["team-x"] }
API server proceeds to authorization as user "bob"
# TokenReview response shape the webhook returns:
apiVersion: authentication.k8s.io/v1
kind: TokenReview
status:
authenticated: true
user:
username: "bob"
groups: ["team-x"]
Exercises
- (Beginner) What does webhook authentication delegate, and to what?
- (Beginner) What API object does the API server send to the webhook to validate a token?
- (Intermediate) Give a real-world example of webhook authentication in managed Kubernetes.
- (Interview) What operational risk does webhook authentication introduce, and how do you mitigate it? (Hint: critical-path dependency; HA, low latency, caching.)
Answers
- It delegates token authentication to an external service (an HTTP webhook) that validates the token and returns the identity.
- A
TokenReview(the webhook responds withauthenticatedplus username/groups).- EKS maps AWS IAM identities to Kubernetes users via a webhook/authenticator (e.g., the AWS IAM Authenticator), so IAM users/roles authenticate to the cluster. (Other clouds similarly bridge their IAM to Kubernetes auth.)
- The webhook becomes a critical-path dependency: if it's slow or unavailable, authentications relying on it fail, potentially locking users out. Mitigate by running the webhook highly available and low-latency, setting sensible timeouts, enabling the API server's authentication token cache (TTL) to reduce calls, and ensuring an alternative admin access path (e.g., a cert-based break-glass credential) exists in case the webhook is down.
9.2 Authorization
Once identity is established, authorization decides what that identity may do. RBAC is the dominant mechanism. This subchapter covers it and the principle of least privilege.
RBAC overview
Theory
RBAC (Role-Based Access Control) is the standard authorization mode in Kubernetes. It answers: "may this identity perform this verb (get, list, create, update, delete, watch...) on this resource (pods, services, secrets...) in this scope (namespace or cluster)?" RBAC is additive and deny-by-default: with no rule granting access, access is denied; you can only grant, never explicitly deny (denial is the absence of a grant).
RBAC is built from four object types, in two pairs:
- Role / ClusterRole: define a set of permissions (rules). Role is namespaced; ClusterRole is cluster-wide.
- RoleBinding / ClusterRoleBinding: bind a Role/ClusterRole to subjects (users, groups, ServiceAccounts), granting them those permissions.
The mental model: a Role says what can be done; a Binding says who can do it. RBAC is evaluated on every request after authentication, and is the primary tool for enforcing least privilege in a cluster.
Example
Permissions model:
(subject: user/group/SA) --bound by--> RoleBinding --references--> Role
|
rules: verbs on resources
kubectl auth can-i list pods -n dev # check your own access
kubectl auth can-i '*' '*' --all-namespaces # am I cluster-admin?
Exercises
- (Beginner) What three things does an RBAC rule combine to define a permission?
- (Beginner) Is RBAC deny-by-default or allow-by-default? Can you write an explicit deny rule?
- (Intermediate) What is the difference in responsibility between a Role and a RoleBinding?
- (Interview) Explain how RBAC's additive, deny-by-default model supports least privilege and why the absence of explicit deny rules is acceptable. (Hint: grant only what's needed; everything else is implicitly denied.)
Answers
- Verbs (actions like get/list/create/delete), resources (pods, secrets, etc.), and the API groups they belong to — evaluated within a namespace or cluster scope.
- Deny-by-default. You cannot write explicit deny rules — RBAC only grants; anything not granted is denied.
- A Role (or ClusterRole) defines a set of permissions (what actions on what resources). A RoleBinding (or ClusterRoleBinding) assigns that Role to subjects (who gets those permissions). One is the capability definition; the other is the grant to identities.
- Because access is denied unless a rule grants it, you start from zero and add only the specific permissions each identity needs — the essence of least privilege. There's no need for explicit deny rules: anything you don't grant is already denied, so security is enforced by carefully limiting grants rather than enumerating prohibitions. This makes policies simpler to reason about (you audit what's allowed) and fails safe (a forgotten grant means no access, not accidental access).
Roles and ClusterRoles
Theory
A Role is a namespaced set of permission rules — it can only grant access to resources within its own namespace. A ClusterRole is the cluster-scoped equivalent, and it serves several purposes a Role cannot:
- Grant access to cluster-scoped resources (nodes, PersistentVolumes, namespaces themselves).
- Grant access to non-resource URLs (e.g.,
/healthz,/metrics). - Be reused across namespaces — a single ClusterRole can be bound (via RoleBindings) in many namespaces, avoiding duplicating the same Role everywhere.
Each rule lists apiGroups, resources, and verbs (and optionally resourceNames to restrict to specific named objects). Kubernetes ships default ClusterRoles like view, edit, admin, and cluster-admin — useful building blocks. Choosing Role vs. ClusterRole comes down to scope: namespace-local permissions use Roles; anything cluster-wide or reused broadly uses ClusterRoles.
Example
# Namespaced Role: read pods in the "dev" namespace only
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata: { namespace: dev, name: pod-reader }
rules:
- apiGroups: [""] # "" = core API group
resources: ["pods"]
verbs: ["get", "list", "watch"]
---
# ClusterRole: read nodes (a cluster-scoped resource)
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRole
metadata: { name: node-reader }
rules:
- apiGroups: [""]
resources: ["nodes"]
verbs: ["get", "list", "watch"]
Exercises
- (Beginner) What is the scope difference between a Role and a ClusterRole?
- (Beginner) Name two things a ClusterRole can grant access to that a Role cannot.
- (Intermediate) Why might you define a permission set as a ClusterRole even if you'll use it within namespaces?
- (Interview) Name the default ClusterRoles Kubernetes ships and describe when you'd bind
viewvs.editvs.admin. (Hint: read-only, read-write within ns, manage ns incl. RBAC.)
Answers
- A Role grants permissions only within its own namespace; a ClusterRole grants permissions cluster-wide (and can be applied across namespaces).
- Any two: cluster-scoped resources (nodes, PersistentVolumes, namespaces), non-resource URLs (
/healthz,/metrics), and reuse of one role definition across multiple namespaces.- To define the permission set once and reuse it: a single ClusterRole can be bound via a RoleBinding in each namespace, avoiding copying an identical Role into every namespace and keeping the definition DRY and consistent.
- Defaults:
view,edit,admin,cluster-admin. Bind view for read-only access to most namespaced resources (no secrets, no write). Bind edit for read-write on most namespaced resources (can modify workloads/config but not manage RBAC). Bind admin for full control within a namespace including managing Roles/RoleBindings (but not cluster-scoped resources). (cluster-adminis unrestricted superuser — use sparingly.)
RoleBindings and ClusterRoleBindings
Theory
A Role/ClusterRole alone grants nothing until bound to subjects. The binding objects:
- RoleBinding: grants the permissions of a Role or a ClusterRole to subjects within a specific namespace. (Yes — a RoleBinding can reference a ClusterRole, in which case the ClusterRole's permissions are limited to that namespace. This is the reuse pattern.)
- ClusterRoleBinding: grants a ClusterRole's permissions to subjects across the entire cluster.
Subjects are users, groups, or ServiceAccounts. The crucial scoping rule: the binding type determines the scope of the grant, not the role type. A ClusterRole bound via a RoleBinding only applies in that namespace; the same ClusterRole bound via a ClusterRoleBinding applies everywhere. A common mistake — granting cluster-wide access by using a ClusterRoleBinding when a namespaced RoleBinding was intended — is a frequent over-privilege bug.
Example
# Bind the ClusterRole "node-reader" cluster-wide to a group:
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata: { name: read-nodes }
subjects:
- { kind: Group, name: platform-team, apiGroup: rbac.authorization.k8s.io }
roleRef: { kind: ClusterRole, name: node-reader, apiGroup: rbac.authorization.k8s.io }
---
# Bind a ClusterRole but scope it to ONE namespace via a RoleBinding:
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata: { namespace: dev, name: dev-edit }
subjects:
- { kind: ServiceAccount, name: ci, namespace: dev }
roleRef: { kind: ClusterRole, name: edit, apiGroup: rbac.authorization.k8s.io }
Exercises
- (Beginner) What are the three kinds of subjects a binding can reference?
- (Beginner) Does a RoleBinding determine scope by the Role's type or the binding's type?
- (Intermediate) Explain what happens when a RoleBinding references a ClusterRole.
- (Interview) A team intended to give a ServiceAccount edit access in one namespace but accidentally granted it cluster-wide. What was the likely mistake, and how is it fixed? (Hint: ClusterRoleBinding vs. RoleBinding.)
Answers
- Users, groups, and ServiceAccounts.
- By the binding's type/scope: a RoleBinding grants within its namespace regardless of whether it references a Role or a ClusterRole.
- The ClusterRole's permission set is applied, but scoped to the RoleBinding's namespace only. This lets you reuse a single ClusterRole definition across namespaces by binding it with a RoleBinding in each, without granting cluster-wide access.
- They likely used a ClusterRoleBinding (which grants cluster-wide) instead of a RoleBinding in the target namespace. Fix it by deleting the ClusterRoleBinding and creating a RoleBinding in the intended namespace that references the same role (e.g., the
editClusterRole) and the ServiceAccount subject — confining the grant to that one namespace.
ABAC and Webhook authorization
Theory
RBAC is the default, but Kubernetes supports other authorization modes, often chained (a request is allowed if any enabled authorizer allows it):
- ABAC (Attribute-Based Access Control): an older mode where access rules are defined in a policy file of attributes (user, resource, namespace, verb) on the API server's disk. Changing policy requires editing the file and restarting the API server, which makes it inflexible and largely deprecated in favor of RBAC.
- Webhook authorization: delegates the authorization decision to an external service via a
SubjectAccessReview— similar to webhook authentication but for "may they do this?" This enables centralized or custom policy engines and is how some clouds integrate their IAM for authorization. - Node authorizer: a special-purpose mode that authorizes kubelets to access only the resources their own Pods need.
In practice, clusters run RBAC (plus the Node authorizer), and reach for webhook authorization when external policy integration is required. ABAC is essentially legacy.
Example
--authorization-mode=Node,RBAC[,Webhook] # authorizers tried in order; any allow = allow
# Webhook authorization: API server asks an external service via SubjectAccessReview
apiVersion: authorization.k8s.io/v1
kind: SubjectAccessReview
spec:
user: "alice"
resourceAttributes: { namespace: "prod", verb: "delete", resource: "pods" }
# external authorizer responds: { status: { allowed: false } }
Exercises
- (Beginner) Why is ABAC considered inflexible compared to RBAC?
- (Beginner) What review object does webhook authorization use?
- (Intermediate) When multiple authorization modes are enabled, how is the final allow/deny decided?
- (Interview) Why might an organization add webhook authorization alongside RBAC, and what is the cost? (Hint: centralized/custom policy; external critical-path dependency.)
Answers
- ABAC rules live in a policy file on the API server's disk; changing them requires editing the file and restarting the API server, with no dynamic, API-driven management — unlike RBAC, which is managed as live API objects (
kubectl apply) without restarts.- A
SubjectAccessReview(the external authorizer responds allowed/denied).- The enabled authorizers are consulted in order; if any of them allows the request, it is allowed. It's denied only if none allow it (and an explicit deny from one authorizer can short-circuit, but generally the union of allows applies).
- To integrate centralized or custom authorization policy (e.g., a company-wide policy engine or cloud IAM) that RBAC alone can't express, enabling consistent cross-system access decisions. The cost: the webhook is on the authorization critical path — added latency per request and a dependency that, if slow or down, can disrupt access — so it must be highly available and fast, typically with caching.
Least privilege principle
Theory
The principle of least privilege (PoLP) is the guiding philosophy for all the above: every identity (user, ServiceAccount, controller) should have only the permissions it needs to do its job — no more. In Kubernetes this means: prefer narrowly-scoped Roles over broad ClusterRoles, prefer RoleBindings (namespaced) over ClusterRoleBindings, grant specific verbs/resources rather than wildcards, and give each workload its own ServiceAccount rather than sharing privileged ones.
Why it matters: Kubernetes is a high-value target, and over-privileged identities massively increase the blast radius of any compromise or mistake. A leaked token for a least-privilege SA can do little; a leaked cluster-admin token owns everything. Practical habits: avoid cluster-admin except for true administrators; never grant secrets read broadly; audit RBAC regularly (tools like kubectl auth can-i --list, rbac-tool, kubeaudit); and disable SA token automounting for Pods that don't call the API. Least privilege is the single highest-leverage security practice in a cluster.
Example
# Least privilege: a dedicated SA that may only read one ConfigMap, nothing else
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata: { namespace: app, name: read-app-config }
rules:
- apiGroups: [""]
resources: ["configmaps"]
resourceNames: ["app-config"] # restrict to a SPECIFIC object
verbs: ["get"]
# Don't auto-mount a token if the Pod never calls the API:
spec:
serviceAccountName: app-sa
automountServiceAccountToken: false
Exercises
- (Beginner) State the principle of least privilege in one sentence.
- (Beginner) Name two concrete RBAC practices that follow least privilege.
- (Intermediate) How does
resourceNameshelp enforce least privilege?- (Interview) Explain "blast radius" in the context of RBAC and why per-workload ServiceAccounts with minimal permissions reduce it. (Hint: compromise of one identity should grant minimal access.)
Answers
- Grant every identity only the minimum permissions required to perform its function, and nothing more.
- Any two: use namespaced Roles + RoleBindings instead of cluster-wide grants; specify exact verbs/resources instead of wildcards; give each workload its own ServiceAccount; use
resourceNamesto restrict to specific objects; avoidcluster-admin; disable token automounting when unused; audit RBAC regularly.resourceNamesrestricts a rule to specific named object(s) rather than all objects of a type — e.g., allowgeton only theapp-configConfigMap instead of all ConfigMaps. This narrows access to exactly what's needed.- "Blast radius" is the extent of damage possible if an identity is compromised. A
cluster-admintoken, if leaked, lets an attacker do anything cluster-wide (read all secrets, run privileged Pods, take over nodes) — huge blast radius. A per-workload ServiceAccount scoped to only the few resources/verbs that workload needs grants an attacker almost nothing if its token leaks — tiny blast radius. Minimizing each identity's permissions, and isolating them per workload, ensures any single compromise is contained, which is the core defensive value of least privilege.
9.3 Admission Control
After authn and authz, admission controllers enforce policy on what may be created or how. This subchapter covers built-in controllers, webhooks, and policy engines.
Admission controller types
Theory
Admission controllers intercept requests to the API server after authentication and authorization but before persistence (Chapter 2 introduced the pipeline). There are two functional types, run in two phases:
- Mutating admission controllers can modify the object (set defaults, inject sidecars, add labels). They run first.
- Validating admission controllers can only accept or reject (enforce policy), not modify. They run second, after mutation.
Kubernetes ships many built-in (compiled-in) admission controllers enabled by default — e.g., NamespaceLifecycle, LimitRanger, ResourceQuota, ServiceAccount, DefaultStorageClass, PodSecurity, MutatingAdmissionWebhook, ValidatingAdmissionWebhook. Beyond these, the two webhook controllers are the extension points that let you plug in custom mutating/validating logic (next topics). Admission control is where organizational policy ("no privileged Pods," "all images from our registry," "every Pod must have resource limits") is enforced at the cluster's front door.
Example
Request (after authz)
|
v MUTATING phase (can change the object)
[ MutatingAdmissionWebhook, DefaultStorageClass, ServiceAccount, ... ]
|
v VALIDATING phase (accept/reject only)
[ ValidatingAdmissionWebhook, ResourceQuota, PodSecurity, ... ]
|
v persist to etcd (only if not rejected)
# See which admission plugins are enabled (on the API server):
kube-apiserver -h | grep enable-admission-plugins
Exercises
- (Beginner) What are the two functional types of admission controllers?
- (Beginner) Which phase runs first, and why?
- (Intermediate) Name three built-in admission controllers and what each enforces/does.
- (Interview) Admission control is described as enforcing organizational policy "at the front door." Give two example policies and say whether each needs a mutating or validating controller. (Hint: inject = mutate; forbid = validate.)
Answers
- Mutating (can modify the object) and validating (can only accept/reject).
- The mutating phase runs first, so that validating controllers evaluate the final, post-mutation object (e.g., after defaults/sidecars are injected).
- Examples:
ResourceQuota(enforces namespace aggregate quotas),LimitRanger(applies default/min/max resource values),PodSecurity(enforces Pod Security Standards),ServiceAccount(assigns/automounts SA tokens),DefaultStorageClass(sets default class on PVCs),NamespaceLifecycle(prevents creating objects in terminating namespaces).- Examples: "Inject a logging sidecar into every Pod" → mutating (modifies the object). "Reject any Pod that runs as privileged" → validating (rejects, doesn't modify). "Set a default resource request if missing" → mutating. "Require all images come from our registry" → validating (reject if not). Generally, adding/changing fields = mutating; allowing/denying based on rules = validating.
ValidatingAdmissionWebhook
Theory
A ValidatingAdmissionWebhook lets you enforce custom policy by having the API server call your external webhook to approve or reject a request. You register a ValidatingWebhookConfiguration specifying which operations/resources to intercept and the webhook's endpoint; the API server sends an AdmissionReview, and your service responds allowed: true/false (with an optional message). The webhook cannot modify the object — only accept or reject it.
Use cases: enforce naming conventions, require certain labels/annotations, forbid :latest image tags, block privileged Pods, require resource limits — any rule expressible as "is this object acceptable?" Critical operational concerns: failurePolicy (Fail = reject if the webhook is unreachable, Ignore = allow), scoping via rules/namespaceSelector/objectSelector to avoid intercepting everything (especially kube-system), and timeoutSeconds. A newer, in-process alternative — ValidatingAdmissionPolicy using CEL expressions — avoids running a webhook for many cases.
Example
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata: { name: no-latest-tag }
webhooks:
- name: images.example.com
rules:
- { operations: ["CREATE","UPDATE"], apiGroups: [""], apiVersions: ["v1"], resources: ["pods"] }
clientConfig:
service: { name: policy-svc, namespace: policy, path: /validate }
failurePolicy: Fail # reject if webhook is down (fail closed)
namespaceSelector:
matchExpressions:
- { key: kubernetes.io/metadata.name, operator: NotIn, values: ["kube-system"] }
admissionReviewVersions: ["v1"]
sideEffects: None
Exercises
- (Beginner) Can a ValidatingAdmissionWebhook modify the object it reviews?
- (Beginner) What does the API server send to the webhook, and what does the webhook return?
- (Intermediate) What is the effect of
failurePolicy: FailversusIgnorewhen the webhook is unreachable?- (Interview) What in-process alternative to validating webhooks exists, and what advantage does it offer? (Hint: ValidatingAdmissionPolicy with CEL; no external service.)
Answers
- No — it can only accept or reject; modification is the job of mutating webhooks.
- The API server sends an
AdmissionReview(containing the object and request info); the webhook returns an AdmissionReview response withallowed: true/falseand an optional status message.- With
Fail, if the webhook can't be reached, the request is rejected (fail closed) — safer for policy enforcement but risks blocking operations if the webhook is down. WithIgnore, the request is allowed (fail open) — avoids outages but lets requests bypass the policy during webhook downtime.- ValidatingAdmissionPolicy (with F+CEL expressions) evaluates policy in-process in the API server, with no external webhook service to deploy, secure, scale, or keep available. It removes the critical-path network dependency and operational burden of a webhook for many validation use cases, while offering declarative CEL-based rules.
MutatingAdmissionWebhook
Theory
A MutatingAdmissionWebhook is the same mechanism but for modifying incoming objects before they're stored. Your webhook receives the AdmissionReview and returns a JSON Patch describing changes to apply — e.g., inject a sidecar container, add default labels/annotations, set a default securityContext, or mount a volume. This is the machinery behind service meshes (Istio/Linkerd inject their proxy sidecar via a mutating webhook) and many platform features.
Because it runs before validation, the mutated object is what validation (and the rest of admission) sees. Important nuances: mutating webhooks should be idempotent (Kubernetes may call them more than once, and re-applying must be safe), order among multiple mutating webhooks can matter, and the same failurePolicy/scoping/timeout concerns apply. Mutating webhooks are powerful but riskier than validating ones — a buggy mutation can corrupt every object it touches — so they're used deliberately and tested carefully.
Example
apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
metadata: { name: sidecar-injector }
webhooks:
- name: inject.example.com
rules:
- { operations: ["CREATE"], apiGroups: [""], apiVersions: ["v1"], resources: ["pods"] }
clientConfig:
service: { name: injector, namespace: mesh, path: /mutate }
reinvocationPolicy: IfNeeded # may be re-invoked; mutation must be idempotent
failurePolicy: Ignore
admissionReviewVersions: ["v1"]
sideEffects: None
# The webhook returns a JSON patch adding the sidecar container to the Pod.
Exercises
- (Beginner) What does a mutating webhook do that a validating one cannot?
- (Beginner) Name a widely-used feature implemented via mutating webhooks.
- (Intermediate) Why must a mutating webhook be idempotent?
- (Interview) Why are mutating webhooks considered riskier than validating webhooks, and what precautions apply? (Hint: silently changes every matched object; idempotency, scoping, testing, ordering.)
Answers
- It can modify the incoming object (e.g., via a JSON patch) — injecting containers, setting defaults, adding labels — whereas a validating webhook can only accept/reject.
- Service-mesh sidecar injection (Istio/Linkerd inject their Envoy/proxy container via a mutating webhook). (Also: default securityContext injection, automatic label/annotation insertion.)
- The API server may invoke mutating webhooks more than once (e.g., via
reinvocationPolicyafter other mutations), so the webhook must produce the same correct result if applied repeatedly — otherwise it could double-inject a sidecar or repeatedly alter fields, corrupting the object.- They silently alter every object they match, so a bug can corrupt all affected resources (e.g., break every new Pod) rather than just rejecting some. Precautions: make mutations idempotent, scope tightly with
rules/namespaceSelector/objectSelector(exclude kube-system), set sensiblefailurePolicy/timeouts, consider ordering relative to other mutating webhooks, run the webhook HA, and test thoroughly (including reinvocation) before broad rollout.
OPA Gatekeeper
Theory
Writing and operating custom admission webhooks is a lot of work. OPA Gatekeeper is a popular policy engine that provides a ready-made validating (and limited mutating) admission webhook driven by declarative policies, so you don't write webhook code. It's built on the Open Policy Agent (OPA) and its Rego policy language.
Gatekeeper introduces two concepts: a ConstraintTemplate (defines a reusable policy and its Rego logic, exposing it as a new CRD) and Constraints (instances of that template applied with parameters and scope). For example, a template "required labels" plus a constraint "all namespaces must have an owner label." Gatekeeper also supports audit (continuously checking existing objects against policies, not just new ones) and a constraint library of common policies. It's a CNCF project widely used to enforce governance (image registries, resource limits, allowed types) without bespoke webhook development.
Example
# 1) ConstraintTemplate defines the policy logic (Rego), exposed as a CRD:
apiVersion: templates.gatekeeper.sh/v1
kind: ConstraintTemplate
metadata: { name: k8srequiredlabels }
spec:
crd: { spec: { names: { kind: K8sRequiredLabels } } }
targets: [ { target: admission.k8s.gatekeeper.sh, rego: "..." } ]
---
# 2) Constraint applies it: every Namespace must have an "owner" label
apiVersion: constraints.gatekeeper.sh/v1beta1
kind: K8sRequiredLabels
metadata: { name: ns-must-have-owner }
spec:
match: { kinds: [ { apiGroups: [""], kinds: ["Namespace"] } ] }
parameters: { labels: ["owner"] }
Exercises
- (Beginner) What does OPA Gatekeeper save you from writing yourself?
- (Beginner) What policy language does Gatekeeper (via OPA) use?
- (Intermediate) Explain the relationship between a ConstraintTemplate and a Constraint.
- (Interview) What does Gatekeeper's "audit" feature add beyond admission-time enforcement, and why is it valuable? (Hint: finds existing violations created before the policy.)
Answers
- A custom admission webhook (server code) — Gatekeeper provides the webhook and you supply declarative policies instead of writing/operating your own webhook service.
- Rego (the Open Policy Agent policy language).
- A ConstraintTemplate defines reusable policy logic (in Rego) and exposes it as a new CRD kind; a Constraint is an instance of that template that activates the policy with specific parameters and a match scope (which resources it applies to). One template, many constraints.
- Audit periodically evaluates existing cluster objects against the active constraints, reporting violations even for resources created before the policy existed (or that bypassed admission). This is valuable for discovering and remediating non-compliant resources already in the cluster, giving continuous compliance visibility rather than only blocking new violations at admission time.
Kyverno policy engine
Theory
Kyverno is another CNCF policy engine and the main alternative to Gatekeeper. Its defining feature: policies are written as native Kubernetes YAML (no Rego/new language to learn) — making it especially approachable for Kubernetes practitioners. Kyverno runs as an admission webhook and can validate, mutate, and generate resources, plus verify image signatures and run background scans.
Capabilities that distinguish it:
- Validate: reject non-compliant resources (like Gatekeeper).
- Mutate: modify resources (e.g., add default labels/securityContext) — a richer mutation story than Gatekeeper.
- Generate: automatically create related resources (e.g., create a default NetworkPolicy or ResourceQuota whenever a Namespace is created).
- VerifyImages: enforce image signature verification (cosign).
The choice between Kyverno and Gatekeeper often comes down to language preference (YAML vs. Rego) and feature needs (Kyverno's generate/mutate vs. OPA's general-purpose Rego power). Both enforce policy-as-code for governance and security.
Example
# Kyverno policy in plain YAML: require resource limits on all Pods
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata: { name: require-limits }
spec:
validationFailureAction: Enforce # Enforce (block) or Audit (report)
rules:
- name: check-limits
match: { any: [ { resources: { kinds: ["Pod"] } } ] }
validate:
message: "CPU and memory limits are required."
pattern:
spec:
containers:
- resources:
limits: { memory: "?*", cpu: "?*" } # must be set (non-empty)
Exercises
- (Beginner) What makes Kyverno's policy authoring approachable compared to Gatekeeper?
- (Beginner) Name the four kinds of actions Kyverno policies can perform.
- (Intermediate) Give an example of Kyverno's "generate" capability and why it's useful.
- (Interview) How would you decide between Kyverno and OPA Gatekeeper for a team? (Hint: YAML vs. Rego, generate/mutate features vs. general-purpose policy logic.)
Answers
- Kyverno policies are written as native Kubernetes YAML — no separate policy language (like Rego) is required — so they're familiar to anyone who already writes Kubernetes manifests.
- Validate, mutate, generate, and verifyImages (image signature verification).
- Example: automatically generate a default NetworkPolicy (or ResourceQuota/LimitRange/RoleBinding) whenever a new Namespace is created. This is useful for enforcing baseline governance/security on every namespace without manual steps, ensuring consistency and reducing the chance of forgetting required defaults.
- Consider the team's familiarity and needs: Kyverno if you prefer YAML-native policies, want built-in mutate/generate/image-verification, and value a gentle learning curve. Gatekeeper/OPA if you want the full expressive power of Rego (complex logic, reuse across systems beyond Kubernetes) and the OPA ecosystem/constraint library, and the team is comfortable learning Rego. Both provide validation and audit; the decision largely hinges on language preference and whether Kyverno's generation/mutation features matter.
9.4 Pod Security
Even with RBAC, a Pod can be dangerous if it runs as root, privileged, or with host access. Pod-level security controls constrain what containers can do. This subchapter covers them.
Pod Security Admission (PSA)
Theory
The old PodSecurityPolicy (PSP) was removed in v1.25. Its replacement is Pod Security Admission (PSA) — a built-in admission controller that enforces the Pod Security Standards (next topic) at the namespace level via labels. Instead of complex policy objects, you label a namespace with the standard ("level") to apply and the mode of enforcement.
Three modes, which can be combined:
- enforce: reject Pods that violate the level.
- audit: allow but record violations in the audit log.
- warn: allow but return a user-facing warning.
You also pin a version (e.g., v1.30) for consistent behavior. PSA is simpler than PSP but coarser (per-namespace, three predefined levels) — for finer or custom rules you combine it with a policy engine (Kyverno/Gatekeeper). It's the standard built-in mechanism for baseline Pod hardening.
Example
apiVersion: v1
kind: Namespace
metadata:
name: prod
labels:
pod-security.kubernetes.io/enforce: restricted # block violators
pod-security.kubernetes.io/enforce-version: v1.30
pod-security.kubernetes.io/warn: restricted # also warn
pod-security.kubernetes.io/audit: restricted # and audit
# A Pod violating "restricted" in this namespace is rejected at creation:
kubectl apply -f privileged-pod.yaml -n prod
# Error: violates PodSecurity "restricted:v1.30": privileged (...)
Exercises
- (Beginner) What did Pod Security Admission replace?
- (Beginner) At what scope is PSA applied, and how is it configured?
- (Intermediate) What are the three PSA modes and how do they differ?
- (Interview) PSA is simpler but coarser than the old PSP. When would you supplement PSA with a policy engine like Kyverno? (Hint: custom/finer-grained rules beyond the three standard levels.)
Answers
- PodSecurityPolicy (PSP), which was removed in v1.25.
- At the namespace level, configured via labels on the Namespace (
pod-security.kubernetes.io/<mode>: <level>and a pinned version).- enforce (reject violating Pods), audit (allow but log violations to the audit log), and warn (allow but return a warning to the user). They can be set simultaneously with different levels.
- When you need rules beyond the three predefined Pod Security Standards or finer granularity than per-namespace — e.g., enforce specific allowed registries, require particular labels, mandate resource limits, allow narrow exceptions, or apply object-specific logic. A policy engine (Kyverno/Gatekeeper) provides custom, parameterized policies (validate/mutate/generate) that complement PSA's coarse baseline.
Pod Security Standards: Privileged, Baseline, Restricted
Theory
The Pod Security Standards (PSS) define three cumulative security levels that PSA enforces:
- Privileged: unrestricted — no constraints. For trusted, system-level workloads (e.g., CNI/CSI agents) that legitimately need host access.
- Baseline: prevents known privilege escalations while staying broadly compatible. Disallows the most dangerous settings —
privileged: true, host namespaces (hostNetwork/PID/IPC), hostPath in dangerous ways, adding dangerous capabilities — but is lenient enough for most apps. - Restricted: the hardened, best-practice level. Requires running as non-root, dropping ALL capabilities,
seccompProfile: RuntimeDefault,allowPrivilegeEscalation: false, read-only root filesystem encouraged, no host access. This is the target for security-sensitive workloads.
They are a spectrum from "anything goes" to "locked down." The recommendation: default namespaces to at least Baseline, and aim for Restricted for production workloads, reserving Privileged only for specific trusted system components.
Example
# A Pod that satisfies the "restricted" standard:
spec:
securityContext:
runAsNonRoot: true
seccompProfile: { type: RuntimeDefault }
containers:
- name: app
image: myapp:1.0
securityContext:
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities: { drop: ["ALL"] }
Exercises
- (Beginner) Name the three Pod Security Standards from least to most restrictive.
- (Beginner) Which standard would you use for a trusted CNI agent that needs host access?
- (Intermediate) List three requirements the Restricted level imposes.
- (Interview) Why is Baseline a pragmatic default for most workloads while Restricted is the goal for production? (Hint: compatibility vs. hardening trade-off.)
Answers
- Privileged (least restrictive), Baseline, Restricted (most restrictive).
- Privileged — system components like CNI/CSI agents legitimately require host namespaces/privileges that Baseline/Restricted forbid.
- Any three: run as non-root (
runAsNonRoot: true), drop ALL capabilities,allowPrivilegeEscalation: false,seccompProfile: RuntimeDefault, no host namespaces/privileged, restricted volume types (and read-only root filesystem is encouraged).- Baseline blocks the most dangerous, known-escalation settings while remaining compatible with the majority of existing applications, so it's a safe default that rarely breaks workloads. Restricted enforces full hardening best practices (non-root, dropped capabilities, seccomp, no privilege escalation) which materially reduces attack surface but may require app changes (e.g., not running as root). Production should aim for Restricted to minimize risk, using Baseline as the broadly-applicable minimum and Privileged only for vetted system components.
SecurityContext: runAsUser, fsGroup, capabilities
Theory
A securityContext is where you actually configure the per-Pod and per-container security settings that the Pod Security Standards check. It can be set at the Pod level (applies to all containers) and container level (overrides for that container). Key fields:
- runAsUser / runAsGroup / runAsNonRoot: the UID/GID the container process runs as;
runAsNonRoot: trueblocks running as root (UID 0). - fsGroup: a supplemental group applied to mounted volumes so the container can read/write them (fixes permission issues on persistent volumes).
- capabilities: Linux capabilities to add or drop. Best practice is
drop: ["ALL"]then add back only what's strictly needed (e.g.,NET_BIND_SERVICEto bind ports < 1024). - allowPrivilegeEscalation, readOnlyRootFilesystem, privileged, seccompProfile: further hardening.
Setting a strong securityContext (non-root, dropped capabilities, no privilege escalation, read-only root FS) is the concrete implementation of Pod hardening and is required to meet the Restricted standard.
Example
spec:
securityContext: # Pod-level (all containers)
runAsNonRoot: true
runAsUser: 1000
fsGroup: 2000 # volume files owned by gid 2000 -> writable
containers:
- name: app
image: myapp:1.0
securityContext: # container-level overrides/additions
allowPrivilegeEscalation: false
readOnlyRootFilesystem: true
capabilities:
drop: ["ALL"]
add: ["NET_BIND_SERVICE"] # only what's needed
Exercises
- (Beginner) What does
runAsNonRoot: trueenforce?- (Beginner) What problem does
fsGroupsolve?- (Intermediate) What is the recommended approach to Linux capabilities in a securityContext?
- (Interview) Explain how Pod-level and container-level securityContext interact, and give an example where you'd override at the container level. (Hint: Pod sets defaults; container overrides for specific needs.)
Answers
- It prevents the container from running as the root user (UID 0); if the image would run as root, the container fails to start.
- Volume permission issues:
fsGroupsets a supplemental group owner on mounted (especially persistent) volumes so the container's process can read/write them, instead of being denied access due to root-owned mount permissions.- Drop all capabilities (
drop: ["ALL"]) and then add back only the specific capabilities the workload truly needs (e.g.,NET_BIND_SERVICE), minimizing privileges.- The Pod-level securityContext sets defaults for all containers; a container-level securityContext overrides or augments those for that specific container. Example: a Pod sets
runAsNonRoot: trueand drops all capabilities globally, but one container that must bind to port 80 overrides at the container level toadd: ["NET_BIND_SERVICE"](or a sidecar needs a read-only root FS while the main app needs writable scratch). Container-level settings take precedence for that container, allowing minimal, targeted exceptions to a hardened Pod default.
AppArmor and Seccomp profiles
Theory
Beyond capabilities and user IDs, two Linux kernel security mechanisms further confine what a container can do:
- Seccomp (secure computing mode): filters the system calls a process may make. Most applications use only a small subset of the ~300+ syscalls; blocking the rest shrinks the kernel attack surface dramatically. The
RuntimeDefaultseccomp profile (a sensible blocklist provided by the runtime) is recommended for all workloads and required by the Restricted standard; you can also supply custom profiles. - AppArmor: a Linux Security Module that confines a program to a set of allowed capabilities and file/resource accesses via a named profile loaded on the node. It restricts what files a container can read/write/execute and what operations it can perform.
Both are configured through the Pod/container spec (seccomp via securityContext.seccompProfile; AppArmor via securityContext.appArmorProfile in recent versions, or annotations in older ones). They embody defense in depth: even if an attacker gains code execution in a container, these confinements limit what they can actually do to the kernel/host.
Example
spec:
securityContext:
seccompProfile:
type: RuntimeDefault # block dangerous syscalls (recommended baseline)
containers:
- name: app
image: myapp:1.0
securityContext:
appArmorProfile:
type: Localhost # use a named AppArmor profile loaded on the node
localhostProfile: k8s-myapp
# custom seccomp alternative:
# seccompProfile: { type: Localhost, localhostProfile: profiles/myapp.json }
Exercises
- (Beginner) What does seccomp filter, and what does AppArmor confine?
- (Beginner) Which seccomp profile is the recommended baseline for workloads?
- (Intermediate) Why does restricting syscalls reduce the kernel attack surface?
- (Interview) How do seccomp and AppArmor exemplify defense in depth alongside RBAC and securityContext? (Hint: limit damage even after a container is compromised.)
Answers
- Seccomp filters the system calls a process may make to the kernel; AppArmor confines a program's allowed file/resource accesses and operations via a profile (a Linux Security Module).
RuntimeDefault(the runtime-provided default seccomp profile).- Most applications need only a small fraction of the kernel's hundreds of syscalls. Many kernel vulnerabilities are reached through specific, rarely-needed syscalls; blocking everything outside the app's required set removes those attack paths, so even exploited code can't invoke the dangerous syscalls — shrinking the exploitable surface.
- Defense in depth layers independent controls so a single failure isn't catastrophic. RBAC limits API access; securityContext limits privileges/identity (non-root, dropped capabilities); seccomp and AppArmor confine the container at the kernel level — restricting syscalls and file/resource access. So even if an attacker compromises the application and bypasses higher layers, seccomp/AppArmor constrain what they can do to the kernel and host (e.g., can't make the syscalls needed for a container escape), containing the breach. Each layer assumes the others might fail.
9.5 Secrets and Supply Chain Security
The images you run are only as trustworthy as their provenance. This subchapter covers verifying images, controlling registries, runtime threat detection, and software bills of materials.
Image signing and verification (cosign, Notary)
Theory
How do you know the image your cluster pulls is the one your team built, unmodified? By default, you don't — image tags are mutable and registries can be compromised. Image signing addresses this: a trusted party cryptographically signs an image, and the cluster verifies the signature before running it, guaranteeing authenticity (who built it) and integrity (it wasn't tampered with).
The dominant modern tool is cosign (part of the Sigstore project), which signs container images (and other artifacts) and stores signatures in the registry alongside the image. Sigstore enables keyless signing (using short-lived certificates tied to an OIDC identity, removing the burden of managing signing keys). The older Notary / Docker Content Trust (TUF-based) served a similar purpose. Verification is enforced in the cluster via admission policies (next topic) so unsigned or untrusted images are rejected — closing a major supply-chain attack vector.
Example
# Sign an image with cosign (keyless, using OIDC identity):
cosign sign --yes registry.example.com/myapp@sha256:abc123...
# Verify a signature before deploying:
cosign verify --certificate-identity=ci@example.com \
--certificate-oidc-issuer=https://accounts.google.com \
registry.example.com/myapp@sha256:abc123...
Exercises
- (Beginner) What two properties does image signing guarantee?
- (Beginner) What is the modern tool/project most associated with container image signing?
- (Intermediate) What problem does Sigstore's "keyless signing" solve?
- (Interview) Why is verifying image signatures (and pinning by digest) important against supply-chain attacks, given that tags are mutable? (Hint: tag re-pointing, registry compromise, integrity/authenticity.)
Answers
- Authenticity (the image came from the expected, trusted signer) and integrity (it has not been altered since signing).
- cosign (part of the Sigstore project). (Notary / Docker Content Trust is the older alternative.)
- The operational burden and risk of managing long-lived signing keys. Keyless signing uses short-lived certificates issued against an OIDC identity (recorded in a transparency log), so there are no private keys to store, rotate, or leak — signing is tied to a verifiable identity instead.
- Tags like
:latestor:v1are mutable pointers — an attacker who compromises the registry or pipeline can re-point a tag to a malicious image, and consumers would unknowingly pull it. Signature verification ensures only images signed by a trusted identity run (authenticity) and that the content matches what was signed (integrity), while pinning by immutable digest (@sha256:...) guarantees you run exactly the intended bytes. Together they defeat tag re-pointing and tampering, even if the registry is compromised.
Admission policies for image registries
Theory
Signing is only useful if the cluster enforces verification, and you often also want to restrict where images may come from. This is done with admission policies:
- Registry allowlisting: reject Pods whose images aren't from approved registries (e.g., only
registry.example.com/*), preventing use of arbitrary public images that could be malicious or unvetted. - Signature verification at admission: reject images lacking a valid signature from a trusted signer (Kyverno's
verifyImages, Connaisseur, Sigstore policy-controller, or Gatekeeper with external data). - Tag policies: forbid mutable tags like
:latest, requiring immutable digests for reproducibility.
These are implemented with the policy engines from §9.3 (Kyverno, Gatekeeper) or purpose-built controllers. The result is a cluster that will only run images that are from trusted sources, properly signed, and pinned — a strong supply-chain control enforced automatically at deploy time.
Example
# Kyverno: require images from an approved registry AND a valid cosign signature
apiVersion: kyverno.io/v1
kind: ClusterPolicy
metadata: { name: trusted-images }
spec:
validationFailureAction: Enforce
rules:
- name: only-approved-registry
match: { any: [ { resources: { kinds: ["Pod"] } } ] }
validate:
message: "Images must come from registry.example.com"
pattern: { spec: { containers: [ { image: "registry.example.com/*" } ] } }
- name: verify-signature
match: { any: [ { resources: { kinds: ["Pod"] } } ] }
verifyImages:
- imageReferences: [ "registry.example.com/*" ]
attestors: [ { entries: [ { keyless: { issuer: "https://accounts.google.com", subject: "ci@example.com" } } ] } ]
Exercises
- (Beginner) What does registry allowlisting prevent?
- (Beginner) What kind of tag do tag policies typically forbid, and what do they require instead?
- (Intermediate) Which tools enforce image signature verification at admission?
- (Interview) Describe a layered admission policy that ensures only trusted images run, and what each layer defends against. (Hint: registry source + signature + digest pinning.)
Answers
- Running images from unapproved/arbitrary sources (e.g., random public registries), which could be malicious, vulnerable, or unvetted — restricting Pods to images from trusted registries only.
- They forbid mutable tags like
:latestand require immutable digests (@sha256:...) for reproducibility and integrity.- Policy engines/controllers such as Kyverno (
verifyImages), Sigstore's policy-controller, Connaisseur, or OPA Gatekeeper (with external data) — invoked as admission webhooks.- A layered policy: (1) registry allowlist — reject images not from approved registries (defends against pulling untrusted/unknown images); (2) signature verification — reject images without a valid signature from a trusted identity (defends against tampered or unauthorized images, even from the right registry); (3) digest pinning / no mutable tags — require
@sha256:digests (defends against tag re-pointing and ensures reproducible, exact-content deployments). Together they ensure every running image is from a known source, provably authentic and untampered, and exactly the intended artifact — enforced automatically at admission.
Falco for runtime threat detection
Theory
All the prior controls are preventive (block bad things at deploy time). But you also need detective controls for what happens at runtime — because a container can be exploited, or a legitimate image can behave maliciously, after it's running. Falco (a CNCF project) is the leading runtime security/threat-detection tool for Kubernetes.
Falco taps into kernel events (via eBPF or a kernel module) to observe syscalls and behavior in real time, comparing them against a set of rules that describe suspicious activity: a shell spawned inside a container, an unexpected outbound connection, writes to sensitive paths, reads of secrets, privilege escalation, or a process not in the container's normal profile. When a rule matches, Falco emits an alert (to stdout, a webhook, Falcosidekick, SIEM, etc.). It's the runtime analog to an intrusion detection system, enabling you to detect and respond to attacks that slipped past preventive controls — embodying the "assume breach" mindset.
Example
# A Falco rule: alert when a shell is started inside a container
- rule: Terminal shell in container
desc: A shell was spawned in a container (often a sign of intrusion)
condition: >
spawned_process and container
and shell_procs and proc.tty != 0
output: "Shell in container (user=%user.name container=%container.name cmd=%proc.cmdline)"
priority: WARNING
# Falco runs as a DaemonSet; alerts stream from each node:
kubectl -n falco logs -l app.kubernetes.io/name=falco -f
# WARNING Shell in container (user=root container=web cmd=bash)
Exercises
- (Beginner) Is Falco a preventive or a detective control? What does it observe?
- (Beginner) How does Falco typically run in a cluster?
- (Intermediate) Give two examples of suspicious runtime behavior Falco can detect.
- (Interview) Why are runtime detection tools like Falco necessary even with strong admission and Pod-security controls? (Hint: assume breach; exploits/zero-days/insider behavior at runtime.)
Answers
- Detective — it detects suspicious activity at runtime (it doesn't block deployment). It observes kernel-level events/syscalls and container behavior in real time.
- As a DaemonSet (one agent per node), tapping kernel events via eBPF or a kernel module, emitting alerts.
- Any two: a shell/terminal spawned inside a container, unexpected outbound network connections, writes to sensitive directories (e.g., /etc), reading secret files, package-manager execution in a running container, privilege escalation attempts, or processes outside the container's expected set.
- Preventive controls reduce but can't eliminate risk: applications have vulnerabilities (including zero-days), legitimate-but-malicious or compromised images can pass admission, and insiders or supply-chain issues can introduce threats that only manifest at runtime. Under an "assume breach" model, you need to detect malicious behavior after a container is running — Falco provides that visibility (e.g., catching an attacker who exploited an app and spawned a shell), enabling alerting and response that preventive admission/Pod-security checks alone cannot offer.
SBOM and vulnerability scanning
Theory
You can't secure what you don't know you're shipping. A Software Bill of Materials (SBOM) is a complete inventory of all components, libraries, and dependencies (with versions) inside a container image — like an ingredients label. Generated at build time (tools: Syft, Trivy, in formats like SPDX or CycloneDX), an SBOM lets you instantly answer "are we affected by the new CVE in library X?" by querying which images contain it — crucial during incidents like Log4Shell.
Vulnerability scanning complements the SBOM by checking those components against known-vulnerability databases (CVEs). Scanners (Trivy, Grype, Clair, Snyk) run in CI (scan before push), in the registry (continuous re-scan as new CVEs are disclosed), and via admission (block images above a severity threshold). The combination — know your contents (SBOM) and check them for known flaws (scanning) — is foundational supply-chain hygiene, shifting vulnerability management "left" into the build pipeline and continuously across the image lifecycle.
Example
# Generate an SBOM for an image (Syft):
syft registry.example.com/myapp:1.0 -o spdx-json > sbom.json
# Scan an image for known CVEs (Trivy), failing on High/Critical:
trivy image --severity HIGH,CRITICAL --exit-code 1 registry.example.com/myapp:1.0
# myapp:1.0 (alpine 3.19)
# Total: 2 (HIGH: 1, CRITICAL: 1)
# CVE-2024-XXXX openssl 3.1.4 -> 3.1.6 (fixed)
Exercises
- (Beginner) What is an SBOM, in one sentence?
- (Beginner) What does vulnerability scanning check images against?
- (Intermediate) During a newly-disclosed critical CVE, how does having SBOMs speed up your response?
- (Interview) Explain "shift left" for vulnerability management and why scanning in CI alone is insufficient — what additional scanning points are needed? (Hint: new CVEs appear after build; registry/admission re-scanning.)
Answers
- A Software Bill of Materials is a complete, machine-readable inventory of all components, libraries, and dependencies (and their versions) contained in a software artifact/image.
- Known-vulnerability databases — i.e., published CVEs (and advisories) for the components present in the image.
- With SBOMs you can immediately query which images/deployments contain the affected component and version, pinpointing exposure across your fleet in minutes instead of manually inspecting each image — enabling targeted, fast remediation (as was painfully needed during Log4Shell).
- "Shift left" means moving vulnerability detection earlier, into the build/CI pipeline, so issues are caught before deployment. But CI scanning only reflects vulnerabilities known at build time; new CVEs are disclosed continuously after an image is built and deployed. So you also need continuous re-scanning in the registry (re-evaluate stored images against updated CVE feeds) and admission-time scanning/policy (block deploying images exceeding a severity threshold), plus runtime awareness — covering the whole lifecycle, not just the moment of build.
10. Observability
You cannot operate what you cannot see. Observability is the practice of understanding a system's internal state from its external outputs, traditionally organized into three pillars — logs, metrics, and traces — plus health probes, events, and auditing that are specific to Kubernetes. This chapter covers how to collect, aggregate, and act on each, so you can answer not just "is it up?" but "why is it slow/failing, and where?"
10.1 Logging
Logs are the most immediate window into what an application is doing. This subchapter covers reading logs, how Kubernetes handles them at the node level, and how to aggregate them cluster-wide.
kubectl logs and log streaming
Theory
The simplest observability tool is kubectl logs, which retrieves the stdout/stderr of a container. The Kubernetes logging convention is that applications should write logs to standard output and standard error, not to files — the container runtime captures these streams, and the kubelet exposes them. This is the twelve-factor "logs as event streams" principle.
Practical usage: -f streams (follows) logs live; --previous shows logs from a crashed/restarted container's prior instance (essential for debugging CrashLoopBackOff); -c selects a specific container in a multi-container Pod; --since/--tail bound the output; and -l with --prefix can read logs across Pods matching a label. The key limitation: kubectl logs reads from the node's local log files, so logs are lost when the Pod is deleted or the node is gone — which is exactly why cluster-level aggregation (later) is needed.
Example
kubectl logs web-abc # current logs
kubectl logs -f web-abc -c app # stream container "app"
kubectl logs web-abc --previous # logs from the crashed prior container
kubectl logs web-abc --since=1h --tail=100 # last hour, last 100 lines
kubectl logs -l app=web --prefix --all-containers # across all matching Pods
Exercises
- (Beginner) Where should containerized applications write their logs, by convention?
- (Beginner) Which flag shows logs from a container that already crashed and restarted?
- (Intermediate) Why does
kubectl logsbecome useless after a Pod is deleted, and what does that imply?- (Interview) Explain the "logs as event streams" principle and why writing to stdout/stderr (rather than files) is the right model in Kubernetes. (Hint: runtime captures streams; decouples app from log routing.)
Answers
- To standard output (stdout) and standard error (stderr).
--previous(kubectl logs <pod> --previous).kubectl logsreads node-local log files for the Pod's containers; when the Pod is deleted (or the node is gone), those files are removed, so the logs disappear. This implies you need a cluster-level log aggregation system to persist logs beyond a Pod's lifetime.- The principle treats logs as a continuous stream of time-ordered events the app simply writes to stdout/stderr, leaving collection, routing, and storage to the platform. In Kubernetes the container runtime captures these streams and the kubelet exposes them, so apps don't manage log files, rotation, or shipping. This decouples the application from log infrastructure (a logging agent can collect and forward streams anywhere), keeps containers stateless, and makes log handling uniform across all workloads.
Node-level logging architecture
Theory
To understand where logs live, you need the node-level picture. When a container writes to stdout/stderr, the container runtime redirects those streams to log files on the node (typically under /var/log/pods/... and /var/log/containers/..., the latter being symlinks). The kubelet manages these and provides them to kubectl logs. The kubelet also handles log rotation (e.g., rotating at a size limit and keeping a few files) so logs don't fill the disk — which means old log lines are discarded over time.
This architecture has important consequences: logs are ephemeral and node-local — tied to the container's life on that node, rotated away, and lost on Pod/node deletion. There is no built-in cluster-wide log storage. This is by design: Kubernetes provides the raw streams and node-level handling, and leaves durable, searchable, aggregated logging to add-on systems (next topics). Knowing this explains both how kubectl logs works and why you must deploy a logging stack for production.
Example
Container stdout/stderr
| (container runtime)
v
/var/log/pods/<ns>_<pod>_<uid>/<container>/0.log (rotated by kubelet)
^
| kubectl logs reads here | a logging agent (DaemonSet) also tails here
# On a node, the raw log files the runtime/kubelet manage:
ls /var/log/containers/ # symlinks: <pod>_<ns>_<container>-<id>.log
Exercises
- (Beginner) What component redirects container stdout/stderr to files on the node?
- (Beginner) Where on the node are container logs typically stored?
- (Intermediate) Why are node-level logs considered ephemeral, and name two events that cause log loss.
- (Interview) Given Kubernetes provides no built-in cluster log storage, what architectural pattern is used to achieve durable, searchable logs? (Hint: agent per node tails files and forwards to a backend.)
Answers
- The container runtime (with the kubelet managing/rotating the resulting files).
- Under
/var/log/pods/...(with symlinks in/var/log/containers/...).- They are stored as node-local files tied to the container instance and are rotated by the kubelet (old lines discarded) — so they don't persist indefinitely. Log loss occurs on log rotation (old entries dropped), Pod deletion, container removal, or node failure/replacement.
- A node-level logging agent (typically a DaemonSet, e.g., Fluent Bit/Fluentd) runs on every node, tails the container log files (and adds Kubernetes metadata), and forwards them to a centralized backend (Elasticsearch/OpenSearch, Loki, a cloud log service). The backend provides durable storage, indexing, and search across the whole cluster — the standard cluster-level logging pattern that compensates for the absence of built-in central storage.
Cluster-level logging patterns
Theory
To make logs durable and queryable cluster-wide, you add a logging pipeline. Three common patterns for collecting application logs:
- Node-level agent (DaemonSet) — the most common: one logging agent per node tails all containers' log files and forwards them. Efficient (one agent per node, app-agnostic, no app changes) — the default choice.
- Sidecar container — a logging sidecar in the Pod reads logs (e.g., from a shared volume or app-specific files) and either streams them to stdout or forwards them directly. Used when the app writes to files instead of stdout, or needs per-app processing.
- Application pushes directly — the app itself sends logs to the backend via a logging library. Simplest infrastructure but couples the app to the backend and is generally discouraged.
A complete pipeline has: collection (agent) → transport/processing (parse, enrich with Kubernetes metadata, filter) → storage/indexing (Elasticsearch/OpenSearch, Loki, cloud) → visualization/query (Kibana, Grafana). The node-agent pattern with a processing layer is the standard production architecture.
Example
Pattern 1 (recommended): node agent
[app->stdout]->[node log files]->[Fluent Bit DaemonSet]->[Loki/ES]->[Grafana/Kibana]
Pattern 2: sidecar
[app->file]->[sidecar reads file]->stdout or ->backend
Pattern 3: app pushes
[app library]----------------------------->[backend] (tight coupling)
Exercises
- (Beginner) What is the most common cluster-level logging collection pattern?
- (Beginner) Name the four stages of a complete logging pipeline.
- (Intermediate) When would you use a logging sidecar instead of a node-level agent?
- (Interview) Why is the node-level DaemonSet agent generally preferred over having each application push logs directly to the backend? (Hint: decoupling, no app changes, efficiency, consistency.)
Answers
- A node-level logging agent deployed as a DaemonSet (one per node) that tails container logs and forwards them.
- Collection (agent) → transport/processing (parse/enrich/filter) → storage/indexing (backend) → visualization/query (UI).
- When the application writes logs to files rather than stdout/stderr (so the node agent can't capture them from the standard streams), or when a specific app needs custom per-Pod log processing/parsing before forwarding. The sidecar reads the app's files (often via a shared volume) and emits them to stdout or directly to the backend.
- The DaemonSet agent collects logs from all containers without modifying applications, works uniformly across every workload, runs one efficient agent per node, and decouples apps from the logging backend (you can change backends without touching apps). Direct app-push couples each application to a specific backend/library, requires code changes in every app, scatters configuration, and makes backend changes painful — so the node agent is the more maintainable, consistent, and efficient choice.
Log aggregation with Fluentd, Fluent Bit, Loki
Theory
The concrete tools that implement the pipeline:
- Fluentd (CNCF): a mature, feature-rich log collector/processor with a huge plugin ecosystem for inputs, parsing, filtering, and outputs. Powerful but relatively heavier (Ruby-based) on resources.
- Fluent Bit (CNCF): a lightweight, high-performance collector (C-based) designed for low memory/CPU — ideal as the per-node DaemonSet agent. Often Fluent Bit collects/forwards and (optionally) Fluentd aggregates centrally; many setups now use Fluent Bit alone.
- Loki (Grafana): a log storage/query backend designed to be cost-efficient by indexing only labels (metadata) rather than full log text (unlike Elasticsearch, which does full-text indexing). It pairs naturally with Grafana for querying via LogQL and with Prometheus-style labels.
A very common modern stack is Fluent Bit (collect) → Loki (store) → Grafana (visualize) — lightweight and cost-effective. The classic alternative is the EFK stack (Elasticsearch, Fluentd/Fluent Bit, Kibana) with full-text search. Choice depends on cost, search needs, and existing tooling.
Example
| Tool | Role | Characteristic |
|---|---|---|
| Fluentd | Collector/aggregator | Feature-rich, many plugins, heavier |
| Fluent Bit | Collector (agent) | Lightweight, fast, low footprint |
| Loki | Storage/query | Indexes labels only — cheap; LogQL + Grafana |
| Elasticsearch | Storage/query | Full-text index — powerful search, heavier/costlier |
# Fluent Bit (DaemonSet) tailing container logs and shipping to Loki (conceptual):
[INPUT] Name tail Path /var/log/containers/*.log
[FILTER] Name kubernetes # enrich with pod/namespace/labels metadata
[OUTPUT] Name loki Host loki.logging.svc Labels job=fluentbit
Exercises
- (Beginner) Which is more lightweight: Fluentd or Fluent Bit?
- (Beginner) What does Loki index, and how does that make it cost-efficient?
- (Intermediate) Contrast Loki with Elasticsearch for log storage.
- (Interview) Why might a team choose a Fluent Bit + Loki + Grafana stack over the classic EFK stack? (Hint: resource/cost efficiency, label-based indexing, Grafana unification vs. full-text search needs.)
Answers
- Fluent Bit (it's a lightweight, C-based collector with low memory/CPU footprint, versus the heavier Ruby-based Fluentd).
- Loki indexes only labels/metadata (not the full log text). By avoiding expensive full-text indexing and storing compressed log chunks in cheap object storage, it greatly reduces storage and compute cost.
- Loki indexes labels only (cheap, lower resource use, queried via LogQL, integrates with Grafana and Prometheus-style labels) but has weaker arbitrary full-text search. Elasticsearch performs full-text indexing of log content (powerful, fast ad-hoc text search and analytics) at the cost of much higher storage, memory, and operational overhead.
- The Fluent Bit + Loki + Grafana stack is lighter and cheaper: Fluent Bit uses minimal node resources, Loki's label-only indexing slashes storage/compute costs, and Grafana unifies logs with metrics/traces in one UI (good for correlation). EFK offers superior full-text search and analytics but is heavier and costlier to run and scale. A team that values cost efficiency, a unified Grafana experience, and label-based querying — and doesn't need heavy full-text search — would prefer the Loki stack.
10.2 Metrics and Monitoring
Metrics are numeric time-series that reveal trends, capacity, and health. Prometheus and Grafana dominate this space. This subchapter covers the metrics ecosystem.
Metrics Server and resource metrics
Theory
The Metrics Server (introduced in Chapter 2) is the lightweight component that provides real-time resource metrics — current CPU and memory usage of Pods and nodes — via the Metrics API (metrics.k8s.io). It powers kubectl top and the Horizontal/Vertical Pod Autoscalers. It scrapes the kubelets' Summary API and keeps only the latest values in memory (no history).
The crucial framing (worth repeating because it's commonly misunderstood): Metrics Server is not a monitoring system. It exists solely to supply current resource metrics for autoscaling and kubectl top. It cannot show trends, retain history, alert, or store custom application metrics. For all of that — historical dashboards, alerting, custom/business metrics — you use Prometheus (next). So a typical cluster runs both: Metrics Server for autoscaling/top, and Prometheus for full monitoring.
Example
kubectl top nodes
# NAME CPU(cores) CPU% MEMORY(bytes) MEMORY%
# node-1 320m 16% 2100Mi 54%
kubectl top pods -n prod --containers
# POD NAME CPU(cores) MEMORY(bytes)
# web-1 app 45m 180Mi
Exercises
- (Beginner) What two things does the Metrics Server primarily power?
- (Beginner) Does the Metrics Server store historical metrics?
- (Intermediate) Why do clusters typically run both Metrics Server and Prometheus?
- (Interview) A teammate proposes using Metrics Server data to build CPU-usage dashboards and alerts. Why won't that work, and what's the right tool? (Hint: in-memory latest values only; Prometheus for history/alerting.)
Answers
kubectl top(node/Pod resource usage) and the autoscalers (HPA/VPA) via the Metrics API.- No — it keeps only the latest values in memory, with no historical retention.
- They serve different purposes: Metrics Server provides lightweight, real-time resource metrics for autoscaling and
kubectl top, while Prometheus provides durable historical storage, custom/application metrics, querying (PromQL), dashboards, and alerting. Neither replaces the other, so both run together.- Metrics Server only holds the most recent CPU/memory readings in memory with no history, no long-term storage, and no alerting capability — so you can't build time-series dashboards or alert rules from it. The right tool is Prometheus, which scrapes and stores metrics over time, supports PromQL queries, integrates with Grafana for dashboards, and with Alertmanager for alerting.
Prometheus operator and ServiceMonitors
Theory
Prometheus is the de facto standard monitoring system for Kubernetes (a CNCF graduated project). It works on a pull model: it periodically scrapes HTTP /metrics endpoints exposed by targets (apps, exporters, Kubernetes components) and stores the results as time series in its own TSDB, queryable with PromQL. It discovers targets dynamically via Kubernetes service discovery.
Managing Prometheus config by hand is tedious, so the Prometheus Operator (often deployed via the kube-prometheus-stack Helm chart, which bundles Prometheus, Grafana, Alertmanager, and exporters) manages it through CRDs. The key one is the ServiceMonitor (and PodMonitor): instead of editing Prometheus's scrape config, you declare a ServiceMonitor that says "scrape Services with these labels on this port/path," and the Operator generates the config automatically. This makes monitoring declarative and self-service — teams add a ServiceMonitor alongside their app to get it scraped.
Example
# Declaratively tell Prometheus (via the Operator) to scrape this app:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: web
labels: { release: kube-prometheus-stack } # matched by the Prometheus selector
spec:
selector:
matchLabels: { app: web } # which Services to scrape
endpoints:
- port: metrics # the Service port exposing /metrics
path: /metrics
interval: 30s
Exercises
- (Beginner) Does Prometheus use a push or pull model for collecting metrics?
- (Beginner) What query language does Prometheus use?
- (Intermediate) What does a ServiceMonitor let you do without editing Prometheus's raw config?
- (Interview) Explain how the Prometheus Operator and ServiceMonitors make monitoring declarative and self-service in a multi-team cluster. (Hint: CRDs generate scrape config; teams ship monitors with their apps.)
Answers
- Pull — Prometheus scrapes targets'
/metricsendpoints on an interval.- PromQL.
- Declare which Services/endpoints Prometheus should scrape (by label selector, port, path, interval). The Operator translates ServiceMonitors into Prometheus scrape configuration automatically, so you never hand-edit the Prometheus config.
- The Prometheus Operator watches ServiceMonitor/PodMonitor CRDs and automatically generates and reloads Prometheus's scrape configuration. Each team can ship a ServiceMonitor alongside their application declaring how to scrape it; the Operator picks it up and Prometheus starts scraping — no central config edits or Prometheus restarts. This makes monitoring declarative (config-as-Kubernetes-objects, version-controllable) and self-service (teams onboard their own metrics) while keeping a single managed Prometheus, which scales cleanly across many teams.
kube-state-metrics
Theory
There's an important distinction between two kinds of cluster metrics:
- Resource usage metrics (from Metrics Server / cAdvisor / node-exporter): how much CPU/memory is being consumed.
- Object state metrics (from kube-state-metrics): the state of Kubernetes objects themselves — e.g., how many replicas a Deployment desires vs. has available, whether a Pod is in CrashLoopBackOff, how many Jobs failed, whether a node is unschedulable, PVC phase, etc.
kube-state-metrics (KSM) is a service that listens to the Kubernetes API and exposes the status of objects as Prometheus metrics. It does not measure resource consumption — it reflects the desired vs. actual state and conditions of API objects. This is what lets you alert on "a Deployment has fewer available replicas than desired for 10 minutes" or "Pods are restarting frequently." KSM is a standard component of the kube-prometheus stack, complementing node-exporter (node/hardware metrics) and the app metrics.
Example
# Object-state metrics exposed by kube-state-metrics, queried in PromQL:
kube_deployment_status_replicas_available{deployment="web"}
kube_deployment_spec_replicas{deployment="web"}
# Alert when available < desired:
kube_deployment_status_replicas_available < kube_deployment_spec_replicas
kube_pod_container_status_restarts_total{namespace="prod"} # restart counts
Exercises
- (Beginner) What does kube-state-metrics expose, as opposed to Metrics Server?
- (Beginner) Does kube-state-metrics measure CPU/memory consumption?
- (Intermediate) Give two examples of conditions you can alert on using kube-state-metrics.
- (Interview) Clarify the roles of node-exporter, Metrics Server, and kube-state-metrics — what distinct question does each answer? (Hint: node/hardware usage; live resource usage for autoscaling; object desired-vs-actual state.)
Answers
- The state of Kubernetes API objects (e.g., Deployment desired/available replicas, Pod phase/restarts, Job success/failure, node conditions) as Prometheus metrics — not resource usage.
- No — it reports object state/conditions, not resource consumption.
- Any two: a Deployment's available replicas below desired for some duration; Pods stuck in Pending or CrashLoopBackOff; high container restart counts; failed Jobs; PVCs stuck unbound; nodes marked unschedulable/NotReady.
- node-exporter: node/host and hardware-level resource metrics (CPU, memory, disk, network of the machine). Metrics Server: current Pod/node CPU & memory usage exposed via the Metrics API for
kubectl topand autoscaling (real-time, no history). kube-state-metrics: the desired-vs-actual state and conditions of Kubernetes objects (replicas, phases, restarts). Roughly: node-exporter = "how is the machine doing?", Metrics Server = "how much is this Pod using right now (for scaling)?", kube-state-metrics = "what does Kubernetes think the state of my objects is?".
Grafana dashboards for Kubernetes
Theory
Grafana is the standard visualization layer. It connects to data sources — most commonly Prometheus (metrics), and also Loki (logs) and Tempo/Jaeger (traces) — and renders dashboards of panels (graphs, gauges, tables, heatmaps) driven by queries (PromQL/LogQL). It turns raw time-series into human-readable views of cluster and application health.
Grafana's strengths for Kubernetes: a huge library of prebuilt community dashboards (import by ID) for cluster overview, node/Pod resources, workloads, and many exporters; templating/variables (e.g., a namespace or Pod dropdown) for reusable dashboards; and the ability to unify all three observability pillars (metrics, logs, traces) in one place with cross-links (click a metric spike → jump to the relevant logs/traces). The kube-prometheus-stack ships Grafana pre-configured with Prometheus and a set of Kubernetes dashboards out of the box.
Example
Grafana data sources:
Prometheus -> metrics panels (PromQL)
Loki -> logs panels (LogQL)
Tempo/Jaeger-> trace views
Dashboard variables (templating):
$namespace = label_values(kube_pod_info, namespace)
Panel query: sum(rate(container_cpu_usage_seconds_total{namespace="$namespace"}[5m])) by (pod)
# Import a community Kubernetes dashboard by its ID in the Grafana UI
# (e.g., "Kubernetes / Compute Resources / Namespace (Pods)").
Exercises
- (Beginner) What is Grafana's primary role, and name two data sources it commonly uses.
- (Beginner) How can you quickly get a Kubernetes dashboard without building it from scratch?
- (Intermediate) What do Grafana template variables enable?
- (Interview) How does Grafana help correlate across the three observability pillars, and why is that valuable during an incident? (Hint: unified UI, cross-links from a metric spike to logs/traces.)
Answers
- Visualization of observability data via dashboards/panels. Common data sources: Prometheus (metrics), Loki (logs), Tempo/Jaeger (traces).
- Import a prebuilt community dashboard by its ID (Grafana has a large library), or use the dashboards bundled with the kube-prometheus-stack.
- Reusable, parameterized dashboards: variables (e.g., namespace, workload, Pod dropdowns) let one dashboard be filtered/scoped dynamically instead of duplicating dashboards per target, and queries reference the variables.
- Grafana can display metrics, logs, and traces from their respective backends in one UI and link between them — e.g., from a latency/error spike on a metrics panel you can jump to the corresponding logs (Loki) and distributed traces (Tempo/Jaeger) for the same time window/service. During an incident this drastically shortens diagnosis: you see that something is wrong (metrics), why via detailed errors (logs), and where in the request path (traces), all correlated, instead of manually stitching together separate tools.
Alertmanager and alerting rules
Theory
Dashboards are passive; alerting makes monitoring proactive. In the Prometheus ecosystem, alerting rules are PromQL expressions evaluated continuously by Prometheus; when an expression is true for a specified duration (for:), Prometheus fires an alert. These alerts are sent to Alertmanager, a separate component that handles routing and delivery.
Alertmanager's responsibilities are what make alerting usable rather than noisy:
- Grouping: combine related alerts into one notification (e.g., all alerts from one outage).
- Routing: send different alerts to different receivers (Slack, PagerDuty, email) based on labels/severity.
- Silencing: temporarily mute alerts (e.g., during maintenance).
- Inhibition: suppress lower-priority alerts when a higher-priority related alert is firing (e.g., don't page about individual Pods when the whole node is down).
- Deduplication: avoid repeated notifications for the same alert.
With the Prometheus Operator, alerting rules are declared via PrometheusRule CRDs. Good alerting targets symptoms users feel (high error rate, latency) over noisy causes, following SRE practice.
Example
# A PrometheusRule (Operator CRD) defining an alert:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata: { name: web-alerts, labels: { release: kube-prometheus-stack } }
spec:
groups:
- name: web.rules
rules:
- alert: HighErrorRate
expr: sum(rate(http_requests_total{status=~"5..",app="web"}[5m]))
/ sum(rate(http_requests_total{app="web"}[5m])) > 0.05
for: 10m # must hold 10 min before firing
labels: { severity: page }
annotations: { summary: "web 5xx error rate > 5% for 10m" }
Exercises
- (Beginner) What component evaluates alerting rules, and what component handles routing/delivery?
- (Beginner) What does the
for:field in an alerting rule control?- (Intermediate) Name three things Alertmanager does to reduce alert noise.
- (Interview) Why is "inhibition" valuable, and why do SRE practices favor alerting on symptoms over causes? (Hint: suppress redundant downstream alerts; reduce noise/actionability.)
Answers
- Prometheus evaluates the alerting rules and fires alerts; Alertmanager handles grouping, routing, and delivery to receivers.
- How long the alert expression must be continuously true before the alert actually fires — preventing flapping/transient spikes from triggering notifications.
- Any three: grouping (combine related alerts), deduplication, silencing (mute during maintenance), inhibition (suppress lower-priority alerts when a higher-priority related one fires), and routing (send to the right receiver to avoid spamming everyone).
- Inhibition suppresses redundant downstream alerts when a root-cause alert is already firing — e.g., if a node is down, you don't also want dozens of pages for each Pod on it; you get the one meaningful alert. Symptom-based alerting (high user-facing error rate, latency, unavailability) is favored because symptoms are what actually impact users and are reliably actionable, whereas alerting on every internal cause produces noise and false alarms (many causes don't affect users). Alerting on symptoms keeps pages few, meaningful, and tied to real impact, while metrics/dashboards/traces help diagnose the cause after a symptom alert fires.
10.3 Tracing
When a single request flows through many microservices, logs and metrics alone can't show you where time went or which hop failed. Distributed tracing reconstructs the request's journey. This subchapter covers it.
Distributed tracing concepts
Theory
In a microservices architecture, one user request may traverse a dozen services. If it's slow or fails, which service is responsible? Distributed tracing answers this by following a single request across all the services it touches, producing a trace.
Core concepts:
- Trace: the end-to-end journey of one request, identified by a trace ID.
- Span: a single unit of work within the trace (e.g., one service handling the request, or one DB call). Each span has a start/end time, a name, and attributes.
- Parent/child relationships: spans nest to form a tree, showing causality and how time is spent across services.
- Context propagation: the trace ID and span context are passed between services (via HTTP headers like W3C
traceparent) so spans from different services link into one trace.
A trace is typically visualized as a waterfall/Gantt of spans, instantly revealing which service or call consumed the most time or errored. Tracing is the pillar that provides causality and latency breakdown across service boundaries — something neither logs nor metrics give you.
Example
Trace ID: 4bf92... (one request)
├─ span: gateway [=================] 250ms
│ ├─ span: auth-service [===] 30ms
│ └─ span: order-service [============] 200ms
│ └─ span: db query [========] 150ms <- the bottleneck
Exercises
- (Beginner) What is the difference between a trace and a span?
- (Beginner) What identifier ties all spans of one request together?
- (Intermediate) What is "context propagation" and why is it essential for distributed tracing?
- (Interview) A user request is intermittently slow across a microservices system. Why is tracing better suited than logs or metrics to find the cause? (Hint: per-request causality and latency breakdown across services.)
Answers
- A trace is the complete end-to-end record of one request across all services; a span is a single unit of work within that trace (one operation/service hop), with its own timing and attributes. A trace is composed of many nested spans.
- The trace ID (shared by all spans belonging to the same request).
- Context propagation is passing the trace/span context (e.g., trace ID via headers like W3C
traceparent) from one service to the next as the request flows. It's essential because without it, each service's spans would be isolated; propagation links spans across service boundaries into a single coherent trace.- Tracing reconstructs the exact path and timing of an individual request across every service and call, so you can see precisely which hop or dependency added the latency (or failed) for the slow requests. Metrics show aggregate trends (latency is up) but not which request/path; logs show per-service events but aren't inherently correlated end-to-end across services. Tracing provides the per-request causality and latency breakdown across boundaries needed to pinpoint an intermittent, cross-service slowdown.
OpenTelemetry in Kubernetes
Theory
Historically, each tracing/metrics vendor had its own SDKs and agents, causing lock-in and fragmentation. OpenTelemetry (OTel) — a CNCF project — is the vendor-neutral standard for generating, collecting, and exporting telemetry (traces, metrics, and logs). It provides consistent SDKs/APIs for instrumenting applications and the OpenTelemetry Collector for receiving, processing, and exporting telemetry to any backend.
In Kubernetes, OTel is typically deployed via the OpenTelemetry Collector (as a DaemonSet for node-local collection and/or a Deployment for central processing), often managed by the OpenTelemetry Operator (which can also auto-inject instrumentation into Pods). The Collector receives telemetry (e.g., OTLP), processes it (batching, filtering, enrichment with Kubernetes metadata), and exports it to backends like Jaeger/Tempo (traces), Prometheus (metrics), or Loki (logs). The big win: instrument once with OTel and switch backends freely — no vendor lock-in.
Example
App (OTel SDK) --OTLP--> OpenTelemetry Collector --> exporters:
|--> Tempo / Jaeger (traces)
|--> Prometheus (metrics)
|--> Loki (logs)
# OpenTelemetry Collector pipeline (conceptual):
receivers: [ otlp ] # accept telemetry from apps (gRPC/HTTP)
processors: [ batch, k8sattributes ] # batch + enrich with pod/namespace labels
exporters: [ otlp/tempo, prometheus ]
Exercises
- (Beginner) What problem does OpenTelemetry solve regarding vendors?
- (Beginner) What are the two common ways the OpenTelemetry Collector is deployed in Kubernetes?
- (Intermediate) What does the OpenTelemetry Collector do between applications and backends?
- (Interview) Why is "instrument once with OTel, export anywhere" valuable, and how does the Collector enable backend flexibility? (Hint: avoid SDK lock-in; Collector decouples instrumentation from backend via exporters.)
Answers
- Vendor lock-in and fragmentation — instead of vendor-specific SDKs/agents, OTel provides a single vendor-neutral standard and tooling for generating and exporting telemetry to any backend.
- As a DaemonSet (node-local collection) and/or as a Deployment (centralized processing/gateway), often managed by the OpenTelemetry Operator.
- It receives telemetry (e.g., via OTLP), processes it — batching, filtering, sampling, and enriching with Kubernetes metadata (k8sattributes) — and exports it to one or more backends through configurable exporters.
- Applications are instrumented once against the OTel API/SDK and emit standard telemetry; the Collector (via swappable exporters) decides where it goes. To change or add a backend (e.g., switch from Jaeger to Tempo, or also send metrics to Prometheus), you reconfigure the Collector's exporters — no application code changes. This decouples instrumentation from backends, eliminating per-vendor SDK rewrites and lock-in while letting you fan telemetry out to multiple systems.
Jaeger and Tempo integration
Theory
Once spans are generated and collected, you need a tracing backend to store, index, and visualize traces. The two leading open-source choices:
- Jaeger (CNCF): a mature distributed tracing system with its own collectors, storage (Elasticsearch/Cassandra), and a feature-rich UI for searching traces by service/operation/tags and viewing dependency graphs. It can index trace attributes for rich querying.
- Tempo (Grafana): a tracing backend optimized for cost and simplicity — it stores traces in cheap object storage and, by design, does not index trace attributes (you look traces up by trace ID, typically discovered via correlated logs/metrics in Grafana). This makes it very cheap to run at scale and integrates tightly with the Grafana + Loki + Prometheus stack for cross-pillar correlation.
The trade-off mirrors Loki vs. Elasticsearch: Jaeger offers rich trace search at higher storage/operational cost; Tempo offers cheap, scalable trace storage with lookup driven by correlation rather than full indexing. Both are commonly fed by the OpenTelemetry Collector.
Example
| Backend | Storage | Trace search | Best with |
|---|---|---|---|
| Jaeger | Elasticsearch/Cassandra | Rich (by service/op/tags) | Standalone, detailed trace queries |
| Tempo | Object storage (cheap) | By trace ID (correlate via logs/metrics) | Grafana + Loki + Prometheus stack |
OTel Collector --(OTLP)--> Tempo --(Grafana)--> view trace by ID
(jump from a Loki log line's traceID to its trace)
Exercises
- (Beginner) What is the role of a tracing backend like Jaeger or Tempo?
- (Beginner) How does Tempo keep costs low compared to Jaeger?
- (Intermediate) If Tempo doesn't index trace attributes, how do you typically find the trace you want?
- (Interview) Compare Jaeger and Tempo and explain when each is the better fit. (Hint: rich search vs. cheap scalable storage + Grafana correlation.)
Answers
- To store, index/organize, and visualize traces (spans) so engineers can search and inspect request flows and latency.
- Tempo stores traces in inexpensive object storage and avoids indexing trace attributes (only trace-ID lookup), eliminating the costly full indexing and heavy storage backends (Elasticsearch/Cassandra) that Jaeger typically uses.
- By trace ID — usually discovered through correlation: you find a relevant log line (Loki) or exemplar metric in Grafana that carries the trace ID, then jump to that trace in Tempo. (Tempo also offers TraceQL for some search in newer versions, but the core model is lookup/correlation by ID.)
- Jaeger is the better fit when you need rich, standalone trace search (by service, operation, tags) and dependency analysis, and can afford its heavier storage/operations. Tempo is better when you want low-cost, highly scalable trace storage and you operate within the Grafana ecosystem (Loki + Prometheus), relying on cross-pillar correlation (jump from logs/metrics to a trace by ID) rather than full attribute indexing. Choose Jaeger for query richness, Tempo for cost/scale and unified Grafana correlation.
Instrumenting applications for tracing
Theory
Traces don't appear by magic — applications must be instrumented to create spans and propagate context. There are two approaches:
- Automatic instrumentation: language-specific agents/libraries (OpenTelemetry auto-instrumentation, or the OTel Operator's auto-injection) hook into common frameworks (HTTP servers/clients, gRPC, DB drivers) to generate spans and propagate context without code changes. Quick to adopt; captures framework-level spans.
- Manual instrumentation: developers use the OTel SDK to create custom spans around important business logic, add attributes (user ID, order value), and record events/errors. More effort, but captures domain-specific detail auto-instrumentation can't.
The essential requirement either way is context propagation: incoming trace context must be extracted from request headers and injected into outbound calls, so spans link across services. Best practice is to combine both — auto-instrumentation for broad coverage of framework calls, plus targeted manual spans for critical custom operations. In Kubernetes, the OTel Operator can auto-inject instrumentation by annotating Pods.
Example
# Manual instrumentation with the OpenTelemetry SDK (Python):
from opentelemetry import trace
tracer = trace.get_tracer(__name__)
def process_order(order):
with tracer.start_as_current_span("process_order") as span: # custom span
span.set_attribute("order.id", order.id) # domain attribute
span.set_attribute("order.value", order.total)
charge_payment(order) # downstream call inherits context automatically
# OTel Operator auto-injection: annotate the Pod, no code change
metadata:
annotations:
instrumentation.opentelemetry.io/inject-python: "true"
Exercises
- (Beginner) What are the two approaches to instrumenting an application for tracing?
- (Beginner) What must happen with trace context for spans to link across services?
- (Intermediate) What can manual instrumentation capture that automatic instrumentation typically cannot?
- (Interview) Why is combining automatic and manual instrumentation considered best practice? (Hint: broad framework coverage plus business-specific spans/attributes.)
Answers
- Automatic instrumentation (agents/libraries that instrument frameworks without code changes) and manual instrumentation (developers add spans/attributes via the SDK).
- Trace context must be propagated — extracted from incoming request headers and injected into outbound requests — so child spans in downstream services attach to the same trace.
- Domain/business-specific detail: custom spans around particular logic, meaningful attributes (e.g., order ID, customer tier, cache hit/miss), and explicit error/event recording — context that generic framework auto-instrumentation doesn't know about.
- Automatic instrumentation gives broad, low-effort coverage of common framework operations (HTTP, gRPC, DB calls) and handles context propagation, so you get end-to-end traces quickly. Manual instrumentation adds the business-critical spans and attributes that make traces actionable for your domain (where the time/errors matter to the application's logic). Combining them yields both comprehensive coverage and meaningful, domain-rich detail — more complete and useful traces than either alone.
10.4 Health Probes
Kubernetes needs to know whether a container is alive, ready for traffic, and finished starting. Probes provide these signals and drive self-healing and traffic routing. This subchapter covers them.
Liveness probes
Theory
A liveness probe answers: "is this container still healthy, or is it stuck/deadlocked?" The kubelet periodically runs the probe; if it fails repeatedly (beyond failureThreshold), the kubelet restarts the container. This provides automatic recovery from states where the process is running but no longer functioning — a deadlock, an infinite loop, an exhausted thread pool — situations a simple process-alive check wouldn't catch.
The critical caution: a misconfigured liveness probe is dangerous. If it's too aggressive (short timeout/threshold) or checks something that's slow under load, it can restart healthy-but-busy containers, turning a load spike into a restart storm and making outages worse. Liveness probes should check only that the process itself is fundamentally healthy — not its dependencies (don't fail liveness because a database is down; that just restarts a container that can't fix the DB). Tunable fields: initialDelaySeconds, periodSeconds, timeoutSeconds, failureThreshold.
Example
spec:
containers:
- name: app
image: myapp:1.0
livenessProbe:
httpGet: { path: /healthz, port: 8080 } # "am I fundamentally alive?"
initialDelaySeconds: 10
periodSeconds: 10
timeoutSeconds: 1
failureThreshold: 3 # 3 consecutive failures -> restart container
Exercises
- (Beginner) What action does the kubelet take when a liveness probe fails repeatedly?
- (Beginner) What kind of failure does a liveness probe catch that a process-alive check does not?
- (Intermediate) Why should a liveness probe avoid checking external dependencies like a database?
- (Interview) How can a misconfigured liveness probe make an outage worse, and what guidelines prevent this? (Hint: restart storms under load; check process health only, sane thresholds.)
Answers
- It restarts the container.
- A process that is running but stuck/non-functional — e.g., deadlocked, in an infinite loop, or otherwise unresponsive — which still appears "alive" as a process but cannot do work. The liveness probe detects functional failure and triggers a restart.
- Because failing liveness restarts the container, but restarting won't fix an external dependency (e.g., a down database). It would needlessly restart an otherwise-healthy container, and if many Pods do this simultaneously it amplifies the problem. Dependency health belongs in readiness (stop sending traffic) not liveness (restart).
- If the liveness probe is too sensitive (short timeout/period/threshold) or checks something that slows under load, then during a traffic spike healthy-but-busy containers fail the probe and get restarted, dropping capacity exactly when it's needed and causing cascading restart storms that deepen the outage. Guidelines: probe only fundamental process health (a lightweight endpoint), never external dependencies; set generous timeouts and failure thresholds; use
initialDelaySeconds/startup probes to avoid premature failures; and keep the check cheap and fast.
Readiness probes
Theory
A readiness probe answers a different question: "is this container ready to receive traffic right now?" If it fails, the kubelet does not restart the container — instead, the Pod is removed from the Service's endpoints, so no traffic is routed to it until it passes again. This is the mechanism that prevents requests from hitting a Pod that's still warming up, temporarily overloaded, or waiting on a dependency.
This is the crucial distinction from liveness: liveness failure = restart; readiness failure = stop sending traffic (no restart). Readiness is where checking dependencies is appropriate — if a Pod can't serve correctly because a needed backend is unavailable, it should report not-ready so it's pulled from rotation, then automatically returns when it recovers. Readiness probes also make rolling updates safe: new Pods only receive traffic once ready, ensuring zero-downtime deployments. Same tunables as liveness.
Example
spec:
containers:
- name: app
image: myapp:1.0
readinessProbe:
httpGet: { path: /ready, port: 8080 } # "can I serve traffic now?"
periodSeconds: 5
failureThreshold: 3 # 3 failures -> removed from Service endpoints
Exercises
- (Beginner) What happens to a Pod when its readiness probe fails?
- (Beginner) Does a readiness probe restart the container? Contrast with liveness.
- (Intermediate) Why is it appropriate for a readiness probe (but not liveness) to check dependencies?
- (Interview) How do readiness probes enable zero-downtime rolling updates? (Hint: new Pods get traffic only when ready; old Pods stay until then.)
Answers
- It is removed from the Service's endpoints (EndpointSlices), so no traffic is routed to it until it becomes ready again; the container keeps running.
- No — readiness failure does not restart the container (it only gates traffic). Liveness failure does restart the container. Liveness = restart; readiness = remove from traffic.
- Because the response to "I can't serve correctly because a dependency is down" should be to stop receiving traffic (so requests go to healthy Pods or fail fast), not to restart the container — restarting wouldn't fix the dependency. Readiness gates traffic and automatically restores it when the dependency recovers, which is the correct behavior; liveness restarting on dependency failure would be harmful.
- During a rolling update, new Pods are started but only added to the Service's endpoints once their readiness probe passes, while old Pods continue serving until then (and are removed only as new ones become ready, subject to surge/unavailable settings). This guarantees traffic is always routed to Pods that are actually ready to handle it, so users experience no errors or downtime as the new version rolls out.
Startup probes
Theory
Some applications take a long and variable time to start — large JVM apps, services loading big caches, legacy apps. This creates a dilemma for liveness probes: set initialDelaySeconds long enough for the slow start, and you delay detecting real deadlocks for fast-starting cases; set it short, and the liveness probe kills the container before it finishes starting. The startup probe resolves this.
A startup probe runs first; while it is failing, the kubelet disables the liveness and readiness probes so they can't act prematurely. Only once the startup probe succeeds do liveness and readiness take over. You give the startup probe a generous total budget (failureThreshold × periodSeconds) to allow even worst-case startup. This lets you keep liveness probes aggressive (fast deadlock detection) for the running state while safely accommodating slow startup — the best of both worlds.
Example
spec:
containers:
- name: legacy-app
image: legacy:1.0
startupProbe:
httpGet: { path: /healthz, port: 8080 }
periodSeconds: 10
failureThreshold: 30 # allow up to 30*10s = 5 minutes to start
livenessProbe: # only active AFTER startup succeeds; can be tight
httpGet: { path: /healthz, port: 8080 }
periodSeconds: 5
failureThreshold: 3
Exercises
- (Beginner) What problem does a startup probe solve?
- (Beginner) What happens to the liveness and readiness probes while the startup probe is still failing?
- (Intermediate) How do you give a startup probe a 5-minute total grace period?
- (Interview) Why does using a startup probe let you keep liveness probes aggressive, and why is that desirable? (Hint: separate slow-start tolerance from steady-state deadlock detection.)
Answers
- It handles applications with long/variable startup times, preventing the liveness probe from killing a container before it has finished starting (without having to weaken liveness for the running state).
- They are disabled (held off) — the kubelet does not run liveness/readiness checks until the startup probe succeeds, so they can't prematurely restart or mark the Pod not-ready during startup.
- Set
periodSecondsandfailureThresholdso their product is 5 minutes — e.g.,periodSeconds: 10andfailureThreshold: 30(10s × 30 = 300s = 5 minutes).- The startup probe absorbs all the slow-start tolerance, so the liveness probe doesn't need a long
initialDelaySecondsor lenient thresholds to accommodate startup. Once startup completes, liveness can use short periods/timeouts and low failure thresholds to detect deadlocks quickly. This is desirable because fast deadlock detection improves recovery time in steady state, while slow startup is handled separately — you no longer have to trade off one against the other.
Probe types: HTTP, TCP, exec, gRPC
Theory
All three probes (liveness, readiness, startup) can use any of four mechanisms to perform their check:
- httpGet: the kubelet makes an HTTP GET to a path/port; a response code 200–399 is success. The most common for web services (use a dedicated lightweight health endpoint).
- tcpSocket: success if the kubelet can open a TCP connection to the port. Useful for non-HTTP services (databases, message brokers) where "port accepts connections" is a reasonable health signal.
- exec: the kubelet runs a command inside the container; exit code 0 is success. Flexible (any custom check) but heavier (spawns a process each time) and requires the binary to exist in the image.
- grpc: the kubelet uses the standard gRPC health checking protocol (stable since v1.27); ideal for gRPC services, replacing the old workaround of shipping a grpc-health-probe binary with an exec probe.
Choose the lightest mechanism that meaningfully reflects health: prefer httpGet/grpc for those protocols, tcpSocket for simple connectivity, and exec only when nothing else fits.
Example
livenessProbe: # HTTP
httpGet: { path: /healthz, port: 8080 }
---
readinessProbe: # TCP connectivity
tcpSocket: { port: 5432 }
---
livenessProbe: # exec a command (exit 0 = healthy)
exec: { command: ["sh", "-c", "pg_isready -U postgres"] }
---
livenessProbe: # native gRPC health check
grpc: { port: 50051 }
Exercises
- (Beginner) What HTTP status code range counts as success for an httpGet probe?
- (Beginner) What defines success for an exec probe?
- (Intermediate) When would you choose a tcpSocket probe over an httpGet probe?
- (Interview) Before native gRPC probes, how did teams health-check gRPC services, and why is the built-in
grpcprobe an improvement? (Hint: grpc-health-probe binary via exec; simpler, lighter, standard.)
Answers
- 200–399 (a 2xx/3xx response is considered success).
- The executed command returning exit code 0 (non-zero = failure).
- For non-HTTP services where you only need to verify the port is open/accepting connections — e.g., a database (PostgreSQL on 5432), a message broker, or any TCP service that doesn't expose an HTTP health endpoint. A successful TCP connection is the health signal.
- Teams used an exec probe running a separate
grpc-health-probebinary that had to be baked into the image, which added image bloat, spawned a process per check, and required maintaining the extra tool. The nativegrpcprobe lets the kubelet speak the standard gRPC health-checking protocol directly — no extra binary, lighter (no process spawn), simpler to configure, and standardized — making gRPC health checks first-class.
10.5 Events and Auditing
Beyond logs/metrics/traces, Kubernetes emits events about what's happening to objects, and can audit every API request. This subchapter covers these Kubernetes-native signals.
Kubernetes events and their lifecycle
Theory
Events are Kubernetes objects that record what is happening to other objects — they're the cluster's narration of its own actions. When the scheduler can't place a Pod, when an image fails to pull, when a probe fails, when a Pod is killed (OOM), when a Deployment scales — an Event is created describing it (with a reason, message, involved object, type, and count). Events are the first place to look when debugging "why is my Pod not working?" via kubectl describe (which shows an object's recent events) or kubectl get events.
The critical caveat: events are ephemeral. By default they are stored in etcd with a TTL of ~1 hour and then deleted, to avoid bloating the datastore. This means events are great for immediate debugging but disappear quickly — you can't investigate yesterday's incident from events alone. For retention, you export events to a logging/monitoring system (e.g., via an event exporter to Loki/Elasticsearch). Events are also rate-limited/aggregated (repeated events increment a count rather than creating many objects).
Example
kubectl describe pod web-abc # shows the Pod's recent Events at the bottom
kubectl get events -n prod --sort-by=.lastTimestamp # cluster events, newest last
kubectl get events --field-selector involvedObject.name=web-abc,type=Warning
# LAST SEEN TYPE REASON OBJECT MESSAGE
# 2m Warning BackOff pod/web-abc Back-off restarting failed container
# 5m Warning Failed pod/web-abc Failed to pull image "myapp:typo"
Exercises
- (Beginner) What do Kubernetes events record?
- (Beginner) Roughly how long are events retained by default, and why the limit?
- (Intermediate) Where do you most commonly view an object's recent events when debugging?
- (Interview) Given events' short TTL, how do you make them available for investigating past incidents, and why isn't this built in? (Hint: export to log/monitoring backend; etcd bloat avoidance.)
Answers
- What is happening to cluster objects — significant occurrences like scheduling decisions, image pull results, probe failures, restarts/OOM kills, scaling actions — with a reason, message, type (Normal/Warning), and the involved object.
- About one hour (default TTL), after which they're deleted. The limit exists to prevent events from accumulating and bloating etcd (and overloading the API/watchers).
- Via
kubectl describe <kind> <name>, which lists the object's recent Events (alsokubectl get events, optionally filtered with field selectors).- Export events to a durable backend — e.g., run an event exporter/watcher that streams events to a logging/monitoring system (Loki, Elasticsearch, a SIEM) where they're retained and searchable. It isn't built in because keeping all events indefinitely in etcd would bloat the cluster datastore and degrade performance; events are designed to be short-lived, real-time signals, with long-term retention delegated to external systems suited for storage and querying.
Audit policy configuration
Theory
For security and compliance, you need a record of who did what in the cluster — every request to the API server. Auditing provides this: when enabled, the API server logs requests according to an audit policy that you define. Since logging everything in full detail would be overwhelming and storage-heavy, the policy lets you control the level of detail recorded per request type.
The audit levels (per matched rule) are:
- None: don't log these requests.
- Metadata: log request metadata (who, when, verb, resource) but not request/response bodies.
- Request: log metadata + the request body.
- RequestResponse: log metadata + request and response bodies (most verbose).
A policy is a list of rules matched in order by users, groups, resources, verbs, and namespaces — letting you, for example, log Secret access at Metadata, mutating writes at RequestResponse, and ignore noisy read-only/system traffic. Each request also passes through stages (RequestReceived, ResponseComplete, etc.) you can choose to record. Tuning the policy balances forensic value against volume.
Example
apiVersion: audit.k8s.io/v1
kind: Policy
rules:
- level: None # ignore noisy health checks
nonResourceURLs: ["/healthz*", "/readyz*"]
- level: Metadata # secrets: log access, but not contents
resources: [ { group: "", resources: ["secrets"] } ]
- level: RequestResponse # log full detail for writes to deployments
verbs: ["create","update","patch","delete"]
resources: [ { group: "apps", resources: ["deployments"] } ]
- level: Metadata # default: metadata for everything else
Exercises
- (Beginner) What does Kubernetes auditing record, and which component produces it?
- (Beginner) Name the four audit levels from least to most detailed.
- (Intermediate) Why would you log Secret access at
Metadatarather thanRequestResponse?- (Interview) How does an audit policy balance forensic value against log volume, and give an example rule ordering. (Hint: per-rule levels; ignore noise, detail sensitive writes.)
Answers
- It records API requests to the kube-apiserver — who made the request, when, what verb/resource, from where, and the outcome (and optionally bodies) — providing a "who did what" trail. The kube-apiserver produces the audit log according to the configured policy.
- None, Metadata, Request, RequestResponse.
- Logging Secret access at
RequestResponsewould write the secret values themselves into the audit log — a serious exposure.Metadatarecords that someone accessed a Secret (who/when/which) without capturing its contents, giving the security signal you want without leaking the sensitive data.- Each rule sets a level for matched requests, so you record high detail only where it matters and little/nothing where it doesn't. A typical ordering: first
Nonefor noisy/irrelevant traffic (health/readiness endpoints, certain system reads); thenMetadatafor sensitive resources like Secrets (access without contents); thenRequestResponsefor mutating writes to important resources (full forensic detail on changes); and a catch-allMetadatadefault at the end. This captures actionable, security-relevant detail while keeping overall log volume manageable.
Audit log backends
Theory
Once the API server generates audit events per the policy, they must go somewhere. Kubernetes supports two audit backends:
- Log backend: writes audit events to a file on the control-plane node (JSON lines), with rotation options (max age/size/backups). Simple and self-contained; a logging agent then ships the file to central storage. This is the common approach for self-managed clusters.
- Webhook backend: sends audit events to an external HTTP endpoint (a SIEM, a log aggregator, a security platform) in near real time. Good for centralized, tamper-resistant collection and immediate analysis, at the cost of a network dependency.
In managed clusters, the provider integrates auditing with its cloud logging (e.g., EKS → CloudWatch, GKE → Cloud Logging, AKS → Azure Monitor), so you consume audit logs there. Regardless of backend, the goal is to land audit events in a durable, secure, searchable store — ideally isolated so that even a cluster compromise can't erase the evidence. Audit logs are key inputs for incident response, compliance, and detecting suspicious API activity.
Example
# API server flags (self-managed):
--audit-policy-file=/etc/kubernetes/audit-policy.yaml
--audit-log-path=/var/log/kubernetes/audit.log # log backend (file)
--audit-log-maxage=30 --audit-log-maxbackup=10 --audit-log-maxsize=100
# OR webhook backend:
--audit-webhook-config-file=/etc/kubernetes/audit-webhook.yaml
# Managed clusters surface audit logs in cloud logging, e.g. GKE:
gcloud logging read 'resource.type="k8s_cluster" protoPayload.methodName:"delete"'
Exercises
- (Beginner) What are the two audit backends Kubernetes supports?
- (Beginner) Where do managed Kubernetes services typically deliver audit logs?
- (Intermediate) What is an advantage of the webhook backend over the log (file) backend?
- (Interview) Why should audit logs be stored in a durable, isolated location, and how does that aid incident response? (Hint: tamper-resistance even if the cluster is compromised; forensic integrity.)
Answers
- The log backend (writes to a file on the control-plane node) and the webhook backend (sends events to an external HTTP endpoint).
- In the cloud provider's logging service (e.g., AWS CloudWatch for EKS, Google Cloud Logging for GKE, Azure Monitor for AKS).
- The webhook backend ships audit events to an external system (SIEM/aggregator) in near real time, enabling centralized, immediate analysis/alerting and storage off the node (more tamper-resistant), rather than sitting in a local file that depends on a separate shipping step and is more exposed if the node is compromised.
- If audit logs are durable and isolated from the cluster (e.g., in a separate account/SIEM with restricted, append-only access), an attacker who compromises the cluster cannot delete or alter them to cover their tracks — preserving forensic integrity. For incident response this means you retain a trustworthy record of every API action to reconstruct what happened, when, and by whom, supporting investigation, scoping the breach, and meeting compliance/legal requirements.
Tools: kubeaudit, kube-bench
Theory
Beyond runtime auditing, you should proactively assess your cluster's security posture against best practices. Two widely used open-source scanners:
- kube-bench (Aqua Security): checks a cluster against the CIS Kubernetes Benchmark — a comprehensive set of security configuration recommendations for the control plane, etcd, kubelet, and policies. It reports which checks pass/fail/warn with remediation guidance, helping you harden node and component configuration (file permissions, API server flags, etc.). Often run as a Job/DaemonSet on the nodes.
- kubeaudit (Shopify): audits your workload manifests / running workloads for security issues — containers running as root, missing securityContext settings, privileged Pods, missing resource limits, allowPrivilegeEscalation, etc. It can run against manifests (in CI) or a live cluster and suggests fixes.
Roughly: kube-bench assesses cluster/infrastructure configuration against CIS, while kubeaudit assesses workload security settings. Both complement the runtime detection (Falco) and policy enforcement (Kyverno/Gatekeeper) from Chapter 9, forming a layered, continuous security-assessment practice.
Example
# kube-bench: check the cluster against the CIS Benchmark
kube-bench run --targets master,node
# [PASS] 1.2.1 Ensure --anonymous-auth is set to false
# [FAIL] 1.2.6 Ensure --kubelet-certificate-authority is set ... (remediation: ...)
# kubeaudit: audit workloads for insecure settings
kubeaudit all -f deployment.yaml
# ERROR RunAsNonRoot: runAsNonRoot is not set, root user allowed [container: app]
# ERROR Capabilities: capability not dropped: ALL [container: app]
Exercises
- (Beginner) What standard does kube-bench check a cluster against?
- (Beginner) What kind of issues does kubeaudit look for?
- (Intermediate) Summarize the difference in focus between kube-bench and kubeaudit.
- (Interview) How do kube-bench and kubeaudit fit alongside Falco and Kyverno/Gatekeeper to form a layered security practice? (Hint: posture assessment vs. runtime detection vs. admission enforcement.)
Answers
- The CIS Kubernetes Benchmark.
- Insecure workload/manifest security settings — e.g., containers running as root, missing/weak securityContext (runAsNonRoot, capabilities not dropped, allowPrivilegeEscalation), privileged Pods, missing resource limits.
- kube-bench assesses cluster/infrastructure configuration (control plane, etcd, kubelet, node settings) against the CIS Benchmark; kubeaudit assesses workload security configurations (Pod/container securityContext and related settings). One targets how the cluster is set up; the other targets how workloads are configured.
- They cover different layers and phases: kube-bench/kubeaudit provide proactive posture assessment (is the cluster/workload configured securely against best practices?), useful in CI and periodic scans. Kyverno/Gatekeeper enforce policy at admission (prevent insecure resources from being created). Falco provides runtime detection (catch malicious behavior after deployment). Together: assess and remediate configuration before/continuously (bench/audit), block non-compliant deployments (policy engines), and detect attacks in progress (Falco) — defense in depth across the build/deploy/run lifecycle.
11. Helm and Package Management
Raw YAML manifests don't scale: you end up copying near-identical files per environment, with no versioning, templating, or easy rollback. Helm is the de facto package manager for Kubernetes, bundling related manifests into versioned, parameterized charts. Kustomize offers a template-free alternative. This chapter covers both, from consuming charts to authoring and publishing your own.
11.1 Helm Fundamentals
This subchapter introduces Helm's architecture and core vocabulary — charts, releases, repositories — and how it compares to Kustomize.
Helm architecture and components
Theory
Helm is "the package manager for Kubernetes." Just as apt or npm install packaged software with dependencies, Helm installs charts — packages of templated Kubernetes manifests — into a cluster. It solves the problems of raw YAML: it templates manifests so one chart serves many configurations, versions deployments, tracks them as releases, and supports upgrade/rollback.
A vital historical note: Helm 3 (current) is client-only — there is no server-side component. Helm 2 had a privileged in-cluster server called Tiller, which was a security liability (it held broad cluster permissions); Helm 3 removed it entirely. Now the helm CLI talks directly to the Kubernetes API using your kubeconfig credentials (so it respects your RBAC), and stores release state as Secrets (by default) in the cluster namespace. The architecture is therefore simple: a CLI + chart packages + release metadata stored in-cluster.
Example
Helm 2 (old): helm CLI --> Tiller (in-cluster, privileged) --> API server [removed]
Helm 3 (now): helm CLI ----------------------------------------> API server
(uses your kubeconfig + RBAC; release state stored as Secrets)
helm version # v3.x — client only
helm install web ./mychart # render chart + apply to cluster as release "web"
kubectl get secret -l owner=helm # release history stored as Secrets
Exercises
- (Beginner) What is Helm, by analogy to other ecosystems?
- (Beginner) What major component did Helm 3 remove from Helm 2, and why?
- (Intermediate) In Helm 3, how does the CLI authenticate to the cluster, and where is release state stored?
- (Interview) Why was removing Tiller a significant security improvement, and how does Helm 3 respect existing RBAC? (Hint: no privileged in-cluster server; client uses user credentials.)
Answers
- Helm is the package manager for Kubernetes — analogous to apt/yum/npm — installing packaged, versioned, parameterized applications (charts) into a cluster.
- Tiller, the privileged in-cluster server component, because it was a security liability (it typically ran with broad cluster permissions, creating a powerful attack target and privilege-escalation path).
- The
helmCLI uses your kubeconfig credentials to talk directly to the Kubernetes API (so actions are bound by your RBAC). Release state/history is stored in-cluster, by default as Secrets in the release's namespace.- Tiller ran in-cluster with wide-ranging permissions, so anyone who could reach it could effectively perform privileged actions regardless of their own RBAC — a major risk. Helm 3 removes Tiller and makes the client act directly as the user: every operation goes through the API server with the user's own kubeconfig/RBAC, so Helm can do exactly what the user is permitted to do — no more, no less. This eliminates the privileged server and aligns Helm's access with standard Kubernetes authorization.
Charts, releases, and repositories
Theory
Three core Helm nouns:
- Chart: a package — a directory (or
.tgz) of templated manifests plus metadata and default values. It's the definition of an application (like an installer or a recipe). Charts are versioned. - Release: an instance of a chart deployed to a cluster, with a name and a specific configuration. Installing the same chart twice (different names) yields two independent releases, each with its own version history (revision 1, 2, 3...). This is what enables upgrades and rollbacks.
- Repository: a place where charts are stored and shared — an HTTP server hosting an index of charts (or, increasingly, an OCI registry). You add repos and pull charts from them (e.g., Bitnami's repo).
The relationship: a chart (from a repository) is installed to create a release (a running, versioned instance) in your cluster. This mirrors package managers: a package from a repository becomes an installed instance. Artifact Hub is the central index for discovering public charts.
Example
helm repo add bitnami https://charts.bitnami.com/bitnami # add a repository
helm repo update
helm search repo bitnami/postgresql # find a chart
helm install db1 bitnami/postgresql # release "db1" from the chart
helm install db2 bitnami/postgresql # release "db2" — independent instance
helm list # show releases and their revisions
helm history db1 # revision history of release db1
Exercises
- (Beginner) Define chart, release, and repository in one phrase each.
- (Beginner) Can you install the same chart multiple times in one cluster? What distinguishes the instances?
- (Intermediate) What enables Helm's upgrade and rollback capability at the release level?
- (Interview) Map Helm's chart/release/repository model onto a traditional package manager and explain why this mental model is useful. (Hint: package/installed-instance/repo.)
Answers
- Chart = a versioned package of templated Kubernetes manifests (the app definition); Release = a named, configured instance of a chart deployed in a cluster; Repository = a store/index where charts are published and fetched.
- Yes — install it under different release names; each release is an independent instance with its own configuration and revision history.
- Each release maintains a revision history (Helm stores the rendered manifests/values for each revision as Secrets). Upgrades create a new revision; rollback re-applies a prior stored revision — so versioned release state is what makes upgrade/rollback possible.
- It maps directly: a repository hosts charts (packages), and installing a chart produces a release (an installed instance) — just like a repo hosts packages that you install as running software, e.g.,
apt's repo → package → installed program. This is useful because it brings familiar package-management concepts (discovery, versioning, install, upgrade, rollback, multiple instances) to Kubernetes deployments, so practitioners can reason about app lifecycle the same way they do with OS/language package managers.
Installing and managing Helm
Theory
Helm is a single CLI binary you install on your workstation/CI (like kubectl). Once installed, the core lifecycle commands are few and consistent:
helm repo add/update— register and refresh chart repositories.helm search— find charts (repo or Artifact Hub).helm install <release> <chart>— deploy a chart as a named release.helm upgrade <release> <chart>— apply changes/new versions (with--installto create-or-upgrade idempotently).helm rollback <release> <revision>— revert to a previous revision.helm uninstall <release>— remove a release.helm list/helm status/helm history— inspect releases.
Two indispensable habits for safety: helm template renders a chart to plain YAML locally (to inspect what would be applied), and --dry-run --debug simulates an install/upgrade against the API server without changing anything. helm get values/manifest shows what's currently deployed. These let you validate changes before they hit a cluster.
Example
helm install web ./mychart -n prod --create-namespace \
-f values-prod.yaml --set image.tag=2.0 # install with value overrides
helm upgrade --install web ./mychart -n prod -f values-prod.yaml # create-or-upgrade
helm rollback web 2 -n prod # revert to revision 2
# Validate before applying:
helm template web ./mychart -f values-prod.yaml # render to YAML locally
helm upgrade web ./mychart --dry-run --debug -n prod # simulate, no changes
Exercises
- (Beginner) Which command deploys a chart as a named release?
- (Beginner) What does
helm upgrade --installdo that plainhelm upgradedoes not?- (Intermediate) How can you see the YAML a chart will produce without applying it?
- (Interview) Why are
helm templateand--dry-runimportant habits in a production workflow? (Hint: validate/inspect rendered output before changing the cluster.)
Answers
helm install <release-name> <chart>.--installmakes it idempotent create-or-upgrade: it installs the release if it doesn't exist yet, or upgrades it if it does — whereas plainhelm upgradefails if the release doesn't already exist. This is ideal for CI/CD.- Run
helm template <release> <chart> [-f values.yaml], which renders the chart's templates to plain Kubernetes YAML locally without contacting the cluster (orhelm install/upgrade --dry-runto render via the API server).- They let you inspect and validate exactly what will be applied before it touches a live cluster:
helm templaterenders the final manifests locally (catch templating errors, review diffs, run through policy/linting), and--dry-run --debugsimulates the operation against the API server (validating against admission/schema) without making changes. This prevents surprises — misrendered values, accidental deletions, invalid resources — from reaching production.
Helm vs Kustomize
Theory
Helm and Kustomize are the two dominant configuration-management tools, with fundamentally different philosophies:
- Helm uses templating: charts contain Go-template placeholders filled from
values.yaml. It's a full package manager with versioning, releases, dependencies, rollback, and a sharing ecosystem. Powerful, but templated YAML can become complex and hard to read, and it's not valid YAML until rendered. - Kustomize uses overlays/patching: you start with plain, valid base YAML and apply patches for each variant (environment) — no templating language. It's built into
kubectl(kubectl apply -k), keeps manifests as real YAML, and is simpler for straightforward per-environment customization. But it lacks packaging, versioning, dependencies, and a distribution ecosystem.
Rule of thumb: Helm when you need to package and distribute applications (especially third-party software) with parameters and lifecycle management; Kustomize when you want to customize your own plain manifests across environments without a templating language. They're not mutually exclusive — you can render a Helm chart and post-process with Kustomize, and tools like Argo CD/Flux support both.
Example
| Aspect | Helm | Kustomize |
|---|---|---|
| Mechanism | Go templating + values | Overlays/patches on plain YAML |
| Packaging/versioning | Yes (charts, releases) | No |
| Dependencies | Yes (subcharts) | No |
| Rollback | Yes (release history) | No (rely on Git/kubectl) |
| Built into kubectl | No | Yes (apply -k) |
| Best for | Packaging/distributing apps | Customizing own manifests per env |
Exercises
- (Beginner) What is the core mechanism Helm uses versus Kustomize?
- (Beginner) Which of the two is built into kubectl?
- (Intermediate) Give one capability Helm has that Kustomize lacks, and one advantage Kustomize has over Helm.
- (Interview) When would you choose Helm over Kustomize, and can they be used together? (Hint: packaging/distribution vs. plain-YAML overlays; render Helm then patch.)
Answers
- Helm uses Go-template-based templating driven by a values file; Kustomize uses overlays and patches applied to plain, valid base YAML (no templating language).
- Kustomize (
kubectl apply -k/kubectl kustomize).- Helm has packaging/versioning, releases, dependencies (subcharts), and built-in rollback — which Kustomize lacks. Kustomize keeps manifests as plain, valid YAML (easier to read/diff, no templating complexity) and is built into kubectl. (Either direction acceptable.)
- Choose Helm when you need to package, version, parameterize, and distribute an application — especially installing/maintaining third-party software with dependencies and lifecycle (upgrade/rollback). Choose Kustomize for customizing your own plain manifests across environments without templating. They can be combined: e.g., render a Helm chart (
helm template) and then apply Kustomize patches on top, and GitOps tools (Argo CD, Flux) natively support both, even Helm charts post-rendered with Kustomize.
11.2 Working with Charts
This subchapter goes inside charts: their structure, values, the templating language, and lifecycle hooks.
Chart structure and anatomy
Theory
A Helm chart is a directory with a conventional structure. The key files and folders:
mychart/
Chart.yaml # chart metadata: name, version, appVersion, dependencies
values.yaml # default configuration values
templates/ # templated Kubernetes manifests
deployment.yaml
service.yaml
_helpers.tpl # reusable named template snippets (helpers)
NOTES.txt # message shown to the user after install
charts/ # subcharts (dependencies) live here
crds/ # CRDs installed before other resources
.helmignore # files to exclude from packaging
The two most important are Chart.yaml (identifies the chart and its versions: version is the chart's own version, appVersion is the app it deploys) and values.yaml (the default parameters users override). Everything in templates/ is rendered through Go templating with the values, then applied. Understanding this layout is the foundation for both using and authoring charts.
Example
# Chart.yaml
apiVersion: v2
name: mychart
version: 1.2.0 # the CHART version (bump on chart changes)
appVersion: "2.4.1" # the version of the APP being deployed
description: My web app
dependencies:
- name: postgresql
version: "15.x.x"
repository: https://charts.bitnami.com/bitnami
Exercises
- (Beginner) What two files are the most important in a chart, and what does each hold?
- (Beginner) What goes in the
templates/directory?- (Intermediate) What is the difference between
versionandappVersionin Chart.yaml?- (Interview) What is the purpose of the
crds/directory and why does Helm treat CRDs specially? (Hint: CRDs must exist before CRs that use them; installed first, limited lifecycle handling.)
Answers
Chart.yaml(chart metadata: name, versions, dependencies) andvalues.yaml(default configuration values users can override).- The templated Kubernetes manifests (Deployments, Services, etc.), plus helpers (
_helpers.tpl) andNOTES.txt— rendered through Go templating with the values and applied to the cluster.versionis the chart's own package version (bump it whenever the chart changes);appVersionis the version of the application the chart deploys (e.g., the container image's app version). They're independent — you can release a new chart version without changing the app, and vice versa.- The
crds/directory holds CustomResourceDefinitions that Helm installs before the rest of the chart's resources, so that custom resources relying on them can be created. Helm treats them specially because CRDs are cluster-scoped, foundational schema that must exist first; to avoid race conditions and accidental data loss, Helm installs CRDs from this directory up front and does not manage their upgrade/deletion through the normal templating lifecycle (CRD updates/removal must be handled deliberately).
values.yaml and value overrides
Theory
values.yaml holds a chart's default configuration as structured YAML; templates reference these values (e.g., {{ .Values.image.tag }}). The whole point is parameterization: the same chart produces different deployments by supplying different values — this is how one chart serves dev, staging, and prod.
Values can be overridden at install/upgrade time, and there's a clear precedence (later wins):
- The chart's own
values.yaml(defaults). - A parent chart's values (for subcharts).
- User-supplied
-f myvalues.yamlfiles (multiple allowed; later files override earlier). --set key=valueflags on the command line (highest precedence).
The common pattern is a base values.yaml plus per-environment files (values-prod.yaml) layered with -f, reserving --set for one-off or sensitive/CI-injected values. Understanding precedence prevents "why isn't my override taking effect?" confusion.
Example
# values.yaml (defaults)
replicaCount: 1
image:
repository: myapp
tag: "1.0"
resources:
limits: { cpu: 500m, memory: 512Mi }
# Override: file then --set (--set wins)
helm upgrade --install web ./mychart \
-f values-prod.yaml \ # e.g., replicaCount: 5
--set image.tag=2.0 # highest precedence
# Result: replicaCount from values-prod.yaml, image.tag=2.0 from --set
Exercises
- (Beginner) What is the purpose of values.yaml?
- (Beginner) Between a
-f values.yamlfile and a--setflag, which takes precedence?- (Intermediate) Describe the typical pattern for managing values across multiple environments.
- (Interview) Explain Helm's value precedence order and a scenario where misunderstanding it causes a bug. (Hint: defaults < parent < -f files < --set; an override silently ignored.)
Answers
- To hold the chart's default configuration values, which templates reference — enabling one chart to be parameterized into many different deployments by overriding values.
--settakes precedence over-fvalue files.- Keep a base
values.yamlwith sane defaults, plus per-environment files (e.g.,values-dev.yaml,values-prod.yaml) layered via-fat install/upgrade. Use--setsparingly for one-off or CI-injected/sensitive values. Store these in version control for reproducibility.- Precedence (lowest to highest): chart
values.yamldefaults → parent chart values (for subcharts) → user-ffiles (later files override earlier) →--setflags. A common bug: you editvalues-prod.yamlto changeimage.tag, but a--set image.tag=...(e.g., left in a CI script) overrides it, so your file change appears ignored. Or two-ffiles set the same key and the later one silently wins. Knowing the order tells you exactly which source is taking effect.
Installing, upgrading, and rolling back releases
Theory
Helm manages the full release lifecycle, and its killer feature over raw kubectl apply is revision history with rollback. Each helm install/upgrade creates a new revision, storing the rendered manifests and values. If an upgrade breaks, helm rollback <release> <revision> restores a prior known-good state precisely — no manual reconstruction.
Important behaviors and flags:
helm upgradecomputes a three-way merge (old chart, new chart, live state) to apply changes.--atomic: if the upgrade fails, automatically roll back to the previous revision (prevents being stuck in a broken half-applied state).--wait: wait until resources are ready before considering the release successful (often with--timeout).helm historylists revisions;helm rollbackreverts;helm uninstall --keep-historycan retain history.
This lifecycle management — atomic, waited, reversible — is what makes Helm suitable for safe, repeatable production deployments.
Example
helm history web -n prod
# REVISION STATUS CHART APP VERSION DESCRIPTION
# 1 superseded mychart-1.0.0 1.0 Install complete
# 2 deployed mychart-1.1.0 1.1 Upgrade complete
# Safe upgrade: wait for readiness, auto-rollback on failure
helm upgrade web ./mychart -n prod --atomic --wait --timeout 5m -f values-prod.yaml
helm rollback web 1 -n prod # revert to revision 1 if needed
Exercises
- (Beginner) What does each
helm install/upgradecreate that enables rollback?- (Beginner) Which command reverts a release to a previous revision?
- (Intermediate) What do the
--atomicand--waitflags do during an upgrade?- (Interview) How is Helm's rollback fundamentally safer/easier than rolling back a raw
kubectl applydeployment? (Hint: stored rendered revisions vs. manual reconstruction.)
Answers
- A new revision, with the rendered manifests and values stored as release history.
helm rollback <release> <revision>.--waitmakes Helm wait until the release's resources report ready (within--timeout) before declaring success.--atomiccauses a failed upgrade to automatically roll back to the previous revision, so you're never left in a broken, half-applied state.- Helm stores the complete rendered manifests and values for every revision, so a rollback deterministically re-applies an exact prior known-good state with one command. With raw
kubectl apply, there's no built-in history of what was applied at each point — to roll back you must locate and re-apply the previous manifests yourself (from Git, hopefully), reconcile drift, and remove resources that were added, which is manual, error-prone, and incomplete. Helm's stored, versioned releases make rollback precise, fast, and reliable.
Helm templating with Go templates
Theory
Helm templates are Kubernetes YAML with Go template directives that get rendered using the chart's values and built-in objects. This is the engine that turns one chart into many configurations. Key building blocks:
.Values— user/chart values;.Release— release info (name, namespace, revision);.Chart— Chart.yaml data;.Capabilities— cluster/API capabilities.- Actions in
{{ }}: substitution ({{ .Values.x }}), conditionals ({{ if }}...{{ end }}), loops ({{ range }}), variables ({{ $x := ... }}). - Pipelines and functions:
{{ .Values.name | upper | quote }}— Helm includes the Sprig function library (string/number/date helpers), plusdefault,required,toYaml,nindent,tpl, etc. - Whitespace control:
{{-and-}}trim whitespace to keep rendered YAML valid.
The trade-off: templating makes charts flexible but the source is not valid YAML until rendered, and complex logic can hurt readability. Functions like required (fail if a value is missing) and toYaml/nindent (embed structured values with correct indentation) are workhorses.
Example
# templates/deployment.yaml (excerpt)
apiVersion: apps/v1
kind: Deployment
metadata:
name: {{ .Release.Name }}-web # built-in object
spec:
replicas: {{ .Values.replicaCount | default 1 }}
template:
spec:
containers:
- name: web
image: "{{ .Values.image.repository }}:{{ required "image.tag is required" .Values.image.tag }}"
{{- if .Values.resources }}
resources:
{{- toYaml .Values.resources | nindent 10 }} # embed structured YAML, indented
{{- end }}
Exercises
- (Beginner) What built-in object gives you the release's name and namespace?
- (Beginner) What does the
requiredfunction do?- (Intermediate) Why are
toYamlandnindentcommonly used together?- (Interview) What is the downside of Go templating in charts, and how do whitespace-control and helper functions mitigate template fragility? (Hint: invalid YAML until rendered;
{{- -}}, nindent, default/required.)
Answers
.Release(e.g.,.Release.Name,.Release.Namespace).- It returns a value but fails the render with an error message if that value is empty/missing — enforcing that mandatory values are supplied (e.g.,
required "image.tag is required" .Values.image.tag).toYamlserializes a structured value (like a map of resources) into YAML, andnindentindents that block by the correct number of spaces (with a leading newline) so it nests properly under its parent key. Together they let you inject arbitrary structured config from values into a template while keeping valid, correctly-indented YAML.- The downside is that templated source is not valid YAML until rendered, so it's harder to read/lint, and indentation/whitespace mistakes easily produce invalid manifests; complex logic reduces clarity. Mitigations: whitespace-trim markers (
{{-,-}}) keep rendered output clean and valid;nindent/indentensure correct indentation for embedded blocks;defaultprovides fallbacks to avoid empty values;requiredfails fast with a clear message instead of producing a broken manifest; andhelm template/--dry-runlet you verify the rendered YAML before applying.
Hooks and tests
Theory
Sometimes you need actions to run at specific points in a release's lifecycle — before an upgrade, after an install, before deletion. Helm hooks are ordinary Kubernetes resources (usually Jobs) annotated to run at a lifecycle phase. Common hook events: pre-install, post-install, pre-upgrade, post-upgrade, pre-delete, post-delete. Typical uses: run a database migration before an upgrade, seed data after install, or back up before deletion. Hook ordering is controlled with helm.sh/hook-weight, and cleanup with helm.sh/hook-delete-policy.
Helm tests are a special hook (helm.sh/hook: test) — Jobs/Pods that validate a release works after deployment (e.g., curl the service, check a DB connection). You run them with helm test <release>; success/failure indicates whether the deployed release is functioning. Together, hooks (lifecycle automation) and tests (post-deploy validation) make charts capable of orchestrating real operational workflows, not just applying static manifests.
Example
# A pre-upgrade migration hook (a Job that runs before the upgrade proceeds):
apiVersion: batch/v1
kind: Job
metadata:
name: {{ .Release.Name }}-migrate
annotations:
"helm.sh/hook": pre-upgrade
"helm.sh/hook-weight": "0"
"helm.sh/hook-delete-policy": hook-succeeded
spec:
template:
spec:
restartPolicy: Never
containers:
- { name: migrate, image: migrator:2.0, command: ["/migrate.sh"] }
---
# A test hook (run via `helm test <release>`):
apiVersion: v1
kind: Pod
metadata:
name: "{{ .Release.Name }}-test"
annotations: { "helm.sh/hook": test }
spec:
restartPolicy: Never
containers:
- { name: curl, image: curlimages/curl, command: ["curl","-f","http://{{ .Release.Name }}-web/"] }
Exercises
- (Beginner) What are Helm hooks typically implemented as?
- (Beginner) Name two hook events and a use case for each.
- (Intermediate) How do you control the order in which multiple hooks run?
- (Interview) What is a Helm test, how do you run it, and how does it differ from a hook? (Hint: test hook validates a release post-deploy via
helm test.)
Answers
- Ordinary Kubernetes resources — most commonly Jobs (or Pods) — annotated with
helm.sh/hookto run at a lifecycle phase.- Examples:
pre-upgradeto run a database migration before the new version rolls out;post-installto seed initial data;pre-deleteto back up before removal. (Any two valid events + uses.)- With the
helm.sh/hook-weightannotation — hooks are sorted by ascending weight (ties broken by name/kind), so lower weights run first. (Andhook-delete-policycontrols when hook resources are cleaned up.)- A Helm test is a resource annotated with
helm.sh/hook: test(a Job/Pod that verifies the release works, e.g., by hitting its Service). You run it on demand withhelm test <release>; its success/failure reports whether the deployed release is functioning. It differs from lifecycle hooks in when and why it runs: lifecycle hooks fire automatically during install/upgrade/delete to perform operational steps, whereas a test is invoked explicitly (after deployment) purely to validate that the release is healthy.
11.3 Creating and Publishing Charts
This subchapter covers authoring your own charts: scaffolding, reusable helpers, dependencies, and distribution.
Scaffolding a new chart
Theory
To author a chart, you don't start from scratch — helm create <name> scaffolds a complete, working starter chart with best-practice structure and a functional example (a Deployment, Service, ServiceAccount, HPA, Ingress, and a sensible values.yaml with helpers). You then edit the templates and values to fit your application.
Two essential companion commands during development:
helm lint: checks the chart for errors and best-practice violations (malformed templates, missing required fields), catching problems before you try to install.helm template/helm install --dry-run: render the chart to confirm the output YAML is what you expect.
The scaffolded chart also demonstrates the conventions you should follow: using _helpers.tpl for names/labels, parameterizing everything through values.yaml, and including NOTES.txt. Starting from helm create and iterating with lint + template is the standard authoring loop.
Example
helm create mychart # scaffold a working starter chart
tree mychart # Chart.yaml, values.yaml, templates/, _helpers.tpl, ...
# Edit templates/values, then validate:
helm lint mychart # check for errors / best-practice issues
helm template mychart # render to YAML to inspect
helm install demo mychart --dry-run --debug # simulate against the cluster
Exercises
- (Beginner) What does
helm creategive you?- (Beginner) What does
helm lintdo?- (Intermediate) Describe the typical develop-validate loop when authoring a chart.
- (Interview) Why is scaffolding with
helm createand following its conventions (helpers, values parameterization) beneficial for maintainability? (Hint: consistent structure, best practices, reusable naming/labels.)
Answers
- A complete, working starter chart with best-practice structure and a functional example application (Deployment, Service, ServiceAccount, Ingress, HPA, helpers, values, NOTES) that you customize.
- It validates the chart for errors and best-practice violations (template syntax, missing/invalid fields, conventions), catching issues before installation.
- Edit templates and
values.yaml, runhelm lintto catch errors, render withhelm template(and/orhelm install --dry-run --debug) to inspect the produced YAML, fix issues, and repeat — iterating until the rendered manifests are correct before a real install.helm createprovides a consistent, proven structure and demonstrates conventions like centralizing names/labels in_helpers.tpland parameterizing everything viavalues.yaml. Following these makes charts predictable to read and maintain, ensures consistent labeling/naming (important for selectors and tooling), reduces duplication (reusable helpers), and incorporates community best practices from the start — so the chart is easier for others to understand, extend, and operate.
Named templates and helpers
Theory
Charts often repeat the same snippets — resource names, label sets, selector labels — across many templates. Named templates (a.k.a. helpers, conventionally defined in _helpers.tpl) let you define a reusable block once and call it everywhere, keeping charts DRY and consistent. Files starting with _ aren't rendered into manifests themselves; they only hold definitions.
You define a named template with {{ define "name" }}...{{ end }} and use it with {{ include "name" . }} (passing the current context .). include is preferred over the built-in template action because include's output can be piped (e.g., {{ include "mychart.labels" . | nindent 4 }}), which template cannot. The canonical use is standardized labels and names: define mychart.fullname, mychart.labels, and mychart.selectorLabels once and reference them in every resource, guaranteeing consistent metadata (critical for selectors matching across Deployments/Services).
Example
# templates/_helpers.tpl
{{- define "mychart.fullname" -}}
{{ .Release.Name }}-{{ .Chart.Name }}
{{- end }}
{{- define "mychart.labels" -}}
app.kubernetes.io/name: {{ .Chart.Name }}
app.kubernetes.io/instance: {{ .Release.Name }}
app.kubernetes.io/managed-by: {{ .Release.Service }}
{{- end }}
# templates/service.yaml — reuse the helpers
metadata:
name: {{ include "mychart.fullname" . }}
labels:
{{- include "mychart.labels" . | nindent 4 }} # include can be piped
Exercises
- (Beginner) Where are named template helpers conventionally defined, and are those files rendered into manifests?
- (Beginner) Which keyword defines a named template, and which uses it?
- (Intermediate) Why is
includepreferred over thetemplateaction?- (Interview) Why is defining shared label/name helpers important for correctness, not just DRY-ness? (Hint: selectors must consistently match across resources.)
Answers
- In
_helpers.tpl(files beginning with_), which are not rendered into output manifests themselves — they only hold reusable definitions.{{ define "name" }}...{{ end }}defines a named template;{{ include "name" . }}uses it (passing the context).- Because
includereturns a string that can be piped through functions (e.g.,| nindent 4,| quote), whereas the built-intemplateaction writes directly to output and cannot be piped — makingincludefar more flexible for embedding helper output with correct indentation/formatting.- Labels and selectors must match exactly between related resources (e.g., a Deployment's pod template labels and its selector, and a Service's selector). Defining them once in a shared helper guarantees every resource uses identical label sets, so selectors reliably match. Hand-writing labels in each template risks subtle mismatches that break Service routing or controller ownership — so helpers protect correctness, not just reduce duplication.
Subcharts and chart dependencies
Theory
Real applications often depend on other components — a web app needs a database, a cache. Helm models this with dependencies (subcharts): a parent chart declares dependencies in Chart.yaml, and helm dependency update fetches them into the charts/ directory. Installing the parent then deploys the subcharts too. This lets you compose applications from reusable building blocks (e.g., depend on Bitnami's PostgreSQL chart instead of writing your own).
Key mechanics:
- Values flow down: a parent can override a subchart's values by nesting them under the subchart's name in its own
values.yaml. There are also global values shared across all subcharts. - Conditions/tags: dependencies can be enabled/disabled via
condition/tags(e.g., only deploy the bundled database in dev, use an external one in prod). - Aliases: include the same subchart multiple times under different names.
Dependencies enable modular, reusable chart composition — but deep dependency trees can get complex, so use them judiciously.
Example
# Chart.yaml: declare a dependency, conditionally enabled
dependencies:
- name: postgresql
version: "15.x.x"
repository: https://charts.bitnami.com/bitnami
condition: postgresql.enabled # toggle via values
# Parent values.yaml: override the subchart's values under its name
postgresql:
enabled: true
auth:
database: myapp
username: app
global:
storageClass: fast # shared with all subcharts
helm dependency update ./mychart # fetch subcharts into charts/
Exercises
- (Beginner) Where do you declare a chart's dependencies, and which command fetches them?
- (Beginner) How does a parent chart override a subchart's values?
- (Intermediate) How can you conditionally enable or disable a subchart (e.g., a bundled database)?
- (Interview) What are the benefits and the risks of composing applications from many subcharts? (Hint: reuse/modularity vs. complexity, value-plumbing, version coupling.)
Answers
- In the parent's
Chart.yamlunderdependencies;helm dependency update(orhelm dependency build) fetches them into thecharts/directory.- By nesting overrides under the subchart's name in the parent's
values.yaml(e.g., a top-levelpostgresql:block sets values passed to thepostgresqlsubchart);globalvalues can also be shared across all subcharts.- Give the dependency a
condition(e.g.,postgresql.enabled) ortagsinChart.yaml, then toggle that value (e.g.,postgresql.enabled: falsein prod to use an external database) to include or exclude the subchart.- Benefits: reuse of well-maintained components (don't reinvent a database chart), modular composition, and consistent deployment of an app plus its dependencies in one release. Risks: complexity grows with deep dependency trees (hard to reason about and debug), value-plumbing between parent and subcharts becomes intricate and error-prone, version coupling/compatibility must be managed, and an upstream subchart change can ripple unexpectedly. Use dependencies where they add clear value and keep trees shallow and well-documented.
Packaging and publishing to OCI registries
Theory
Once a chart is ready, you package it into a versioned .tgz archive with helm package (the filename encodes name + version). To share it, you publish to a chart repository. Modern Helm supports OCI registries (the same registries that store container images — Docker Hub, GHCR, ECR, Harbor) as first-class chart storage, which is now the recommended approach over the older HTTP index.yaml repositories.
The OCI workflow: helm package to create the archive, then helm push <chart>.tgz oci://<registry>/<repo> to upload it, and consumers use helm install <name> oci://<registry>/<repo>/<chart> --version <v> to pull it. Benefits of OCI: unified storage/auth/tooling with container images (one registry, one set of credentials and access controls), versioning, and content addressing. (The legacy alternative is hosting a static repo with an index.yaml via helm repo index.)
Example
helm package ./mychart # -> mychart-1.2.0.tgz
# Publish to an OCI registry (e.g., GitHub Container Registry):
helm push mychart-1.2.0.tgz oci://ghcr.io/myorg/charts
# Consume it:
helm install web oci://ghcr.io/myorg/charts/mychart --version 1.2.0
Exercises
- (Beginner) What does
helm packageproduce?- (Beginner) What is the recommended modern way to store/distribute charts?
- (Intermediate) Write the commands to push a packaged chart to an OCI registry and install it.
- (Interview) What advantages do OCI registries offer over the legacy
index.yamlHTTP chart repositories? (Hint: unified storage/auth/tooling with images, versioning, access control.)
Answers
- A versioned chart archive — a
.tgzfile named<chart>-<version>.tgz(e.g.,mychart-1.2.0.tgz).- Publishing to an OCI registry (the same registries used for container images), which is now the recommended approach.
- Push:
helm push mychart-1.2.0.tgz oci://<registry>/<repo>. Install:helm install <name> oci://<registry>/<repo>/mychart --version 1.2.0.- OCI registries unify chart and container-image storage in one system with shared authentication, access control, and tooling — so you manage charts with the same credentials, RBAC, replication, and scanning infrastructure as images. They provide robust versioning and content addressing, avoid maintaining a separate static
index.yaml/web server, and integrate with existing registry ecosystems (Harbor, ECR, GHCR). The legacy HTTP repo required hosting and regenerating an index file and separate auth, which is more operational overhead and less integrated.
Chart Museum and Artifact Hub
Theory
Two more pieces of the chart ecosystem:
- ChartMuseum: an open-source chart repository server you self-host. It serves charts over HTTP with an auto-generated
index.yamland supports various storage backends (local disk, S3, GCS). It's the classic way to run a private chart repository for your organization (predating widespread OCI support; OCI registries now often replace it). - Artifact Hub: a CNCF-hosted central discovery website (artifacthub.io) for finding public Kubernetes packages — Helm charts, but also Operators, Kustomize bases, OPA/Kyverno policies, Falco rules, and more. It's the "search engine/catalog" for the cloud-native ecosystem: you discover a chart and its repository here, then add the repo and install. It does not host charts itself (it indexes repositories).
In short: ChartMuseum is for hosting your own charts (a private repo server); Artifact Hub is for discovering publicly available charts and other artifacts.
Example
# ChartMuseum: run a private repo server, then upload charts to it
helm repo add myrepo http://chartmuseum.internal:8080
curl --data-binary "@mychart-1.2.0.tgz" http://chartmuseum.internal:8080/api/charts
# Artifact Hub: discover public charts at https://artifacthub.io
# search "postgresql" -> find the Bitnami chart + its repo -> add and install
helm repo add bitnami https://charts.bitnami.com/bitnami
Exercises
- (Beginner) What is ChartMuseum used for?
- (Beginner) What is Artifact Hub, and does it host charts itself?
- (Intermediate) Besides Helm charts, name two other artifact types you can discover on Artifact Hub.
- (Interview) Distinguish the roles of ChartMuseum and Artifact Hub in the chart ecosystem. (Hint: self-hosted private repo vs. central public discovery/index.)
Answers
- Running a self-hosted (often private) Helm chart repository server that stores and serves charts over HTTP with an auto-generated index, backed by various storage backends.
- Artifact Hub is a CNCF-hosted central website for discovering Kubernetes/cloud-native packages. It does not host the charts themselves — it indexes/links to the repositories that do.
- Any two: Operators (OLM), Kustomize bases, OPA/Gatekeeper or Kyverno policies, Falco rules, Tekton tasks, CoreDNS plugins, container images, etc.
- ChartMuseum is infrastructure you run to host and serve your own (typically private) charts — a repository server. Artifact Hub is a public discovery catalog/index where users find publicly available charts and other artifacts and learn which repository to add. One is about hosting/distribution within your control; the other is about searchability/discovery across the broader ecosystem. (OCI registries increasingly fulfill ChartMuseum's hosting role.)
11.4 Kustomize
Kustomize is the template-free alternative built into kubectl. This subchapter covers its overlay model, patches, generators, and usage.
Kustomize overlays and bases
Theory
Kustomize customizes plain Kubernetes YAML without templates, using a base + overlays model:
- A base is a directory of complete, valid manifests plus a
kustomization.yamllisting them — the common configuration shared across environments. - An overlay is a directory with its own
kustomization.yamlthat references the base (resources: [ ../../base ]) and applies environment-specific modifications (patches, name prefixes, replica counts, images, extra resources).
Because the base is real YAML, you can kubectl apply -f it directly, and overlays only express the differences per environment (dev/staging/prod). This keeps each environment DRY and the differences explicit and reviewable. You build the final manifests with kubectl kustomize <overlay> (render) or kubectl apply -k <overlay> (render + apply). The philosophy is "declarative, template-free customization through composition," contrasting sharply with Helm's templating.
Example
app/
base/
kustomization.yaml # resources: deployment.yaml, service.yaml
deployment.yaml # plain, valid YAML (replicas: 1)
service.yaml
overlays/
prod/
kustomization.yaml # references ../../base + prod patches
replicas-patch.yaml # replicas: 5
# overlays/prod/kustomization.yaml
resources:
- ../../base
namePrefix: prod-
patches:
- path: replicas-patch.yaml
images:
- name: myapp
newTag: "2.0"
Exercises
- (Beginner) What is the difference between a base and an overlay in Kustomize?
- (Beginner) Which command renders a Kustomize overlay, and which renders + applies it?
- (Intermediate) Why can a Kustomize base be applied directly with
kubectl apply -f, unlike a Helm chart's templates?- (Interview) How does the base+overlay model keep multi-environment configuration DRY while keeping differences explicit? (Hint: shared base, overlays express only deltas.)
Answers
- A base is a set of complete, valid manifests representing the shared/common configuration; an overlay references a base and applies environment-specific modifications (patches, prefixes, image tags, extra resources) — expressing only the differences.
kubectl kustomize <dir>renders to YAML;kubectl apply -k <dir>renders and applies.- Because Kustomize bases are plain, valid Kubernetes YAML (no templating placeholders), they can be applied as-is. Helm chart templates contain Go-template directives and are not valid YAML until rendered with values, so they can't be applied directly.
- The common configuration lives once in the base; each overlay references that base and contains only the deltas for its environment (e.g., prod sets replicas and image tag). This avoids duplicating full manifests per environment (DRY) while making each environment's differences small, explicit, and easy to review in isolation — you can see exactly how prod differs from the base without scanning entire duplicated files.
Patches: strategic merge and JSON6902
Theory
The heart of Kustomize customization is patching existing resources from the base. Two patch styles:
- Strategic Merge Patch: a partial YAML document that merges into the target resource — you specify only the fields you want to change/add, matched by the resource's identity. Intuitive for most edits (change replicas, add an env var, set resources). It understands Kubernetes' merge semantics (e.g., merging list items by key like container
name). - JSON Patch (RFC 6902): a list of explicit operations (
op: replace/add/remove, with apath) applied to the resource. More precise and powerful for surgical changes — especially array element manipulation by index, or removing a field — where strategic merge is awkward.
Use strategic merge for the common "override/add a few fields" case; reach for JSON6902 when you need exact operations (delete a specific item, edit by array index). Both are declared under patches: in kustomization.yaml, targeting resources by kind/name (and optionally label selectors).
Example
# Strategic merge patch: change replicas and add an env var
apiVersion: apps/v1
kind: Deployment
metadata: { name: web }
spec:
replicas: 5
template:
spec:
containers:
- name: web # matched by name; fields merged
env:
- { name: LOG_LEVEL, value: debug }
---
# JSON 6902 patch (in kustomization.yaml): surgical operations
patches:
- target: { kind: Deployment, name: web }
patch: |-
- op: replace
path: /spec/replicas
value: 5
- op: remove
path: /spec/template/spec/containers/0/livenessProbe
Exercises
- (Beginner) What does a strategic merge patch let you specify?
- (Beginner) What does a JSON 6902 patch consist of?
- (Intermediate) Give a case where JSON6902 is more suitable than a strategic merge patch.
- (Interview) How does a strategic merge patch handle list items (like containers) differently from a naive merge, and why does that matter? (Hint: merge-by-key, e.g., container name.)
Answers
- Only the fields you want to add or change, as a partial version of the target resource; Kustomize merges them into the base resource.
- An ordered list of explicit operations (e.g.,
op: add/replace/remove) each with apath(andvaluewhere applicable) applied to the resource.- When you need surgical/precise changes that merge semantics handle poorly — e.g., removing a specific field, replacing or deleting an element of an array by index, or modifying a list where there's no merge key. JSON6902's explicit path-based operations express these precisely.
- Strategic merge uses Kubernetes' schema-aware merge semantics: lists with a merge key (e.g., containers keyed by
name) are merged by matching that key, so a patch updates the matching container rather than replacing the entire list. This matters because a naive merge would overwrite or duplicate the whole list; merge-by-key lets you change one container's fields (or add a new one) while leaving the others intact, which is what you almost always want.
ConfigMap and Secret generators
Theory
A standout Kustomize feature is generators for ConfigMaps and Secrets. Instead of hand-writing a ConfigMap/Secret manifest, you declare a configMapGenerator/secretGenerator that builds it from literals, files, or env files. Crucially, generators append a content-based hash suffix to the generated object's name (e.g., app-config-7f8c9d) and automatically update references to it (in Deployments, etc.).
This elegantly solves the config-change rollout problem from Chapter 5: when the ConfigMap's content changes, its name changes (new hash), which changes the Pod template that references it, which triggers a rolling update — so config changes automatically and safely propagate to running Pods, with no manual restart and no separate checksum-annotation trick. You can disable hashing if you need stable names (generatorOptions: disableNameSuffixHash: true). Generators are a major ergonomic and correctness win unique to the Kustomize workflow.
Example
# kustomization.yaml
configMapGenerator:
- name: app-config
literals:
- LOG_LEVEL=info
files:
- app.properties
secretGenerator:
- name: db-creds
literals:
- password=s3cr3t
# Generated names get a hash suffix, e.g. app-config-7f8c9d2b1a,
# and references to "app-config" in Deployments are rewritten automatically.
Exercises
- (Beginner) What do Kustomize ConfigMap/Secret generators create, and from what sources?
- (Beginner) What does Kustomize append to a generated ConfigMap/Secret's name?
- (Intermediate) How does the name-hash mechanism cause config changes to roll out automatically?
- (Interview) Compare Kustomize's generator-hash approach to the Helm "checksum annotation" pattern for triggering rollouts on config change. (Hint: both change the Pod template; generators also rename + update references automatically.)
Answers
- They generate ConfigMaps/Secrets from literals, files, or env files (so you don't hand-write the manifest), and Kustomize manages them as part of the build.
- A content-based hash suffix on the name (e.g.,
app-config-7f8c9d), and it rewrites references to use the hashed name.- When the generated object's content changes, its hash (and thus name) changes; Kustomize updates the references in workloads to the new name, which alters the Pod template spec. A changed Pod template triggers a Deployment rolling update, so Pods are recreated and pick up the new config automatically and safely.
- Both work by changing the Pod template so the Deployment performs a rolling update on config change. Helm's pattern requires you to manually add a
checksum/configannotation computed from the ConfigMap and keep references in sync. Kustomize's generators do this automatically: they hash content, rename the object, and rewrite all references — no manual annotation or reference plumbing — making safe config rollouts the default behavior with less boilerplate (at the cost of changing object names, which you can disable if needed).
Kustomize with kubectl
Theory
A major practical advantage of Kustomize is that it's built into kubectl — no separate tool to install. Two flags expose it:
kubectl apply -k <dir>: build the kustomization in<dir>and apply the result to the cluster.kubectl kustomize <dir>: build and print the rendered YAML to stdout (for inspection, diffing, or piping).
A caveat: the Kustomize version embedded in kubectl can lag behind the standalone kustomize CLI, so for the newest features you may install the standalone binary. Kustomize also integrates deeply with GitOps tools — Argo CD and Flux natively render kustomizations — making it a first-class citizen of declarative, Git-driven workflows (Chapter 12). The combination of "no extra tooling, plain YAML, native GitOps support" is why many teams choose Kustomize for managing their own applications, reserving Helm for packaged/third-party software.
Example
kubectl kustomize overlays/prod # render the prod overlay to YAML
kubectl apply -k overlays/prod # render + apply to the cluster
kubectl kustomize overlays/prod | kubectl diff -f - # preview changes
# For the latest features, the standalone CLI:
kustomize build overlays/prod | kubectl apply -f -
Exercises
- (Beginner) Which kubectl flag applies a kustomization, and which just renders it?
- (Beginner) Do you need to install a separate tool to use Kustomize with kubectl?
- (Intermediate) Why might you install the standalone
kustomizeCLI even though kubectl includes it?- (Interview) Why is Kustomize's combination of "built into kubectl, plain YAML, native GitOps support" appealing for managing your own applications? (Hint: no extra tooling/templating; Argo CD/Flux render it natively.)
Answers
kubectl apply -k <dir>applies it;kubectl kustomize <dir>renders it to stdout.- No — Kustomize is built into kubectl (
-k/kubectl kustomize).- The Kustomize version bundled in kubectl often lags the standalone CLI, so to use the latest Kustomize features/fixes you install the separate
kustomizebinary (kustomize build).- It requires no extra tooling (already in kubectl) and no templating language (manifests stay plain, valid, readable YAML that's easy to diff and review), lowering the learning curve and complexity. And because GitOps controllers (Argo CD, Flux) render kustomizations natively, your overlays plug directly into declarative, Git-driven deployment pipelines. For teams managing their own apps across environments, this yields a simple, transparent, Git-friendly workflow without Helm's templating overhead — while Helm remains better for packaging/distributing software.
12. CI/CD and GitOps
Building images and writing manifests is only half the story — you need a reliable, automated path from a code commit to a running workload. This chapter covers continuous integration/delivery pipelines for Kubernetes, then the GitOps paradigm that makes Git the single source of truth and uses in-cluster agents to continuously reconcile the cluster to it, with deep dives into the two leading GitOps tools, Argo CD and Flux.
12.1 CI/CD Pipelines for Kubernetes
This subchapter covers the mechanics of pipelines: building images, deploying from CI, and advanced rollout strategies.
Building and pushing container images in CI
Theory
The first job of a Kubernetes CI pipeline is to turn source code into a container image and push it to a registry the cluster can pull from. On a commit, CI checks out the code, builds the image, tags it, and pushes it. The tagging strategy is critical: avoid the mutable :latest tag for deployments; instead tag with an immutable, traceable identifier — the Git commit SHA (and/or a semantic version) — so each image maps to exact source and deployments are reproducible.
Building images inside a containerized/Kubernetes CI environment has a wrinkle: you often can't (or shouldn't) use a privileged Docker daemon ("Docker-in-Docker" needs privileged mode, a security risk). Daemonless builders solve this — Kaniko, BuildKit/buildx, or Buildah build images inside a container without a privileged Docker daemon. The pipeline then authenticates to the registry (via short-lived credentials/OIDC where possible) and pushes the tagged image, optionally signing it (cosign) and scanning it (Trivy) as covered in Chapter 9.
Example
# GitHub Actions: build and push tagged by commit SHA
- uses: docker/build-push-action@v5
with:
push: true
tags: registry.example.com/myapp:${{ github.sha }} # immutable, traceable
# Daemonless build with Kaniko (no privileged Docker needed), in-cluster:
/kaniko/executor --dockerfile=Dockerfile --context=. \
--destination=registry.example.com/myapp:$GIT_SHA
Exercises
- (Beginner) What is the first job of a Kubernetes CI pipeline?
- (Beginner) Why should you avoid tagging deployment images with
:latest?- (Intermediate) Why is "Docker-in-Docker" problematic in containerized CI, and what tools avoid it?
- (Interview) Why is tagging images by Git commit SHA important for reproducibility and rollback? (Hint: immutable mapping image↔source; deterministic deploys.)
Answers
- To build a container image from the source code and push it to a registry the cluster can pull from.
:latestis mutable — it can point to different images over time — so deployments aren't reproducible, rollbacks are ambiguous, and caching/pull behavior is unreliable. Immutable, specific tags (commit SHA / version) are required for deterministic deploys.- Docker-in-Docker requires a privileged container (access to the Docker daemon/host), which is a significant security risk and often disallowed in shared CI/Kubernetes runners. Daemonless builders — Kaniko, BuildKit/buildx, Buildah — build images inside an unprivileged container without a Docker daemon, avoiding the privilege requirement.
- A commit SHA uniquely and immutably identifies the exact source state; tagging the image with it creates a one-to-one, permanent mapping between a running image and the code that produced it. This makes deployments fully reproducible (you always know what's running), enables precise rollback (redeploy the image tagged with the previous good SHA), and supports auditing/debugging (trace a production image back to its commit). Mutable tags break all of these guarantees.
Deploying to Kubernetes from CI (GitHub Actions, GitLab CI)
Theory
After the image is pushed, the pipeline deploys it. The traditional push-based CD approach: the CI runner authenticates to the cluster (kubeconfig/credentials) and runs kubectl apply, helm upgrade, or kustomize build | kubectl apply to update the workload — typically by bumping the image tag to the just-built SHA. Platforms like GitHub Actions and GitLab CI provide jobs/runners and secret storage to do this.
The key consideration is credential security and blast radius: push-based CD means your CI system holds credentials that can modify the production cluster — a high-value target. Best practices: use short-lived, scoped credentials (OIDC federation between CI and the cloud/cluster rather than long-lived kubeconfig secrets), least-privilege RBAC for the deploy identity, and separate credentials per environment. This push model is simple and common, but its security drawbacks (external system with cluster write access) are a major motivation for GitOps (next subchapter), where an in-cluster agent pulls changes instead.
Example
# GitHub Actions: push-based deploy via Helm, image tag = commit SHA
deploy:
steps:
- uses: azure/setup-helm@v4
- run: |
helm upgrade --install web ./chart -n prod \
--set image.tag=${{ github.sha }} --atomic --wait
env:
KUBECONFIG: ${{ secrets.PROD_KUBECONFIG }} # CI holds cluster credentials
Exercises
- (Beginner) In push-based CD, what does the CI runner do to deploy?
- (Beginner) Name two platforms commonly used to run Kubernetes CI/CD pipelines.
- (Intermediate) What is the main security concern with push-based deployment from CI?
- (Interview) Why are OIDC-federated short-lived credentials preferred over a stored long-lived kubeconfig for CI deploys? (Hint: no standing secret to leak; scoped, expiring access.)
Answers
- It authenticates to the cluster (using stored credentials) and applies changes — e.g.,
kubectl apply,helm upgrade, orkustomize build | kubectl apply— usually setting the new image tag (commit SHA).- GitHub Actions and GitLab CI (also Jenkins, CircleCI, Argo Workflows, Tekton — any two).
- The CI system holds credentials with write access to the (production) cluster. CI is an externally-exposed, high-value target; if its secrets leak or a pipeline is compromised, an attacker gains the ability to modify the cluster — a large blast radius.
- A stored long-lived kubeconfig is a standing secret: if it leaks it grants persistent cluster access until manually rotated. OIDC federation instead has the CI job exchange its identity for a short-lived, scoped token at runtime — there's no durable credential sitting in CI secrets to steal, access expires quickly, and it can be tightly scoped (specific cluster/role/environment) and audited. This dramatically reduces the risk and blast radius of credential compromise.
Rolling deployments in pipelines
Theory
When a pipeline updates a Deployment's image, Kubernetes performs a rolling update (Chapter 4) by default — but a robust pipeline must verify the rollout succeeded and react if it didn't, rather than fire-and-forget. The pipeline should wait for the rollout to complete and fail the job (and ideally roll back) if the new version isn't healthy.
The key tools:
kubectl rollout status deployment/<name>: blocks until the rollout succeeds or fails (with a timeout) — the pipeline gates on this.kubectl rollout undoorhelm upgrade --atomic: automatically revert on failure.- Readiness probes (Chapter 10): essential — the rollout only progresses as new Pods become ready, so well-configured probes are what make the rollout (and the pipeline's success/failure signal) meaningful.
The pattern: apply the change → rollout status --timeout → on failure, roll back and fail the pipeline. This turns Kubernetes' built-in rolling update into a safe, gated CD step with automatic recovery, rather than blindly applying and hoping.
Example
# In the pipeline: apply, then gate on rollout health, roll back on failure
kubectl set image deployment/web web=registry.example.com/myapp:$GIT_SHA -n prod
if ! kubectl rollout status deployment/web -n prod --timeout=300s; then
echo "Rollout failed — rolling back"
kubectl rollout undo deployment/web -n prod
exit 1
fi
Exercises
- (Beginner) What command makes a pipeline wait for a rollout to finish?
- (Beginner) Why are readiness probes essential to a meaningful rolling deployment?
- (Intermediate) Describe the apply → verify → recover pattern for a safe pipeline deploy.
- (Interview) Why is "fire-and-forget"
kubectl applyinsufficient for production CD, and what makes a rollout gate reliable? (Hint: no health verification; rollout status + probes + auto-rollback.)
Answers
kubectl rollout status deployment/<name>(typically with--timeout).- Because a rolling update only proceeds (and is judged successful) as new Pods become ready; readiness probes define "ready." Without them, Kubernetes considers a Pod available as soon as it starts, so the rollout/pipeline could report success even though the app isn't actually serving — defeating the verification.
- Apply the change (new image), then verify with
kubectl rollout status --timeoutto wait for the new Pods to become ready; if it fails (times out / Pods unhealthy), recover by rolling back (kubectl rollout undoorhelm --atomic) and fail the pipeline. This gates success on actual rollout health and auto-reverts bad deploys.- Plain
kubectl applyreturns success once the object is accepted by the API server, not when the new version is actually running and healthy — so a broken deploy can silently "succeed." A reliable gate combinesrollout status(waits for real readiness within a timeout), correctly-configured readiness probes (so "ready" means "serving"), and automatic rollback on failure — making the pipeline's success signal reflect genuine application health and preventing bad versions from staying live.
Canary and blue-green deployments
Theory
Rolling updates replace all Pods over time, but two advanced strategies give finer control over risk during releases:
- Blue-green: run two complete environments — blue (current) and green (new). Deploy the new version to green while blue serves all traffic; once green is verified, switch all traffic (e.g., repoint the Service/Ingress) from blue to green instantly. Rollback is instant (switch back). Cost: you run double the capacity during the transition, and the cutover is all-at-once.
- Canary: route a small percentage of traffic (e.g., 5%) to the new version while the rest stays on the old, monitor metrics (errors, latency), and gradually increase the new version's share if healthy (or roll back if not). Limits blast radius and enables data-driven promotion, but requires traffic-splitting and good metrics.
Kubernetes Deployments don't do these natively (they only do rolling/recreate). You implement them with multiple Deployments + Service switching, an Ingress/service-mesh traffic split, or — most practically — progressive-delivery controllers like Argo Rollouts or Flagger, which automate canary/blue-green with metric analysis and automatic promotion/rollback.
Example
Blue-Green: Canary:
[blue v1] <- 100% traffic [v1] <- 95% ---\
[green v2] (verify, 0%) [v2] <- 5% ----> monitor metrics
switch -> [green v2] <- 100% healthy? -> 25% -> 50% -> 100% (else rollback)
# Argo Rollouts: a canary strategy with automated steps + analysis
spec:
strategy:
canary:
steps:
- setWeight: 5
- pause: { duration: 5m } # observe metrics
- setWeight: 25
- pause: { duration: 5m }
- setWeight: 50
- pause: {} # manual or analysis-driven promotion
Exercises
- (Beginner) In a blue-green deployment, how does traffic move to the new version?
- (Beginner) What does a canary deployment do with traffic initially?
- (Intermediate) What is a trade-off of blue-green versus canary?
- (Interview) Native Kubernetes Deployments don't support canary/blue-green directly. What tools provide progressive delivery, and what do they automate? (Hint: Argo Rollouts/Flagger; traffic shifting + metric analysis + auto promote/rollback.)
Answers
- The new version (green) is deployed alongside the old (blue) with no traffic; once verified, all traffic is switched at once from blue to green (e.g., by repointing the Service/Ingress). Rollback is an instant switch back.
- It sends only a small percentage of traffic to the new version (the canary) while the majority stays on the old version, then increases the share gradually based on observed health.
- Blue-green needs double the capacity during the transition and cuts over all-at-once (instant rollback but a big-bang switch and higher cost). Canary uses less extra capacity and limits blast radius via gradual exposure, but is slower to fully roll out and requires traffic-splitting and reliable metrics to drive promotion. (Either contrast acceptable.)
- Progressive-delivery controllers like Argo Rollouts and Flagger add canary/blue-green to Kubernetes. They automate traffic shifting (via Service/Ingress or a service mesh), run metric analysis (query Prometheus/etc. for error rate/latency at each step), and automatically promote the release when healthy or roll back when not — turning manual, scripted strategies into safe, automated, data-driven rollouts.
12.2 GitOps Principles
GitOps reframes deployment around Git as the source of truth and continuous reconciliation. This subchapter covers the core concepts.
GitOps core concepts
Theory
GitOps is an operational model where the desired state of your entire system (applications and infrastructure) is declared in Git, and an automated agent continuously makes the live system match what's in Git. It applies the principles that made application code management reliable — version control, review, history — to operations.
The four widely-cited GitOps principles (per OpenGitOps):
- Declarative: the whole system is described declaratively (Kubernetes manifests, Helm, Kustomize).
- Versioned and immutable: desired state is stored in Git — versioned, with full history and immutability of past states.
- Pulled automatically: software agents automatically pull the declared state from Git.
- Continuously reconciled: agents continuously observe actual state and reconcile it toward the desired state, correcting drift.
The payoff: Git becomes the single source of truth, every change is a reviewed/auditable commit (and revertable), deployments are reproducible, and the cluster self-heals toward the declared state. GitOps is essentially Kubernetes' reconciliation loop (Chapter 1) extended to your whole deployment, driven by Git.
Example
Developer --commit/PR--> Git repo (desired state, manifests)
|
GitOps agent (in cluster) pulls + watches
|
reconcile: make cluster == Git (continuously)
|
Kubernetes cluster (actual state)
Exercises
- (Beginner) In GitOps, where is the desired state stored?
- (Beginner) List the four core GitOps principles.
- (Intermediate) How does GitOps make deployments auditable and revertable?
- (Interview) How is GitOps essentially an extension of Kubernetes' own reconciliation model, and what does that buy operationally? (Hint: declared desired state + continuous reconcile; self-healing, single source of truth.)
Answers
- In Git (a Git repository holds the declarative manifests/config that define the system).
- Declarative; versioned and immutable (in Git); pulled automatically by agents; continuously reconciled toward desired state.
- Every change to the system is made as a Git commit (typically via pull request), so you get full version history, review/approval, and attribution for who changed what and when (auditability). Reverting is just reverting the commit — the agent then reconciles the cluster back to the prior declared state — making rollbacks simple and reliable.
- Kubernetes controllers already work by continuously reconciling actual state toward a declared desired state stored in the API. GitOps moves the authoritative desired state out to Git and adds an agent that reconciles the cluster to it. Operationally this buys a single source of truth (Git), self-healing against drift (manual or accidental changes are reverted toward Git), reproducible and auditable deployments (commit history), easy rollback (git revert), and the same robust level-triggered convergence model applied to your whole deployment, not just individual controllers.
Git as the single source of truth
Theory
The central tenet of GitOps is that Git is the single source of truth for the desired state of the system. The cluster should reflect exactly what's in Git — nothing more, nothing less. This has powerful consequences for how teams operate:
- All changes go through Git: you deploy by committing/merging, not by running ad-hoc
kubectl apply. This gives every change a review (pull request), an approver, a timestamp, and a revert path. - No manual drift: direct changes to the cluster (someone
kubectl edit-ing a Deployment) are considered drift and will be reverted by the agent (or at least detected) — the cluster is not authoritative, Git is. - Reproducibility & DR: because the entire desired state lives in Git, you can recreate the cluster's workloads from scratch by pointing an agent at the repo — invaluable for disaster recovery and spinning up new clusters.
This requires discipline: emergency manual fixes must be backported to Git, or they'll be reverted. The benefit is a fully auditable, reproducible, review-gated operational model where Git history is your deployment history.
Example
WRONG (imperative, drift-prone):
$ kubectl edit deployment web # ad-hoc change, no record, gets reverted
RIGHT (GitOps):
edit manifest in repo -> open PR -> review/approve -> merge
-> agent reconciles cluster to match -> change is live + auditable
Exercises
- (Beginner) In GitOps, how should you make a change to a deployment?
- (Beginner) What happens to a manual
kubectl editchange under strict GitOps?- (Intermediate) How does Git-as-source-of-truth help with disaster recovery?
- (Interview) Why must emergency manual fixes be backported to Git in a GitOps model, and what's the discipline cost/benefit? (Hint: agent reverts drift; auditable consistency vs. process overhead.)
Answers
- By committing the change to the Git repository (typically via a reviewed pull request that is merged) — not by running ad-hoc commands against the cluster.
- It's treated as drift: the GitOps agent detects it and (if auto-sync/self-heal is on) reverts the cluster back to match Git, undoing the manual change. Git, not the live cluster, is authoritative.
- The complete desired state (all manifests/config) lives in Git, so to recover you point a GitOps agent at the repo and it reconciles a fresh cluster to the declared state — recreating all workloads reproducibly without manual rebuilding. Git serves as a versioned, restorable definition of the whole system.
- Because the agent continuously reconciles the cluster to Git, any manual emergency fix that isn't reflected in Git will be reverted (and lost), reintroducing the incident. So fixes must be committed back to Git to persist. The cost is process discipline (even urgent changes need to land in Git), but the benefit is consistency and auditability: the cluster always matches a reviewed, version-controlled definition, there's no untracked drift, and history remains a complete, trustworthy record of every change.
Pull-based vs push-based deployments
Theory
This contrast is the crux of why GitOps differs from traditional CI/CD:
- Push-based (traditional CI/CD): an external system (the CI pipeline) has credentials to the cluster and pushes changes into it (
kubectl apply,helm upgrade). The cluster is passive; the pipeline drives. - Pull-based (GitOps): an agent running inside the cluster continuously pulls the desired state from Git and applies it. The cluster drives its own state; no external system needs cluster write credentials.
The pull model's advantages:
- Security: cluster credentials never leave the cluster — CI doesn't hold them, shrinking the attack surface (CI only needs write access to Git, not to the cluster). The agent's permissions stay internal.
- Drift correction: because the agent continuously reconciles (not just on pipeline runs), it detects and fixes drift automatically.
- Scalability: many clusters can each run an agent pulling from Git, rather than CI pushing to each (better for multi-cluster).
The trade-off: pull-based requires running and operating the agent, and feedback (did my change deploy?) is slightly more indirect. GitOps tools (Argo CD, Flux) implement the pull model.
Example
Push-based (CI/CD):
CI (holds cluster creds) --kubectl/helm--> Cluster [creds leave CI; one-shot]
Pull-based (GitOps):
CI --> Git (commit)
Cluster's agent --pull--> Git --reconcile--> Cluster [creds stay in cluster; continuous]
Exercises
- (Beginner) In a pull-based model, what runs inside the cluster and what does it do?
- (Beginner) In push-based CD, who holds the cluster credentials?
- (Intermediate) Give two advantages of the pull model over the push model.
- (Interview) Why is the pull model more secure and better suited to multi-cluster than the push model? (Hint: credentials stay in-cluster; each cluster self-pulls vs. CI pushing to all.)
Answers
- A GitOps agent/controller runs in the cluster; it continuously pulls the desired state from Git and reconciles the cluster to match it.
- The external CI/CD system (the pipeline) holds credentials that allow it to write to the cluster.
- Any two: better security (cluster credentials stay inside the cluster; CI only needs Git access); automatic, continuous drift detection/correction (not just on pipeline runs); better multi-cluster scalability (each cluster runs its own agent pulling from Git); and natural disaster recovery (agent rebuilds state from Git).
- Security: in pull mode the agent's cluster credentials never leave the cluster — external CI doesn't hold cluster write access, so compromising CI doesn't directly grant cluster control (CI only writes to Git). Multi-cluster: instead of CI maintaining credentials for and pushing to every cluster (a growing, fragile, high-privilege fan-out), each cluster independently runs an agent that pulls from Git and self-reconciles. This scales cleanly to many clusters, keeps per-cluster credentials local, and means adding a cluster is just deploying another agent pointed at the repo.
Reconciliation and drift detection
Theory
The engine of GitOps is continuous reconciliation: the agent repeatedly compares the desired state (Git) with the actual state (cluster) and acts to eliminate any difference. A difference between the two is drift, which arises when someone changes the cluster directly, when a resource is deleted/modified out-of-band, or when Git is updated.
The agent's loop:
- Observe actual cluster state and fetch desired state from Git.
- Diff them to detect drift (per-resource: in-sync, out-of-sync, missing, extra).
- Reconcile — depending on policy, either report the drift or auto-correct it by applying Git's state (self-healing).
This is the same level-triggered reconciliation pattern as Kubernetes controllers (Chapter 1), now spanning Git→cluster. Drift detection gives visibility ("the cluster no longer matches Git"); self-healing (optional) automatically reverts unauthorized changes. Together they keep the live system continuously, verifiably aligned with the reviewed source of truth — catching both accidental changes and unauthorized tampering, and making the cluster's state predictable and auditable at all times.
Example
Reconciliation loop (continuous):
desired (Git) vs actual (cluster)
| |
+------ diff ------+
|
in-sync? -> do nothing
drift? -> report (OutOfSync) and/or auto-apply Git state (self-heal)
# Argo CD reflects this: a change made directly in the cluster shows as OutOfSync,
# and with self-heal enabled is reverted to match Git automatically.
Exercises
- (Beginner) What is "drift" in GitOps?
- (Beginner) What are the two possible responses to detected drift?
- (Intermediate) How is GitOps reconciliation related to Kubernetes controllers' reconciliation?
- (Interview) What is the difference between drift detection and self-healing, and when might you want detection-only? (Hint: report vs. auto-revert; caution in some environments.)
Answers
- A difference between the desired state declared in Git and the actual state of the cluster — i.e., the cluster no longer matches what's defined in Git (from out-of-band changes, deletions, or new Git commits not yet applied).
- Report it (flag the resource as out-of-sync for humans to act on) or automatically correct it by applying the Git-declared state (self-heal).
- It's the same level-triggered reconciliation pattern: continuously observe actual state, compare to desired state, and act to close the gap. Kubernetes controllers reconcile actual cluster state toward desired state stored in the API; GitOps extends this so the authoritative desired state lives in Git and the agent reconciles the whole cluster toward it.
- Drift detection observes and reports differences (visibility) without changing anything; self-healing additionally auto-reverts the cluster to match Git. You might want detection-only when you need a human in the loop before changes (e.g., sensitive production where an automatic revert during an active incident or a legitimate emergency hotfix could be harmful), or while initially adopting GitOps and building trust — getting alerts on drift while controlling when reconciliation actually applies. Once confident, enabling self-heal gives automatic enforcement of the source of truth.
12.3 Argo CD
Argo CD is the leading GitOps tool, with a rich UI and application model. This subchapter covers its architecture and features.
Argo CD architecture
Theory
Argo CD is a declarative, pull-based GitOps continuous delivery tool for Kubernetes (a CNCF graduated project). It runs in the cluster, continuously monitors Git repositories of manifests, and reconciles the cluster to match — with a strong web UI (and CLI) for visualizing application state, sync status, and resource health.
Its main components:
- API Server: serves the UI/CLI/API, handling auth and operations.
- Repository Server: clones Git repos and renders manifests (running Helm/Kustomize/plain YAML) to produce the desired state.
- Application Controller: the reconciliation engine — compares desired (from repo server) vs. live cluster state, reports sync/health status, and applies changes.
- (Plus Redis for caching and Dex for optional SSO.)
Argo CD supports plain manifests, Helm, Kustomize, and jsonnet, and can manage many clusters from one Argo CD instance. Its hallmark is observability of deployments: a visual tree of every resource, its health, and whether it matches Git — making GitOps approachable and debuggable.
Example
+-----------------------------------------------+
| Argo CD |
UI/CLI->| API Server | | Repo Server (renders Git) | |
| | Application Controller |--> target clusters
+-----------------------------------------------+
^ |
| watch/clone | reconcile
Git repos Kubernetes cluster(s)
argocd app list # list managed applications
argocd app get web # show sync + health status
argocd app sync web # trigger a sync to match Git
Exercises
- (Beginner) Where does Argo CD run, and what does it continuously do?
- (Beginner) Which Argo CD component renders manifests from Git (Helm/Kustomize/YAML)?
- (Intermediate) What does the Application Controller do?
- (Interview) What distinguishes Argo CD's user experience, and why is deployment observability valuable in GitOps? (Hint: visual resource tree, sync/health status, debuggability.)
Answers
- It runs inside the Kubernetes cluster and continuously monitors Git repositories, reconciling the cluster's state to match the declared desired state in Git.
- The Repository Server (it clones the repos and renders Helm/Kustomize/plain YAML/jsonnet into manifests).
- It's the reconciliation engine: it compares the desired state (rendered manifests from the repo server) against the live cluster state, determines sync and health status of each resource, and applies changes to converge the cluster toward Git (per the configured sync policy).
- Argo CD provides a rich web UI (and CLI/API) showing a visual tree of every application's resources with their health and sync status (in-sync/out-of-sync vs. Git). This observability makes GitOps approachable and debuggable — you can immediately see what's deployed, whether it matches Git, what drifted, and why something is unhealthy, and trigger/inspect syncs. In a model where the cluster self-reconciles, this visibility is crucial for understanding and trusting what the system is doing.
Application and AppProject resources
Theory
Argo CD is configured through its own CRDs, chiefly Application and AppProject:
- An Application is the core unit: it declares what to deploy and where. It points to a source (a Git repo, path, and revision — plus Helm/Kustomize config) and a destination (a target cluster and namespace), and it carries the sync policy. Argo CD then keeps that destination in sync with that source.
- An AppProject is a grouping and governance boundary for Applications. It restricts what Applications in the project may do: which source repos are allowed, which destination clusters/namespaces are permitted, and which resource kinds can be deployed — plus project-scoped RBAC roles. It's how you enforce multi-tenancy and guardrails (e.g., team A's apps can only deploy to team A's namespaces from team A's repos).
Together: Applications define individual deployments; AppProjects bound and govern groups of them. (The app-of-apps pattern and ApplicationSet, covered in Chapter 15, scale this to many apps/clusters.)
Example
apiVersion: argoproj.io/v1alpha1
kind: Application
metadata: { name: web, namespace: argocd }
spec:
project: team-a
source:
repoURL: https://github.com/org/team-a-config
path: apps/web/overlays/prod
targetRevision: main
destination:
server: https://kubernetes.default.svc
namespace: web
syncPolicy:
automated: { prune: true, selfHeal: true }
---
apiVersion: argoproj.io/v1alpha1
kind: AppProject
metadata: { name: team-a, namespace: argocd }
spec:
sourceRepos: [ "https://github.com/org/team-a-config" ] # allowed repos
destinations:
- { server: https://kubernetes.default.svc, namespace: "web*" } # allowed targets
Exercises
- (Beginner) What does an Argo CD Application declare?
- (Beginner) What is the purpose of an AppProject?
- (Intermediate) Name three things an AppProject can restrict.
- (Interview) How do Application and AppProject together enable safe multi-tenancy in a shared Argo CD instance? (Hint: apps define deployments; projects bound allowed repos/destinations/kinds + RBAC.)
Answers
- What to deploy and where: a source (Git repo, path, revision, and Helm/Kustomize settings) and a destination (target cluster + namespace), plus the sync policy. Argo CD keeps that destination in sync with that source.
- To group Applications and act as a governance/security boundary — constraining and applying RBAC to the Applications within it (enabling multi-tenancy and guardrails).
- Any three: which source repositories are allowed, which destination clusters/namespaces are permitted, which resource kinds may be deployed (allow/deny lists), and project-scoped RBAC roles/permissions.
- In a shared Argo CD instance, each team's Applications belong to an AppProject that limits them to that team's allowed repos, destination clusters/namespaces, and resource kinds, with project-scoped RBAC controlling who can manage them. So even though many teams share one Argo CD, an Application can only deploy approved content from approved sources to approved targets — preventing one tenant from deploying to another's namespace, pulling from arbitrary repos, or creating disallowed resources. Applications provide the per-deployment definition; AppProjects provide the enforced boundaries around groups of them.
Sync policies and auto-sync
Theory
A sync policy governs how Argo CD applies Git changes to the cluster. The key choice is manual vs. automated sync:
- Manual sync: Argo CD detects drift/changes and marks the app OutOfSync, but waits for a human to trigger the sync (via UI/CLI). Good for gated production changes.
- Automated sync: Argo CD applies changes automatically when Git changes. Two important sub-options:
- selfHeal: true — also revert manual cluster changes back to Git (drift correction).
- prune: true — delete cluster resources that were removed from Git (without prune, deletions in Git aren't reflected — orphaned resources linger).
Additional features: sync waves (annotations to order resource application — e.g., CRDs/namespaces first), sync hooks (PreSync/PostSync jobs like migrations), and sync options (e.g., create namespaces, server-side apply). Choosing the policy is a balance: automated+selfHeal+prune gives full hands-off GitOps enforcement; manual gives control at the cost of automation. Many teams use automated sync for non-prod and gated/manual or PR-driven promotion for prod.
Example
spec:
syncPolicy:
automated:
prune: true # delete resources removed from Git
selfHeal: true # revert out-of-band cluster changes
syncOptions:
- CreateNamespace=true
- ApplyOutOfSyncOnly=true
---
# Sync waves order resources (lower waves first):
metadata:
annotations:
argocd.argoproj.io/sync-wave: "-1" # e.g., apply CRDs before workloads
Exercises
- (Beginner) What is the difference between manual and automated sync?
- (Beginner) What does
prune: truedo?- (Intermediate) What does
selfHeal: trueadd, and how does it relate to drift?- (Interview) Why might a team use automated sync in non-prod but manual/gated sync in prod, and what do sync waves solve? (Hint: control vs. automation; ordering dependent resources.)
Answers
- Manual sync detects changes/drift and marks the app OutOfSync but waits for a human to trigger applying them; automated sync applies Git changes to the cluster automatically without manual intervention.
- It deletes cluster resources that have been removed from Git, so the cluster doesn't accumulate orphaned resources that no longer exist in the source of truth.
selfHeal: truemakes Argo CD automatically revert out-of-band cluster changes (drift) back to the state declared in Git — so manualkubectledits are undone and the cluster always matches Git, not just on Git changes but on cluster changes too.- Non-prod benefits from full automation (fast, hands-off deploys; quick iteration), so automated sync is convenient and low-risk there. Production often needs a human gate/approval (or PR-based promotion) before changes go live, to control timing and reduce risk, so manual or gated sync is preferred. Sync waves solve ordering: they let you control the sequence in which resources are applied (e.g., CRDs, namespaces, and databases before the workloads that depend on them), preventing failures from applying dependent resources before their prerequisites exist.
Multi-cluster deployments with Argo CD
Theory
A single Argo CD instance can manage deployments to many clusters — a common "hub-and-spoke" pattern where one central Argo CD (the hub) deploys to multiple workload clusters (spokes). You register external clusters with Argo CD (storing their credentials as Secrets), and then an Application's destination.server targets a specific registered cluster.
This centralizes GitOps across a fleet: one place to see and control all deployments everywhere. Combined with ApplicationSet (Chapter 15), you can template one Application across many clusters automatically (e.g., "deploy this app to every cluster labeled env=prod"). Considerations: the hub Argo CD holds credentials to all managed clusters (a high-value target to secure), and you must decide between one central instance (simpler oversight, bigger blast radius) versus an Argo CD per cluster (more isolation, more to operate). Multi-cluster Argo CD is a foundational pattern for managing many environments/regions consistently from Git.
Example
# Register an external cluster with Argo CD (stores its creds):
argocd cluster add prod-eu-context
# An Application targeting that specific cluster:
# destination:
# server: https://prod-eu.example.com # the registered cluster's API
# namespace: web
argocd app create web-eu --dest-server https://prod-eu.example.com \
--repo https://github.com/org/config --path apps/web --dest-namespace web
Exercises
- (Beginner) Can one Argo CD instance deploy to multiple clusters?
- (Beginner) What does an Application's
destination.serverspecify?- (Intermediate) What is the "hub-and-spoke" Argo CD pattern?
- (Interview) What is the security/operational trade-off between one central Argo CD managing all clusters versus one Argo CD per cluster? (Hint: blast radius of stored creds vs. isolation and operational overhead.)
Answers
- Yes — a single Argo CD instance can manage and deploy to many registered clusters.
- The target cluster's API server (which registered cluster the Application deploys to), alongside the target namespace.
- A central Argo CD instance (the hub) manages deployments to multiple workload clusters (the spokes) from one place, providing unified GitOps control and visibility across the fleet.
- A central (hub) Argo CD must store credentials to every managed cluster, so compromising it endangers all clusters (large blast radius), though it offers single-pane oversight and less operational overhead. One Argo CD per cluster isolates credentials and failure/compromise to that cluster (smaller blast radius, stronger tenancy) but means operating and upgrading many Argo CD instances and losing the single unified view. The choice balances centralized convenience/visibility against isolation and the security risk of concentrating fleet-wide credentials.
Argo CD RBAC and SSO
Theory
Because Argo CD can deploy to production across clusters, controlling who can do what within Argo CD is critical. Argo CD has its own RBAC layer (separate from Kubernetes RBAC) governing actions on Argo CD objects: who can view, sync, create, or delete Applications, scoped by project. Policies are defined in a CSV-like format mapping subjects/roles to permissions (e.g., role team-a may sync apps in project team-a but only get others).
For identity, Argo CD integrates SSO via Dex (a bundled OIDC broker) or directly with an OIDC provider (Okta, Google, Azure AD). Users log in with corporate SSO, and their groups map to Argo CD RBAC roles — so you manage access centrally and grant least-privilege per team/project. This two-layer model matters: Argo CD RBAC controls operations in Argo CD (e.g., who may trigger a prod sync), while the underlying cluster RBAC still governs what Argo CD's service account can do in the cluster. Securing both is essential since Argo CD is effectively a high-privilege deployment gateway.
Example
# Argo CD RBAC policy (policy.csv) — map SSO groups to scoped permissions
p, role:team-a, applications, sync, team-a/*, allow
p, role:team-a, applications, get, */*, allow
g, team-a-oidc-group, role:team-a # SSO group -> Argo CD role
# SSO via OIDC (argocd-cm) — users authenticate with corporate IdP
oidc.config: |
name: Okta
issuer: https://example.okta.com
clientID: argocd
requestedScopes: ["openid","profile","email","groups"]
Exercises
- (Beginner) Is Argo CD's RBAC the same as Kubernetes RBAC?
- (Beginner) What does Argo CD use to integrate corporate SSO?
- (Intermediate) How do SSO groups connect to Argo CD permissions?
- (Interview) Why must you secure both Argo CD's own RBAC and the cluster RBAC of Argo CD's service account? (Hint: Argo CD RBAC = who can operate Argo CD; cluster RBAC = what Argo CD can do in the cluster.)
Answers
- No — Argo CD has its own RBAC layer governing actions on Argo CD objects (Applications, projects, syncs), separate from (and in addition to) Kubernetes RBAC.
- Dex (a bundled OIDC broker) or a direct OIDC integration with an external identity provider (Okta, Google, Azure AD, etc.).
- Users authenticate via SSO/OIDC, which provides their group memberships; Argo CD RBAC policies map those groups (
g, <group>, role:<role>) to roles with scoped permissions (e.g., sync apps within a specific project), so access is granted per team/project based on central group membership.- They control different things. Argo CD RBAC governs who can perform which operations within Argo CD (e.g., who may sync or delete a production Application) — protecting the deployment control plane. Cluster RBAC governs what Argo CD's service account is permitted to do in the target cluster (what resources it can create/modify). If Argo CD RBAC is weak, an unauthorized user could trigger powerful deployments; if Argo CD's cluster RBAC is over-broad, a compromise of Argo CD (or a malicious sync) could do anything in the cluster. Since Argo CD is effectively a high-privilege gateway to production, both layers must be locked down with least privilege.
12.4 Flux CD
Flux is the other major GitOps tool — a set of composable controllers. This subchapter covers Flux v2.
Flux v2 architecture
Theory
Flux (Flux v2, "Flux CD") is the other leading CNCF GitOps tool. Unlike Argo CD's more monolithic, UI-centric design, Flux v2 is built as a set of composable controllers called the GitOps Toolkit, each handling one concern and exposed via its own CRDs. This makes Flux modular, API-driven, and a natural building block for higher-level platforms (it's notably more "Kubernetes-native CLI/CRD"-oriented and ships without a built-in UI, though third-party UIs exist).
The GitOps Toolkit controllers:
- Source Controller: fetches artifacts from sources (Git repos, Helm repos, OCI, buckets) and exposes them internally.
- Kustomize Controller: builds and applies Kustomize/plain-YAML from a source, with reconciliation/health/pruning.
- Helm Controller: manages Helm releases declaratively (via HelmRelease).
- Notification Controller: handles inbound events (webhooks to trigger syncs) and outbound alerts (Slack, etc.).
- Image Automation Controllers: update Git with new image versions (covered later).
You bootstrap Flux into a cluster (flux bootstrap), which installs these controllers and configures them to manage themselves from Git. The philosophy: small, focused, interoperable controllers you compose, versus a single integrated application.
Example
GitOps Toolkit controllers (each a CRD-driven concern):
Source Controller -> GitRepository / HelmRepository / OCIRepository
Kustomize Ctrl -> Kustomization
Helm Controller -> HelmRelease
Notification Ctrl -> Receiver / Alert / Provider
Image Automation -> ImageRepository / ImagePolicy / ImageUpdateAutomation
flux bootstrap github --owner=org --repository=fleet --path=clusters/prod
flux get kustomizations # status of reconciliations
Exercises
- (Beginner) How is Flux v2's architecture structured, in contrast to Argo CD?
- (Beginner) Name three GitOps Toolkit controllers and their concerns.
- (Intermediate) What does
flux bootstrapdo?- (Interview) What are the implications of Flux's composable, CRD-driven, UI-less design for platform builders versus Argo CD's integrated UI approach? (Hint: modularity/automation/building-block vs. out-of-box visibility.)
Answers
- Flux v2 is a set of small, composable, single-purpose controllers (the GitOps Toolkit), each with its own CRDs, rather than a more monolithic, UI-centric application like Argo CD.
- Any three: Source Controller (fetch artifacts from Git/Helm/OCI/bucket sources), Kustomize Controller (build/apply Kustomize/YAML), Helm Controller (manage Helm releases via HelmRelease), Notification Controller (inbound webhooks/outbound alerts), Image Automation controllers (update image versions in Git).
- It installs the Flux controllers into the cluster and configures them to manage the cluster (and themselves) from a Git repository/path — committing Flux's own manifests to Git so Flux is itself managed via GitOps from the start.
- Flux's modular, CRD/CLI-driven, UI-less design makes it highly composable and automatable — ideal as a building block embedded in larger platforms/products, scriptable and API-native, with each concern independently usable. The trade-off is no built-in graphical visibility out of the box (you rely on CLI/CRDs or third-party UIs). Argo CD's integrated UI gives immediate, approachable deployment observability and operations for end users, at the cost of being a more monolithic, opinionated application. Platform builders often favor Flux for embedding/automation; teams wanting turnkey visibility/self-service often favor Argo CD.
GitRepository, Kustomization, HelmRelease resources
Theory
Flux's declarative model centers on a few CRDs that compose source + apply:
- GitRepository (a source): defines where to fetch from — a Git repo URL, branch/tag/semver, and interval. The Source Controller keeps an internal artifact of that repo up to date. (Siblings:
HelmRepository,OCIRepository,Bucket.) - Kustomization (an apply): references a source and a path within it, and tells the Kustomize Controller to build and apply those manifests to the cluster, with
interval,prune, health checks, and dependency ordering (dependsOn). - HelmRelease (an apply for Helm): declaratively manages a Helm release — referencing a chart (from a HelmRepository/OCI/GitRepository source) plus values — handled by the Helm Controller, which performs installs/upgrades/rollbacks.
The pattern is separation of "where to get it" (source) from "what to do with it" (Kustomization/HelmRelease). One GitRepository source can feed many Kustomizations/HelmReleases. This composability — plus dependsOn for ordering and health gating — is how Flux expresses an entire fleet's desired state declaratively.
Example
apiVersion: source.toolkit.fluxcd.io/v1
kind: GitRepository
metadata: { name: fleet, namespace: flux-system }
spec:
url: https://github.com/org/fleet
ref: { branch: main }
interval: 1m
---
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata: { name: web, namespace: flux-system }
spec:
sourceRef: { kind: GitRepository, name: fleet }
path: ./apps/web/overlays/prod
prune: true # delete resources removed from Git
interval: 5m
dependsOn: [ { name: infra } ] # order: apply infra first
Exercises
- (Beginner) What does a GitRepository resource define?
- (Beginner) What does a Kustomization resource do with a source?
- (Intermediate) How does Flux separate "where to get manifests" from "what to do with them"?
- (Interview) How do
dependsOnandpruneon a Kustomization contribute to safe, ordered reconciliation? (Hint: ordering dependencies; removing deleted resources.)
Answers
- A source: where to fetch manifests from — a Git repo URL, the ref (branch/tag/semver), and the poll interval; the Source Controller maintains an up-to-date artifact of it.
- It references a source and a path within it, and instructs the Kustomize Controller to build (kustomize) and apply those manifests to the cluster, reconciling on an interval with optional pruning, health checks, and ordering.
- Source CRDs (GitRepository/HelmRepository/OCIRepository/Bucket) define where to fetch artifacts; apply CRDs (Kustomization/HelmRelease) reference a source and define what to build/apply and how. One source can be consumed by many Kustomizations/HelmReleases, cleanly decoupling fetching from applying.
dependsOnenforces ordering: a Kustomization waits for its dependencies to be successfully applied and healthy before reconciling, so prerequisites (CRDs, namespaces, infrastructure) come up before dependent workloads — preventing failures from missing dependencies.prune: trueremoves from the cluster any resources that were deleted from Git, keeping actual state aligned with the source of truth and avoiding orphaned leftovers. Together they make reconciliation both correctly ordered and accurately convergent (no stale resources).
Image automation with Flux
Theory
A distinctive Flux capability is image update automation: Flux can watch a container registry for new image versions and automatically commit the updated image tag back to Git — closing the loop so that new builds get deployed without manual manifest edits, while still keeping Git as the source of truth (the change is a commit, fully auditable). Argo CD has no built-in equivalent (it's typically paired with external tools like Argo CD Image Updater).
Three CRDs drive it:
- ImageRepository: scans a registry for available tags of an image.
- ImagePolicy: selects which tag is "latest desired" via a policy (semver range, numerical, regex/filter) — e.g., "highest
1.xsemver" or "newest by build timestamp". - ImageUpdateAutomation: writes the selected tag back into the Git manifests (via markers in the YAML) and commits/pushes.
This enables a fully automated path: CI pushes a new image → Flux detects it via ImageRepository/ImagePolicy → ImageUpdateAutomation commits the new tag to Git → the Kustomize/Helm controller deploys it. It's GitOps-native CD that keeps every deployment recorded as a Git commit.
Example
apiVersion: image.toolkit.fluxcd.io/v1beta2
kind: ImagePolicy
metadata: { name: web, namespace: flux-system }
spec:
imageRepositoryRef: { name: web }
policy: { semver: { range: ">=1.0.0 <2.0.0" } } # pick highest 1.x
# In the deployment manifest, a marker tells Flux what to update:
spec:
template:
spec:
containers:
- name: web
image: registry.example.com/web:1.4.2 # {"$imagepolicy": "flux-system:web"}
Exercises
- (Beginner) What does Flux's image automation do when a new image version appears?
- (Beginner) Which CRD decides which image tag is the desired one?
- (Intermediate) Why does writing the new tag back to Git (rather than directly to the cluster) preserve GitOps principles?
- (Interview) Trace the fully automated flow from a new image build to it running in the cluster using Flux image automation. (Hint: CI push → ImageRepository → ImagePolicy → ImageUpdateAutomation commit → Kustomize/Helm deploy.)
Answers
- It automatically updates the image tag in the Git manifests (commits the change to Git), so the new version gets deployed without manual edits.
- ImagePolicy (it applies a policy like a semver range, numeric, or filter to the tags discovered by ImageRepository to select the desired tag).
- Because the change is made as a Git commit to the manifests, Git remains the single source of truth — the update is versioned, reviewable, auditable, and revertable, and the cluster is still reconciled from Git. Writing directly to the cluster would bypass Git, creating drift and losing the audit/rollback guarantees; committing to Git keeps the whole flow GitOps-compliant.
- CI builds and pushes a new image tag to the registry → Flux's ImageRepository scans the registry and discovers the new tag → ImagePolicy selects it as the desired version per its policy → ImageUpdateAutomation writes that tag into the Git manifests (at the marked field) and commits/pushes to Git → the Source Controller picks up the new commit and the Kustomize/Helm Controller applies it, deploying the new image to the cluster — all automatically, with the deployment recorded as a Git commit.
Multi-tenancy with Flux
Theory
Flux supports multi-tenancy so multiple teams can safely share a cluster (or fleet) under GitOps, each managing their own apps without interfering with others. The model leans on Kubernetes-native primitives plus Flux's structure:
- Namespace isolation + RBAC: each tenant's Flux resources (GitRepository, Kustomization, HelmRelease) live in their own namespace, and Flux applies their manifests using a tenant-scoped ServiceAccount (via
spec.serviceAccountName/ impersonation), so a tenant's reconciliation can only do what that ServiceAccount's RBAC allows — confining their blast radius. - Repository structure: a common pattern is a platform/fleet repo (managed by the platform team) that defines each tenant and points to per-tenant repos the tenants control. The platform Kustomization sets up tenants; tenant Kustomizations reconcile their own apps.
- Source restrictions & policies: limit which sources tenants may use, and combine with admission policies (Kyverno/Gatekeeper) for guardrails.
The key mechanism is impersonation/least-privilege ServiceAccounts: Flux doesn't reconcile every tenant with cluster-admin; it reconciles each tenant as a constrained identity, enforcing isolation through Kubernetes RBAC. This makes Flux multi-tenancy as strong as the underlying RBAC.
Example
# A tenant's Kustomization reconciles with a least-privilege ServiceAccount:
apiVersion: kustomize.toolkit.fluxcd.io/v1
kind: Kustomization
metadata: { name: team-a-apps, namespace: team-a }
spec:
serviceAccountName: team-a-reconciler # impersonate this SA (RBAC-limited)
sourceRef: { kind: GitRepository, name: team-a-repo, namespace: team-a }
path: ./apps
prune: true
targetNamespace: team-a # confined to the tenant's namespace
Exercises
- (Beginner) What two Kubernetes-native primitives underpin Flux multi-tenancy?
- (Beginner) What does setting
serviceAccountNameon a Kustomization achieve?- (Intermediate) Describe the platform-repo + per-tenant-repo structure.
- (Interview) Why is reconciling each tenant with a least-privilege ServiceAccount (impersonation) the cornerstone of Flux multi-tenancy security? (Hint: confines what each tenant's reconciliation can do via RBAC; no shared cluster-admin.)
Answers
- Namespaces (isolation per tenant) and RBAC (limiting each tenant's permissions), combined with tenant-scoped ServiceAccounts.
- It makes Flux apply that Kustomization's resources as that ServiceAccount (impersonation), so the reconciliation is constrained to what the SA's RBAC permits — preventing a tenant from creating resources or affecting namespaces beyond their grant.
- A central platform/fleet repository, controlled by the platform team, defines the tenants and bootstraps a Flux setup (Kustomizations/sources) for each, pointing at per-tenant repositories that the individual teams own and control. The platform repo governs structure and guardrails; tenants manage their own application manifests in their own repos, which Flux reconciles within their scoped boundaries.
- If Flux reconciled all tenants with a single highly-privileged (e.g., cluster-admin) identity, any tenant's manifests could affect the whole cluster, breaking isolation. By reconciling each tenant as a least-privilege ServiceAccount scoped to their namespace(s) and resources, Flux ensures a tenant's reconciliation can only do what Kubernetes RBAC explicitly allows that identity — so a malicious or mistaken manifest from one tenant cannot touch another's resources or escalate. The strength of the isolation reduces to the strength of that RBAC, making impersonation the security cornerstone of Flux multi-tenancy.
13. Service Mesh
As applications decompose into many microservices, the network between them becomes the hard part: secure connections, retries, timeouts, load balancing, traffic shaping, and observability — repeated in every service. A service mesh moves these cross-cutting concerns out of application code and into the infrastructure layer. This chapter explains the concept, then surveys the leading implementations: Istio, Linkerd, and Cilium's eBPF-based mesh.
13.1 Service Mesh Concepts
This subchapter builds the mental model: why meshes exist, how they're structured, and the patterns that define them.
Why service meshes exist
Theory
In a microservices system, every service-to-service call needs the same things: encryption (mTLS), retries and timeouts, load balancing, circuit breaking, and telemetry (metrics/traces/logs). Without a mesh, each team implements these in application code using language-specific libraries — inconsistently, repeatedly, and coupling business logic to networking concerns. Worse, you can't easily enforce a uniform policy (e.g., "all traffic must be encrypted") across services written in different languages.
A service mesh solves this by moving these concerns into a dedicated infrastructure layer that handles all service-to-service communication transparently. Applications just make normal network calls; the mesh intercepts them and applies mTLS, retries, traffic policy, and observability — without changing application code. The benefits: consistent, language-agnostic security and reliability; centralized policy and traffic control; and uniform, automatic observability of all inter-service traffic. The cost: added complexity and per-request latency/resource overhead from the proxies — so a mesh is justified when you have enough services that the consistency and control outweigh the overhead.
Example
Without a mesh: every service re-implements (per language):
[svc-A: mTLS+retry+metrics lib] -> [svc-B: mTLS+retry+metrics lib]
With a mesh: concerns move to the infra layer (proxies):
[svc-A]->[proxy] ===mTLS/retry/metrics===> [proxy]->[svc-B]
(app code is unchanged; the mesh handles cross-cutting concerns)
Exercises
- (Beginner) Name three cross-cutting concerns a service mesh handles.
- (Beginner) Does adopting a service mesh require changing application code?
- (Intermediate) Why is implementing these concerns in per-language libraries problematic at scale?
- (Interview) What is the central trade-off of adopting a service mesh, and when is it justified? (Hint: consistency/control/observability vs. complexity and overhead; enough services to warrant it.)
Answers
- Any three: mutual TLS encryption, retries, timeouts, load balancing, circuit breaking, traffic shaping/routing, and observability (metrics/traces/logs).
- No — a mesh handles these concerns transparently at the infrastructure layer; applications make normal network calls and need no (or minimal) code changes.
- Each service/language reimplements the same logic with different libraries, producing inconsistency (different behaviors/bugs), duplicated effort, coupling of networking to business code, and an inability to enforce uniform cross-language policy (e.g., guaranteeing mTLS everywhere). Upgrades/policy changes must be made in every codebase.
- The trade-off is gaining consistent, language-agnostic security (mTLS), reliability (retries/timeouts/circuit breaking), centralized traffic control, and uniform observability — at the cost of added architectural complexity and per-request latency/CPU/memory overhead from the proxies and control plane. It's justified when you have enough services that managing these concerns in-app becomes inconsistent and unscalable, and the value of uniform security/observability/traffic control outweighs the operational and performance overhead. For a handful of services, the overhead often isn't worth it.
Data plane vs control plane in a mesh
Theory
A service mesh has the same two-plane structure as Kubernetes itself:
- Data plane: the network of proxies that actually carry and act on the traffic. Each service instance has a proxy (typically a sidecar) that intercepts all inbound/outbound traffic and applies the mesh's policies — mTLS, routing, retries, load balancing — and emits telemetry. The data plane is in the request path for every call.
- Control plane: the brain that configures and manages the data plane. It doesn't touch request traffic; instead it distributes configuration (routing rules, security policies, service discovery, certificates) to all the proxies and collects their telemetry. Operators interact with the control plane (via CRDs/APIs), which translates intent into proxy configuration.
This separation mirrors Kubernetes' API-server-vs-kubelet split: you declare desired behavior to the control plane, and the distributed data plane enforces it on live traffic. Istio's Envoy proxies + istiod, and Linkerd's micro-proxies + control plane, are concrete instances. Understanding which plane a component belongs to clarifies where latency, failures, and configuration live.
Example
Control Plane (configures, no traffic)
[ policies | routing | certs | discovery ]
| distributes config ^
v | telemetry
Data Plane: [proxy]<->[proxy]<->[proxy] (carries actual request traffic)
| | |
[svc-A] [svc-B] [svc-C]
Exercises
- (Beginner) Which plane carries the actual request traffic?
- (Beginner) Does the control plane sit in the request path?
- (Intermediate) What does the control plane distribute to the data plane?
- (Interview) How does the mesh's data/control plane split mirror Kubernetes' own architecture, and why is this separation valuable? (Hint: declare to control plane, distributed enforcement; like API server vs kubelet.)
Answers
- The data plane (the proxies).
- No — the control plane configures and manages the proxies and collects telemetry but does not handle the actual request traffic.
- Configuration and policy: routing rules, security/mTLS policies, certificates, service discovery information — which the proxies then enforce on live traffic (and it collects their telemetry).
- Like Kubernetes' API server (declarative control) vs. kubelets/components (distributed execution), a mesh separates a control plane where you declare desired behavior (routing, security, policy) from a data plane of distributed proxies that enforce it on real traffic. This separation is valuable because it decouples policy/intent from enforcement: you manage behavior centrally and declaratively while enforcement scales out across every proxy, the control plane can fail without immediately dropping traffic (proxies keep using last config), and concerns (config vs. data path) are cleanly isolated for reasoning about latency, failures, and upgrades.
Sidecar proxy pattern
Theory
The classic way a mesh injects its data plane is the sidecar proxy pattern: a proxy container (e.g., Envoy) is added to every application Pod, sharing the Pod's network namespace. Traffic to and from the app container is transparently redirected (via iptables rules or, increasingly, eBPF) through the sidecar, which applies all mesh functions. The app is unaware — it thinks it's talking directly to other services.
Injection is usually automatic via a mutating admission webhook (Chapter 9) triggered by a namespace/Pod label, which adds the sidecar to the Pod spec at creation. Benefits: per-Pod isolation, language-agnostic, and fine-grained control. Drawbacks — which have driven recent innovation — include resource overhead (a proxy per Pod adds CPU/memory across the whole fleet), latency (extra hops), and operational complexity (sidecar lifecycle, startup ordering, upgrades). These costs motivated sidecar-less approaches: Istio's Ambient mode (shared per-node proxies + optional per-workload proxies) and Cilium's eBPF mesh, which reduce or eliminate per-Pod sidecars.
Example
Each Pod gets a sidecar proxy sharing its network namespace:
+---------------- Pod ----------------+
| [app container] <-> [envoy proxy] | <- iptables/eBPF redirects traffic
+-------------------------------------+
outbound/inbound traffic always flows app <-> proxy <-> network
# Automatic sidecar injection via a namespace label (Istio example):
apiVersion: v1
kind: Namespace
metadata:
name: app
labels: { istio-injection: enabled } # webhook injects the sidecar into Pods
Exercises
- (Beginner) In the sidecar pattern, where does the proxy run relative to the app?
- (Beginner) How is the sidecar typically injected into Pods automatically?
- (Intermediate) Name two drawbacks of the per-Pod sidecar model.
- (Interview) What sidecar-less approaches have emerged to address the sidecar model's costs, and what do they change? (Hint: Istio Ambient per-node proxies; Cilium eBPF.)
Answers
- As a separate proxy container inside the same Pod as the app, sharing the Pod's network namespace, with traffic transparently redirected through it.
- Via a mutating admission webhook (triggered by a namespace/Pod label like
istio-injection: enabled) that adds the sidecar container to the Pod spec at creation time.- Any two: resource overhead (a proxy per Pod consumes CPU/memory across the whole fleet), added latency (extra in-path hops), and operational complexity (sidecar lifecycle, startup ordering with the app, and coordinated upgrades).
- Sidecar-less approaches reduce or remove per-Pod proxies: Istio Ambient mode uses shared per-node proxies (ztunnel) for L4/mTLS plus optional per-workload (waypoint) proxies only where L7 features are needed, eliminating a sidecar in every Pod. Cilium's eBPF service mesh pushes much of the data-plane functionality into the kernel via eBPF (and shared per-node proxies for L7), avoiding per-Pod sidecars entirely. Both aim to cut the resource, latency, and operational overhead of injecting a proxy into every Pod while preserving mesh capabilities.
mTLS and service-to-service security
Theory
One of the most compelling reasons to adopt a mesh is automatic mutual TLS (mTLS) for all service-to-service traffic. In mTLS, both the client and server present certificates and verify each other's identity — providing encryption (traffic is confidential on the wire) and mutual authentication (each side cryptographically proves who it is). This is the foundation of zero-trust networking: services don't trust the network; they trust verified identities.
The mesh makes this effortless and transparent: the control plane acts as (or integrates with) a certificate authority, issuing short-lived identity certificates to each workload's proxy (often based on its ServiceAccount, following the SPIFFE/SPIFFE ID standard) and automatically rotating them. Proxies establish mTLS between each other without the application doing anything. On top of authenticated identity, the mesh enforces authorization policies ("service A may call service B's /api but not /admin"). The result: encryption-in-transit and strong identity for the whole fleet, configured centrally, with no app changes and no manual certificate management — something extremely hard to achieve consistently otherwise.
Example
mTLS handshake between proxies (both verify identities):
[proxy A: cert=spiffe://cluster/ns/app/sa/web]
<===== mutual TLS (encrypt + authenticate both sides) =====>
[proxy B: cert=spiffe://cluster/ns/app/sa/api]
control plane issues + auto-rotates these short-lived certs
# Istio: require strict mTLS for all workloads in a namespace
apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata: { name: default, namespace: app }
spec:
mtls: { mode: STRICT } # only accept mTLS connections
Exercises
- (Beginner) What two guarantees does mTLS provide?
- (Beginner) Who issues and rotates the identity certificates in a mesh?
- (Intermediate) How does mesh mTLS relate to a zero-trust security model?
- (Interview) Why is achieving consistent, rotating mTLS across all services far easier with a mesh than implementing it in each application? (Hint: control plane CA + auto-rotation + transparent proxies vs. per-app cert management.)
Answers
- Encryption of traffic in transit (confidentiality) and mutual authentication (both client and server cryptographically prove their identities).
- The mesh's control plane (acting as or integrating with a certificate authority) issues short-lived workload identity certificates to each proxy and automatically rotates them (often using SPIFFE identities tied to the workload's ServiceAccount).
- Zero-trust means not trusting the network and instead verifying identity for every connection. Mesh mTLS provides exactly that: each service proves its cryptographic identity on every call and traffic is encrypted, so trust is based on verified workload identity rather than network location — plus authorization policies restrict which identities may call which services.
- The mesh provides a control-plane CA that automatically issues and rotates short-lived certificates per workload, and the proxies transparently negotiate mTLS on every connection — with no application code or manual cert handling. Doing this in-app would require every service (in every language) to obtain, store, rotate, and validate certificates, implement mutual TLS correctly and consistently, and coordinate a CA — error-prone, inconsistent, and operationally heavy. The mesh centralizes and automates all of it, guaranteeing uniform encryption and identity fleet-wide effortlessly.
13.2 Istio
Istio is the most feature-rich service mesh. This subchapter covers its architecture, traffic management, observability, and security.
Istio architecture and components
Theory
Istio is the most feature-complete and widely known service mesh. Its architecture follows the data/control plane split:
- Data plane: Envoy proxies — a high-performance, programmable L7 proxy — deployed as sidecars (or, in Ambient mode, as shared node/waypoint proxies). Envoy does the actual mTLS, routing, retries, load balancing, and telemetry.
- Control plane: istiod, a single consolidated component (earlier versions had separate Pilot/Citadel/Galley) that handles configuration distribution (translating Istio CRDs into Envoy config), service discovery, and certificate issuance/rotation (acting as the CA).
You interact with Istio through its CRDs — VirtualService, DestinationRule, Gateway, PeerAuthentication, AuthorizationPolicy, etc. — which istiod compiles into Envoy configuration pushed to the proxies. Istio also provides ingress/egress Gateways (Envoy-based edge proxies) for traffic entering/leaving the mesh. Its strength is breadth and power (rich L7 traffic management, security, and policy); its historical reputation is complexity, which Ambient mode and consolidation into istiod aim to reduce.
Example
Control plane: [ istiod ] (config + discovery + CA; compiles CRDs -> Envoy config)
| pushes config / issues certs
Data plane: [Envoy]<->[Envoy]<->[Envoy] (sidecars carrying traffic)
Edge: [Ingress Gateway (Envoy)] -> into the mesh
istioctl install --set profile=demo # install Istio
kubectl label namespace app istio-injection=enabled
istioctl proxy-status # check Envoy/config sync status
Exercises
- (Beginner) What proxy does Istio use for its data plane?
- (Beginner) What is istiod responsible for?
- (Intermediate) How do you configure Istio's behavior, and how does that reach the proxies?
- (Interview) Istio is powerful but historically complex. What architectural changes have aimed to reduce that complexity? (Hint: consolidation into istiod; Ambient mode.)
Answers
- Envoy.
- The control plane functions: distributing configuration (translating Istio CRDs into Envoy config), service discovery, and certificate issuance/rotation (acting as the mesh CA).
- Through Istio CRDs (VirtualService, DestinationRule, Gateway, PeerAuthentication, AuthorizationPolicy, etc.). istiod watches these, compiles them into Envoy configuration, and pushes that config to the Envoy proxies in the data plane, which enforce it on traffic.
- Earlier Istio had multiple separate control-plane components (Pilot, Citadel, Galley, Mixer), which were consolidated into a single istiod, simplifying deployment/operations (and Mixer was removed). More recently, Ambient mode removes per-Pod sidecars in favor of shared per-node L4 proxies (ztunnel) plus optional L7 waypoint proxies, reducing resource overhead and operational complexity. Together these reduce Istio's footprint and the burden of managing sidecars everywhere.
VirtualService and DestinationRule
Theory
Istio's two core traffic-management CRDs work together:
- VirtualService: defines routing rules — how requests for a service are matched and routed. It can route by URI/header/weight to different subsets or services, enabling canary splits (90/10), header-based routing (A/B testing), URL rewrites, redirects, retries, timeouts, and fault injection. It answers "where should this request go?"
- DestinationRule: defines policies applied after routing to a destination — load balancing algorithm, connection pool limits, outlier detection (circuit breaking), and TLS settings. Critically, it defines subsets (named groups of a service's Pods by labels, e.g.,
version: v1/v2) that VirtualServices route to. It answers "how should traffic to this destination behave?"
The pattern: a DestinationRule declares the subsets and traffic policy; a VirtualService routes traffic across those subsets with weights/match rules. Together they give precise, declarative control over service traffic — the foundation for canary releases, circuit breaking, and resilience, all without app changes.
Example
apiVersion: networking.istio.io/v1
kind: DestinationRule
metadata: { name: reviews }
spec:
host: reviews
subsets: # define named subsets by labels
- { name: v1, labels: { version: v1 } }
- { name: v2, labels: { version: v2 } }
trafficPolicy:
connectionPool: { tcp: { maxConnections: 100 } }
outlierDetection: { consecutive5xxErrors: 5, interval: 30s } # circuit breaking
---
apiVersion: networking.istio.io/v1
kind: VirtualService
metadata: { name: reviews }
spec:
hosts: [ reviews ]
http:
- route: # canary: 90% v1, 10% v2
- { destination: { host: reviews, subset: v1 }, weight: 90 }
- { destination: { host: reviews, subset: v2 }, weight: 10 }
Exercises
- (Beginner) What does a VirtualService control versus a DestinationRule?
- (Beginner) What are "subsets," and where are they defined?
- (Intermediate) How would you implement a 90/10 canary split between two versions using these resources?
- (Interview) Explain how VirtualService and DestinationRule together enable canary releases and circuit breaking without changing application code. (Hint: subsets + weighted routing; outlier detection policy.)
Answers
- A VirtualService controls routing — matching and directing requests (by weight/URI/header) to destinations/subsets (the "where"). A DestinationRule controls destination policies applied after routing — load balancing, connection pools, outlier detection/circuit breaking, TLS — and defines subsets (the "how").
- Subsets are named groups of a service's Pods selected by labels (e.g.,
version: v1vsversion: v2). They're defined in the DestinationRule and referenced by VirtualServices for routing.- Define a DestinationRule with two subsets (v1, v2) by version label, then a VirtualService routing to those subsets with
weight: 90for v1 andweight: 10for v2 — sending 10% of traffic to the new version.- The DestinationRule declares subsets (e.g., v1/v2 by label) and traffic policies including outlier detection (eject unhealthy instances = circuit breaking). The VirtualService routes traffic across those subsets with weights/match rules. So a canary is just shifting weights between subsets (90/10 → 50/50 → 100%), and circuit breaking is the DestinationRule's outlier-detection policy automatically removing failing endpoints — all expressed declaratively in mesh config and enforced by the Envoy proxies, with the application completely unaware and unchanged.
Traffic management and load balancing
Theory
Beyond canary routing, Istio provides rich traffic management capabilities, all enforced by Envoy without app changes:
- Load balancing algorithms: round-robin, least-request, random, consistent hashing (for session affinity by header/cookie) — configured in the DestinationRule.
- Resilience: timeouts, retries (with per-try timeouts and retry conditions), and circuit breaking/outlier detection (eject endpoints returning errors), making the system tolerant of slow/failing instances.
- Fault injection: deliberately inject delays or aborts (errors) into a percentage of requests to test resilience (chaos testing) — without touching code.
- Traffic mirroring (shadowing): send a copy of live traffic to a new version to test it with real load while the response is discarded — safe pre-production validation.
- Edge control via Gateways: manage ingress/egress traffic at the mesh boundary.
These turn the network into a programmable, resilient layer: you declaratively shape, protect, and test traffic flows. The combination of fine-grained routing + resilience + testing primitives is a major reason teams adopt Istio for complex traffic scenarios.
Example
# Retries, timeout, and fault injection on a route (VirtualService):
spec:
hosts: [ ratings ]
http:
- route: [ { destination: { host: ratings } } ]
timeout: 2s
retries: { attempts: 3, perTryTimeout: 1s, retryOn: 5xx,connect-failure }
fault:
delay: { percentage: { value: 10 }, fixedDelay: 5s } # inject latency in 10%
---
# Traffic mirroring: send a copy to v2 while serving from v1
http:
- route: [ { destination: { host: app, subset: v1 } } ]
mirror: { host: app, subset: v2 }
mirrorPercentage: { value: 20 }
Exercises
- (Beginner) Name two resilience features Istio provides for traffic.
- (Beginner) What is traffic mirroring (shadowing) used for?
- (Intermediate) What is fault injection, and why would you use it?
- (Interview) How does Istio's traffic management turn the network into a "programmable, resilient layer," and what's the advantage of doing this in the mesh rather than in code? (Hint: declarative retries/timeouts/circuit-breaking/testing; consistent, language-agnostic, no redeploy.)
Answers
- Any two: timeouts, retries (with per-try timeout/conditions), and circuit breaking/outlier detection.
- Sending a copy of live production traffic to another version (while discarding its responses) to test the new version against real traffic safely, without affecting users.
- Deliberately injecting failures — delays (latency) or aborts (error responses) — into a configurable percentage of requests. It's used to test and validate system resilience (chaos/fault testing): verifying timeouts, retries, fallbacks, and circuit breakers behave correctly under failure, without modifying application code.
- Istio lets you declaratively configure routing, retries, timeouts, circuit breaking, load-balancing strategies, fault injection, and mirroring — all enforced by Envoy on live traffic. This makes the network programmable (shape/route traffic via config) and resilient (automatic handling of slow/failing instances). Doing it in the mesh rather than code means the behavior is consistent and language-agnostic across all services, centrally managed and auditable, changeable at runtime without redeploying applications, and decoupled from business logic — far more uniform and flexible than each team implementing resilience libraries differently.
Observability with Istio (Kiali, Jaeger)
Theory
Because every request passes through Envoy proxies, Istio gets observability for free — it can emit consistent metrics, distributed traces, and access logs for all service-to-service traffic without app instrumentation (beyond trace-header propagation). This produces the "golden signals" (latency, traffic, errors, saturation) uniformly across the mesh.
Istio integrates with a standard observability stack:
- Metrics: Envoy exports rich metrics (request rates, error rates, latencies per service) scraped by Prometheus, visualized in Grafana.
- Distributed tracing: Envoy generates spans and reports to Jaeger (or Tempo/Zipkin) — though the application must propagate the trace headers so spans link across hops (the one piece the mesh can't do alone).
- Service-graph visualization: Kiali is Istio's dedicated console that renders a live topology graph of services and their traffic (rates, health, mTLS status), and helps validate/edit Istio config.
So Istio + Prometheus/Grafana + Jaeger + Kiali gives metrics, traces, and a visual service map of the entire mesh largely automatically — a powerful, consistent observability layer that's hard to achieve with per-service instrumentation alone.
Example
Envoy proxies emit telemetry for every call:
metrics -> Prometheus -> Grafana (rates, errors, latency)
traces -> Jaeger (per-request spans; app must propagate headers)
topology -> Kiali (live service graph: traffic, health, mTLS)
istioctl dashboard kiali # open the service-graph console
istioctl dashboard jaeger # open distributed tracing
Exercises
- (Beginner) Why does Istio provide observability with little app instrumentation?
- (Beginner) What does Kiali provide?
- (Intermediate) For distributed tracing to work across hops in Istio, what must the application still do?
- (Interview) How does proxy-based telemetry give consistent "golden signals" across a polyglot fleet, and what's its limitation regarding tracing? (Hint: uniform metrics from Envoy; apps must propagate trace context.)
Answers
- Because all traffic flows through the Envoy proxies, which automatically emit consistent metrics, traces, and access logs for every request — so the mesh produces telemetry without requiring per-application instrumentation.
- Kiali is Istio's observability console that visualizes the mesh as a live service topology/graph (traffic rates, health, mTLS status between services) and helps inspect, validate, and configure Istio resources.
- The application must propagate the trace context headers (e.g., W3C
traceparent/B3 headers) from incoming requests to outgoing requests, so that spans generated by different proxies link into a single end-to-end trace. The mesh can generate spans but cannot stitch them across hops without the app forwarding the headers.- Every service's traffic passes through identical Envoy proxies, which emit the same set of metrics (request rate, error rate, latency, etc.) regardless of the application's language or framework — yielding uniform golden signals across a polyglot fleet automatically, with no per-language instrumentation. The limitation is tracing: while proxies create spans, they can't correlate them across services unless the application forwards the trace headers between inbound and outbound calls — so distributed tracing still requires that small piece of app cooperation (header propagation), even though metrics and topology come for free.
Istio security policies
Theory
On top of automatic mTLS (identity + encryption), Istio enforces security policy via two main CRDs:
- PeerAuthentication: controls mTLS mode for workloads — e.g.,
STRICT(only accept mTLS),PERMISSIVE(accept both mTLS and plaintext, useful during migration), orDISABLE. Scoped mesh-wide, per-namespace, or per-workload. - AuthorizationPolicy: defines access control — which identities (principals/ServiceAccounts), from which namespaces/sources, may access which workloads, on which paths/methods. It supports ALLOW and DENY actions and rich conditions (source identity, request path, method, headers, JWT claims). This is how you implement least-privilege service-to-service authorization ("only the
webServiceAccount may callpaymentsonPOST /charge").
Istio also supports RequestAuthentication for validating end-user JWTs (e.g., from an OIDC provider) at the proxy, enabling end-user auth at the edge/service. Together these give a layered, identity-based security model — encryption (PeerAuthentication mTLS) plus fine-grained authorization (AuthorizationPolicy) plus request-level auth (RequestAuthentication) — all enforced by the proxies, declaratively, without app code.
Example
# Require strict mTLS, then allow only "web" SA to POST to payments
apiVersion: security.istio.io/v1
kind: PeerAuthentication
metadata: { name: default, namespace: payments }
spec: { mtls: { mode: STRICT } }
---
apiVersion: security.istio.io/v1
kind: AuthorizationPolicy
metadata: { name: payments-allow, namespace: payments }
spec:
selector: { matchLabels: { app: payments } }
action: ALLOW
rules:
- from: [ { source: { principals: ["cluster.local/ns/app/sa/web"] } } ]
to: [ { operation: { methods: ["POST"], paths: ["/charge"] } } ]
Exercises
- (Beginner) What does PeerAuthentication control?
- (Beginner) What does AuthorizationPolicy let you define?
- (Intermediate) Why is PERMISSIVE mTLS mode useful during a migration?
- (Interview) How do PeerAuthentication, AuthorizationPolicy, and RequestAuthentication layer to form an identity-based security model? (Hint: encryption/identity + service authz + end-user JWT, all at the proxy.)
Answers
- The mTLS mode for workloads — whether they require mTLS (STRICT), accept both mTLS and plaintext (PERMISSIVE), or disable it — scoped mesh-wide, per-namespace, or per-workload.
- Fine-grained access control: which source identities (ServiceAccounts/principals), namespaces, or other attributes may access which workloads, on which paths/methods, via ALLOW/DENY rules and conditions (including JWT claims).
- During migration to mTLS, not all clients may be sending mTLS yet. PERMISSIVE mode lets a workload accept both mTLS and plaintext simultaneously, so you can roll out sidecars/mTLS incrementally without breaking not-yet-migrated callers, then switch to STRICT once everything communicates over mTLS.
- They layer complementary controls all enforced by the proxies: PeerAuthentication establishes encrypted, mutually-authenticated connections (workload identity + confidentiality via mTLS); AuthorizationPolicy uses those verified identities to enforce which services may call which (service-to-service least privilege, by path/method); and RequestAuthentication validates end-user credentials (JWTs from an OIDC provider) at the proxy for request-level/end-user auth. Together they provide encryption + strong workload identity + fine-grained service authorization + end-user authentication — a complete, identity-based, zero-trust security model configured declaratively without modifying application code.
13.3 Linkerd
Linkerd is the lightweight, simplicity-focused mesh. This subchapter covers its design and features.
Linkerd architecture and design philosophy
Theory
Linkerd is a CNCF-graduated service mesh whose guiding philosophy is simplicity, performance, and operational ease — deliberately the opposite of "kitchen-sink" feature breadth. Where Istio aims for maximum capability (and accepts complexity), Linkerd aims to do the essential mesh functions (mTLS, reliability, observability) with minimal overhead and a gentle learning curve.
Its defining technical choice is the micro-proxy: instead of Envoy, Linkerd uses linkerd2-proxy, an ultralight proxy written in Rust, purpose-built only for mesh sidecar duties. It's far smaller and lower-latency/lower-memory than a general-purpose proxy, which is central to Linkerd's performance claims. The control plane is similarly lean (destination, identity, proxy-injector components). The trade-off versus Istio: Linkerd intentionally offers fewer advanced traffic-management knobs, focusing on what most teams actually need. Teams choosing Linkerd typically value getting a secure, observable, reliable mesh running quickly with low overhead over having every possible feature.
Example
| Aspect | Linkerd | Istio |
|---|---|---|
| Proxy | linkerd2-proxy (Rust micro-proxy) | Envoy (general-purpose L7) |
| Philosophy | Simplicity, low overhead | Maximum features/flexibility |
| Learning curve | Gentle | Steep |
| Advanced traffic mgmt | Limited (essentials) | Extensive |
| Resource footprint | Very low | Higher |
linkerd install | kubectl apply -f - # install control plane
linkerd check # validate the installation
kubectl annotate ns app linkerd.io/inject=enabled # enable sidecar injection
Exercises
- (Beginner) What is Linkerd's core design philosophy?
- (Beginner) What proxy does Linkerd use, and in what language is it written?
- (Intermediate) What is the main trade-off of choosing Linkerd over Istio?
- (Interview) Why does Linkerd's purpose-built Rust micro-proxy matter for its performance and operational story? (Hint: tiny, fast, low-memory vs. general-purpose Envoy.)
Answers
- Simplicity, performance, and operational ease — doing the essential mesh functions with minimal complexity and overhead, rather than maximizing features.
linkerd2-proxy, a purpose-built ultralight micro-proxy written in Rust.- Linkerd offers fewer advanced traffic-management/configuration features than Istio in exchange for far greater simplicity, lower resource overhead, and a gentler learning curve. You trade breadth/flexibility for ease of operation and performance.
- Because the proxy runs in every meshed Pod and is in the path of every request, its size and efficiency dominate the mesh's overhead. A general-purpose proxy (Envoy) is powerful but heavier in memory/CPU/latency. Linkerd's Rust micro-proxy is purpose-built for only sidecar duties, making it extremely small, fast, and memory-light, with Rust providing memory safety without a GC. This yields low per-Pod overhead and latency at scale and fewer moving parts to operate — directly enabling Linkerd's simplicity and performance claims.
Installing and configuring Linkerd
Theory
Linkerd emphasizes a fast, validated install experience. The typical flow uses the linkerd CLI: you run linkerd check --pre to validate the cluster is ready, install the CRDs and control plane, and crucially linkerd check after each step — a hallmark feature that runs extensive health/configuration checks and tells you precisely if anything is wrong. This "check-driven" workflow is part of Linkerd's operational-ease philosophy.
Adding workloads to the mesh is done via injection annotations (linkerd.io/inject: enabled) on namespaces or workloads, which trigger the proxy-injector webhook to add the micro-proxy sidecar — usually applied by re-rolling deployments. Configuration is intentionally minimal: sensible defaults mean most teams need little tuning. Optional extensions add capabilities on top (e.g., linkerd viz for the observability dashboard, linkerd multicluster, linkerd jaeger), keeping the core lean while letting you opt into more. The overall experience targets "secure, observable mesh in minutes, with the tool telling you if something's off."
Example
linkerd check --pre # validate cluster readiness
linkerd install --crds | kubectl apply -f -
linkerd install | kubectl apply -f - # install control plane
linkerd check # validate everything is healthy
linkerd viz install | kubectl apply -f - # optional observability extension
kubectl annotate deploy web linkerd.io/inject=enabled && kubectl rollout restart deploy web
Exercises
- (Beginner) What command validates a Linkerd installation's health?
- (Beginner) How do you add a workload to the Linkerd mesh?
- (Intermediate) What is the purpose of Linkerd extensions like
linkerd viz?- (Interview) How does Linkerd's "check-driven" install workflow reflect its operational-ease philosophy? (Hint: automated validation catches misconfig early, lowers operational risk.)
Answers
linkerd check(andlinkerd check --prebefore installing).- By adding the injection annotation
linkerd.io/inject: enabledto the namespace or workload (then rolling the workload so the proxy-injector webhook adds the micro-proxy sidecar).- Extensions add optional capabilities on top of the lean core — e.g.,
linkerd vizprovides the observability dashboard/metrics,linkerd multiclusterenables cross-cluster communication,linkerd jaegeradds tracing integration — so you opt into extra features without bloating the base installation.- The
linkerd checkcommand runs comprehensive validations of prerequisites, control-plane health, certificates, proxy versions, and configuration at every stage, clearly reporting any problems and how to fix them. This catches misconfigurations early and gives operators confidence the mesh is correctly set up, reducing the trial-and-error and operational risk common with complex meshes — directly embodying Linkerd's goal of being simple and safe to operate.
Automatic mTLS
Theory
A flagship Linkerd feature is automatic mTLS that's on by default — the moment a workload is meshed, its traffic to other meshed workloads is mutually authenticated and encrypted, with zero configuration. This reflects Linkerd's "secure by default, simple to operate" stance: you don't write policies to enable encryption; it just happens.
Mechanically, Linkerd's identity control-plane component acts as a CA, issuing each proxy a short-lived certificate (rotated automatically, by default roughly every 24 hours) tied to the workload's ServiceAccount identity. The micro-proxies establish mTLS transparently for TCP traffic between meshed Pods. Because certificates are short-lived and auto-rotated, the security posture is strong without operator effort or the risk of long-lived credentials. (You can layer authorization policies on top via Linkerd's policy resources, but the baseline encryption + identity requires nothing.) This "mTLS for free, by default" experience is a major part of Linkerd's appeal.
Example
# mTLS is automatic for meshed workloads — verify it's happening:
linkerd viz edges deployment -n app
# SRC DST CLIENT_ID SECURED
# web payments web.app.serviceaccount... √ (mTLS in effect)
# Identity certs are short-lived and auto-rotated (no manual management).
Exercises
- (Beginner) Does Linkerd require configuration to enable mTLS between meshed workloads?
- (Beginner) Which component acts as the CA issuing workload certificates?
- (Intermediate) What workload attribute is a Linkerd proxy's identity tied to, and why does short-lived rotation matter?
- (Interview) How does "mTLS on by default" exemplify Linkerd's design philosophy, and what security benefit comes from automatic short-lived cert rotation? (Hint: secure-by-default/zero-config; reduced credential exposure.)
Answers
- No — automatic mTLS is on by default for meshed workloads; no configuration is needed to enable encryption and mutual authentication between them.
- The Linkerd identity control-plane component (acting as the certificate authority).
- It's tied to the workload's ServiceAccount identity. Short-lived, auto-rotated certificates (e.g., ~24h) matter because they limit the window a leaked credential is useful, require no manual rotation, and avoid the risks of long-lived certs — strong security with no operator effort.
- Enabling encryption and identity automatically, with zero configuration, embodies Linkerd's "simple and secure by default" philosophy — operators get a strong baseline (encrypted, mutually authenticated, identity-based traffic) without writing policy or managing certificates. The automatic short-lived rotation means credentials are continuously refreshed and expire quickly, so a compromised certificate is useless almost immediately and there are no long-lived secrets to leak or manually rotate — robust security that's effortless to maintain.
Linkerd observability features
Theory
Like Istio, Linkerd gives observability for free because all meshed traffic flows through its proxies — but consistent with its philosophy, it focuses on the golden metrics that matter most: success rate, request rate (RPS), and latency (p50/p95/p99) per service and per route. These are exposed through the linkerd viz extension, which includes a dashboard and powerful CLI commands for live inspection.
Distinctive Linkerd observability tools:
linkerd viz stat: live golden metrics for deployments/services.linkerd viz top: a live,top-like view of requests by route.linkerd viz tap: stream individual live requests flowing through a proxy in real time (great for debugging).- A Grafana/Prometheus integration and a web dashboard showing service health and dependency graphs.
The emphasis is on immediate, low-friction insight into service health — golden metrics and live request inspection out of the box — rather than a vast configurable telemetry pipeline. This makes diagnosing "is service X healthy, and what are its requests doing right now?" extremely fast, fitting Linkerd's simplicity-first ethos.
Example
linkerd viz stat deploy -n app
# NAME SUCCESS RPS LATENCY_P95 LATENCY_P99
# web 99.8% 24.0 12ms 45ms
# payments 100.0% 5.0 8ms 20ms
linkerd viz top deploy/web -n app # live per-route request view
linkerd viz tap deploy/web -n app # stream individual live requests
Exercises
- (Beginner) What three "golden metrics" does Linkerd focus on?
- (Beginner) Which extension provides Linkerd's dashboard and metrics CLI?
- (Intermediate) What does
linkerd viz taplet you do that aggregate metrics cannot?- (Interview) How does Linkerd's observability approach reflect its simplicity-first philosophy compared to Istio's? (Hint: focused golden metrics + live request inspection out of the box vs. extensive configurable telemetry.)
Answers
- Success rate, request rate (RPS), and latency (e.g., p50/p95/p99).
- The
linkerd vizextension.tapstreams individual live requests flowing through a proxy in real time (source, destination, path, response code, latency), letting you inspect actual ongoing traffic for debugging — which aggregate metrics (averages/percentiles) can't show at the per-request level.- Linkerd provides the most useful signals — golden metrics and live request inspection (stat/top/tap) — immediately and with minimal setup, focusing on quickly answering "is this service healthy and what are its requests doing?" rather than offering a vast, highly-configurable telemetry system. Istio (via Envoy) exposes a much broader, more configurable set of metrics and integrations, with corresponding complexity. Linkerd's curated, out-of-the-box, low-friction observability matches its overall simplicity-first, operational-ease philosophy.
13.4 Cilium Service Mesh
Cilium brings eBPF-based, sidecar-less mesh capabilities. This subchapter covers its distinctive approach.
eBPF-based networking
Theory
Cilium is a CNI (Chapter 6) and service mesh built on eBPF — a Linux kernel technology that lets you run sandboxed programs inside the kernel in response to events (network packets, syscalls) without modifying the kernel or loading modules. For networking, this means Cilium can implement routing, load balancing, network policy, and observability in the kernel data path, rather than via userspace proxies or long iptables rule chains.
Why this matters: traditional kube-proxy/iptables-based networking degrades as Services and rules grow (long, linearly-scanned chains), and userspace proxies add hops and overhead. eBPF programs execute efficiently in-kernel with hash-map lookups (effectively O(1)), enabling high-performance service load balancing, policy enforcement, and packet processing with minimal overhead. Cilium can even replace kube-proxy entirely. This eBPF foundation is what underpins Cilium's networking and its sidecar-light service-mesh approach — pushing mesh functionality into the kernel instead of a proxy in every Pod.
Example
Traditional: packet -> iptables (long rule chains) / userspace proxy -> dest
Cilium/eBPF: packet -> eBPF program in kernel (hash-map lookup, in-path) -> dest
(efficient L3/L4 routing, LB, policy, telemetry in the kernel)
cilium install # install Cilium as CNI
cilium status # check eBPF datapath / kube-proxy replacement
# Cilium can replace kube-proxy (kubeProxyReplacement) using eBPF for Services.
Exercises
- (Beginner) What is eBPF, in one sentence?
- (Beginner) What traditional component can Cilium replace using eBPF?
- (Intermediate) Why does eBPF-based networking scale better than iptables as Services grow?
- (Interview) How does running networking logic in the kernel via eBPF benefit performance compared to userspace proxies and iptables? (Hint: no userspace hops, O(1) lookups, in-kernel data path.)
Answers
- eBPF lets you run sandboxed programs inside the Linux kernel, triggered by events (like network packets or syscalls), without changing kernel source or loading kernel modules.
- kube-proxy (Cilium can provide a kube-proxy replacement for Service load balancing using eBPF).
- iptables-based service routing creates long rule chains that are scanned largely linearly, so per-packet processing and rule-update cost grow with the number of Services/endpoints. eBPF uses efficient in-kernel hash-map lookups (≈O(1)) for service/endpoint resolution, so performance stays roughly constant as Services scale, and updates are far cheaper.
- eBPF programs execute in the kernel directly on the packet path, so traffic doesn't need to be copied to and from a userspace proxy (avoiding context switches and extra hops/latency), and service/endpoint resolution uses constant-time hash-map lookups instead of long iptables chains. The result is lower latency, higher throughput, and better scalability for routing, load balancing, and policy enforcement — achieved in the kernel data path rather than in userspace or via large rule sets.
Cilium as a CNI and service mesh
Theory
Cilium's distinctive proposition is unifying the CNI (networking) and service mesh layers using the same eBPF foundation — and doing much of the mesh without per-Pod sidecars. Because Cilium already operates in the kernel data path as the CNI, it can provide many mesh capabilities (L3/L4 load balancing, transparent mTLS/encryption, network policy, observability) directly via eBPF, with no sidecar proxy in every Pod. For L7 features (HTTP routing, L7 policy), it uses a shared per-node Envoy proxy rather than one per Pod.
The benefits: significantly lower overhead (no proxy injected into every Pod — saving CPU/memory across the fleet and removing sidecar lifecycle/latency issues) and a single integrated system for networking + security + mesh. This "sidecar-less mesh" is part of the same industry shift as Istio's Ambient mode. The trade-off: Cilium's mesh feature set differs from Istio's mature L7 traffic-management breadth, and it requires a recent kernel for full eBPF capabilities. For teams wanting performance and a unified networking+mesh stack, Cilium is compelling.
Example
Cilium (sidecar-less mesh):
L3/L4, mTLS, policy, telemetry -> eBPF in kernel (no per-Pod sidecar)
L7 (HTTP routing/policy) -> shared per-NODE Envoy proxy (not per-Pod)
vs. sidecar mesh: a proxy injected into EVERY Pod (more overhead)
Exercises
- (Beginner) What two layers does Cilium unify with one technology?
- (Beginner) Does Cilium's mesh require a sidecar proxy in every Pod?
- (Intermediate) How does Cilium handle L7 features if it avoids per-Pod sidecars?
- (Interview) What are the benefits and trade-offs of Cilium's sidecar-less, eBPF-based mesh versus a traditional sidecar mesh? (Hint: lower overhead/integration vs. feature breadth/kernel requirements.)
Answers
- The CNI (pod networking) and the service mesh — both built on eBPF in a single integrated system.
- No — much of the mesh (L3/L4 load balancing, mTLS/encryption, policy, observability) is done in the kernel via eBPF without per-Pod sidecars.
- It uses a shared per-node Envoy proxy for L7 functions (HTTP-aware routing, L7 policy), rather than injecting a proxy into every Pod — so L7 capability is available without per-Pod sidecar overhead.
- Benefits: much lower overhead (no proxy in every Pod, saving fleet-wide CPU/memory and eliminating sidecar latency/lifecycle/startup-ordering issues), a unified networking+security+mesh stack on one eBPF foundation, and high performance from in-kernel processing. Trade-offs: the mesh feature set (especially mature, fine-grained L7 traffic management) may be less extensive than Istio's; full eBPF capabilities require a sufficiently recent Linux kernel; and it's a newer/distinct operational model. It suits teams prioritizing performance and an integrated stack over maximal L7 traffic-shaping features.
Hubble observability
Theory
Hubble is Cilium's observability layer, built on eBPF. Because Cilium sees all traffic in the kernel data path, Hubble can provide deep network visibility — flow-level observability of every connection (who talked to whom, on what port, allowed or denied by policy, L7 details like HTTP method/path/status) — without sidecars or app instrumentation. It's the eBPF-native answer to "what is happening on my network, right now and historically."
Hubble offers:
hubble observe(CLI): stream live network flows with rich filtering (by namespace, pod, verdict, protocol).- Hubble UI: a visual service map and flow viewer showing live service dependencies and traffic, including which flows are dropped by NetworkPolicy (invaluable for debugging policy).
- Metrics: export flow-based metrics to Prometheus/Grafana.
The standout capability is L3–L7 flow visibility tied to identity and policy verdicts — you can see exactly which connections are happening and why they're allowed or blocked, straight from the kernel. This makes Hubble especially powerful for debugging connectivity and NetworkPolicy issues that are otherwise opaque.
Example
hubble observe --namespace app --verdict DROPPED
# app/web -> app/payments:8080 HTTP/POST /charge DROPPED (policy denied)
# Immediately shows which flows are blocked by NetworkPolicy and why.
hubble observe --namespace app --protocol http
# app/web -> app/api:8080 GET /users 200 (allowed, L7 visible)
Exercises
- (Beginner) What is Hubble, and what does it provide?
- (Beginner) Does Hubble require sidecars or app instrumentation for network visibility?
- (Intermediate) Why is Hubble especially useful for debugging NetworkPolicy issues?
- (Interview) How does Cilium's eBPF foundation enable Hubble to show identity- and policy-aware L3–L7 flow visibility that's hard to get otherwise? (Hint: kernel sees all traffic + identity + policy verdicts natively.)
Answers
- Hubble is Cilium's eBPF-based observability layer; it provides deep, flow-level network visibility (connections, identities, L7 details like HTTP method/path/status, and policy verdicts) plus a service map, CLI, and metrics.
- No — because Cilium observes traffic in the kernel data path, Hubble provides visibility without per-Pod sidecars or application instrumentation.
- Hubble shows each flow's verdict (allowed/dropped) and the reason, so you can immediately see which connections NetworkPolicy is blocking (and which are permitted), turning otherwise-opaque "why can't A reach B?" policy debugging into a direct observation of dropped flows with their source/destination/identity.
- Because Cilium operates in the kernel as the CNI, it inherently sees every packet/flow, knows the cryptographic/workload identity of endpoints, and is the component making policy decisions — so it can emit flow records annotated with source/destination identity, L3–L7 details, and the allow/deny verdict directly from the data path. Hubble surfaces this. Achieving the same with traditional tooling would require correlating data from proxies, iptables logs, and app instrumentation across services; Cilium produces it natively and uniformly from one kernel-level vantage point.
Cilium Network Policies
Theory
Standard Kubernetes NetworkPolicy is L3/L4 only (IP/port) and namespaced (Chapter 6). Cilium Network Policies (CRDs: CiliumNetworkPolicy and the cluster-wide CiliumClusterwideNetworkPolicy) extend this with eBPF-powered, identity-aware enforcement and richer capabilities:
- L7 policies: allow/deny based on application-layer attributes — e.g., permit only
GET /api/*HTTP requests, or specific Kafka topics/gRPC methods — beyond just port access. - Identity-based: policies are enforced on Cilium's cryptographic workload identity (derived from labels) rather than fragile IPs, which is more robust as Pods churn.
- DNS-aware (FQDN) egress: allow egress to specific domain names (e.g.,
api.stripe.com) rather than IP ranges — Cilium resolves and enforces by FQDN. - Clusterwide scope: apply policies across all namespaces at once.
These run on the same high-performance eBPF data path. The result is far more expressive segmentation than standard NetworkPolicy — you can enforce least privilege at the application protocol level and by stable identity/FQDN, which is especially valuable for zero-trust and controlling egress to external services.
Example
# Cilium L7 + identity policy: allow web -> api ONLY for GET /api/*
apiVersion: cilium.io/v2
kind: CiliumNetworkPolicy
metadata: { name: api-l7, namespace: app }
spec:
endpointSelector: { matchLabels: { app: api } }
ingress:
- fromEndpoints: [ { matchLabels: { app: web } } ] # by identity (labels)
toPorts:
- ports: [ { port: "8080", protocol: TCP } ]
rules: { http: [ { method: "GET", path: "/api/.*" } ] } # L7 rule
---
# FQDN egress: allow only to a specific external domain
egress:
- toFQDNs: [ { matchName: "api.stripe.com" } ]
Exercises
- (Beginner) Name two capabilities Cilium Network Policies add over standard NetworkPolicy.
- (Beginner) What does FQDN-based egress policy allow you to specify?
- (Intermediate) Why is identity-based (label-derived) enforcement more robust than IP-based?
- (Interview) How do Cilium's L7 and FQDN-aware policies strengthen a zero-trust and egress-control posture beyond standard NetworkPolicy? (Hint: app-protocol least privilege; control external destinations by name.)
Answers
- Any two: L7/application-layer rules (e.g., HTTP method/path, Kafka, gRPC), identity-based enforcement (by workload labels/identity rather than IP), DNS/FQDN-aware egress, and cluster-wide policy scope — all beyond standard NetworkPolicy's L3/L4, IP-based, namespaced model.
- Egress allowed to specific domain names (FQDNs), e.g.,
api.stripe.com, rather than to raw IP ranges — Cilium resolves and enforces based on the domain.- Pod IPs are ephemeral and reused as Pods churn, so IP-based rules are fragile and can become stale or ambiguous. Cilium derives a stable cryptographic identity from workload labels, so policies follow the workload regardless of its current IP — more robust, accurate, and meaningful (you allow "the web service," not "whatever IP it currently has").
- Standard NetworkPolicy can only allow/deny by IP/port, so it can't restrict what an allowed connection does or control external destinations by name. Cilium's L7 policies enforce least privilege at the application protocol level (e.g., only
GET /api/*, specific methods/paths/topics), shrinking what a compromised or misbehaving client can do even on an allowed port. FQDN egress lets you permit traffic only to specific external services by domain (e.g., a payment API) instead of broad IP ranges, tightly controlling and auditing egress. Combined with identity-based enforcement, this delivers fine-grained, identity- and protocol-aware segmentation that substantially advances a zero-trust posture and egress control beyond what standard NetworkPolicy can express.
14. Extending Kubernetes
Kubernetes' greatest strength is that it is not a fixed product but an extensible platform. You can teach it about entirely new resource types, write controllers that automate operational knowledge, plug into the API machinery itself, and build all of this with mature client libraries. This chapter covers the extension points — CRDs, the Operator pattern, API extension mechanisms, and the client libraries that power them — that let you make Kubernetes manage anything, not just Pods.
14.1 Custom Resource Definitions (CRDs)
CRDs let you add your own API objects to Kubernetes. This subchapter covers defining, validating, versioning, and structuring them.
CRD schema and validation
Theory
A Custom Resource Definition (CRD) lets you extend the Kubernetes API with your own resource types, which then behave like built-in resources: you kubectl get/apply them, they're stored in etcd, served by the API server, watchable, and RBAC-controlled. For example, you could define a Database or Certificate kind. The CRD declares the new kind's group, version, names (plural/singular/kind), scope (Namespaced/Cluster), and — crucially — its schema.
The schema uses OpenAPI v3 (under validation.openAPIV3Schema) to define the resource's fields and validation rules: types, required fields, enums, ranges, patterns, and defaults. The API server enforces this schema on every create/update, rejecting invalid objects — so your custom resource gets the same server-side validation as native resources, with no code. You can also add CEL validation rules (x-kubernetes-validations) for cross-field/complex constraints. A CRD alone just adds the type and storage; to make it do something you pair it with a controller (Operators, next subchapter).
Example
apiVersion: apiextensions.k8s.io/v1
kind: CustomResourceDefinition
metadata: { name: backups.example.com }
spec:
group: example.com
scope: Namespaced
names: { plural: backups, singular: backup, kind: Backup, shortNames: [bk] }
versions:
- name: v1
served: true
storage: true
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
required: [source, schedule]
properties:
source: { type: string }
schedule: { type: string, pattern: '^(\S+\s+){4}\S+$' } # cron
retain: { type: integer, minimum: 1, default: 7 }
Exercises
- (Beginner) What does a CRD let you do, and how do the resulting objects behave?
- (Beginner) What schema standard does a CRD use to validate its fields?
- (Intermediate) Does creating a CRD alone make the custom resource do anything? What else is needed?
- (Interview) How does CRD schema validation give your custom resources the same robustness as built-in types without writing code? (Hint: API server enforces OpenAPI/CEL on every write.)
Answers
- It adds a new custom resource type to the Kubernetes API. The resulting objects behave like native ones — stored in etcd, served by the API server, manageable with kubectl, watchable, and controlled by RBAC.
- OpenAPI v3 (under
openAPIV3Schema), optionally augmented with CEL rules viax-kubernetes-validations.- No — a CRD only defines the type and provides storage/validation/API serving. To make the resource take effect, you need a controller/operator that watches those custom resources and reconciles real-world state to match them.
- The CRD's OpenAPI v3 schema (plus optional CEL rules) is enforced by the API server on every create/update: it type-checks fields, applies required/enum/range/pattern constraints and defaults, and rejects invalid objects — exactly as it does for built-in resources. So your custom type gets server-side validation, defaulting, and consistent API behavior declaratively, without writing any validation code; you just describe the schema.
Versioning and conversion webhooks
Theory
APIs evolve, and CRDs support multiple versions (e.g., v1alpha1, v1beta1, v1) of the same resource, declared in spec.versions. Exactly one version is marked storage: true (the version actually persisted in etcd), and each version can be independently served. This lets you introduce new schema versions while still supporting clients using older ones — essential for backward compatibility as your API matures.
When clients use different versions than the stored one, the API server must convert between them. For trivial changes, the default None conversion strategy (no field changes between versions) suffices. For non-trivial schema changes, you provide a conversion webhook: the API server calls your webhook to convert objects between versions on the fly (e.g., reading a v1 object that's stored as v2, or vice versa). This is what makes safe CRD evolution possible — you can add/rename/restructure fields across versions and the webhook reconciles representations, so old and new clients both work and stored data stays consistent.
Example
spec:
versions:
- { name: v1alpha1, served: true, storage: false }
- { name: v1, served: true, storage: true } # the version stored in etcd
conversion:
strategy: Webhook # call a webhook to convert between versions
webhook:
clientConfig:
service: { name: conv, namespace: system, path: /convert }
conversionReviewVersions: ["v1"]
Exercises
- (Beginner) How many CRD versions can be marked as the storage version?
- (Beginner) What does the
servedflag control for a CRD version?- (Intermediate) When is a conversion webhook necessary versus the
Nonestrategy?- (Interview) How do multiple versions plus conversion webhooks enable safe, backward-compatible evolution of a custom API? (Hint: serve old+new, store one, convert on the fly.)
Answers
- Exactly one (only one version has
storage: true— the version persisted in etcd).- Whether that version is exposed/served by the API server for clients to read and write (a version can be served without being the storage version, or stop being served when deprecated).
Noneworks only when versions are structurally identical (no field changes requiring transformation). A conversion webhook is needed when versions differ non-trivially (added/renamed/restructured fields), so the API server can transform objects between the requested version and the stored version correctly.- You can serve several versions simultaneously (so old and new clients each use their version) while persisting just one storage version in etcd. A conversion webhook transforms objects between any served version and the storage version on the fly, so reads/writes in any version are consistent regardless of how data is stored. This lets you introduce a new schema version, migrate clients gradually, and eventually deprecate old versions — all without breaking existing clients or corrupting stored data — enabling safe, backward-compatible API evolution.
Structural schemas and pruning
Theory
Modern CRDs (apiextensions.k8s.io/v1) require a structural schema — a complete, well-formed OpenAPI v3 schema that specifies the type of every field (no ambiguous/partial schemas). Structural schemas are the foundation that enables important safety features:
- Pruning: with a structural schema, the API server removes (prunes) any fields not declared in the schema before storing the object. This prevents clients from sneaking unknown/typo'd fields into etcd, which could otherwise cause confusion, bloat, or security issues. (Without pruning, unknown fields would be silently preserved.)
- Defaulting: structural schemas allow server-side default values for fields.
- They also enable proper validation, conversion, and
kubectl explaindocumentation for your CRD.
The practical upshot: define your CRD's schema completely and correctly, and the API server will validate, default, prune, and document your resource just like a native type — and reject or strip anything that doesn't conform. Pruning in particular guards against silent acceptance of malformed/unexpected data, making your custom API strict and predictable.
Example
# A structural schema (every field typed) enables pruning + defaulting:
schema:
openAPIV3Schema:
type: object
properties:
spec:
type: object
properties:
replicas: { type: integer, default: 1 } # defaulting
image: { type: string }
# NOT listed here => pruned (removed) on write
kubectl apply -f cr-with-typo.yaml
# A field like "replcias: 3" (typo, not in schema) is PRUNED — silently dropped,
# so the object stores only declared fields. (Use --validate to catch typos earlier.)
Exercises
- (Beginner) What is required for a modern (v1) CRD's schema?
- (Beginner) What does "pruning" do to fields not declared in the schema?
- (Intermediate) Why is pruning a useful safety feature?
- (Interview) How do structural schemas underpin validation, defaulting, and pruning, and what risk do they mitigate? (Hint: complete typed schema enables strict, predictable handling; prevents unknown fields in etcd.)
Answers
- A complete structural schema — a well-formed OpenAPI v3 schema that fully specifies the type of every field (no partial/ambiguous schemas).
- They are removed (pruned) from the object before it's persisted, so only fields declared in the schema are stored.
- It prevents unknown, misspelled, or unexpected fields from being silently stored in etcd. This avoids confusion (a typo'd field that's silently ignored), data bloat, and potential security/consistency issues from clients injecting arbitrary fields — keeping stored objects strictly conformant to the declared schema.
- A structural schema fully types every field, which the API server needs to reliably validate values, apply defaults, perform version conversion, and document the type (
kubectl explain). It also enables pruning: anything not in the schema is stripped on write. The risk it mitigates is unstructured/unknown data entering the API — without it, typo'd or unexpected fields could be silently preserved, leading to confusing behavior, bloat, or security concerns. With it, the custom resource is handled as strictly and predictably as a native type.
CRD status subresource
Theory
Kubernetes objects conventionally separate spec (the user's desired state) from status (the controller's observed/actual state). A CRD can enable the /status subresource, which formalizes this split with important benefits:
- Separation of concerns: users/clients write
spec; the controller writesstatus. With the status subresource enabled, updates tospecandstatusgo through separate API endpoints, so a controller updatingstatusdoesn't accidentally clobber a user'sspecchange (and RBAC can grant status-update separately). - No spec changes via status updates: updating the status subresource ignores changes to anything but
status(and vice versa), preventing conflicts. - Enables
/scale: a CRD can also expose a scale subresource, lettingkubectl scaleand the HPA operate on your custom resource.
This is the standard pattern for any custom resource managed by a controller: the controller continuously reconciles toward spec and reports progress/conditions in status (e.g., conditions, observedGeneration, phase). Enabling the status subresource is best practice for well-behaved, controller-managed CRDs.
Example
spec:
versions:
- name: v1
served: true
storage: true
subresources:
status: {} # enable the /status subresource
scale: # optional: enable kubectl scale / HPA
specReplicasPath: .spec.replicas
statusReplicasPath: .status.replicas
schema: { openAPIV3Schema: { type: object, properties:
{ spec: { type: object }, status: { type: object } } } }
// Controller updates status via the dedicated endpoint (won't touch spec):
r.Status().Update(ctx, &backup) // controller-runtime: status subresource
Exercises
- (Beginner) What is the conventional difference between
specandstatus?- (Beginner) Who writes
specand who writesstatusfor a controller-managed CRD?- (Intermediate) What problem does enabling the status subresource prevent?
- (Interview) Why is enabling the status (and optionally scale) subresource best practice for controller-managed CRDs? (Hint: separate endpoints avoid clobbering; enables scale/HPA; clean reconcile reporting.)
Answers
specis the user's declared desired state;statusis the controller's reported observed/actual state.- Users/clients write
spec; the controller writesstatus.- It separates updates to
specandstatusinto distinct API endpoints, so a controller writingstatuscan't accidentally overwrite a concurrent user change tospec(and vice versa) — preventing conflicts/clobbering between the user's desired state and the controller's reported state.- Enabling the status subresource enforces the spec/status separation at the API level: controllers update status without risk of clobbering user spec edits, RBAC can grant status updates independently, and reconcile progress (conditions, observedGeneration, phase) is reported cleanly. The scale subresource lets standard tooling (
kubectl scale) and the HPA operate on the custom resource by exposing replicas paths. Together they make a custom resource behave like a proper, well-integrated Kubernetes object that controllers and ecosystem tools handle correctly — the expected pattern for any controller-managed CRD.
14.2 Operators
Operators combine CRDs with controllers to automate operational knowledge. This subchapter covers the pattern and how to build them.
Operator pattern and use cases
Theory
A CRD adds a new type; an Operator makes it do something. The Operator pattern combines a custom resource (CRD) with a custom controller that encodes operational knowledge — the human expertise of running a particular application — as software. The controller continuously watches its custom resources and reconciles the real world to match them, automating tasks a human operator would otherwise perform.
The canonical use case is stateful, complex applications that require domain-specific operational logic: a database operator (PostgreSQL, MySQL) that handles provisioning, configuration, backups, failover, scaling, and version upgrades automatically; a Kafka/Elasticsearch/Prometheus operator; etc. Instead of you running runbooks, you declare a PostgresCluster resource and the operator does the rest — and keeps doing it (self-healing, responding to failures). The pattern's power is encoding the "Day 2" operational know-how (not just install, but ongoing management) into a controller, turning manual expertise into automated, reliable, declarative operations. Many such operators are published on OperatorHub.
Example
Operator = CRD (new API) + Controller (operational logic)
You declare: The operator continuously does:
kind: PostgresCluster - provision Pods/PVCs/Services
spec: - configure replication
instances: 3 - take scheduled backups
version: "16" - detect primary failure -> failover
backup: { schedule: ... } - perform safe version upgrades
Exercises
- (Beginner) What two things does the Operator pattern combine?
- (Beginner) What kind of applications most benefit from Operators?
- (Intermediate) Give three "Day 2" operational tasks a database operator might automate.
- (Interview) Explain how the Operator pattern turns human operational knowledge into software, and why that's valuable. (Hint: encode runbooks as a reconciling controller; reliability, self-healing, declarative ops.)
Answers
- A custom resource (CRD) and a custom controller that reconciles those resources (encoding operational logic).
- Complex, stateful applications that require domain-specific operational knowledge to run — e.g., databases, message brokers, search/monitoring systems (PostgreSQL, Kafka, Elasticsearch, Prometheus).
- Any three: provisioning/configuration, automated backups and restores, failover/high-availability handling, scaling, rolling version upgrades, and monitoring/self-healing.
- An Operator encodes the expertise a skilled human operator uses to run an application — the runbooks for install, configuration, backup, failover, scaling, and upgrades — into a controller that continuously watches custom resources and reconciles reality to the declared spec. This is valuable because it makes that operational knowledge automated, consistent, and always-on: tasks that would otherwise require manual intervention (and be error-prone, slow, or done only when someone notices) happen reliably and immediately, the system self-heals from failures, and operations become declarative (you state desired outcomes, the operator achieves them). It scales human expertise across many instances and clusters without scaling the humans.
Controller reconcile loop
Theory
The heart of any controller/operator is the reconcile loop — the same level-triggered pattern from Chapter 1, applied to your resources. The controller watches its resource(s) and, on any change (create/update/delete) or periodic resync, invokes a Reconcile function for the affected object. Reconcile's job: read the current desired state (the object's spec), observe the actual world, and take actions to converge them — then update status.
Key properties of a well-written reconcile function:
- Idempotent: it may be called many times for the same object (events, resyncs, retries), so it must produce the same correct result each time — it computes desired vs. actual and acts only on the difference, never assuming it's the "first" call.
- Level-triggered: it acts on the current observed state, not on the specific event that triggered it — so missed events don't break convergence; the next reconcile fixes things.
- Returns a result: success, requeue-after-duration, or error (which triggers retry with backoff).
This loop is what gives operators their robustness and self-healing: regardless of what happened (or what was missed), each reconcile drives the system toward the declared state, retrying on failure. Writing correct, idempotent reconcile logic is the core skill of operator development.
Example
func (r *BackupReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
var backup examplev1.Backup
if err := r.Get(ctx, req.NamespacedName, &backup); err != nil {
return ctrl.Result{}, client.IgnoreNotFound(err) // deleted: nothing to do
}
// Observe actual state, compare to backup.Spec, act on the DIFFERENCE (idempotent):
if !backupJobExists(&backup) {
r.createBackupJob(ctx, &backup) // safe to call repeatedly
}
backup.Status.LastRun = metav1.Now()
r.Status().Update(ctx, &backup) // report observed state
return ctrl.Result{RequeueAfter: time.Hour}, nil // periodic resync
}
Exercises
- (Beginner) What does a controller's Reconcile function do at a high level?
- (Beginner) Why must Reconcile be idempotent?
- (Intermediate) What does it mean that controllers are "level-triggered," and why does it aid robustness?
- (Interview) How does the reconcile loop give operators self-healing behavior even when events are missed or actions fail? (Hint: act on current state each call; retries/requeue; convergence.)
Answers
- It reads the desired state (the resource's spec), observes the actual state of the world, takes actions to make actual match desired, and updates the resource's status.
- Because it can be invoked many times for the same object (from events, periodic resyncs, and retries), so it must not assume it's running for the first time. It should compute desired vs. actual and act only on the difference, producing the same correct outcome regardless of how often it's called — otherwise repeated calls could create duplicates or cause errors.
- Level-triggered means the controller responds to the current observed state rather than to the specific event that triggered the call. So even if an event is missed or arrives out of order, the next reconcile still observes reality and corrects it — convergence depends on current state, not on processing every event, which makes the controller robust to dropped events, restarts, and races.
- Each reconcile re-observes the current actual state and drives it toward the declared spec, so any drift (a deleted resource, a failed component, a missed event) is detected and corrected on the next pass. If an action fails or the object isn't yet converged, the controller returns an error or requeue, and the loop retries (with backoff) — repeatedly attempting until the world matches the spec. Because correctness depends only on current state and the loop keeps running/retrying, the operator continuously heals the system toward desired state regardless of intermediate failures or missed events.
Operator SDK and Kubebuilder
Theory
Writing an operator from scratch (API types, CRD manifests, controller scaffolding, manager setup) is a lot of boilerplate. Two frameworks generate and standardize this, both built on the controller-runtime library (the Go foundation for controllers):
- Kubebuilder: a CNCF/Kubernetes-SIG project that scaffolds an operator in Go — it generates the project layout, API type definitions, CRD YAML (from Go structs + markers), and controller stubs, and provides the manager/runtime wiring. You define your types and fill in
Reconcile. - Operator SDK (Red Hat, part of the Operator Framework): builds on Kubebuilder for Go operators and additionally supports Ansible and Helm-based operators (where you can turn an existing Helm chart or Ansible playbooks into an operator with little/no Go code), plus integration with OLM (next).
Both let you focus on the business logic (the reconcile function and API design) rather than plumbing. Kubebuilder is the lower-level Go-centric foundation; Operator SDK wraps it with more options (Go/Ansible/Helm) and Operator Framework tooling. They generate CRDs, RBAC, deployment manifests, and test scaffolding via code markers, dramatically accelerating operator development.
Example
# Kubebuilder: scaffold a project, an API/CRD, and a controller
kubebuilder init --domain example.com --repo example.com/backup-operator
kubebuilder create api --group apps --version v1 --kind Backup
# -> generates: API types (Go), CRD YAML, controller stub (Reconcile), RBAC markers
make manifests # regenerate CRDs/RBAC from Go markers
make run # run the controller locally against the cluster
// Markers on the Go type drive CRD/RBAC generation:
//+kubebuilder:validation:Minimum=1
//+kubebuilder:rbac:groups=apps,resources=backups,verbs=get;list;watch;create;update
Exercises
- (Beginner) What underlying library do both Kubebuilder and Operator SDK build on?
- (Beginner) Besides Go, what two operator types does Operator SDK support?
- (Intermediate) What does scaffolding with these tools generate so you can focus on business logic?
- (Interview) How do Go markers/annotations in Kubebuilder reduce boilerplate, and what's the relationship between Kubebuilder and Operator SDK? (Hint: markers generate CRD/RBAC; SDK builds on Kubebuilder + adds Ansible/Helm/OLM.)
Answers
- controller-runtime (the Go library for building controllers/operators).
- Ansible-based and Helm-based operators (in addition to Go).
- Project scaffolding: the directory layout, API type definitions, generated CRD manifests, controller stubs (with a
Reconcilemethod to implement), RBAC rules, manager/runtime wiring, and test scaffolding — so you implement the reconcile logic and API design rather than the plumbing.- Kubebuilder uses special comment markers on Go types/methods (e.g., validation constraints, RBAC rules, subresources) that code generators read to produce CRD YAML, RBAC manifests, and deepcopy code automatically — so you declare intent inline and regenerate boilerplate with
make, instead of hand-writing it. Operator SDK builds on top of Kubebuilder for Go operators (sharing controller-runtime and the same scaffolding), and extends it with additional operator types (Ansible, Helm) and Operator Framework integration (like OLM). So Kubebuilder is the Go-focused foundation; Operator SDK wraps it with more options and tooling.
Operator Lifecycle Manager (OLM)
Theory
Once operators exist, you need to manage the operators themselves — install them, resolve their dependencies, handle their RBAC/CRDs, and upgrade them safely over time. The Operator Lifecycle Manager (OLM) is a component (part of the Operator Framework) that does exactly this: it's like a package manager for operators, running in the cluster to install and manage operators declaratively.
OLM introduces concepts such as:
- ClusterServiceVersion (CSV): metadata describing an operator version — its CRDs, required permissions, dependencies, and install strategy.
- Subscription: declares that you want a particular operator (from a catalog) installed and kept updated on a chosen update channel (e.g.,
stable), enabling automatic or manual upgrades. - CatalogSource / OperatorHub: catalogs of available operators OLM can install from.
OLM handles dependency resolution (an operator needing another), upgrade graphs (moving safely between versions via channels), and lifecycle/permissions — so cluster admins manage operators consistently rather than each operator being installed ad hoc. It's prominent in OpenShift (where it's built in) and available on upstream Kubernetes.
Example
# Subscribe to an operator from a catalog; OLM installs and keeps it updated
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata: { name: my-db-operator, namespace: operators }
spec:
channel: stable # update channel (upgrade graph)
name: my-db-operator
source: operatorhubio-catalog # a CatalogSource
sourceNamespace: olm
installPlanApproval: Automatic # or Manual (gate upgrades)
Exercises
- (Beginner) What is OLM, by analogy?
- (Beginner) What does a Subscription declare?
- (Intermediate) What does a ClusterServiceVersion (CSV) describe?
- (Interview) Why is lifecycle management of operators (dependency resolution, channels, upgrades) valuable in a cluster running many operators? (Hint: consistent install/upgrade/permissions vs. ad-hoc operator sprawl.)
Answers
- A package manager for operators — it installs, manages, and upgrades operators in the cluster declaratively (analogous to how Helm/apt manage packages, but for operators).
- That you want a particular operator installed from a catalog and kept updated on a chosen update channel (with automatic or manual upgrade approval).
- An operator version's metadata: its CRDs, required RBAC permissions, dependencies, install strategy, and other descriptive information OLM uses to install and manage that operator.
- With many operators, installing/upgrading each ad hoc leads to inconsistency, version drift, permission sprawl, and dependency conflicts. OLM provides consistent, declarative lifecycle management: it resolves operator dependencies, uses update channels/upgrade graphs to move safely between versions, manages each operator's CRDs and RBAC, and lets admins approve or automate upgrades. This makes running and maintaining a fleet of operators reliable and auditable — operators are managed as first-class, versioned packages rather than scattered manual installs.
OperatorHub
Theory
OperatorHub.io is a central registry/catalog of community and vendor operators — the "app store" for Kubernetes operators. Rather than building an operator for a common application yourself, you can browse OperatorHub for an existing, often well-maintained operator (for databases, monitoring, messaging, storage, etc.), review its capabilities and maturity level, and install it.
OperatorHub categorizes operators by a capability level model — from Level 1 "Basic Install" up through Level 5 "Auto Pilot" (full automated operations: auto-scaling, auto-healing, auto-tuning) — so you can judge how much operational automation an operator actually provides. Operators from OperatorHub integrate with OLM for installation and lifecycle management (especially on OpenShift, which surfaces OperatorHub in its console). The value: a large ecosystem of reusable operational automation you can adopt instead of reinventing, with a standard way to assess maturity and install/manage them. (Helm's Artifact Hub also lists operators; OperatorHub is the OLM-centric catalog.)
Example
OperatorHub capability levels (maturity of operational automation):
1 Basic Install -> install/configure the app
2 Seamless Upgrades -> handle version upgrades
3 Full Lifecycle -> backups, failover, scaling
4 Deep Insights -> metrics, alerts, analysis
5 Auto Pilot -> auto-scaling/healing/tuning
# Discover an operator on operatorhub.io, then install it (via OLM):
# create a Subscription (see OLM example) pointing at the catalog entry.
Exercises
- (Beginner) What is OperatorHub?
- (Beginner) What does the capability-level model communicate about an operator?
- (Intermediate) What does OperatorHub integrate with for installation and lifecycle?
- (Interview) Why might you adopt an operator from OperatorHub instead of building your own, and what should you evaluate before doing so? (Hint: reuse maintained automation; assess maturity/capability level, trust, maintenance.)
Answers
- A central registry/catalog ("app store") of community- and vendor-provided Kubernetes operators that you can browse and install.
- How much operational automation the operator provides — from Level 1 (basic install) up to Level 5 (auto-pilot: auto-scaling, healing, tuning) — helping you gauge its maturity and what "Day 2" operations it handles.
- OLM (the Operator Lifecycle Manager), which installs and manages the operators (with OperatorHub prominently integrated into OpenShift's console).
- Building and maintaining a robust operator for a complex stateful app is significant, ongoing work; OperatorHub offers existing, often vendor/community-maintained operators that encode that expertise, saving development and maintenance effort. Before adopting one you should evaluate its capability/maturity level (does it automate the operations you need — backups, upgrades, failover?), its trustworthiness and source (vendor vs. community, security posture, RBAC it requires), maintenance activity and support, license, and how well it fits your environment — since you're delegating critical operational logic to third-party software running with cluster permissions.
14.3 Kubernetes API Extension Points
Beyond CRDs, Kubernetes exposes deeper extension points in its API machinery. This subchapter covers them.
Aggregated API servers
Theory
CRDs are the easy way to add API types, but they have limits: they're stored in etcd as-is, can't do arbitrary custom storage/business logic on reads, and can't easily implement complex behaviors (custom subresources, specialized validation/processing, non-etcd backing). For those cases, Kubernetes offers API Aggregation: you run your own API server (an "extension API server") and register it with the main API server via an APIService object, so requests for your API group/version are transparently proxied to your server.
To clients, the aggregated API looks like a native part of the Kubernetes API (same endpoint, auth, discovery) — but your extension server fully controls how those resources are validated, stored, and served. This is how Metrics Server works (it serves metrics.k8s.io as an aggregated API, computing metrics on the fly rather than storing them in etcd), and how custom/external metrics APIs are implemented. Aggregation is more powerful but more complex than CRDs (you build and operate a real API server). Rule of thumb: use CRDs unless you need custom storage/logic that CRDs can't provide — then reach for an aggregated API server.
Example
# Register an extension API server so /apis/metrics.k8s.io is proxied to it:
apiVersion: apiregistration.k8s.io/v1
kind: APIService
metadata: { name: v1beta1.metrics.k8s.io }
spec:
group: metrics.k8s.io
version: v1beta1
service: { name: metrics-server, namespace: kube-system } # your API server
groupPriorityMinimum: 100
versionPriority: 100
Exercises
- (Beginner) What does API aggregation let you add to Kubernetes beyond a CRD?
- (Beginner) What object registers an aggregated API with the main API server?
- (Intermediate) Give a real example of an aggregated API server and why it needs aggregation rather than a CRD.
- (Interview) When should you choose an aggregated API server over a CRD, and what's the cost? (Hint: custom storage/logic/subresources vs. operating a real API server.)
Answers
- Your own extension API server serving custom API group(s)/version(s) with full control over validation, storage, and request handling — not just a new type stored in etcd like a CRD.
- An
APIServiceobject (apiregistration.k8s.io), which tells the main API server to proxy requests for that group/version to your extension server.- The Metrics Server serves
metrics.k8s.ioas an aggregated API. It needs aggregation (not a CRD) because metrics are computed on the fly and held in memory rather than stored in etcd — a CRD would persist objects in etcd and can't implement that dynamic, non-etcd-backed, computed serving behavior. (Custom/external metrics adapters are similar.)- Choose an aggregated API server when you need capabilities CRDs can't provide: custom/non-etcd storage, computed-on-the-fly responses, specialized validation/business logic on read/write, custom subresources, or protocol-level control. The cost is significant complexity — you must build, deploy, secure, scale, and maintain a real API server (handling auth delegation, TLS, availability), whereas a CRD is declarative and managed by the existing API server. Default to CRDs; use aggregation only when their limitations genuinely block your requirements.
Admission webhooks deep dive
Theory
Admission webhooks (introduced in Chapter 9) are a primary API extension point: they let you intercept and influence API requests with custom logic, without modifying Kubernetes. Recapping and going deeper:
- Mutating webhooks modify objects (inject sidecars, set defaults); validating webhooks accept/reject (enforce policy). Mutating runs first.
- Configuration:
MutatingWebhookConfiguration/ValidatingWebhookConfigurationspecify therules(which operations/resources/groups to intercept), aclientConfig(the webhook endpoint + CA bundle for TLS),namespaceSelector/objectSelector(scoping),failurePolicy(Fail/Ignore),timeoutSeconds,sideEffects, andreinvocationPolicy. - Operational pitfalls (critical at scale): webhooks are synchronous and on the API critical path, so a slow/down webhook with
failurePolicy: Failcan block API operations cluster-wide; intercepting too broadly (e.g., all Pods including kube-system) can cause outages or deadlocks (a webhook that gates its own dependencies); and they require valid TLS (the API server verifies the webhook's cert against the configured CA).
Webhooks are extremely powerful (they underpin service meshes, policy engines, defaulting), but must be scoped narrowly, made highly available and fast, and fail-safe-considered. A newer in-process alternative — ValidatingAdmissionPolicy (CEL) — avoids the webhook network hop for many validation cases.
Example
apiVersion: admissionregistration.k8s.io/v1
kind: ValidatingWebhookConfiguration
metadata: { name: pod-policy }
webhooks:
- name: validate.example.com
rules: [ { operations: ["CREATE"], apiGroups: [""], apiVersions: ["v1"], resources: ["pods"] } ]
clientConfig: { service: { name: policy, namespace: sys, path: /validate }, caBundle: <CA> }
namespaceSelector: { matchExpressions: [ { key: kubernetes.io/metadata.name, operator: NotIn, values: [kube-system] } ] }
failurePolicy: Fail # consider blast radius if the webhook is down!
timeoutSeconds: 5
sideEffects: None
admissionReviewVersions: ["v1"]
Exercises
- (Beginner) What two kinds of admission webhooks exist, and which runs first?
- (Beginner) Why do admission webhooks require TLS?
- (Intermediate) What can go wrong if a webhook with
failurePolicy: Failbecomes unavailable?- (Interview) What operational precautions make admission webhooks safe at scale, and what in-process alternative reduces the risk? (Hint: scope narrowly, HA, timeouts, exclude kube-system; ValidatingAdmissionPolicy/CEL.)
Answers
- Mutating webhooks (modify objects) and validating webhooks (accept/reject); the mutating phase runs first.
- The API server calls the webhook over HTTPS and verifies its certificate against the configured CA bundle (
caBundle), so the connection must use valid TLS — this authenticates the webhook and protects the request/response in transit.- Since webhooks are synchronous and on the API request critical path, a
Failpolicy means that if the webhook is slow or unreachable, the API server rejects the affected requests — potentially blocking creation/updates of those resources cluster-wide (e.g., no new Pods), causing an outage. Overly broad scope can even deadlock (the webhook gating resources it itself depends on).- Precautions: scope webhooks narrowly with precise
rulesandnamespaceSelector/objectSelector(exclude kube-system and the webhook's own dependencies); run the webhook service highly available and fast; set sensibletimeoutSeconds; choosefailurePolicydeliberately (Fail = safe-but-can-block, Ignore = available-but-can-bypass); declaresideEffects: None; and test downtime scenarios. The in-process alternative ValidatingAdmissionPolicy (CEL expressions evaluated inside the API server) removes the external webhook network dependency for many validation use cases, eliminating that critical-path risk.
Scheduler extenders and plugins
Theory
The default scheduler (Chapter 8) covers most needs, but sometimes you need custom scheduling logic — special hardware awareness, gang/batch scheduling, custom bin-packing, or external constraints. Kubernetes offers several ways to extend or replace scheduling:
- Scheduler plugins (Scheduling Framework): the modern, preferred approach. The scheduler is built as a framework with extension points (PreFilter, Filter, Score, Reserve, Permit, Bind, etc.); you write plugins in Go that hook into these points and compile them into a scheduler binary. This is efficient (in-process) and powerful.
- Scheduler extenders: an older mechanism where the default scheduler calls out to an external HTTP service (a "extender") during filtering/scoring. Simpler to integrate (any language, no rebuild) but slower (network call per scheduling decision) and more limited.
- Multiple schedulers: you can run your own scheduler alongside the default and have Pods opt in via
spec.schedulerName— useful for entirely custom schedulers (e.g., batch schedulers like Volcano/YuniKorn).
The trend favors the Scheduling Framework plugins for performance and flexibility; extenders remain for quick, language-agnostic add-ons; and separate schedulers serve fundamentally different scheduling paradigms.
Example
# Run a Pod with a custom scheduler instead of the default:
spec:
schedulerName: my-custom-scheduler # a separate scheduler you run
containers: [ ... ]
# KubeSchedulerConfiguration enabling a custom framework plugin:
apiVersion: kubescheduler.config.k8s.io/v1
kind: KubeSchedulerConfiguration
profiles:
- schedulerName: default-scheduler
plugins:
score:
enabled: [ { name: MyCustomScorePlugin } ] # framework plugin at Score point
Exercises
- (Beginner) What is the modern, preferred way to add custom scheduling logic?
- (Beginner) How does a Pod opt into a non-default scheduler?
- (Intermediate) What is the main drawback of scheduler extenders versus framework plugins?
- (Interview) Compare the three approaches (framework plugins, extenders, separate schedulers) and when each fits. (Hint: in-process performance vs. language-agnostic HTTP vs. whole alternative paradigm.)
Answers
- Writing Scheduling Framework plugins that hook into the scheduler's extension points (PreFilter/Filter/Score/etc.), compiled into the scheduler.
- By setting
spec.schedulerNameto the name of the custom scheduler you run; only that scheduler will schedule the Pod.- Extenders call an external HTTP service during scheduling, adding network latency per decision (slower) and offering fewer/looser integration points than in-process plugins — so they don't scale as well and are more limited, though they're simpler and language-agnostic.
- Framework plugins: in-process Go plugins at defined extension points — highest performance and tightest integration; best when you need efficient custom filter/score/bind logic and can build/run a scheduler binary. Extenders: external HTTP callouts from the default scheduler — easiest to add in any language without rebuilding the scheduler, good for quick or third-party integrations, but slower and more limited (per-decision network hop). Separate schedulers: run an entirely different scheduler (opt-in via
schedulerName) — appropriate when you need a fundamentally different scheduling paradigm (e.g., gang/batch scheduling for ML/HPC via Volcano/YuniKorn) rather than tweaking the default. Choose plugins for performance-sensitive customization, extenders for simple language-agnostic add-ons, and separate schedulers for distinct scheduling models.
Custom metrics API
Theory
The HPA (Chapter 8) can scale on custom and external metrics, but Kubernetes itself doesn't produce those — they're served through metrics APIs implemented as aggregated API servers (tying together two extension points). There are three:
metrics.k8s.io(resource metrics) — served by Metrics Server.custom.metrics.k8s.io(custom metrics tied to Kubernetes objects, e.g., requests-per-second on a Pod/Service).external.metrics.k8s.io(external metrics not tied to a cluster object, e.g., a cloud queue length).
To make the HPA scale on, say, RPS or queue depth, you deploy an adapter that implements the custom/external metrics API (as an aggregated API server registered via APIService) and translates from a real metrics backend. The most common is the Prometheus Adapter, which exposes PromQL query results through custom.metrics.k8s.io/external.metrics.k8s.io; KEDA and cloud-specific adapters do similar. So "scale on a custom metric" = run an adapter that serves that metrics API from your monitoring data, then reference the metric in the HPA. This is a concrete, common application of API aggregation in the wild.
Example
# Prometheus Adapter rule: expose a PromQL result as a custom metric
rules:
- seriesQuery: 'http_requests_total{namespace!="",pod!=""}'
resources: { overrides: { namespace: {resource: "namespace"}, pod: {resource: "pod"} } }
name: { matches: "http_requests_total", as: "http_requests_per_second" }
metricsQuery: 'sum(rate(<<.Series>>{<<.LabelMatchers>>}[2m])) by (<<.GroupBy>>)'
---
# HPA then scales on that custom metric (see Chapter 8 example)
Exercises
- (Beginner) Are custom/external metrics produced by Kubernetes itself?
- (Beginner) What component commonly serves custom metrics from Prometheus data?
- (Intermediate) How is the custom metrics API related to the aggregated API server extension point?
- (Interview) Trace what you must deploy to make the HPA scale on a Prometheus-based RPS metric, and why this combines two extension points. (Hint: adapter as aggregated API server exposing custom.metrics.k8s.io; HPA references it.)
Answers
- No — Kubernetes provides the metrics APIs but not the metric data; you must run adapters that supply custom/external metrics from a real backend.
- The Prometheus Adapter.
- The custom (and external) metrics APIs are themselves implemented as aggregated API servers: an adapter registers via an APIService so requests for
custom.metrics.k8s.io/external.metrics.k8s.ioare proxied to it, and it serves metric values computed from a backend. So consuming custom metrics relies directly on the API aggregation extension point.- Deploy a metrics backend (Prometheus) scraping your app's metrics, then deploy the Prometheus Adapter configured to translate a PromQL query (e.g., rate of
http_requests_total) into a custom metric (e.g.,http_requests_per_second); the adapter registers as an aggregated API server servingcustom.metrics.k8s.io. Then define an HPA referencing that custom metric/target. This combines two extension points: API aggregation (the adapter is an aggregated API server exposing the custom metrics API) and the HPA's pluggable metrics consumption (it queries that API). Together they let the HPA scale on an application-meaningful signal sourced from your monitoring system rather than just CPU/memory.
14.4 Kubernetes Client Libraries
Building controllers, operators, and tools requires programmatic API access. This subchapter covers the client libraries that make it possible.
client-go overview
Theory
client-go is the official Go client library for the Kubernetes API — the foundational SDK that virtually all Go-based Kubernetes tooling (including the components of Kubernetes itself, kubectl, controllers, and operators) is built on. It provides everything needed to talk to the API programmatically: authenticated clients, typed and dynamic access to resources, watching, caching, and leader election.
Key pieces of client-go:
- Clientsets: typed Go clients for built-in resources (e.g.,
clientset.AppsV1().Deployments(ns).Get(...)) — compile-time-checked, ergonomic. - Dynamic client / discovery: work with arbitrary resources (including CRDs) without compile-time types, using
unstructuredobjects — essential for generic tooling. - Informers and listers (next topic): efficient cached watching of resources.
- RESTConfig / kubeconfig loading: build a client from in-cluster config or a kubeconfig file.
- workqueue, leaderelection utilities for building controllers.
client-go is the bedrock of controller-runtime (which wraps it for operators). Understanding it — typed vs. dynamic clients, and the informer machinery — is foundational for writing any serious Kubernetes automation in Go.
Example
import ("k8s.io/client-go/kubernetes"; "k8s.io/client-go/tools/clientcmd")
cfg, _ := clientcmd.BuildConfigFromFlags("", kubeconfigPath) // or rest.InClusterConfig()
cs, _ := kubernetes.NewForConfig(cfg) // typed clientset
// Typed access to a built-in resource:
deploy, _ := cs.AppsV1().Deployments("default").Get(ctx, "web", metav1.GetOptions{})
fmt.Println(*deploy.Spec.Replicas)
Exercises
- (Beginner) What is client-go?
- (Beginner) What is the difference between a typed clientset and the dynamic client?
- (Intermediate) When would you need the dynamic client instead of a typed clientset?
- (Interview) Why is client-go considered foundational for the Kubernetes Go ecosystem, and what core capabilities does it provide for building controllers? (Hint: used by K8s/kubectl/operators; clients, informers, workqueues, leader election.)
Answers
- The official Go client library for the Kubernetes API — the SDK used to programmatically authenticate to and interact with the cluster, underpinning most Go Kubernetes tooling.
- A typed clientset provides compile-time-checked, strongly-typed access to known (built-in) resources (e.g.,
Deployments); the dynamic client works with arbitrary resources (including CRDs) generically viaunstructuredobjects without compile-time types.- When you need to handle resources whose types aren't known at compile time — e.g., arbitrary CRDs, generic tooling that operates across many resource kinds, or code that must work with custom resources it doesn't import Go types for. The dynamic client handles them via unstructured data and discovery.
- Nearly all Go-based Kubernetes software — Kubernetes components, kubectl, controllers, and operators (via controller-runtime, which wraps client-go) — is built on it, making it the common foundation of the ecosystem. For controllers it provides authenticated typed/dynamic clients, informers/listers for efficient cached watching, workqueues for rate-limited reconcile processing, leader election for HA controllers, and config loading (in-cluster or kubeconfig). These are exactly the building blocks needed to implement the watch-cache-reconcile pattern reliably.
Informers and work queues
Theory
Naively, a controller could poll the API server or open raw watches — but that's inefficient and fragile at scale. Informers are client-go's solution: an informer maintains a local in-memory cache of a resource type, populated by an initial LIST and kept current by a WATCH (the LIST-WATCH pattern from Chapter 2). Controllers read from this cache (via a lister) instead of hitting the API server repeatedly, and register event handlers (OnAdd/OnUpdate/OnDelete) to react to changes. SharedInformers let many controllers share one cache/watch per resource, avoiding duplication.
Informers pair with work queues: instead of doing work directly in the event handler, the handler simply enqueues the object's key onto a rate-limited work queue; worker goroutines dequeue keys and run reconcile. This decoupling is crucial because it provides:
- De-duplication: multiple rapid events for the same object collapse into one queue entry.
- Rate limiting and retries with backoff: failed items are requeued with exponential backoff.
- Level-triggered processing: workers re-fetch current state from the cache by key (not stale event data), reconciling against reality.
This informer + workqueue architecture is the standard controller pattern — efficient, resilient, and the basis of controller-runtime. It's why controllers scale and self-heal.
Example
SharedInformer (cache) <--LIST/WATCH-- API server
| OnAdd/OnUpdate/OnDelete: enqueue object KEY
v
WorkQueue (dedup + rate-limit + retry-with-backoff)
|
v worker goroutine
Reconcile(key): lister.Get(key) from cache -> compare desired/actual -> act
informer.AddEventHandler(cache.ResourceEventHandlerFuncs{
AddFunc: func(o interface{}) { queue.Add(keyOf(o)) }, // enqueue key, don't do work here
UpdateFunc: func(_, n interface{}) { queue.Add(keyOf(n)) },
})
Exercises
- (Beginner) What does an informer maintain, and how is it kept current?
- (Beginner) Why do controllers read from a lister/cache instead of the API server directly?
- (Intermediate) Why do event handlers enqueue a key onto a work queue rather than doing the work inline?
- (Interview) How do informers + work queues together produce an efficient, resilient, level-triggered controller? (Hint: shared cache reduces API load; queue dedups/retries; workers reconcile current state by key.)
Answers
- A local in-memory cache of a resource type, populated by an initial LIST and kept up to date by a WATCH (LIST-WATCH), with event handlers firing on changes.
- To avoid hammering the API server: reads served from the local cache are fast and cheap and reduce load on the API server/etcd, while the cache stays current via the watch. (SharedInformers let multiple controllers share one cache/watch.)
- To decouple event reception from work execution, gaining de-duplication (collapse rapid repeated events for the same object into one key), rate limiting, and retries with exponential backoff on failure. It also keeps processing level-triggered: the worker later fetches the object's current state from the cache by key rather than acting on possibly-stale event payloads.
- Informers provide a shared, always-current local cache (efficient: one LIST-WATCH feeds many readers, minimizing API load). Event handlers enqueue only the object's key into a work queue that de-duplicates and rate-limits, and retries failures with backoff (resilient). Worker goroutines dequeue keys and reconcile by re-reading current state from the cache and acting on the desired-vs-actual difference (level-triggered) — so missed/duplicate events don't matter and transient failures are retried. The combination yields controllers that are low-overhead, robust to bursts and failures, and continuously convergent — the standard, scalable controller architecture.
Python, Java, and other language clients
Theory
While Go (client-go) is the native and dominant language for Kubernetes development, Kubernetes provides official client libraries for many languages — Python, Java, JavaScript/TypeScript, C#/.NET, and more — plus community clients. These let you automate Kubernetes, build tools, or even write controllers/operators in your team's preferred language.
Key points:
- The Python client (
kubernetes) is popular for scripting, automation, data/ML workflows (e.g., Kubeflow tooling), and quick operational tools. - The Java and other clients suit teams with existing stacks building Kubernetes integrations.
- Many official clients are partly auto-generated from the Kubernetes OpenAPI spec, so they stay consistent across versions and cover the full API.
- Trade-off: non-Go clients are great for scripting and applications, but the richest controller/operator tooling (informers, workqueues, controller-runtime, Kubebuilder) lives in the Go ecosystem. You can build controllers in Python/Java (some clients offer watch/informer-like helpers), but Go remains the path of least resistance for serious controllers.
So: pick the client that fits your language/use case for automation and apps; gravitate to Go for production controllers/operators where the ecosystem is deepest.
Example
# Python client: list Pods in a namespace
from kubernetes import client, config
config.load_kube_config() # or load_incluster_config()
v1 = client.CoreV1Api()
for pod in v1.list_namespaced_pod("default").items:
print(pod.metadata.name, pod.status.phase)
Exercises
- (Beginner) Name three languages with official Kubernetes client libraries.
- (Beginner) Which language is most common for serious controller/operator development, and why?
- (Intermediate) Why are many official clients auto-generated, and what benefit does that bring?
- (Interview) When would you choose the Python client over Go, and what limitations should you keep in mind for controllers? (Hint: scripting/automation/ML fit; controller tooling richest in Go.)
Answers
- Any three: Python, Java, JavaScript/TypeScript, C#/.NET (also Go natively).
- Go (client-go), because the deepest, most mature controller/operator tooling — informers, work queues, controller-runtime, Kubebuilder, Operator SDK — is in the Go ecosystem, and Kubernetes itself is written in Go.
- They're generated from the Kubernetes OpenAPI specification, so they comprehensively and consistently cover the full API and stay in sync across Kubernetes versions with less manual maintenance — giving reliable, complete coverage across languages.
- Choose Python (or another non-Go client) for scripting, operational automation, glue code, and application/ML workflows (e.g., Kubeflow) where it fits your team's stack and the task is interacting with the API rather than building a heavy controller. Keep in mind that controller-specific machinery (informers/work queues/controller-runtime, scaffolding) is far richer and more battle-tested in Go; while you can write controllers in Python/Java (some clients offer watch/informer helpers), you'll have less tooling and may reimplement patterns that Go gives you for free — so for production-grade operators, Go is usually preferable.
controller-runtime library
Theory
controller-runtime is the higher-level Go library (a Kubernetes SIG project) that sits on top of client-go and provides the batteries-included framework for building controllers and operators. It's what Kubebuilder and Operator SDK scaffold against. Where client-go gives you the raw building blocks (clients, informers, workqueues), controller-runtime assembles them into an ergonomic, opinionated framework so you write far less plumbing.
What it provides:
- Manager: wires up shared caches/informers, clients, leader election, metrics, and health checks, and runs your controllers.
- Controller/Reconciler abstraction: you implement a simple
Reconcile(ctx, req)method; the framework handles watches, the work queue, requeuing, and calling you with object keys. - Client with caching: a unified client that reads from the shared cache and writes to the API, plus the
Status().Update()for status subresources. - Builder API: declaratively set up which resources a controller watches and owns (so changes to owned resources, like Pods created by your CR, trigger reconciliation).
The result: you focus almost entirely on reconcile logic and API types, while controller-runtime handles the informer/workqueue/leader-election/caching machinery correctly. It's the de facto standard for operator development in Go.
Example
// Set up a controller with controller-runtime's builder:
ctrl.NewControllerManagedBy(mgr).
For(&examplev1.Backup{}). // watch Backup CRs (primary)
Owns(&batchv1.Job{}). // also reconcile when owned Jobs change
Complete(&BackupReconciler{Client: mgr.GetClient()})
// You implement just the reconcile logic:
func (r *BackupReconciler) Reconcile(ctx context.Context, req ctrl.Request) (ctrl.Result, error) {
// get the Backup, observe Jobs, act on the difference, update status
return ctrl.Result{}, nil
}
Exercises
- (Beginner) What does controller-runtime sit on top of, and what do Kubebuilder/Operator SDK use?
- (Beginner) What does the Manager component do?
- (Intermediate) What does
.Owns(...)accomplish in the controller builder?- (Interview) How does controller-runtime let developers focus on reconcile logic, and what plumbing does it handle for them? (Hint: manager wires caches/informers/workqueue/leader election; you implement Reconcile.)
Answers
- It sits on top of client-go; Kubebuilder and Operator SDK scaffold projects that build against controller-runtime.
- The Manager wires together and runs the shared caches/informers, clients, leader election, metrics, and health endpoints, and starts the registered controllers — providing the runtime that hosts your reconcilers.
.Owns(<kind>)tells the controller it owns resources of that kind (created/managed by its primary resource via ownerReferences), so changes to those owned objects (e.g., a Job the CR created) trigger reconciliation of the owning custom resource — keeping the parent's reconcile in sync with its children.- controller-runtime provides a Manager that sets up shared informers/caches (efficient watching), the work queue and requeue/backoff handling, a cached read + write client, leader election for HA, and metrics/health — and a Builder to declare what to watch/own. You only implement the
Reconcile(ctx, req)method (and define your API types). All the watch-cache-enqueue-retry-leaderelection machinery that you'd otherwise assemble from client-go is handled by the framework, so development focuses on the business/reconcile logic and API design rather than controller plumbing — which is exactly why it's the standard for Go operators.
15. Multi-Cluster and Federation
A single cluster eventually hits limits — of scale, blast radius, geography, or compliance. Running multiple clusters solves these but introduces new problems: how do services find each other across clusters, how do you deploy consistently everywhere, and how do you manage the fleet's lifecycle? This chapter covers the patterns and tools for multi-cluster architectures, federation, and multi-cluster GitOps.
15.1 Multi-Cluster Patterns
This subchapter covers why and how organizations run multiple clusters, and the architectural patterns involved.
Reasons for multi-cluster architectures
Theory
A single large cluster is simpler to operate, so multi-cluster should be a deliberate choice driven by real needs. The common drivers:
- Blast radius / isolation: a cluster is a failure and security domain. Multiple clusters contain the impact of an outage, a bad upgrade, a misconfiguration, or a compromise to one cluster instead of everything.
- Geography / latency: placing clusters near users in different regions reduces latency and improves availability; clusters per region are natural.
- Compliance / data residency: regulations may require data to stay in a specific country/region, mandating separate clusters per jurisdiction.
- Scale limits: very large workloads can approach practical limits (etcd size, node count, API server load); splitting across clusters relieves this.
- Environment / tenant separation: separate clusters for prod vs. non-prod, or per business unit/tenant, for stronger isolation than namespaces provide.
- Cloud/hybrid: spanning multiple clouds or on-prem+cloud (avoiding lock-in, leveraging best-of-breed) inherently means multiple clusters.
The trade-off is significant added operational complexity (managing many clusters, cross-cluster networking, consistent config). So adopt multi-cluster when the isolation, locality, compliance, or scale benefits clearly justify that complexity — not by default.
Example
Single cluster: simpler, but one failure/security/compliance domain for everything
Multi-cluster drivers:
isolation -> contain outages/compromise per cluster
geography -> clusters near users (low latency)
compliance -> data stays in-region (EU cluster, US cluster)
scale -> avoid single-cluster limits (etcd/nodes/API)
hybrid/multicloud -> span clouds/on-prem
Exercises
- (Beginner) Name three reasons organizations adopt multi-cluster architectures.
- (Beginner) What is the main downside of running many clusters?
- (Intermediate) How does multi-cluster limit "blast radius" compared to one large cluster?
- (Interview) Why should multi-cluster be a deliberate decision rather than a default, and what factors tip the balance? (Hint: complexity cost vs. isolation/latency/compliance/scale benefits.)
Answers
- Any three: blast-radius/failure-and-security isolation, geographic proximity/latency, compliance/data residency, scale limits of a single cluster, environment/tenant separation, and multi-cloud/hybrid spanning.
- Significantly increased operational complexity — managing many clusters, cross-cluster networking/discovery, and keeping configuration and policy consistent everywhere.
- Each cluster is a separate failure and security domain. An outage, faulty upgrade, misconfiguration, or breach affects only its cluster rather than the entire estate, so problems are contained and other clusters keep running — unlike one big cluster where a single incident can take everything down.
- A single cluster is simpler to operate, and each additional cluster multiplies operational overhead (networking, config drift, tooling, upgrades). So you should add clusters only when specific benefits clearly outweigh that cost. Factors that tip the balance: needing strong isolation/blast-radius containment, users spread across regions (latency/availability), legal data-residency/compliance requirements, approaching single-cluster scale limits, or a genuine multi-cloud/hybrid strategy. Absent such drivers, one well-run cluster (with namespaces/RBAC for tenancy) is usually preferable.
Active-active vs active-passive
Theory
For high availability across clusters, two fundamental topologies exist:
- Active-active: multiple clusters all serve live traffic simultaneously (e.g., a cluster per region, each handling its region's users, with global load balancing/DNS distributing traffic). Benefits: full utilization of all clusters, low latency (users hit the nearest), and seamless failover (if one cluster dies, others already serve traffic). Challenges: data must be replicated/synchronized across clusters (hard for stateful/consistency-sensitive systems), and you need global traffic management.
- Active-passive (active-standby): one cluster serves traffic; a second standby cluster is kept ready (data replicated to it) and only takes over on failover. Simpler consistency (one active writer) and a clear DR story, but the passive cluster's capacity is idle most of the time (cost), and failover has a recovery time (detecting failure + promoting standby).
The choice hinges on data consistency needs, cost tolerance, and RTO/RPO (recovery time/point objectives). Stateless or eventually-consistent workloads favor active-active for utilization and instant failover; strongly-consistent stateful systems often use active-passive to avoid multi-writer conflicts. Many real architectures mix both (active-active stateless front ends, active-passive databases).
Example
Active-Active: Active-Passive:
Global LB/DNS Traffic -> [Primary] (active)
/ \ | replicate data
[Cluster A] [Cluster B] both serve [Standby] (idle, ready)
data replicated both ways failover: promote standby (RTO delay)
full utilization, instant failover simpler consistency, idle capacity
Exercises
- (Beginner) In active-active, how many clusters serve live traffic?
- (Beginner) What is the main downside of active-passive regarding the standby cluster?
- (Intermediate) Why is data consistency harder in active-active than active-passive?
- (Interview) How do RTO/RPO and data-consistency requirements drive the choice between active-active and active-passive? (Hint: multi-writer sync vs. single-writer + failover time.)
Answers
- All of them — multiple clusters serve live traffic simultaneously.
- The standby cluster sits mostly idle (reserved capacity that isn't serving traffic), which wastes resources/cost; there's also a failover delay before it takes over.
- Active-active has multiple clusters accepting writes concurrently, so data must be replicated and kept consistent across them — risking multi-writer conflicts and requiring conflict resolution or complex distributed consistency. Active-passive has a single active writer with data replicated one-way to the standby, avoiding concurrent-write conflicts and making consistency much simpler.
- Active-active gives near-zero RTO (other clusters already serve traffic, so failover is seamless) and full utilization, but demands multi-cluster data synchronization — feasible for stateless or eventually-consistent data, hard for strongly-consistent stateful systems. Active-passive gives simpler, strong consistency (one writer) and a clean DR model, but incurs a nonzero RTO (detect failure + promote standby) and possible RPO gap (replication lag), plus idle standby cost. So workloads needing instant failover and horizontal locality with tolerant data models choose active-active; strongly-consistent stateful systems that can accept some failover time choose active-passive — and many systems combine both per tier.
Cluster segmentation strategies
Theory
If you're going to run multiple clusters, how do you divide workloads among them? Common segmentation strategies (often combined):
- By environment: separate clusters for prod, staging, dev — strong isolation of production from experimentation, different security/RBAC and stability requirements.
- By region/zone: clusters per geographic region for latency, availability, and data residency.
- By tenant/business unit: a cluster per team, customer, or business unit for isolation, cost attribution, and independent lifecycle (stronger than namespace tenancy).
- By workload type/criticality: separate clusters for, say, batch/ML vs. latency-sensitive services, or high-security vs. general workloads (so noisy or risky workloads don't affect critical ones).
- By cloud/infra: clusters per cloud provider or on-prem for hybrid/multi-cloud strategies.
The overarching trade-off is again isolation vs. overhead: finer segmentation gives stronger isolation and clearer boundaries but more clusters to manage and more cross-cluster complexity. The goal is to segment along the axes where isolation matters most for your risk, compliance, and organizational structure — while keeping the number of clusters manageable (which is where fleet-management and multi-cluster GitOss tooling, later, becomes essential).
Example
Segmentation axes (often combined):
Environment: [prod] [staging] [dev]
Region: [us-east] [eu-west] [ap-south]
Tenant/BU: [team-a] [team-b] [customer-x]
Workload: [latency-sensitive] [batch/ML] [high-security]
Infra: [aws] [gcp] [on-prem]
more segmentation = stronger isolation but more clusters to manage
Exercises
- (Beginner) Name three axes along which you can segment clusters.
- (Beginner) What is the recurring trade-off in choosing how finely to segment?
- (Intermediate) Why might you put batch/ML workloads in a separate cluster from latency-sensitive services?
- (Interview) How should an organization decide its cluster segmentation strategy? (Hint: segment where isolation matters most for risk/compliance/org, balance against management overhead.)
Answers
- Any three: by environment (prod/staging/dev), by region/zone, by tenant/business unit, by workload type/criticality, and by cloud/infrastructure.
- Isolation vs. operational overhead — finer segmentation gives stronger isolation and clearer boundaries but creates more clusters to manage and more cross-cluster complexity.
- To isolate resource contention and risk: batch/ML jobs are bursty and resource-hungry (and often lower priority), so isolating them prevents them from starving or destabilizing latency-sensitive services; the two also have different scaling, node types (e.g., GPUs), and reliability needs, which are easier to manage in separate clusters.
- Segment along the axes where isolation delivers the most value for the organization's specific risk (blast radius), compliance (data residency, security tiers), latency/geography, and organizational structure (teams/tenants/cost attribution) — prioritizing separations that matter (e.g., prod vs. non-prod, per-region for residency). Then balance that against management overhead: don't create more clusters than you can operate consistently, and invest in fleet-management/GitOps tooling so the chosen segmentation stays maintainable. The right strategy maximizes needed isolation while keeping the cluster count and cross-cluster complexity manageable.
Cross-cluster service discovery
Theory
Once services live in different clusters, they need to find and reach each other — but each cluster has its own DNS (CoreDNS) and its own Service ClusterIPs, isolated from the others. Cross-cluster service discovery bridges this so a service in cluster A can resolve and call a service in cluster B.
Approaches:
- Multi-cluster Services (MCS API): a Kubernetes SIG standard (
ServiceExport/ServiceImport) where you "export" a Service in one cluster and it becomes discoverable/reachable (via aclusterset.localdomain) in others that import it — implemented by tools like Submariner or cloud/mesh integrations. - Service mesh multi-cluster: meshes (Istio, Linkerd, Cilium Cluster Mesh) can span clusters, providing unified service discovery, mTLS, and traffic routing across them — often the richest option.
- Global DNS / load balancers: expose services externally and use global DNS or a global LB to route across clusters (coarser, typically for edge/ingress traffic).
- Flat networking: tools that connect cluster Pod networks (Submariner, Cilium Cluster Mesh) so Pods/Services are directly routable across clusters.
The key challenge is that cross-cluster calls cross network and trust boundaries, so discovery is usually paired with secure connectivity (mTLS/encrypted tunnels). Choosing an approach depends on how tightly you need clusters integrated (occasional external calls vs. a seamless multi-cluster mesh).
Example
Multi-cluster Services (MCS API):
Cluster A: ServiceExport(payments) ---> payments.ns.svc.clusterset.local
Cluster B: ServiceImport(payments) ---> resolves + routes to A's payments
Service mesh (e.g., Istio/Cilium Cluster Mesh):
unified discovery + mTLS + routing across clusters (Pods reachable cross-cluster)
Exercises
- (Beginner) Why can't a service in cluster A resolve a service in cluster B by default?
- (Beginner) What Kubernetes standard defines exporting/importing Services across clusters?
- (Intermediate) Name two approaches to cross-cluster service discovery and how they differ.
- (Interview) Why is cross-cluster discovery usually paired with secure connectivity, and what makes a service mesh an attractive option for it? (Hint: traffic crosses trust boundaries; mesh adds mTLS + unified routing/discovery.)
Answers
- Each cluster has its own isolated DNS (CoreDNS) and its own Service ClusterIPs/network; a cluster's DNS only knows its own Services, and Pod/Service networks are separate — so B's names/IPs aren't resolvable or routable from A without additional integration.
- The Multi-Cluster Services (MCS) API, using
ServiceExportandServiceImport(with theclusterset.localdomain).- Any two: the MCS API (export/import Services via a SIG standard, implemented by tools like Submariner) provides Kubernetes-native cross-cluster Services; a service mesh spanning clusters (Istio/Linkerd/Cilium Cluster Mesh) adds unified discovery plus mTLS and traffic management; global DNS/load balancers route to externally-exposed services (coarser, edge-level); flat cross-cluster networking (Submariner, Cilium Cluster Mesh) makes Pods/Services directly routable. They differ in integration depth (native Services vs. mesh features vs. external routing) and whether they also provide security/routing.
- Cross-cluster traffic leaves one cluster's network and trust domain and traverses (often untrusted) networks to another, so it needs encryption and authenticated identity to be safe — hence pairing discovery with mTLS/encrypted tunnels. A service mesh is attractive because it provides both in one system: unified cross-cluster service discovery and automatic mTLS, plus consistent traffic management/observability across clusters — so services get secure, identity-verified, routable connectivity spanning clusters without each app implementing it, making the multi-cluster estate behave more like one integrated mesh.
15.2 Federation and Multi-Cluster Tools
This subchapter surveys the tools that manage workloads and connectivity across clusters.
Kubefed (Kubernetes Federation v2)
Theory
Kubefed (KubeFed, Kubernetes Cluster Federation v2) was the official attempt to manage multiple clusters as one: from a central host cluster, you define federated versions of resources (FederatedDeployment, FederatedService, etc.) with a template, placement (which clusters), and overrides (per-cluster differences), and Kubefed propagates them to the member clusters. It aimed to give a single control point for deploying and configuring workloads across a fleet.
However, Kubefed struggled with adoption and is now effectively deprecated/archived. Its approach — special federated wrapper types and a central push controller — proved complex, added a new API surface, and didn't align well with how teams actually operate (many preferred GitOps-based propagation). It's important primarily as historical/conceptual context: it illustrates the federation idea (template + placement + overrides propagated to clusters) that later tools (and multi-cluster GitOps like Argo CD ApplicationSet, and projects like Karmada) refined. The lesson: naive "one API to rule all clusters" federation is hard; the ecosystem largely moved to GitOps and purpose-built multi-cluster schedulers instead.
Example
# Kubefed-style federated resource (conceptual/historical):
apiVersion: types.kubefed.io/v1beta1
kind: FederatedDeployment
metadata: { name: web }
spec:
template: { ... } # base Deployment
placement:
clusters: [ { name: cluster-a }, { name: cluster-b } ] # which clusters
overrides:
- clusterName: cluster-b
clusterOverrides: [ { path: "/spec/replicas", value: 5 } ] # per-cluster diff
Exercises
- (Beginner) What was Kubefed's goal?
- (Beginner) What three parts define a federated resource in Kubefed (template + …)?
- (Intermediate) Why did Kubefed struggle with adoption?
- (Interview) What conceptual model did Kubefed establish, and what approaches largely replaced it? (Hint: template/placement/overrides; GitOps propagation and tools like Karmada.)
Answers
- To manage multiple clusters centrally as a federation — defining resources once on a host cluster and propagating them (with per-cluster placement and overrides) to member clusters.
- Template (the base resource), placement (which clusters to deploy to), and overrides (per-cluster customizations).
- Its model introduced complex, special "federated" wrapper types and a central push-controller API surface that added significant complexity and didn't match how teams increasingly preferred to operate (GitOps). It saw limited adoption and is now effectively deprecated/archived.
- Kubefed established the conceptual federation model of template + placement + overrides propagated to member clusters. That idea persists, but the mechanism was largely replaced by GitOps-based multi-cluster propagation (e.g., Argo CD ApplicationSet, Flux across clusters) and purpose-built multi-cluster orchestration/scheduling projects (e.g., Karmada, Admiralty), which achieve fleet-wide deployment with less bespoke API surface and better alignment with declarative, Git-driven workflows.
Liqo for transparent multi-cluster
Theory
Liqo takes a strikingly different, "transparent" approach to multi-cluster: instead of federating resource definitions, it makes remote clusters appear as virtual nodes in your local cluster. Using the Virtual Kubelet concept, Liqo represents a peered remote cluster as a big node in the local cluster; when you schedule a Pod "onto" that virtual node, Liqo actually runs it in the remote cluster — transparently, with cross-cluster networking and service reachability handled automatically ("resource borrowing"/offloading).
The appeal is seamlessness: from the local cluster's perspective, you just deploy Pods as usual and they may transparently run elsewhere, effectively pooling capacity across clusters as if they were one. Clusters establish peering relationships (which can be dynamic/on-demand), and Liqo extends networking and service discovery across the peered clusters so offloaded workloads still communicate. It's well-suited to bursting/overflow (borrow capacity from another cluster when yours is full) and treating multiple clusters as a unified resource pool — a fundamentally different model from explicit federation or GitOps propagation, prioritizing transparency of the multi-cluster boundary.
Example
Liqo peering makes a remote cluster look like a local virtual node:
Local cluster
[real node][real node][ liqo virtual-node (== remote cluster) ]
|
Pod scheduled here transparently runs in the REMOTE cluster
(Liqo handles cross-cluster networking + service reachability)
Exercises
- (Beginner) How does Liqo represent a remote cluster locally?
- (Beginner) What underlying concept does Liqo build on to do this?
- (Intermediate) What multi-cluster use case is Liqo especially well-suited to?
- (Interview) How does Liqo's "transparent" model differ fundamentally from federation (Kubefed) or GitOps propagation? (Hint: virtual node/offloading vs. explicit per-cluster resource definitions.)
Answers
- As a virtual node in the local cluster; scheduling a Pod onto that node transparently runs it in the peered remote cluster.
- The Virtual Kubelet concept (a virtual node backed by another system — here, a remote cluster).
- Capacity bursting/overflow and pooling resources across clusters — e.g., borrowing capacity from a peered cluster when the local one is full, treating multiple clusters as a unified resource pool.
- Federation (Kubefed) and GitOps propagation require you to explicitly define/propagate resources per cluster (templates, placement, overrides, or per-cluster manifests) — the multi-cluster boundary is visible and managed. Liqo instead hides the boundary: remote clusters appear as local virtual nodes, so you deploy Pods normally and they may transparently execute in another cluster, with networking/service discovery bridged automatically. It's resource-pooling/offloading (capacity federation) rather than declarative resource federation — you don't manage where things go per cluster; you let the local cluster transparently extend into peers.
Admiralty for multi-cluster scheduling
Theory
Admiralty focuses specifically on multi-cluster scheduling — intelligently placing workloads across a set of clusters. Like Liqo, it uses virtual-node (Virtual Kubelet) techniques so that a management/source cluster can offload Pods to target clusters, but its emphasis is on the scheduling decision: given multiple candidate clusters, decide which cluster each Pod should run in based on capacity, constraints, cost, or locality, then delegate execution there.
The model: you submit Pods (or workloads) to a source cluster; Admiralty's cross-cluster scheduler evaluates member clusters and creates "delegate" Pods in the chosen target cluster(s), while the source cluster tracks them via proxy/virtual-node Pods. This enables cluster-spanning scheduling — spreading a Deployment's replicas across clusters for HA, bin-packing across a fleet, or routing workloads to clusters with spare/appropriate capacity — using familiar Kubernetes scheduling semantics (affinities, topology spread) extended to the multi-cluster level. It's a good fit when you want a scheduler-driven way to distribute workloads across clusters rather than manually assigning them.
Example
Admiralty: submit to source cluster -> cross-cluster scheduler picks target(s)
Source cluster: Pod (pending) --scheduler evaluates member clusters-->
creates delegate Pods in chosen target(s):
[Cluster A: 2 replicas] [Cluster B: 1 replica] (spread for HA/capacity)
source tracks them via virtual-node proxy Pods
Exercises
- (Beginner) What is Admiralty's primary focus?
- (Beginner) What technique (shared with Liqo) does Admiralty use to offload Pods?
- (Intermediate) Give an example of a scheduling goal Admiralty enables across clusters.
- (Interview) How does Admiralty extend familiar Kubernetes scheduling semantics to the multi-cluster level, and when would you choose it? (Hint: cross-cluster scheduler + delegate Pods; scheduler-driven distribution vs. manual placement.)
Answers
- Multi-cluster scheduling — intelligently deciding which cluster each workload should run in and placing it there.
- Virtual-node / Virtual Kubelet techniques (representing target clusters so Pods can be offloaded to them).
- Any example such as: spreading a Deployment's replicas across multiple clusters for high availability, bin-packing/distributing workloads across a fleet by available capacity, or routing Pods to clusters with appropriate/spare resources or better locality/cost.
- You submit workloads to a source cluster as usual, and Admiralty's cross-cluster scheduler evaluates member clusters — honoring Kubernetes scheduling constructs (affinities, topology spread, resource requests) at the fleet level — then creates delegate Pods in the chosen target cluster(s) while the source tracks them via virtual-node proxies. So multi-cluster placement is expressed with the same familiar semantics as single-cluster scheduling. Choose Admiralty when you want a scheduler-driven way to automatically distribute workloads across clusters (for HA, capacity, or locality) rather than manually deciding and deploying to each cluster.
Submariner for cross-cluster networking
Theory
Most multi-cluster patterns assume workloads in different clusters can actually reach each other over the network — but by default cluster Pod/Service networks are isolated (and may even have overlapping CIDRs). Submariner solves the networking layer: it connects the Pod and Service networks of multiple clusters, enabling direct, encrypted cross-cluster connectivity and cross-cluster Service discovery.
Its components:
- Gateway engines in each cluster establish secure (IPsec/WireGuard) tunnels between clusters, so Pods in one cluster can route to Pods/Services in another.
- Lighthouse provides cross-cluster Service discovery (implementing the MCS API's
*.clusterset.localDNS), so services are resolvable across the connected clusters. - It handles overlapping CIDRs (via Globalnet) when clusters weren't planned with distinct ranges.
Submariner is essentially the connectivity fabric underneath higher-level multi-cluster patterns: things like cross-cluster service calls, multi-cluster meshes, or federated apps need the networks joined, and Submariner provides that flat, secure inter-cluster network and discovery. It's the answer to "my clusters can't talk to each other's Pods/Services — connect them."
Example
Submariner joins cluster networks with secure tunnels + cross-cluster DNS:
[Cluster A]==(IPsec/WireGuard tunnel via Gateways)==[Cluster B]
Pods/Services in A <---- directly reachable ----> Pods/Services in B
Lighthouse DNS: payments.ns.svc.clusterset.local resolves across clusters
Globalnet handles overlapping Pod CIDRs
Exercises
- (Beginner) What layer of the multi-cluster problem does Submariner address?
- (Beginner) How does Submariner secure traffic between clusters?
- (Intermediate) What does Submariner's Lighthouse component provide, and what problem does Globalnet solve?
- (Interview) Why is a connectivity fabric like Submariner a prerequisite for many higher-level multi-cluster patterns? (Hint: they assume cross-cluster Pod/Service reachability, which is isolated by default.)
Answers
- The networking layer — connecting the Pod/Service networks of multiple clusters to enable direct cross-cluster connectivity and discovery.
- Via encrypted tunnels between per-cluster gateway engines, using IPsec or WireGuard.
- Lighthouse provides cross-cluster Service discovery (implementing the MCS API
*.clusterset.localDNS so Services resolve across connected clusters). Globalnet handles overlapping Pod/Service CIDRs between clusters that weren't planned with distinct ranges, allowing them to interconnect despite address conflicts.- Higher-level patterns — cross-cluster service calls, multi-cluster service meshes, federated/offloaded workloads — all assume that a Pod/Service in one cluster can actually reach one in another. By default cluster networks are isolated (and possibly overlapping), so that assumption doesn't hold. Submariner establishes the flat, secure inter-cluster network and cross-cluster DNS that makes cross-cluster reachability real, so the higher-level patterns have the connectivity foundation they depend on. Without such a fabric, multi-cluster discovery/routing has nothing to route over.
15.3 Multi-Cluster GitOps
This subchapter covers managing many clusters declaratively with GitOps and fleet tooling.
Argo CD ApplicationSet for multi-cluster
Theory
Managing dozens of clusters by hand-writing an Argo CD Application per app-per-cluster doesn't scale. ApplicationSet is an Argo CD controller that templates and generates Applications automatically from a generator, so you define the pattern once and it produces (and keeps in sync) many Applications across clusters/environments.
Generators supply the inputs to fill the Application template:
- Cluster generator: create the app on every registered cluster (or those matching a label) — e.g., "deploy monitoring to all clusters labeled
env=prod." - Git generator: generate Applications from directories/files in a Git repo (e.g., one app per subfolder).
- List / Matrix / Merge generators: explicit lists or combinations (cluster × app) for fine control.
When you add a new cluster (and it matches the generator), ApplicationSet automatically creates the corresponding Applications — no manual per-cluster work. This makes fleet-wide, consistent GitOps deployment practical: declare "these apps on these clusters" as a templated ApplicationSet, and Argo CD keeps every cluster reconciled. It's the standard way to do multi-cluster GitOps with Argo CD.
Example
apiVersion: argoproj.io/v1alpha1
kind: ApplicationSet
metadata: { name: monitoring, namespace: argocd }
spec:
generators:
- clusters: { selector: { matchLabels: { env: prod } } } # all prod clusters
template:
metadata: { name: 'monitoring-{{name}}' } # per-cluster App
spec:
project: default
source: { repoURL: https://github.com/org/config, path: monitoring, targetRevision: main }
destination: { server: '{{server}}', namespace: monitoring }
syncPolicy: { automated: { prune: true, selfHeal: true } }
Exercises
- (Beginner) What problem does ApplicationSet solve for multi-cluster Argo CD?
- (Beginner) What does the cluster generator do?
- (Intermediate) What happens when you register a new cluster matching an ApplicationSet's generator?
- (Interview) How do ApplicationSet generators enable scalable, consistent fleet-wide GitOps? (Hint: template once, generate many Applications from cluster/Git/list inputs; auto-onboard clusters.)
Answers
- It removes the need to hand-write a separate Application per app-per-cluster by templating and automatically generating those Applications, so managing many clusters scales.
- It generates an Application for each registered cluster (optionally filtered by label selector), so an app is deployed across all matching clusters.
- ApplicationSet automatically generates the corresponding Application(s) for the new cluster and Argo CD deploys/reconciles them — the cluster is onboarded to the relevant apps with no manual per-cluster configuration.
- Generators feed inputs (clusters, Git directories, explicit lists, or combinations via matrix/merge) into a single Application template, so you declare the deployment pattern once and ApplicationSet produces and continuously maintains all the concrete Applications across the fleet. Adding/removing clusters or apps that match the generators automatically adds/removes Applications, keeping every cluster consistently reconciled to Git without manual effort. This templated, input-driven generation is what makes large-scale, uniform multi-cluster GitOps practical.
Cluster API (CAPI) for cluster lifecycle
Theory
GitOps manages what runs on clusters — but who manages the clusters themselves (creating, scaling, upgrading, deleting them)? Cluster API (CAPI) is a Kubernetes SIG project that brings declarative, Kubernetes-style lifecycle management to clusters: you describe your clusters as Kubernetes resources (Cluster, MachineDeployment, MachineSet, Machine, and infrastructure/control-plane provider resources), and CAPI controllers provision and manage the actual clusters and their nodes to match.
The key idea: use Kubernetes to manage Kubernetes. A management cluster runs CAPI, which creates and operates workload clusters on various infrastructures via providers (AWS, Azure, GCP, vSphere, bare metal, etc.). Because clusters are now declarative CRs, you can apply the same practices to fleet lifecycle as to apps: version them in Git, GitOps-reconcile them, scale a node pool by editing a MachineDeployment's replicas, and perform rolling upgrades by changing the Kubernetes version in the spec. CAPI is the foundation for treating a fleet of clusters as cattle — reproducible, declarative, automatable — rather than hand-built pets, and it underpins many higher-level fleet/cluster-management platforms.
Example
# CAPI: declare a cluster and a node pool as Kubernetes resources
apiVersion: cluster.x-k8s.io/v1beta1
kind: Cluster
metadata: { name: prod-eu }
spec:
controlPlaneRef: { kind: KubeadmControlPlane, name: prod-eu-cp }
infrastructureRef: { kind: AWSCluster, name: prod-eu-infra }
---
apiVersion: cluster.x-k8s.io/v1beta1
kind: MachineDeployment
metadata: { name: prod-eu-workers }
spec:
replicas: 5 # scale nodes by editing this
# ... references worker infra + kubeadm bootstrap + version
Exercises
- (Beginner) What does Cluster API manage, in contrast to GitOps tools like Argo CD?
- (Beginner) What is the "management cluster" vs. a "workload cluster"?
- (Intermediate) How would you scale a workload cluster's node pool using CAPI?
- (Interview) What does "use Kubernetes to manage Kubernetes" mean with CAPI, and why is declarative cluster lifecycle valuable for a fleet? (Hint: clusters as CRs; reproducible, GitOps-able, rolling upgrades.)
Answers
- Cluster API manages the clusters themselves — their creation, scaling, upgrading, and deletion (and their nodes/machines) — whereas Argo CD/GitOps manage the workloads/config running inside clusters.
- The management cluster runs CAPI and its controllers, which provision and operate workload clusters (the actual clusters that run your applications) on various infrastructures via providers.
- Edit the
MachineDeployment'sreplicasfield (declaratively) — CAPI's controllers reconcile by adding/removing machines (nodes) in that pool to match the new count.- CAPI models clusters and their nodes as Kubernetes custom resources reconciled by controllers running in a management cluster — so you manage clusters with the same declarative, API-driven, controller-reconciled approach Kubernetes uses for Pods (hence "Kubernetes managing Kubernetes"). This is valuable because cluster lifecycle becomes reproducible and automatable: clusters can be version-controlled in Git and GitOps-reconciled, provisioned consistently across providers, scaled by editing specs, and upgraded via rolling changes (bump the version in the spec, controllers roll nodes). It turns a fleet of clusters into declarative "cattle" — consistent, auditable, and easy to recreate — rather than fragile hand-built environments.
Fleet management with Rancher Fleet
Theory
Rancher Fleet is a GitOps tool built specifically for managing large fleets of clusters at scale — it's designed from the ground up to deploy and manage applications across many clusters (Rancher cites scaling to a very large number). It follows a hub-and-spoke model: a central Fleet controller (in a management cluster) drives agents in downstream clusters, deploying bundles (Git-sourced sets of manifests/Helm/Kustomize) to clusters selected by labels.
Its fleet-oriented design centers on:
- GitRepo resources pointing at Git repositories of deployments.
- Bundles: the unit of deployment, distributed to targeted clusters via cluster/label selectors and ClusterGroups.
- Per-cluster customization: values/overrides applied per target so the same bundle adapts to each cluster.
Fleet is integrated with Rancher (the broader cluster-management platform) but usable standalone. Its niche versus Argo CD/Flux is scale of clusters and fleet-first ergonomics — targeting bundles across thousands of clusters by label with per-target overrides, rather than being primarily a single-cluster/app-centric GitOps tool. It's a strong choice when your primary challenge is consistently managing app delivery across a very large or edge cluster fleet.
Example
# Fleet GitRepo: deploy a bundle to clusters selected by label
apiVersion: fleet.cattle.io/v1alpha1
kind: GitRepo
metadata: { name: apps, namespace: fleet-default }
spec:
repo: https://github.com/org/fleet-apps
paths: [ "monitoring", "logging" ]
targets:
- clusterSelector: { matchLabels: { env: prod } } # target all prod clusters
Exercises
- (Beginner) What is Rancher Fleet designed for?
- (Beginner) What is the unit of deployment in Fleet called, and how are targets chosen?
- (Intermediate) What is Fleet's niche compared to Argo CD/Flux?
- (Interview) Why does managing thousands of clusters call for a fleet-first GitOps tool, and what capabilities matter most at that scale? (Hint: label-based targeting, per-cluster overrides, hub-and-spoke agents.)
Answers
- GitOps-based management and application delivery across large fleets of clusters at scale (including edge/many-cluster scenarios).
- The bundle (a Git-sourced set of manifests/Helm/Kustomize); targets are chosen via cluster/label selectors and ClusterGroups (deploying the bundle to matching clusters, with per-target customization).
- Fleet is built fleet-first — optimized for deploying and managing apps across a very large number of clusters with label-based targeting and per-cluster overrides (and integrates with Rancher) — whereas Argo CD/Flux are more app/cluster-centric general GitOps tools (though they also support multi-cluster). Fleet's niche is scale of clusters and fleet ergonomics.
- At thousands of clusters, per-cluster manual configuration is impossible and you need to express deployments as patterns applied across many clusters. The capabilities that matter: label/selector-based targeting (deploy to groups of clusters by attributes, auto-including new matching clusters), per-cluster overrides (adapt one bundle to each cluster's specifics), a scalable hub-and-spoke agent architecture (central control, downstream agents pulling/applying, resilient to scale), and efficient reconciliation. A fleet-first tool like Fleet is engineered for this scale and these ergonomics, whereas tools centered on individual apps/clusters become unwieldy across a massive fleet.
Policy propagation across clusters
Theory
In a fleet, you must enforce consistent policy and configuration everywhere — security baselines, resource quotas, RBAC, network policies, admission rules — not cluster-by-cluster (which drifts). Policy propagation is the practice of defining governance centrally and distributing/enforcing it across all clusters uniformly.
Approaches:
- GitOps as the propagation mechanism: store policies (Kyverno/Gatekeeper policies, RBAC, quotas, NetworkPolicies) in Git and have every cluster's GitOps agent apply them — so policy is version-controlled, reviewed, and reconciled onto all clusters (ApplicationSet/Fleet target them fleet-wide).
- Policy engines with multi-cluster distribution: Kyverno and OPA Gatekeeper policies propagated to all clusters (Kyverno even supports policy-report aggregation; Gatekeeper constraints can be templated across clusters).
- Multi-cluster orchestration (e.g., Karmada): propagate resources including policies with placement rules and per-cluster overrides.
- Managed policy services: cloud/fleet platforms (e.g., Google Config Sync/Policy Controller, Azure Policy for AKS, Rancher) offer built-in fleet-wide policy enforcement and drift/compliance reporting.
The goals are consistency (the same guardrails everywhere), compliance visibility (report which clusters comply), and drift prevention (reconcile policy back if changed). The dominant modern pattern is policy-as-code in Git, propagated via GitOps and enforced by policy engines — turning fleet governance into a reviewable, automated, continuously-reconciled system rather than manual per-cluster configuration.
Example
Central policy-as-code in Git:
[Kyverno/Gatekeeper policies | RBAC | quotas | NetworkPolicies]
| propagate via GitOps (ApplicationSet / Fleet / Karmada)
+-------------+-------------+
v v v
Cluster A Cluster B Cluster C (same guardrails enforced + reconciled)
|-> compliance/drift reports aggregated back
Exercises
- (Beginner) What is policy propagation across clusters?
- (Beginner) Name two kinds of policy you'd want consistent across a fleet.
- (Intermediate) How does GitOps serve as a policy-propagation mechanism?
- (Interview) What are the goals of fleet-wide policy propagation, and why is "policy-as-code in Git + policy engines" the dominant pattern? (Hint: consistency, compliance visibility, drift prevention; reviewable, automated, reconciled.)
Answers
- Defining governance/config centrally and distributing and enforcing it uniformly across all clusters in a fleet (rather than configuring each cluster separately).
- Any two: security baselines/admission policies (Kyverno/Gatekeeper), RBAC, resource quotas, NetworkPolicies, Pod Security standards, or image/registry policies.
- Policies are stored as code in Git and each cluster's GitOps agent (Argo CD/ApplicationSet, Flux, Fleet) applies and continuously reconciles them — so the same policies are deployed to every targeted cluster, version-controlled and reviewed, and any drift is corrected. Fleet-targeting mechanisms push them to all (matching) clusters automatically.
- Goals: consistency (identical guardrails on every cluster), compliance visibility (know which clusters conform), and drift prevention (automatically revert unauthorized changes). "Policy-as-code in Git + policy engines" dominates because Git gives version control, review/approval, and auditability of policy changes; GitOps continuously reconciles the policies onto all clusters (self-healing against drift); and policy engines (Kyverno/Gatekeeper) actually enforce them at admission and report compliance. Together this turns fleet governance into a reviewable, automated, continuously-enforced system that scales across many clusters — far more reliable and auditable than manual, per-cluster policy configuration.
16. Cluster Operations and Maintenance
Standing up a cluster is the beginning; keeping it healthy, current, recoverable, and cost-efficient over years is the real work — "Day 2 operations." This chapter covers the operational disciplines: upgrading clusters safely, backing up and recovering from disaster, controlling cost, and troubleshooting the failures you'll inevitably face. These are the skills that separate running a demo cluster from operating production.
16.1 Cluster Upgrades
Kubernetes releases frequently, and clusters must be upgraded to stay supported and secure. This subchapter covers doing it safely.
kubeadm upgrade workflow
Theory
Kubernetes has a fast release cadence (roughly three minor versions per year, each supported ~1 year), so upgrading is a recurring, mandatory operation. For self-managed clusters, kubeadm provides a structured upgrade workflow. The cardinal rule: upgrade the control plane before the nodes, and go one minor version at a time (you can't skip minors, e.g., 1.29 → 1.31 directly).
The kubeadm upgrade sequence:
- Upgrade the first control-plane node: upgrade the
kubeadmbinary, runkubeadm upgrade plan(shows available versions/what will change) andkubeadm upgrade apply <version>(upgrades the control-plane components/etcd), then upgrade that node'skubelet/kubectland restart the kubelet. - Upgrade other control-plane nodes:
kubeadm upgrade nodeon each, then their kubelets. - Upgrade worker nodes one at a time: drain the node (evict Pods safely),
kubeadm upgrade node, upgrade its kubelet, restart, then uncordon it.
Throughout, the version skew policy (next) bounds allowed differences. Doing this carefully — control plane first, one minor at a time, nodes drained during their turn — is what makes upgrades safe and non-disruptive.
Example
# On the first control-plane node:
apt-get install -y kubeadm=1.30.x # upgrade kubeadm binary
kubeadm upgrade plan # preview available upgrades
kubeadm upgrade apply v1.30.2 # upgrade control plane + etcd
apt-get install -y kubelet=1.30.x kubectl=1.30.x && systemctl restart kubelet
# On each worker (one at a time):
kubectl drain node-2 --ignore-daemonsets --delete-emptydir-data
kubeadm upgrade node && apt-get install -y kubelet=1.30.x && systemctl restart kubelet
kubectl uncordon node-2
Exercises
- (Beginner) Do you upgrade the control plane or the worker nodes first?
- (Beginner) Can you skip minor versions (e.g., go from 1.29 straight to 1.31)?
- (Intermediate) What does
kubeadm upgrade plando versuskubeadm upgrade apply?- (Interview) Why must worker nodes be drained before upgrading, and what role does one-node-at-a-time play? (Hint: evict/reschedule workloads; maintain availability.)
Answers
- The control plane first, then the worker nodes.
- No — you must upgrade one minor version at a time (e.g., 1.29 → 1.30 → 1.31); skipping minors isn't supported.
kubeadm upgrade planinspects the cluster and shows available target versions and what the upgrade would change (a preview/validation step);kubeadm upgrade apply <version>actually performs the control-plane upgrade to the chosen version.- Draining cordons the node and evicts its Pods (gracefully, respecting PodDisruptionBudgets) so they're rescheduled onto other nodes before you disrupt/restart the node's kubelet — avoiding running workloads being killed uncleanly. Doing it one node at a time keeps the rest of the cluster serving traffic and preserves capacity/availability throughout the upgrade, so only a small slice is unavailable at any moment rather than the whole fleet at once.
Managed cluster upgrade strategies
Theory
On managed Kubernetes (EKS, GKE, AKS), the provider handles the control-plane upgrade for you (often a one-click or one-API-call operation, with the provider managing etcd, API server, etc.). Your responsibility shifts largely to node (data-plane) upgrades and validating that your workloads tolerate the new version.
Managed node-upgrade strategies:
- Surge/rolling node upgrades: the provider (or a managed node group) creates new nodes on the new version, cordons/drains old nodes, and moves workloads over — minimizing disruption (respecting PDBs).
- Blue-green node pools: create a new node pool on the new version, migrate workloads, then delete the old pool — safe and easily reversible.
- Auto-upgrade / release channels: GKE (and others) offer channels (rapid/regular/stable) and can auto-upgrade the control plane and nodes on a maintenance schedule, trading control for less manual effort.
Best practices regardless of provider: upgrade non-prod first, respect PodDisruptionBudgets, use maintenance windows, and test workloads against the new version (check for removed/deprecated APIs). Managed upgrades remove most control-plane toil, but you still own workload compatibility and node rollout safety.
Example
# GKE: upgrade control plane, then node pool (or use release channels for auto)
gcloud container clusters upgrade prod --master --cluster-version 1.30.2
gcloud container clusters upgrade prod --node-pool default-pool --cluster-version 1.30.2
# EKS: control plane, then managed node group (rolling, respects PDBs)
aws eks update-cluster-version --name prod --kubernetes-version 1.30
aws eks update-nodegroup-version --cluster-name prod --nodegroup-name ng-1
Exercises
- (Beginner) In managed Kubernetes, who handles the control-plane upgrade?
- (Beginner) What upgrade responsibility remains largely yours on a managed cluster?
- (Intermediate) Describe the blue-green node pool upgrade strategy.
- (Interview) What should you do before upgrading production, regardless of managed vs. self-managed, to avoid surprises? (Hint: test non-prod, check deprecated/removed APIs, PDBs, maintenance windows.)
Answers
- The cloud provider handles the control-plane upgrade (API server, etcd, scheduler, controller-manager).
- Node/data-plane upgrades and ensuring your workloads are compatible with the new version (node pool rollout, PDBs, testing) — even if the provider offers node auto-upgrade, workload compatibility is yours.
- Create a new node pool running the new Kubernetes version alongside the old pool, migrate workloads onto the new pool (cordon/drain the old nodes so Pods reschedule to the new ones), verify everything is healthy, then delete the old node pool. It's safe and easily reversible (keep the old pool until confident).
- Upgrade and test in non-production first; check for deprecated/removed APIs your manifests use (a common breakage across minor versions) and update them; ensure PodDisruptionBudgets are set so drains don't take down too many replicas; schedule during a maintenance window; and verify application health/compatibility on the new version before rolling production. This catches API removals, workload incompatibilities, and disruption issues before they hit prod.
Node draining and cordoning
Theory
Before you take a node out of service (for upgrade, maintenance, or decommissioning), you must move its workloads off safely. Two related operations:
- Cordoning (
kubectl cordon <node>): marks the node unschedulable — no new Pods will be placed on it, but existing Pods keep running. It's a "stop sending work here" flag. - Draining (
kubectl drain <node>): cordons the node and evicts its existing Pods (gracefully), so they're rescheduled elsewhere, leaving the node empty and ready for maintenance.
Draining uses the Eviction API, which respects PodDisruptionBudgets (PDBs) — a PDB declares the minimum available replicas (or max unavailable) for an application, so drain will block/slow rather than violate it (preventing an eviction from taking a service below its safe replica count). Common flags: --ignore-daemonsets (DaemonSet Pods aren't evicted — they're node-bound), --delete-emptydir-data (acknowledge losing emptyDir data). After maintenance, kubectl uncordon <node> makes it schedulable again. Cordon/drain + PDBs are the mechanism for zero/minimal-disruption node maintenance.
Example
kubectl cordon node-2 # stop scheduling new Pods here
kubectl drain node-2 --ignore-daemonsets --delete-emptydir-data # evict Pods safely
# ... perform maintenance / upgrade ...
kubectl uncordon node-2 # allow scheduling again
# A PodDisruptionBudget makes drain respect availability:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata: { name: web-pdb }
spec:
minAvailable: 2 # drain won't evict below 2 available web Pods
selector: { matchLabels: { app: web } }
Exercises
- (Beginner) What is the difference between cordoning and draining a node?
- (Beginner) What does
kubectl uncordondo?- (Intermediate) How does a PodDisruptionBudget interact with a drain?
- (Interview) Why does
kubectl drainskip DaemonSet Pods, and how do PDBs enable minimal-disruption maintenance? (Hint: node-bound Pods; drain blocks rather than violating availability.)
Answers
- Cordoning marks the node unschedulable (no new Pods) but leaves existing Pods running; draining cordons and evicts the existing Pods (gracefully) so they reschedule elsewhere, emptying the node.
- Marks the node schedulable again, so the scheduler can place new Pods on it after maintenance.
- Drain uses the Eviction API, which honors PDBs: it won't evict Pods if doing so would drop an application below its PDB's
minAvailable(or exceedmaxUnavailable). Instead the drain waits/blocks until it's safe to evict more, ensuring the service stays above its declared availability threshold.- DaemonSet Pods are inherently tied to their node (one per node, providing node-level function) and would just be recreated on the same node, so evicting them is pointless — drain skips them (hence
--ignore-daemonsets). PDBs enable minimal-disruption maintenance by declaring how many replicas must stay available; drain respects them, so evictions proceed only as fast as the app can tolerate (e.g., one at a time while keepingminAvailableup), preventing a maintenance operation from taking a service down. Together, cordon/drain move work off a node while PDBs guarantee availability isn't violated during the process.
Version skew policy
Theory
Kubernetes components run at potentially different versions during and after upgrades, so Kubernetes defines a version skew policy specifying the maximum allowed version differences between components. Respecting it is what makes staged upgrades safe. The core rules:
- kube-apiserver is the reference. In HA, the API servers may differ by at most 1 minor from each other.
- kubelet may be up to 3 minor versions older than the API server (but never newer). This is why you upgrade the control plane first — the API server must be at least as new as the kubelets.
- kube-controller-manager, kube-scheduler, cloud-controller-manager must be within 1 minor of (and not newer than) the API server.
- kube-proxy matches its node's kubelet (within the same skew of the API server).
- kubectl is supported within ±1 minor of the API server.
The practical implications: upgrade order matters (control plane before nodes, because the API server must lead), you can't skip minors (or you'd violate skew mid-upgrade), and you shouldn't let nodes lag more than 3 minors behind. The skew policy is the formal guarantee that a partially-upgraded cluster still functions correctly, enabling rolling, non-disruptive upgrades.
Example
API server = v1.30 (the reference)
kubelet: v1.27 - v1.30 allowed (up to 3 older, never newer)
controller-mgr/sched: v1.29 - v1.30 (within 1, not newer)
kube-proxy: matches node kubelet
kubectl: v1.29 - v1.31 (±1)
=> upgrade order: API server FIRST, then the rest, one minor at a time.
Exercises
- (Beginner) Which component is the reference point for the version skew policy?
- (Beginner) How many minor versions older than the API server may a kubelet be? Can it be newer?
- (Intermediate) How does the skew policy explain why you upgrade the control plane before nodes?
- (Interview) Why does the version skew policy make it impossible to skip minor versions during an upgrade? (Hint: skipping would exceed allowed component differences mid-upgrade.)
Answers
- The kube-apiserver.
- Up to 3 minor versions older, and it may never be newer than the API server.
- Because the kubelet must not be newer than the API server (and components must be within their allowed skew of it), the API server has to be upgraded first so it's at least as new as everything else. Upgrading a node's kubelet ahead of the control plane would make the kubelet newer than the API server, violating the policy.
- Component versions must stay within the allowed skew of the API server at all times. If you jumped the control plane two minors at once (e.g., 1.29 → 1.31), the still-1.29 kubelets/components could fall outside the supported skew relative to a 1.31 API server during the process, producing an unsupported, potentially broken state. Upgrading one minor at a time keeps every component within the permitted difference throughout, so the cluster remains functional at each step — which is precisely why skipping minors isn't allowed.
16.2 Backup and Disaster Recovery
Things fail; the question is whether you can recover. This subchapter covers backing up cluster state and applications and planning for disaster.
etcd backup and restore
Theory
Since etcd holds the entire cluster state (Chapter 2), backing it up is the single most critical disaster-recovery task for a self-managed cluster: lose etcd irrecoverably and you lose the whole cluster's state. etcd provides a snapshot mechanism — etcdctl snapshot save captures a consistent point-in-time copy of the datastore, which you store safely (encrypted, off-cluster, ideally off-site).
Key practices:
- Regular, automated snapshots (frequency driven by your RPO — how much state you can afford to lose).
- Store snapshots securely and off the cluster (they contain everything, including Secrets — treat them as sensitive), and test restores (an untested backup isn't a backup).
- Restore with
etcdctl snapshot restore, which recreates the etcd data directory from a snapshot; you then point etcd at it and bring the control plane back. In HA etcd, restore involves reinitializing the cluster from the snapshot.
On managed Kubernetes, the provider manages etcd (and its backups), so this is largely their responsibility — but on self-managed clusters, etcd backup/restore is your job and the foundation of cluster DR. (Note: etcd backup restores cluster objects; it doesn't back up PersistentVolume data — that needs separate volume snapshots/Velero.)
Example
# Take a snapshot of etcd (self-managed control plane):
ETCDCTL_API=3 etcdctl snapshot save /backup/etcd-$(date +%F).db \
--endpoints=https://127.0.0.1:2379 \
--cacert=/etc/kubernetes/pki/etcd/ca.crt \
--cert=/etc/kubernetes/pki/etcd/server.crt \
--key=/etc/kubernetes/pki/etcd/server.key
# Restore from a snapshot (recreates the data dir):
ETCDCTL_API=3 etcdctl snapshot restore /backup/etcd-2026-07-01.db \
--data-dir /var/lib/etcd-restored
Exercises
- (Beginner) Why is backing up etcd the most critical DR task on a self-managed cluster?
- (Beginner) What command takes an etcd snapshot?
- (Intermediate) Why must etcd snapshots be treated as sensitive and stored off-cluster, and why test restores?
- (Interview) Does an etcd backup protect your PersistentVolume data? What else do you need for full application DR? (Hint: etcd = objects, not PV contents; need volume snapshots/Velero.)
Answers
- Because etcd stores the entire cluster state (all API objects); if it's lost and unrecoverable, the whole cluster's state is gone. A snapshot is the only way to restore that state.
etcdctl snapshot save(with the appropriate endpoints/certs).- Snapshots contain everything in the cluster — including Secrets — so they're highly sensitive and must be encrypted and stored securely off the cluster (so a cluster/node loss doesn't also lose the backup). Restores must be tested because a backup that has never been verified may be corrupt, incomplete, or unrestorable — you only truly have a backup if you've confirmed you can restore from it.
- No — an etcd backup captures Kubernetes objects (Deployments, Services, PVCs/PV definitions, Secrets, etc.), but not the actual data inside PersistentVolumes (the disk contents). For full application disaster recovery you also need volume-level backups — CSI volume snapshots and/or a tool like Velero (which backs up cluster objects and orchestrates PV snapshots) — so both the cluster state and the persistent application data can be recovered.
Velero for cluster and application backup
Theory
Where etcd snapshots protect the whole cluster's raw state (self-managed only, all-or-nothing), Velero provides application-level, selective, portable backup and restore for Kubernetes. It's the standard tool for backing up namespaces/applications and their data, and works on managed clusters too. Velero backs up two things: Kubernetes API objects (filtered by namespace, label, or resource type) and, via integration, the persistent volume data backing them (using CSI volume snapshots or its file-level backup, Restic/Kopia).
Velero's capabilities make it the go-to for practical DR and migration:
- Selective backup/restore: back up a single namespace or app, not the whole cluster.
- Scheduled backups with retention (TTL).
- PV data: snapshot volumes alongside object backups so restored apps have their data.
- Migration / cloning: restore a backup into a different cluster (e.g., move an app between clusters or clouds, or rebuild after disaster) — because backups are stored in object storage (S3/GCS/Azure Blob) and are portable.
So Velero complements etcd backup: etcd is coarse full-cluster state (self-managed), while Velero gives granular, portable, data-inclusive app backups that work anywhere. Many teams rely on Velero as their primary Kubernetes backup/DR tool.
Example
# Scheduled daily backup of one namespace, including volume data, kept 30 days:
velero schedule create web-daily \
--schedule="0 2 * * *" --include-namespaces web --ttl 720h --snapshot-volumes
velero backup create web-now --include-namespaces web # on-demand backup
velero restore create --from-backup web-now # restore (same or new cluster)
Exercises
- (Beginner) What two categories of things does Velero back up?
- (Beginner) Where does Velero store its backups?
- (Intermediate) How does Velero differ from an etcd snapshot in granularity and portability?
- (Interview) How does Velero enable cross-cluster migration and application-level DR that etcd backups cannot? (Hint: selective, portable, object-storage-based, includes PV data, restore to any cluster.)
Answers
- Kubernetes API objects (selectable by namespace/label/resource) and the persistent volume data backing them (via CSI snapshots or file-level backup like Restic/Kopia).
- In object storage (e.g., AWS S3, Google Cloud Storage, Azure Blob), which makes backups durable and portable.
- An etcd snapshot is coarse and all-or-nothing — the entire cluster's state, restorable only to a matching self-managed control plane. Velero is granular (back up/restore individual namespaces/apps or filtered resources), includes PV data, works on managed clusters, and is portable (stored in object storage, restorable to a different cluster).
- Velero stores portable backups (objects + PV data) in cloud object storage independent of any single cluster, so you can restore into a different cluster — enabling migration between clusters/clouds and rebuilding after a cluster is lost. It's selective (per-namespace/app) and includes the persistent data, so you can recover specific applications with their state. An etcd snapshot only reconstructs the same self-managed cluster's object state (not PV contents) and can't selectively restore apps or move them to another cluster — so Velero covers the application-level, data-inclusive, portable DR/migration scenarios etcd backups cannot.
PV snapshot backup strategies
Theory
For stateful applications, backing up the persistent data is as important as the cluster objects. PersistentVolume snapshot strategies protect that data, using the CSI VolumeSnapshot mechanism (Chapter 7) — point-in-time copies of a volume, often orchestrated by Velero or the storage system.
Key strategy considerations:
- Application consistency: a raw volume snapshot is only crash-consistent (like a power-loss image). For databases, you should quiesce/flush the app (or use its backup mode / a pre-snapshot hook) so the snapshot is application-consistent — otherwise restore may require recovery or yield inconsistent data. Velero supports backup hooks (run a command in the Pod, e.g.,
fsfreezeor DB flush, before/after snapshot). - Frequency & retention: driven by RPO; keep a rotation of snapshots (and expire old ones).
- Off-site/durability: snapshots stored with the storage provider should be replicated/exported so a regional failure doesn't lose them.
- Test restores: verify you can actually restore data and that the app starts cleanly from it.
The overall approach is to combine volume snapshots (for data) with object backups (Velero, for the Kubernetes resources) — and to ensure snapshots are application-consistent for stateful systems. This is what makes stateful-workload DR trustworthy rather than "we have snapshots but can't reliably restore."
Example
# Velero backup hook: flush/quiesce a database before the volume snapshot
metadata:
annotations:
pre.hook.backup.velero.io/command: '["/bin/sh","-c","mysql -e \"FLUSH TABLES WITH READ LOCK\""]'
post.hook.backup.velero.io/command: '["/bin/sh","-c","mysql -e \"UNLOCK TABLES\""]'
# Or a direct CSI VolumeSnapshot (point-in-time copy of a PVC):
# kind: VolumeSnapshot -> source: { persistentVolumeClaimName: data } (see Ch.7)
Exercises
- (Beginner) What mechanism is used to snapshot PersistentVolumes?
- (Beginner) What is the difference between crash-consistent and application-consistent snapshots?
- (Intermediate) How does Velero help achieve application-consistent snapshots for a database?
- (Interview) What combination of backups gives trustworthy DR for a stateful application, and what step is essential to trust it? (Hint: volume snapshots + object backup, app-consistent; test restores.)
Answers
- CSI VolumeSnapshots (point-in-time copies of a volume/PVC), often orchestrated by Velero or the storage system.
- A crash-consistent snapshot captures the disk exactly as-is at an instant (like a power-loss image), possibly with in-flight writes/unflushed buffers, so restore may need recovery or be inconsistent. An application-consistent snapshot is taken with the app quiesced/flushed so on-disk data is in a consistent, clean state at snapshot time.
- Velero supports backup hooks that run commands inside the Pod before and after the snapshot — e.g., flush and lock the database (or
fsfreezethe filesystem) pre-snapshot and unlock/thaw post-snapshot — so the volume is captured in an application-consistent state.- Combine volume snapshots for the persistent data with object-level backups (e.g., Velero for the Kubernetes resources like PVCs, StatefulSets, Secrets), ensuring the volume snapshots are application-consistent (via quiesce/flush hooks) — and store them durably/off-site with sensible frequency/retention (RPO). The essential step to trust it is to test restores: regularly verify you can actually restore the data and bring the application back up cleanly, since an untested backup may be unrestorable or inconsistent.
Recovery time and recovery point objectives
Theory
Disaster recovery planning is quantified by two objectives that shape every backup/HA decision:
- RTO (Recovery Time Objective): the maximum acceptable time to restore service after an incident — "how long can we be down?" It drives choices about failover automation, standby infrastructure, and restore speed.
- RPO (Recovery Point Objective): the maximum acceptable amount of data loss measured in time — "how much recent data can we lose?" It drives backup/replication frequency (an RPO of 1 hour means backing up/replicating at least hourly).
These are business decisions with cost trade-offs: tighter objectives (near-zero RTO/RPO) require more expensive solutions (hot standbys, active-active, continuous replication), while looser objectives tolerate cheaper approaches (periodic backups, slower rebuilds). For Kubernetes: RPO informs how often you snapshot etcd/volumes and replicate; RTO informs whether you keep a warm/standby cluster (fast failover) vs. rebuilding from backups (slower). You design the backup cadence, HA topology (active-active vs. active-passive, Chapter 15), and runbooks to meet the RTO/RPO your business requires — and test that you actually can. Stating RTO/RPO turns vague "we have backups" into a measurable, verifiable DR posture.
Example
RTO = how long to recover (downtime tolerance)
RPO = how much data loss is acceptable (time)
Example targets & implications:
RPO 5 min -> replicate/snapshot at least every 5 min (continuous replication)
RTO 15 min -> need warm standby + automated failover (not "rebuild from scratch")
RPO 24h / RTO 8h -> nightly backups + documented manual rebuild is acceptable
Exercises
- (Beginner) What does RTO measure, and what does RPO measure?
- (Beginner) Which objective drives how frequently you back up or replicate?
- (Intermediate) If your RPO is 5 minutes, what does that imply about backup/replication frequency?
- (Interview) How do RTO/RPO targets translate into Kubernetes DR architecture choices, and why are they business (not just technical) decisions? (Hint: standby cluster vs. rebuild; replication cadence; cost trade-offs.)
Answers
- RTO measures the maximum acceptable time to restore service after an incident (downtime tolerance); RPO measures the maximum acceptable amount of data loss, expressed as a time window.
- RPO (recovery point objective) — it dictates how frequently you must back up or replicate data.
- You must capture/replicate state at least every 5 minutes (e.g., continuous or near-continuous replication / very frequent snapshots), because anything older than 5 minutes could be lost in a disaster and would exceed the RPO.
- Tighter targets demand more capable (and costly) architecture: a low RTO pushes toward warm/hot standby clusters and automated failover (active-active or active-passive with fast promotion) rather than rebuilding from backups; a low RPO pushes toward frequent/continuous replication and snapshots rather than nightly backups. Looser targets allow cheaper periodic backups and manual rebuild runbooks. They're business decisions because they trade cost against acceptable downtime and data loss — the organization must decide how much it's worth to reduce potential downtime/loss, and those decisions then dictate the technical DR design (topology, backup cadence, automation) and the runbooks/tests needed to actually meet them.
16.3 Cluster Cost Optimization
Kubernetes makes it easy to over-provision and overspend. This subchapter covers keeping cluster costs under control.
Right-sizing workloads
Theory
The most common source of Kubernetes waste is over-requested resources: teams set CPU/memory requests far higher than actual usage "to be safe," so the scheduler reserves capacity that sits idle — you pay for nodes to hold reservations nothing uses. Right-sizing means aligning requests (and limits) with actual observed usage (plus a sensible safety headroom), reclaiming the wasted capacity.
The approach: measure real consumption (Metrics Server for current, Prometheus for historical percentiles like p95/p99), then set requests near typical usage and limits to accommodate peaks (with the CPU/memory caveats from Chapter 8). Tools help: the Vertical Pod Autoscaler (in recommendation mode) suggests right-sized requests from history; cost tools (Kubecost, Goldilocks) surface over/under-provisioned workloads. Right-sizing directly increases bin-packing density (more Pods per node → fewer nodes → lower cost) while avoiding the opposite failure (under-requesting → contention, throttling, OOMs). It's usually the highest-ROI cost optimization because over-provisioning is so pervasive, and it compounds with autoscaling (right-sized requests make Cluster Autoscaler and HPA far more efficient).
Example
Before (over-requested): After right-sizing (requests ~ real usage):
requests: cpu 1000m, mem 2Gi requests: cpu 250m, mem 512Mi
actual usage: ~200m / ~400Mi -> 3-4x more Pods fit per node
=> nodes mostly idle, overpaying => fewer nodes needed, lower cost
kubectl top pods -A # observe actual usage vs. requests
# Use VPA recommendations / Goldilocks to derive right-sized requests.
Exercises
- (Beginner) What is the most common source of Kubernetes resource waste?
- (Beginner) What does "right-sizing" a workload mean?
- (Intermediate) What tools help you determine appropriate resource requests?
- (Interview) Why is right-sizing usually the highest-ROI cost optimization, and what's the risk of over-correcting? (Hint: over-provisioning is pervasive; under-requesting causes contention/OOM.)
Answers
- Over-requesting resources — setting CPU/memory requests much higher than actual usage, so reserved-but-idle capacity forces you to run (and pay for) more nodes.
- Aligning a workload's resource requests (and limits) with its actual observed usage plus reasonable headroom, instead of over-reserving.
- Metrics Server (current usage via
kubectl top), Prometheus/Grafana (historical percentiles), the Vertical Pod Autoscaler in recommendation mode, and cost/analysis tools like Kubecost and Goldilocks.- Over-provisioning is extremely common (teams pad requests "to be safe"), so reclaiming that idle reserved capacity across many workloads yields large savings with relatively little effort — and it improves bin-packing so fewer nodes are needed, compounding with autoscaling. The risk of over-correcting (setting requests too low) is resource contention, CPU throttling, and OOMKills under load — degrading reliability. So right-sizing must target real usage plus adequate headroom (e.g., based on p95/p99), not the bare minimum.
Spot and preemptible nodes
Theory
Cloud providers sell spare capacity at steep discounts (up to ~70–90% off) as spot instances (AWS/Azure) or preemptible VMs (GCP) — with the catch that the provider can reclaim them at short notice (interruption). Running suitable workloads on spot/preemptible nodes is one of the biggest infrastructure cost levers in Kubernetes.
The key is matching workloads to interruption tolerance:
- Good fits: stateless, replicated, fault-tolerant, or restartable workloads — batch/ML jobs, CI runners, stateless services with enough replicas and PDBs to survive node loss.
- Poor fits: singleton stateful workloads, workloads that can't tolerate abrupt termination, or anything requiring guaranteed availability on a single node.
Practical patterns: use separate node pools (spot + on-demand), taint spot nodes so only opted-in (toleration-bearing) workloads run there, spread replicas across nodes/zones (topology spread) so a spot reclamation doesn't take out a whole service, handle termination notices to drain gracefully, and keep a baseline of on-demand nodes for critical/stateful components. Tools like Karpenter and Cluster Autoscaler can mix spot/on-demand intelligently. Done right, spot/preemptible nodes cut compute cost dramatically for the large share of workloads that tolerate interruption.
Example
Cost lever: run interruptible workloads on cheap spot/preemptible nodes
[on-demand pool] <- critical/stateful (guaranteed)
[spot pool (tainted)] <- batch/ML, CI, replicated stateless (tolerate interruption)
spread replicas across nodes/zones; handle termination notice -> drain
Savings: up to ~70-90% on the spot portion of compute
Exercises
- (Beginner) Why are spot/preemptible nodes cheaper, and what's the catch?
- (Beginner) Name a workload type well-suited to spot nodes and one that isn't.
- (Intermediate) How do taints/tolerations help you use spot nodes safely?
- (Interview) What practices make running on spot/preemptible nodes safe despite interruptions? (Hint: fault-tolerant workloads, replicas + PDB + topology spread, termination handling, on-demand baseline.)
Answers
- They're spare cloud capacity sold at a large discount (up to ~70–90% off), but the provider can reclaim/terminate them at short notice (interruption), so they're not guaranteed to keep running.
- Well-suited: stateless/replicated/fault-tolerant or restartable workloads (batch/ML jobs, CI runners, stateless services with enough replicas). Not suited: singleton stateful workloads or anything needing guaranteed uninterrupted availability on a single node.
- Taint the spot nodes so ordinary workloads won't schedule there, and add matching tolerations only to workloads that can tolerate interruption — ensuring critical/stateful workloads stay on on-demand nodes while only opted-in, interruption-tolerant workloads run on spot capacity.
- Run only interruption-tolerant workloads on spot; ensure enough replicas with PodDisruptionBudgets and topology spread across nodes/zones so a reclamation doesn't take down a service; handle termination notices (the short warning) to drain/reschedule gracefully; use taints/tolerations and separate node pools to keep critical/stateful workloads on on-demand nodes; maintain an on-demand baseline for guaranteed capacity; and let tools (Karpenter/Cluster Autoscaler) mix spot/on-demand and diversify instance types to reduce simultaneous-reclamation risk. These practices absorb interruptions so the cost savings don't compromise availability.
Cluster Autoscaler tuning
Theory
The Cluster Autoscaler (Chapter 8) saves money by removing underused nodes and adding nodes only when needed — but its default behavior may be too conservative or too aggressive for optimal cost, so tuning matters. The goal is to keep utilization high (few idle nodes) without harming availability or causing thrash.
Key tuning levers:
- Scale-down parameters:
scale-down-utilization-threshold(how empty a node must be before removal) andscale-down-unneeded-time(how long it must stay underused) — lower/shorter values reclaim nodes faster (more savings) but risk removing capacity that's soon needed (thrash) and disrupting workloads. - Node group configuration: right-size and diversify node groups; mix instance types/spot.
- Respecting workloads: PDBs,
safe-to-evictannotations, and avoiding nodes with un-movable Pods (local storage) affect what can be scaled down — misconfiguration leaves nodes stuck. - Expander strategy: how CA chooses which node group to scale up (e.g.,
least-waste,priority) to pick cost-effective nodes.
There's also the newer Karpenter, which replaces node-group-centric autoscaling with just-in-time, right-sized node provisioning (picking optimal instance types/sizes per pending Pods), often achieving better bin-packing and cost than a tuned Cluster Autoscaler. Tuning autoscaling is about balancing cost (aggressive scale-down, tight packing) against stability and responsiveness (avoiding thrash and capacity shortfalls).
Example
Cluster Autoscaler tuning trade-off:
aggressive: low scale-down-utilization-threshold, short scale-down-unneeded-time
-> more idle nodes removed (cheaper) BUT risk of thrash / capacity gaps
conservative: opposite -> stable but more idle capacity (costlier)
expander: least-waste | priority | least-cost -> pick cheapest suitable node group
Karpenter: provisions right-sized nodes just-in-time per pending Pods (better packing)
Exercises
- (Beginner) What two actions does the Cluster Autoscaler take to control cost?
- (Beginner) Name a parameter that controls how aggressively nodes are scaled down.
- (Intermediate) What risk comes from making scale-down too aggressive?
- (Interview) How does Karpenter differ from the traditional Cluster Autoscaler, and why can it improve cost/packing? (Hint: node-group scaling vs. just-in-time right-sized node provisioning.)
Answers
- It adds nodes when there are unschedulable Pods (scale up) and removes underused nodes whose Pods can be rescheduled (scale down), keeping capacity matched to demand.
scale-down-utilization-threshold(how empty a node must be to be considered for removal) orscale-down-unneeded-time(how long it must remain underused before removal). (Either is acceptable.)- Thrash and capacity shortfalls: removing nodes too eagerly can delete capacity that's needed again moments later (causing repeated add/remove churn and scheduling delays) and can disrupt workloads (evictions), hurting stability and responsiveness even though it maximizes short-term savings.
- The traditional Cluster Autoscaler scales predefined node groups (fixed instance types) up/down by count. Karpenter instead provisions nodes just-in-time, choosing the optimal instance type and size for the current set of pending Pods (and consolidating/replacing nodes for better packing), rather than being constrained to preconfigured groups. This lets it pick the most cost-effective, best-fitting capacity and bin-pack more tightly, often yielding better utilization and lower cost than a node-group-based autoscaler, with faster and more flexible provisioning.
Kubecost and cost visibility tools
Theory
You can't optimize what you can't measure, and Kubernetes makes cost opaque: a shared cluster's bill doesn't tell you which team, namespace, or workload drove the spend. Cost visibility tools — chiefly Kubecost (and OpenCost, its CNCF open-source core, plus cloud-native cost tools) — solve this by allocating cluster cost down to namespaces, workloads, labels, and teams.
What they provide:
- Cost allocation/showback/chargeback: break down spend by namespace/label/team (based on each workload's resource requests/usage and the underlying node/cloud prices), enabling accountability.
- Efficiency insights: identify over-provisioned/idle resources and right-sizing opportunities (unused requests, idle nodes), quantifying waste.
- Recommendations: suggest request right-sizing, spot usage, and autoscaling improvements.
- Budgets/alerts: notify when spend or a namespace exceeds thresholds, and forecast costs.
The value is turning an opaque aggregate bill into actionable, attributable data: teams see their own cost, waste is quantified, and optimization efforts (right-sizing, spot, autoscaler tuning) can be prioritized by impact and verified. Cost visibility is the feedback loop that makes all the other cost optimizations measurable and sustained rather than one-off.
Example
# Kubecost/OpenCost surface cost per namespace/workload/label:
# namespace cpu-cost mem-cost total/mo efficiency
# team-a $420 $180 $600 38% <- over-provisioned
# team-b $150 $90 $240 81%
# plus: right-sizing recommendations, idle-cost, budgets/alerts
Exercises
- (Beginner) What problem do Kubernetes cost visibility tools solve?
- (Beginner) What is the open-source CNCF core underlying Kubecost?
- (Intermediate) What does "cost allocation" (showback/chargeback) enable?
- (Interview) Why is cost visibility the feedback loop that makes other optimizations (right-sizing, spot, autoscaler tuning) effective and sustained? (Hint: attribute/quantify waste, prioritize by impact, verify results, accountability.)
Answers
- Cost opacity — a shared cluster's cloud bill doesn't reveal which namespace/workload/team drove the spend; these tools allocate and attribute cost so it becomes visible.
- OpenCost (the CNCF open-source cost-allocation core that Kubecost builds on).
- Breaking down and attributing cluster cost to specific namespaces, labels, teams, or workloads — enabling showback (showing each team its cost) and chargeback (billing teams for their usage), which drives accountability and informed decisions.
- Optimizations only stick if you can see their target and their effect. Cost visibility quantifies and attributes waste (which workloads over-request, which nodes sit idle, which namespaces cost most), so teams can prioritize the highest-impact fixes (right-sizing the worst offenders, moving suitable workloads to spot, tuning the autoscaler), and then verify that changes actually reduced cost. It also creates accountability (teams own their spend) and can alert on regressions. Without this measurement/feedback loop, optimizations are guesswork and savings erode over time; with it, cost management becomes data-driven, targeted, and continuously maintained.
16.4 Troubleshooting
Things break; effective troubleshooting is a core operational skill. This subchapter covers diagnosing the most common Kubernetes failures.
Debugging pod failures and CrashLoopBackOff
Theory
The most common Pod problem is a container that keeps crashing, shown as CrashLoopBackOff — the container starts, exits/crashes, Kubernetes restarts it, it crashes again, and the kubelet applies an exponential backoff (increasing delay between restarts) to avoid hammering. CrashLoopBackOff is a symptom (the container won't stay up), not a root cause — your job is to find why it's crashing.
The systematic debugging workflow:
kubectl describe pod <name>: read the container state/last state (exit code, reason likeError/OOMKilled) and the Events (image pull errors, probe failures, scheduling issues).kubectl logs <name> --previous: the crashed container's own logs are usually the most direct clue (stack trace, config error, missing dependency).- Check the exit code: e.g., 137 = OOMKilled/SIGKILL, 1/2 = application error, 126/127 = command not found/not executable.
- Common causes: application bug/misconfiguration (bad env/ConfigMap/Secret), missing dependency (a DB not reachable at startup), failing liveness probe (restarting a healthy-but-slow app), insufficient memory (OOM), or a wrong command/entrypoint.
- For images without a shell, use
kubectl debug(ephemeral container) to inspect.
The discipline is describe → logs (--previous) → exit code → hypothesis → fix, treating CrashLoopBackOff as a pointer to dig deeper rather than the answer itself.
Example
kubectl get pod web # STATUS: CrashLoopBackOff, RESTARTS climbing
kubectl describe pod web # Events + Last State: Terminated, Reason: Error, Exit: 1
kubectl logs web --previous # the crashed container's logs -> the real error
# e.g., "FATAL: config key DB_HOST not set" -> fix the ConfigMap/env, redeploy
Exercises
- (Beginner) What does CrashLoopBackOff mean, and is it a root cause?
- (Beginner) Which command shows the logs of the crashed (previous) container?
- (Intermediate) What does exit code 137 usually indicate?
- (Interview) Walk through a systematic approach to diagnosing a CrashLoopBackOff. (Hint: describe → previous logs → exit code → common causes → ephemeral debug.)
Answers
- It means the container repeatedly starts, crashes, and is restarted with increasing (exponential) backoff delays. It's a symptom, not a root cause — you still have to find why the container is exiting.
kubectl logs <pod> --previous.- The container was killed by SIGKILL (128 + 9) — most commonly OOMKilled (it exceeded its memory limit), though it can be any external SIGKILL.
- Start with
kubectl describe podto read the container's last state/exit code and the Events (pull errors, probe failures, scheduling). Thenkubectl logs --previousto see the crashed container's own output (usually the clearest clue). Interpret the exit code (137 = OOM, 1/2 = app error, 126/127 = bad command). Form a hypothesis from common causes — app misconfig (bad env/ConfigMap/Secret), missing/unreachable dependency at startup, an aggressive liveness probe restarting a slow app, insufficient memory, or a wrong entrypoint. If the image lacks a shell, usekubectl debug(ephemeral container) to inspect the environment/filesystem. Fix the identified cause and redeploy, then confirm the restarts stop.
Node Not Ready troubleshooting
Theory
A node in NotReady state can't run new workloads, and after a timeout its Pods get evicted/rescheduled — so diagnosing it quickly matters. NotReady means the node's kubelet has stopped reporting healthy status to the control plane (missed heartbeats) or is reporting a problem condition. The cause is on the node or its connectivity, so you investigate at the node level.
The diagnostic path:
kubectl describe node <name>: check the node Conditions —Ready, and pressure conditions likeMemoryPressure,DiskPressure,PIDPressure— and Events. These often name the problem (e.g., disk full).- Common causes: the kubelet is down/crashed or misconfigured; network issues preventing the kubelet from reaching the API server; resource exhaustion on the node (disk full → DiskPressure, out of memory); container runtime (containerd/CRI-O) down; CNI problems; certificate/expiry issues; or the node/VM itself failed.
- On the node (SSH): check
systemctl status kubeletand its logs (journalctl -u kubelet), the runtime status, disk (df -h) and memory, and connectivity to the API server. - Remediate: restart the kubelet/runtime, free disk space, fix networking, or (in cloud/autoscaled setups) let the node be replaced.
The mental model: NotReady = "the control plane can't confirm this node is healthy," so trace kubelet → runtime → node resources → connectivity to find and fix the break.
Example
kubectl get nodes # node-3 NotReady
kubectl describe node node-3 # Conditions: DiskPressure=True, or Ready=Unknown
# On the node:
systemctl status kubelet # is the kubelet running?
journalctl -u kubelet -n 100 # kubelet errors (runtime? certs? network?)
df -h # disk full -> DiskPressure
Exercises
- (Beginner) What does a
NotReadynode status fundamentally indicate?- (Beginner) Where do you look first to see node conditions and events?
- (Intermediate) List three common causes of a node going NotReady.
- (Interview) Describe the chain you'd investigate (from control plane to node internals) to diagnose a NotReady node. (Hint: describe node conditions → kubelet → runtime → resources → connectivity.)
Answers
- That the node's kubelet is no longer reporting a healthy
Readystatus to the control plane (missed heartbeats or a failing condition) — the control plane can't confirm the node is healthy, so it won't schedule new Pods there.kubectl describe node <name>— its Conditions (Ready, MemoryPressure, DiskPressure, PIDPressure) and Events.- Any three: kubelet down/crashed or misconfigured; network/connectivity failure to the API server; node resource exhaustion (disk full → DiskPressure, out of memory); container runtime (containerd/CRI-O) down; CNI/networking problems; certificate expiry; or the underlying node/VM failing.
- Start at the control plane:
kubectl describe nodeto read Conditions/Events (they often point at the cause, e.g., DiskPressure). Then go to the node itself: check the kubelet (systemctl status kubelet,journalctl -u kubelet) since NotReady usually means the kubelet isn't reporting healthy; check the container runtime status; check node resources (df -hfor disk, memory) for exhaustion; and verify connectivity to the API server (network/DNS/certs). Follow the chain kubelet → runtime → resources → connectivity to isolate the break, then remediate (restart kubelet/runtime, free resources, fix networking) or replace the node.
Network connectivity debugging
Theory
Network problems ("Service X can't reach Service Y") are among the trickiest to debug because they span DNS, Services, endpoints, kube-proxy, CNI, and NetworkPolicies. A layered, systematic approach isolates the failing layer:
- DNS: can the client resolve the target's name? Test from a debug Pod (
nslookup <service>.<ns>). DNS failures point at CoreDNS or resolv.conf/NetworkPolicy blocking port 53. - Service & endpoints: does the Service exist and have endpoints?
kubectl get svc,endpointslices— empty endpoints means the selector matches no ready Pods (label mismatch or Pods not ready). - Direct Pod connectivity: can you reach the target Pod IP directly (bypassing the Service)? If Pod-to-Pod works but Service doesn't, suspect kube-proxy/Service config; if Pod-to-Pod fails, suspect CNI/routing.
- NetworkPolicy: is a policy blocking the traffic? Check for policies selecting either side (remember default-deny once selected, and egress policies breaking DNS). Cilium/Hubble can show dropped flows.
- Ports/app: is the app actually listening on the expected
targetPort?
Tools: a debug Pod (kubectl run -it --rm --image=nicolaka/netshoot or kubectl debug) to run curl, nslookup, nc, dig from inside the cluster network. The discipline is to test each layer in order (DNS → Service/endpoints → Pod IP → policy → app) to localize the fault rather than guessing.
Example
# From a debug Pod inside the cluster network (netshoot):
kubectl run net --rm -it --image=nicolaka/netshoot -- bash
nslookup payments.default # 1) DNS resolves?
curl -v http://payments.default:80 # 2/3) Service reachable?
curl -v http://10.244.2.7:8080 # 3) direct Pod IP reachable?
kubectl get svc,endpointslices -n default # 2) does the Service have endpoints?
kubectl get networkpolicy -A # 4) any policy blocking traffic?
Exercises
- (Beginner) What's the first layer to test when a service can't reach another (name-based)?
- (Beginner) What does an empty EndpointSlice for a Service usually indicate?
- (Intermediate) If Pod-to-Pod (direct IP) works but Service access fails, where do you look?
- (Interview) Describe the layered order you'd use to isolate a cross-service connectivity failure, and why order matters. (Hint: DNS → Service/endpoints → Pod IP → NetworkPolicy → app port.)
Answers
- DNS resolution — verify the client can resolve the target's Service name (e.g.,
nslookup <service>.<namespace>from a debug Pod).- That the Service's selector matches no ready Pods — typically a label mismatch between the Service selector and the Pods, or the backing Pods aren't Ready (failing readiness), so there are no endpoints to route to.
- At the Service layer/kube-proxy and Service configuration: since the Pods are reachable directly, the problem is in how the Service maps to them — check the Service's endpoints/selector, port/targetPort, and kube-proxy health/rules (iptables/IPVS) on the relevant nodes.
- Test in layers: (1) DNS (can the name resolve?), (2) Service & endpoints (does the Service exist with ready endpoints?), (3) direct Pod IP (is the target reachable bypassing the Service? — isolates CNI/routing vs. Service issues), (4) NetworkPolicy (is traffic being dropped by policy, including egress/DNS blocks?), and (5) app/port (is the app listening on the expected port?). Order matters because each layer builds on the previous: name resolution must work before Service routing, Service routing depends on endpoints, and confirming direct Pod connectivity distinguishes a network/CNI fault from a Service/kube-proxy fault. Testing sequentially localizes exactly where the path breaks instead of guessing across many possible causes.
OOMKilled and resource exhaustion
Theory
OOMKilled (exit code 137) means a container exceeded its memory limit and the kernel's OOM killer terminated it (memory is incompressible, Chapter 8 — you can't throttle it, so the process is killed). It manifests as restarts/CrashLoopBackOff with the last state showing OOMKilled. This is a resource-exhaustion failure, and there are two levels to consider:
- Container-level OOM: the container hit its own memory limit. Fix by raising the limit to fit actual usage (observe with
kubectl top/Prometheus), reducing the app's memory use (fix leaks, tune heap/cache — using the Downward API to size to the limit), or right-sizing. - Node-level memory pressure: the node runs low on memory, triggering
MemoryPressureand kubelet eviction of Pods (by QoS order: BestEffort first, then Burstable exceeding requests, Guaranteed last, Chapter 8). Fix by right-sizing requests/limits (so scheduling reflects real usage), spreading load, adding capacity, and giving critical Pods Guaranteed QoS.
Diagnosis: kubectl describe pod (last state OOMKilled, exit 137), check the container's limit vs. its actual usage trend, and look for node MemoryPressure and eviction events. The distinction matters: container OOM = that container's limit is too low (or it leaks); node pressure = the node is oversubscribed (usually from under-set requests). Both point back to correct resource requests/limits and right-sizing as the durable fix.
Example
kubectl describe pod web
# Last State: Terminated Reason: OOMKilled Exit Code: 137
# Limits: memory: 256Mi <- container hit its own limit
kubectl top pod web # actual memory usage vs. the 256Mi limit
kubectl get events --field-selector reason=Evicted # node-level memory eviction?
kubectl describe node node-3 | grep MemoryPressure # node oversubscribed?
Exercises
- (Beginner) What does OOMKilled mean, and what exit code accompanies it?
- (Beginner) Why can't Kubernetes just "throttle" memory like it does CPU?
- (Intermediate) Distinguish a container-level OOM kill from node-level memory-pressure eviction.
- (Interview) What is the durable fix for recurring OOMKills and memory-pressure evictions, and how does QoS factor in? (Hint: right-size requests/limits; Guaranteed QoS for critical workloads.)
Answers
- The container exceeded its memory limit and was killed by the kernel's OOM killer; the exit code is 137 (128 + SIGKILL 9).
- Memory is incompressible — you can't reclaim in-use memory from a running process gracefully the way you can withhold CPU cycles. So when a container exceeds its memory limit, the only enforcement is to kill it (OOM), not slow it down.
- Container-level OOM: the container hit its own memory limit and was killed (its limit is too low or the app leaks/uses too much). Node-level memory-pressure eviction: the node ran low on memory, so the kubelet evicted Pods (in QoS order) to reclaim memory — a node-oversubscription problem, typically from requests set below real usage.
- Right-size requests and limits to reflect actual memory usage (observed via metrics), and fix the app's consumption (leaks, oversized heaps/caches — e.g., size to the limit via the Downward API). Correct requests make the scheduler stop oversubscribing nodes (preventing node pressure), and adequate limits stop premature container OOMs. QoS matters because under node memory pressure the kubelet evicts BestEffort first, then Burstable exceeding requests, and protects Guaranteed last — so setting critical workloads to Guaranteed (requests == limits) ensures they're the least likely to be evicted, while proper requests/limits across the board reduce the pressure in the first place.
etcd health and performance
Theory
Because etcd is the cluster's brain (Chapter 2), etcd problems degrade or break the entire cluster — a slow or unhealthy etcd makes the API server slow or unavailable, so everything (scheduling, controllers, kubectl) suffers. etcd health/performance is therefore a critical operational concern, especially at scale.
What to watch and why:
- etcd is extremely disk-latency-sensitive: it fsyncs writes to disk for consistency, so slow disks are the #1 cause of etcd problems. Fast SSDs/NVMe and low
fsync/WAL sync and backend commit latency are essential; high disk latency shows up asetcd_disk_wal_fsync_duration_secondsspikes and leader elections/warnings. - Quorum/health: use
etcdctl endpoint health/statusto check members, the leader, and DB size; a member down or repeated leader elections signal trouble (network or disk). - Database size: etcd has a space quota (default ~2–8 GB); exceeding it puts etcd into a read-only alarm state (NOSPACE). Mitigate with compaction (of old revisions) and defragmentation to reclaim space, and keep object counts/churn reasonable (e.g., don't store huge/many objects, high event churn).
- Latency between members: etcd needs low-latency networking among members (avoid spreading across high-latency links).
Diagnosis combines etcd metrics (via Prometheus), etcdctl health/status, and control-plane symptoms (slow API responses). The durable fixes: fast dedicated disks, adequate resources, keeping the DB compacted/defragmented and under quota, low-latency member networking, and reliable backups (16.2). Treat etcd as the highest-sensitivity component: keep it fast, healthy, and backed up.
Example
ETCDCTL_API=3 etcdctl endpoint status --write-out=table --endpoints=...
# shows each member's DB SIZE, LEADER, RAFT TERM (watch DB size vs. quota)
ETCDCTL_API=3 etcdctl endpoint health --endpoints=... # member health
# Key Prometheus metric (disk sensitivity):
# etcd_disk_wal_fsync_duration_seconds -> high = slow disk = etcd problems
# Reclaim space: compact old revisions, then defragment
etcdctl compact <rev>; etcdctl defrag --endpoints=...
Exercises
- (Beginner) Why does an unhealthy etcd affect the whole cluster?
- (Beginner) What hardware characteristic is etcd most sensitive to?
- (Intermediate) What happens when etcd exceeds its storage quota, and how do you reclaim space?
- (Interview) What are the key measures to keep etcd healthy and performant at scale? (Hint: fast dedicated disks, low fsync latency, compaction/defrag under quota, low-latency members, backups.)
Answers
- Because etcd is the sole datastore behind the API server; if etcd is slow or unavailable, the API server becomes slow/unavailable, and since every component works through the API server, scheduling, controllers, and kubectl all degrade or fail — the whole cluster is affected.
- Disk latency — etcd fsyncs writes to disk for consistency, so slow disks are the primary cause of etcd performance problems; it needs fast, low-latency storage (SSD/NVMe).
- Exceeding the storage quota triggers a NOSPACE alarm that puts etcd into a read-only/maintenance state (rejecting writes), which effectively breaks cluster changes. You reclaim space by compacting old key revisions and then defragmenting the database to release freed space (and then clear the alarm), plus reduce what's stored/churned.
- Give etcd fast, dedicated, low-latency disks (SSD/NVMe) and monitor fsync/WAL and backend-commit latency; provision adequate CPU/memory; keep the database under its quota via regular compaction and defragmentation and by limiting object count/size and event churn; ensure low-latency networking between etcd members (co-located/close, odd-numbered for quorum); watch health/leader stability (avoid frequent leader elections); and maintain reliable, tested backups (snapshots) for recovery. Together these keep etcd fast, stable, and recoverable, which keeps the whole control plane healthy.
17. Advanced Topics
This final chapter surveys the frontier of the Kubernetes ecosystem: kernel-level programmability with eBPF, running specialized hardware like GPUs, serverless and event-driven paradigms, AI/ML platforms, and stretching Kubernetes to the edge. These topics push beyond running ordinary services, showing how Kubernetes has become a universal control plane for increasingly diverse workloads and environments.
17.1 eBPF in Kubernetes
eBPF is reshaping Kubernetes networking, security, and observability. This subchapter covers the technology and its applications.
eBPF fundamentals
Theory
eBPF (extended Berkeley Packet Filter) lets you run small, sandboxed programs inside the Linux kernel in response to events — network packets, syscalls, function entry/exit, tracepoints — without modifying kernel source or loading kernel modules. The kernel verifies each program before loading it (proving it can't crash or hang the kernel — bounded loops, safe memory access), then JIT-compiles it to native code, so it runs safely and efficiently in kernel space.
Why this is transformative for Kubernetes: traditionally, adding kernel-level networking/security/observability meant kernel modules (risky, hard to maintain) or userspace tools (slow, requiring context switches and data copies). eBPF gives the safety of userspace with the performance and vantage point of the kernel — programmable, dynamic behavior at the exact place packets and syscalls are handled. Programs communicate with userspace via eBPF maps (efficient key-value structures). This foundation underpins Cilium (networking/mesh, Chapter 13), Falco (runtime security, Chapter 9), and a wave of observability tools — making eBPF one of the most important recent developments in the cloud-native stack.
Example
eBPF program lifecycle:
write program -> kernel VERIFIER checks safety -> JIT to native -> attach to a hook
hooks: network (XDP/tc), syscalls, kprobes/tracepoints, cgroups, ...
eBPF MAPS <-> userspace (share data efficiently)
Result: kernel-speed, safe, dynamic programmability (no module, no kernel rebuild)
Exercises
- (Beginner) What does eBPF let you do, and where do the programs run?
- (Beginner) What does the kernel do before loading an eBPF program, and why?
- (Intermediate) How do eBPF programs share data with userspace?
- (Interview) Why does eBPF give "the safety of userspace with the performance of the kernel," and why does that matter for cloud-native tooling? (Hint: verified sandboxed in-kernel execution vs. modules/userspace; used by Cilium/Falco.)
Answers
- Run small sandboxed programs in response to kernel events (packets, syscalls, function/tracepoints) — the programs run inside the Linux kernel, without modifying kernel source or loading kernel modules.
- It runs the program through a verifier that proves it's safe (bounded/terminating, no invalid memory access) before loading, so a buggy or malicious eBPF program can't crash or hang the kernel; verified programs are then JIT-compiled to native code.
- Via eBPF maps — efficient in-kernel key-value data structures that both the eBPF program and userspace can read/write, enabling data exchange and configuration.
- eBPF programs are verified and sandboxed like userspace code (they can't crash the kernel), yet they execute in the kernel at the exact hook where events occur, JIT-compiled to native speed — avoiding the context switches, data copies, and latency of userspace processing and the risk/maintenance burden of kernel modules. For cloud-native tooling this means high-performance, dynamic, safe programmability of networking, security, and observability right in the data path — which is why projects like Cilium (CNI/mesh), Falco (runtime security), and many observability tools build on it to do things (efficient service load balancing, syscall-level detection, deep flow visibility) that were previously slow or unsafe.
eBPF-based observability
Theory
eBPF is a game-changer for observability because it can instrument the kernel to see everything — network flows, syscalls, function calls, latency, file/process activity — without modifying or instrumenting applications. This "zero-instrumentation" visibility is its key advantage: you attach eBPF programs to kernel hooks and get rich, low-overhead telemetry across all workloads uniformly, regardless of language or whether the app was built with observability in mind.
Applications in Kubernetes:
- Network observability: tools like Hubble (Cilium) and Pixie capture and expose service-to-service flows, HTTP/gRPC/DNS requests, latencies, and drops directly from the kernel.
- Auto-instrumented metrics/tracing: eBPF can generate golden-signal metrics (RPS, latency, errors) and even traces for services without code changes (e.g., Pixie, Grafana Beyla, Cilium/Tetragon), lowering the barrier to observability.
- Continuous profiling: eBPF-based profilers (Parca, Pyroscope) sample stack traces cluster-wide with negligible overhead to find CPU/memory hotspots.
The overarching benefit: deep, low-overhead, application-transparent observability. Where traditional observability requires instrumenting each app (SDKs, agents, code changes), eBPF observes from the kernel's universal vantage point — a major reason eBPF is central to the modern observability story.
Example
eBPF observability (no app changes):
kernel hooks (sockets, syscalls, uprobes) -> eBPF programs -> maps -> userspace
Hubble/Pixie: L3-L7 flows, HTTP/gRPC/DNS, latency, drops (auto, cluster-wide)
Beyla/Pixie: auto RED metrics + traces without instrumenting the app
Parca/Pyroscope: continuous CPU/mem profiling, low overhead
Exercises
- (Beginner) What is the key advantage of eBPF-based observability regarding applications?
- (Beginner) Name an eBPF-based network observability tool.
- (Intermediate) How can eBPF produce service metrics/traces without code changes?
- (Interview) Why is kernel-level, zero-instrumentation observability powerful compared to traditional per-app instrumentation? (Hint: universal vantage point, language-agnostic, low overhead, no code changes.)
Answers
- It provides deep visibility without modifying or instrumenting the applications ("zero-instrumentation") — you observe from the kernel regardless of the app's language or design.
- Any of: Hubble (Cilium), Pixie, Cilium Tetragon (also Beyla for metrics/traces).
- eBPF attaches to kernel hooks (sockets/syscalls/uprobes) to observe requests and responses in the data path, so tools can compute golden-signal (RED) metrics — request rate, errors, latency — and reconstruct traces from the observed traffic automatically, without the application emitting them or being recompiled.
- Traditional observability requires adding SDKs/agents/code to each application (per language, per app, needing changes and maintenance), and misses anything not instrumented. eBPF observes from the kernel's universal vantage point where all network and syscall activity flows, so it captures rich telemetry across every workload uniformly, language-agnostically, with no code changes, and at low overhead (in-kernel, JIT-compiled, efficient maps). This dramatically lowers the effort to get comprehensive visibility and covers workloads you couldn't or didn't instrument — which is why eBPF is central to modern observability.
eBPF-based networking and security
Theory
Beyond observability, eBPF powers high-performance networking and runtime security in Kubernetes — often replacing older mechanisms:
- Networking: eBPF implements pod networking, service load balancing (a kube-proxy replacement with O(1) map lookups instead of long iptables chains), and network policy enforcement directly in the kernel data path — the basis of Cilium (Chapter 6/13). This yields lower latency, better scalability, and identity-aware, L7-capable policy.
- Security (runtime enforcement + detection): eBPF can observe syscalls and kernel events to detect malicious behavior (Falco, Chapter 9) and, crucially, to enforce security in-kernel — e.g., Cilium Tetragon can observe and block process executions, file access, or network activity based on policy, at the kernel level, in real time. This enables enforcement that's both fast and hard to evade (it's below the application).
The theme is moving networking and security into the kernel data path for performance and a strong vantage point: policy and enforcement happen where traffic and syscalls actually occur, with the safety guarantees of the eBPF verifier. This is why eBPF-based tools are increasingly the foundation for cloud-native networking (Cilium replacing kube-proxy/iptables) and runtime security (kernel-level detection and enforcement) rather than userspace or module-based approaches.
Example
Networking (Cilium/eBPF):
service LB + policy in kernel (O(1) maps) -> replaces kube-proxy/iptables
identity-aware, L3-L7 network policy at kernel speed
Security (Tetragon/eBPF):
observe syscalls/exec/file/net events -> detect AND enforce (block) in-kernel
e.g., block an unexpected process exec or file write in real time
Exercises
- (Beginner) What older mechanism can eBPF-based service load balancing replace?
- (Beginner) Besides detecting threats, what can eBPF-based security tools like Tetragon also do?
- (Intermediate) Why is in-kernel enforcement harder to evade than application-level controls?
- (Interview) What is the common theme behind eBPF-based networking and security, and what advantages does it bring? (Hint: move into kernel data path — performance + vantage point + verifier safety.)
Answers
- kube-proxy's iptables-based service routing (eBPF provides a kube-proxy replacement using efficient in-kernel maps).
- Enforce security in-kernel — not just observe/detect, but actively block actions (e.g., prevent a process execution, file access, or network connection) based on policy, in real time.
- Because enforcement happens in the kernel, below the application: a compromised or malicious process runs in userspace and can't bypass a control that intercepts its syscalls/kernel operations at the kernel level. It can't simply avoid or disable an in-kernel eBPF hook the way it might tamper with a userspace agent or app-level check, and the eBPF program sees the actual kernel-level activity.
- The common theme is pushing networking and security into the kernel data path (where packets and syscalls actually occur), using eBPF's verified, sandboxed programs. Advantages: high performance (kernel-speed, O(1) lookups, no userspace hops), a superior vantage point (sees and can act on all traffic/syscalls, enabling identity/L7-aware policy and hard-to-evade enforcement), and safety (the verifier prevents kernel crashes) plus dynamism (no kernel rebuild/module). This makes eBPF the foundation for modern CNIs/meshes (Cilium replacing kube-proxy/iptables) and runtime security (kernel-level detection and enforcement).
bpftrace and BCC tools
Theory
To use eBPF directly for ad-hoc tracing and debugging (rather than via a platform like Cilium), two toolkits are standard:
- BCC (BPF Compiler Collection): a toolkit and library for building eBPF programs, with a large collection of ready-made performance/tracing tools (e.g.,
execsnoop— trace new process executions,opensnoop— trace file opens,tcpconnect,biolatency— block I/O latency,profile). Programs are typically written with a C (kernel) + Python/Lua (userspace) structure. BCC is powerful for building custom tools. - bpftrace: a high-level tracing language (inspired by awk/DTrace) for writing one-liner and short eBPF tracing scripts quickly. It's the go-to for quick, exploratory kernel/application tracing — e.g., counting syscalls, histogramming latencies — without writing a full BCC program.
In Kubernetes, these are invaluable for deep, node-level performance debugging and understanding what a workload is actually doing at the kernel level (syscalls, I/O, latencies) — the low-level complement to the platform-level eBPF observability tools. You typically run them on a node (or via a privileged debug Pod) to diagnose tricky performance or behavior issues. Rule of thumb: bpftrace for quick one-liners/exploration, BCC for richer, reusable tools.
Example
# bpftrace one-liners (quick ad-hoc kernel tracing):
bpftrace -e 'tracepoint:syscalls:sys_enter_openat { @[comm] = count(); }' # opens by process
bpftrace -e 'kprobe:vfs_read { @bytes = hist(arg2); }' # read-size histogram
# BCC ready-made tools:
execsnoop # trace every new process exec (great for spotting unexpected execs)
biolatency # histogram of block-device I/O latency
tcpconnect # trace TCP connections as they happen
Exercises
- (Beginner) What is bpftrace best suited for?
- (Beginner) What does BCC provide beyond a single tool?
- (Intermediate) When would you choose bpftrace over BCC, and vice versa?
- (Interview) How do bpftrace/BCC complement platform-level eBPF observability (like Hubble) in a Kubernetes context? (Hint: ad-hoc low-level node/kernel debugging vs. continuous service-level visibility.)
Answers
- Quick, ad-hoc, exploratory kernel/application tracing via short one-liners or small scripts in a high-level language (e.g., counting syscalls, histogramming latencies).
- A toolkit/library for building eBPF programs plus a large collection of ready-made performance and tracing tools (execsnoop, opensnoop, biolatency, tcpconnect, profile, etc.) for deeper, reusable instrumentation.
- Choose bpftrace for fast, exploratory investigations and one-liners where you want an answer quickly without writing a full program. Choose BCC when you need richer, more complex, or reusable tools (custom logic, packaged utilities), or want to build a maintained tool rather than an ad-hoc script.
- Platform tools like Hubble provide continuous, service-level observability (flows, HTTP/gRPC, policy verdicts) across the cluster. bpftrace/BCC are for ad-hoc, low-level, node/kernel-depth debugging — drilling into syscalls, I/O latency, process execs, and kernel behavior on a specific node when you need to diagnose a tricky performance or behavior issue that service-level telemetry can't explain. They complement each other: use platform observability to spot where/what is wrong at the service level, then bpftrace/BCC to investigate the underlying kernel-level why on the affected node.
17.2 GPU and Specialized Hardware
Kubernetes can schedule and manage specialized hardware for compute-intensive workloads. This subchapter covers GPUs and accelerators.
GPU operator and device plugins
Theory
Kubernetes only natively understands CPU and memory. To use specialized hardware (GPUs, FPGAs, high-performance NICs), it relies on the Device Plugin framework: a device plugin (a DaemonSet from the hardware vendor) runs on each node, discovers the devices, advertises them to the kubelet as extended resources (e.g., nvidia.com/gpu), and handles allocating a device to a container. Pods then request these resources like any other, and the scheduler places them on nodes with available devices.
Setting up GPUs manually involves many moving parts (drivers, container runtime hooks, device plugin, monitoring). The NVIDIA GPU Operator automates all of it: it deploys and manages the NVIDIA driver, container toolkit (runtime hooks so containers can access GPUs), the device plugin, node labeling (via node feature discovery), and DCGM monitoring — as a single operator, so GPU nodes become ready-to-use without manual per-node driver/toolkit installation. This makes GPUs first-class schedulable resources. The pattern generalizes: device plugins (often bundled in vendor operators) are how any specialized hardware is exposed to and scheduled by Kubernetes.
Example
# A Pod requests a GPU as an extended resource (advertised by the device plugin):
spec:
containers:
- name: trainer
image: pytorch/pytorch:latest
resources:
limits:
nvidia.com/gpu: 1 # request 1 GPU; scheduler finds a node with one free
kubectl describe node gpu-node-1 | grep nvidia.com/gpu
# nvidia.com/gpu: 4 # device plugin advertises 4 GPUs on this node
Exercises
- (Beginner) What framework does Kubernetes use to support GPUs and other special hardware?
- (Beginner) How does a Pod request a GPU?
- (Intermediate) What does the NVIDIA GPU Operator automate?
- (Interview) How does the device plugin framework make specialized hardware a first-class schedulable resource, and how does the pattern generalize beyond GPUs? (Hint: discover/advertise as extended resources; scheduler places by availability; FPGAs/NICs similarly.)
Answers
- The Device Plugin framework (device plugins run per node to expose hardware to the kubelet).
- By requesting the vendor's extended resource in its container
resources.limits, e.g.,nvidia.com/gpu: 1.- The full GPU enablement stack on nodes: installing/managing the NVIDIA driver, the container toolkit/runtime hooks, the device plugin, node feature discovery/labeling, and monitoring (DCGM) — as one operator, so GPU nodes are ready to schedule GPU workloads without manual per-node setup.
- A device plugin (per-node DaemonSet) discovers the hardware and advertises it to the kubelet as an extended resource (e.g.,
nvidia.com/gpu) with a count; the scheduler then treats it like CPU/memory — placing Pods that request that resource onto nodes with available units and the plugin allocates a device to the container. This makes special hardware schedulable using the same request/limit model as normal resources. The pattern generalizes: the same framework exposes FPGAs, high-performance NICs/RDMA, TPUs, and other accelerators as their own extended resources (often via vendor operators bundling the driver + plugin), so Kubernetes can schedule any specialized hardware uniformly.
NVIDIA GPU scheduling
Theory
Once GPUs are advertised as resources, scheduling GPU workloads has particular characteristics and challenges:
- Whole-GPU allocation (default): by default a GPU is allocated exclusively to one container — GPUs aren't divisible like CPU by default, so requesting
nvidia.com/gpu: 1gives a container a whole GPU. This can be wasteful if a workload doesn't fully use the GPU. - Placement: GPU nodes are expensive and typically tainted (Chapter 8) so only GPU workloads (with matching tolerations + node affinity/labels) land there, keeping the costly hardware for jobs that need it.
- Topology awareness: for multi-GPU jobs, performance depends on GPU interconnect (NVLink) and NUMA locality; schedulers/operators can be topology-aware to co-locate GPUs that communicate.
- Sharing options (to improve utilization): time-slicing (multiple Pods share a GPU by time-multiplexing — no memory isolation), MPS (Multi-Process Service), and MIG partitioning (next topic) let multiple workloads share a GPU when whole-GPU exclusivity wastes capacity.
The core tension is utilization vs. isolation: exclusive whole-GPU allocation is simple and isolated but often underutilizes expensive hardware, driving the sharing/partitioning techniques. Effective GPU scheduling combines tainting/affinity (right placement), topology awareness (performance), and sharing strategies (utilization/cost).
Example
# GPU workload: tolerate the GPU taint, target GPU nodes, request GPUs
spec:
tolerations: [ { key: nvidia.com/gpu, operator: Exists, effect: NoSchedule } ]
nodeSelector: { accelerator: nvidia }
containers:
- name: train
image: ml:latest
resources: { limits: { nvidia.com/gpu: 2 } } # 2 whole GPUs (exclusive by default)
Exercises
- (Beginner) By default, is a GPU shared among containers or allocated exclusively?
- (Beginner) Why are GPU nodes typically tainted?
- (Intermediate) Name two techniques for sharing a single GPU across workloads.
- (Interview) What is the core tension in GPU scheduling, and how do topology awareness and sharing address the goals of performance and utilization? (Hint: exclusive allocation wastes costly hardware; time-slicing/MIG for utilization, NVLink/NUMA locality for performance.)
Answers
- Exclusively — by default a whole GPU is allocated to a single container (GPUs aren't divided like CPU by default).
- Because GPU nodes are expensive; tainting them keeps non-GPU workloads off so the costly GPUs are reserved for workloads that actually need (tolerate + request) them.
- Any two: time-slicing (time-multiplexing a GPU among Pods, no memory isolation), NVIDIA MPS (Multi-Process Service), and MIG partitioning (hardware partitions of a GPU).
- The core tension is utilization vs. isolation/cost: exclusive whole-GPU allocation gives clean isolation but frequently underuses very expensive hardware. Sharing techniques (time-slicing, MPS, MIG) raise utilization by letting multiple workloads use one GPU (trading off isolation strength — MIG gives hardware isolation, time-slicing gives none). Topology awareness addresses performance: multi-GPU jobs run faster when the scheduler places them on GPUs with fast interconnect (NVLink) and matching NUMA/PCIe locality, minimizing communication overhead. Together, tainting/affinity ensure the right workloads land on GPU nodes, topology-aware placement maximizes performance for multi-GPU jobs, and sharing/partitioning maximizes utilization (and thus cost-efficiency) of the scarce hardware.
FPGA and custom accelerators
Theory
GPUs are the most common accelerator, but Kubernetes supports other specialized hardware through the same device plugin framework — the abstraction generalizes to any device:
- FPGAs (Field-Programmable Gate Arrays): reconfigurable chips programmed for specific tasks (e.g., custom signal processing, inference, financial computation, compression). Vendors (Intel, Xilinx/AMD) provide device plugins so FPGAs are advertised and schedulable, sometimes with support for reprogramming the FPGA "bitstream" per workload.
- TPUs (Tensor Processing Units) and other AI accelerators (AWS Inferentia/Trainium, etc.): exposed as extended resources via their plugins/operators for ML workloads.
- SmartNICs / DPUs, RDMA/InfiniBand NICs: high-performance networking hardware also surfaced via device plugins for HPC/low-latency workloads.
The unifying idea: Kubernetes' device plugin model makes it a general control plane for heterogeneous, accelerated compute — any vendor can expose its hardware as a schedulable extended resource, and workloads request it uniformly (vendor.com/device: N). This is why Kubernetes has become the platform of choice for AI/ML and HPC: it schedules not just CPUs but the whole zoo of accelerators through one consistent mechanism, with vendor operators handling the drivers/toolkits/plugins.
Example
# Requesting an FPGA (advertised by a vendor device plugin) — same model as GPUs:
spec:
containers:
- name: accel
image: fpga-workload:latest
resources:
limits:
fpga.intel.com/arria10: 1 # vendor-specific extended resource
# or xilinx.com/fpga-...:1, aws.amazon.com/neuron: 1 (Inferentia), etc.
Exercises
- (Beginner) Does Kubernetes support hardware other than GPUs? Via what mechanism?
- (Beginner) What is an FPGA, briefly?
- (Intermediate) How does a workload request a non-GPU accelerator like an FPGA or Inferentia chip?
- (Interview) Why has the device plugin model made Kubernetes a general control plane for heterogeneous/accelerated compute? (Hint: any vendor exposes hardware as a schedulable extended resource, requested uniformly.)
Answers
- Yes — via the same device plugin framework used for GPUs, any specialized hardware can be exposed and scheduled.
- A Field-Programmable Gate Array — a reconfigurable chip that can be programmed (its logic/bitstream) to accelerate specific tasks (e.g., custom inference, signal processing, compression), offering hardware-level acceleration tailored to a workload.
- The same way as a GPU: request the vendor's extended resource in the container's
resources.limits(e.g.,fpga.intel.com/arria10: 1,aws.amazon.com/neuron: 1), and the scheduler places the Pod on a node where that device is available (the vendor's device plugin advertises and allocates it).- Because the device plugin framework lets any vendor expose any hardware as a schedulable extended resource (advertised per node, allocated by the kubelet), and workloads request all of them with the same uniform model (
vendor.com/device: N) as CPU/memory. So GPUs, FPGAs, TPUs, Inferentia/Trainium, SmartNICs/DPUs, and RDMA NICs are all scheduled through one consistent mechanism, with vendor operators handling drivers/toolkits. This uniform abstraction over diverse accelerators is what makes Kubernetes a general-purpose control plane for heterogeneous, accelerated compute — a major reason it dominates AI/ML and HPC platforms.
Resource slicing and MIG partitioning
Theory
Because whole-GPU allocation often wastes expensive hardware (many workloads — small inference services, notebooks, dev — need only a fraction of a GPU), Kubernetes and NVIDIA support GPU sharing/partitioning to improve utilization:
- Time-slicing: the GPU is time-multiplexed among multiple Pods — they take turns on the full GPU. Simple and increases utilization, but there's no memory or fault isolation (a misbehaving Pod can affect others; they share the whole GPU's memory), and no guaranteed QoS.
- MPS (Multi-Process Service): allows concurrent kernels from multiple processes to share a GPU more efficiently than time-slicing, with limited isolation.
- MIG (Multi-Instance GPU): a hardware feature (on A100/H100-class GPUs) that partitions one physical GPU into multiple fully isolated instances, each with its own dedicated memory, compute units, and cache. MIG instances are exposed as separate schedulable resources (e.g.,
nvidia.com/mig-1g.5gb), giving strong, hardware-level isolation and predictable performance per slice.
The trade-off: time-slicing maximizes flexibility/utilization with no isolation (fine for trusted, bursty, or dev workloads); MIG provides true isolation and QoS at the cost of fixed partition sizes and requiring capable hardware (good for multi-tenant or production inference where isolation matters). Both let you pack multiple workloads onto costly GPUs instead of dedicating a whole GPU to an underutilizing job — directly addressing GPU cost-efficiency.
Example
Whole GPU (default): [============ 1 GPU ============] -> 1 Pod (may underutilize)
Time-slicing: [ GPU ] shared by turns among Pod A, B, C (no isolation)
MIG (hardware partition of one A100/H100):
[ 1g.5gb ][ 1g.5gb ][ 2g.10gb ][ 3g.20gb ] -> each an isolated schedulable slice
requested as e.g. nvidia.com/mig-1g.5gb: 1
Exercises
- (Beginner) Why is whole-GPU allocation often wasteful?
- (Beginner) What is the key isolation difference between time-slicing and MIG?
- (Intermediate) How are MIG instances exposed to Kubernetes scheduling?
- (Interview) When would you choose time-slicing versus MIG for GPU sharing, and what's the trade-off? (Hint: flexibility/utilization no isolation vs. hardware isolation/QoS with fixed partitions + capable hardware.)
Answers
- Because many workloads (small inference, notebooks, dev/test) use only a fraction of a GPU's compute/memory, so dedicating a whole expensive GPU to them leaves most of it idle.
- Time-slicing provides no memory/fault isolation — Pods time-share the full GPU and its memory, so they can interfere with each other. MIG partitions the GPU into hardware-isolated instances, each with dedicated memory and compute, so slices are isolated and have predictable performance.
- Each MIG partition is advertised as its own schedulable extended resource (e.g.,
nvidia.com/mig-1g.5gb), so Pods request a specific MIG profile and the scheduler places them onto a node with that MIG instance available — just like requesting a whole GPU, but at partition granularity.- Choose time-slicing when you want maximum flexibility and utilization for trusted, bursty, or non-critical workloads (dev, notebooks) and can tolerate no isolation or QoS guarantees — it's simple and works on any GPU. Choose MIG when you need strong isolation and predictable performance/QoS per workload — e.g., multi-tenant environments or production inference where one workload must not affect another — accepting that MIG requires capable hardware (A100/H100-class) and uses fixed partition sizes (less flexible granularity). The trade-off is flexibility/utilization-without-isolation (time-slicing) versus true hardware isolation/QoS with rigid partitions and hardware requirements (MIG).
17.3 Serverless on Kubernetes
Serverless brings scale-to-zero and event-driven execution to Kubernetes. This subchapter covers the main frameworks.
Knative Serving
Theory
Serverless on Kubernetes means running code without managing servers/scaling explicitly — the platform scales your app up on demand and, crucially, down to zero when idle (so idle workloads cost nothing). Knative Serving is the leading framework that brings this request-driven autoscaling and scale-to-zero to Kubernetes.
Knative Serving abstracts away much of the Deployment/Service/Ingress/HPA boilerplate: you deploy a Knative Service (a single CRD) and it manages Revisions (immutable snapshots of your code+config for each deploy), Routes (traffic routing, including splitting between revisions for canary), and Configuration. Its defining features:
- Scale-to-zero: when there are no requests, Knative removes all Pods; when a request arrives, it rapidly cold-starts a Pod (the request is buffered by the activator until ready). This is what plain HPA can't do (min 1).
- Request-based autoscaling (KPA — Knative Pod Autoscaler): scales on concurrency/RPS, well-suited to spiky, request-driven workloads.
- Traffic splitting across revisions for easy canary/blue-green and instant rollback.
So Knative turns Kubernetes into a serverless platform for HTTP/event-driven services — you get FaaS/serverless ergonomics (deploy code, auto-scale from zero, pay for what you use) on top of standard Kubernetes.
Example
apiVersion: serving.knative.dev/v1
kind: Service
metadata: { name: hello }
spec:
template:
spec:
containers:
- image: gcr.io/knative-samples/helloworld-go
env: [ { name: TARGET, value: "world" } ]
traffic: # split traffic across revisions (canary)
- { revisionName: hello-00002, percent: 10 }
- { revisionName: hello-00001, percent: 90 }
# Idle -> scales to 0 Pods; a request -> cold-starts a Pod (via the activator).
Exercises
- (Beginner) What key capability does Knative Serving add that a plain HPA cannot?
- (Beginner) What is a Knative "Revision"?
- (Intermediate) How does Knative serve a request that arrives when the app is scaled to zero?
- (Interview) Why is scale-to-zero + request-based autoscaling well-suited to spiky/event-driven workloads, and what's the main trade-off? (Hint: cost when idle, elastic to bursts; cold-start latency.)
Answers
- Scale-to-zero — removing all Pods when idle (and scaling back up on demand), which the HPA can't do (its minimum is 1).
- An immutable snapshot of a particular deployment (the code/image + configuration) at a point in time; each change creates a new Revision, and traffic can be routed/split across Revisions (enabling canary and instant rollback).
- The request is received by the Knative activator, which buffers it while Knative rapidly cold-starts a Pod (scales from 0 to 1); once the Pod is ready, the buffered request is forwarded — so the request is served despite there being no running Pods when it arrived.
- Spiky/event-driven workloads are idle much of the time and then burst. Scale-to-zero means you pay nothing (no running Pods) during idle periods, and request-based autoscaling elastically adds Pods to handle bursts based on actual concurrency/RPS — matching resources to demand precisely. The main trade-off is cold-start latency: the first request after scaling from zero must wait for a Pod to start (and the app to initialize), adding latency to that request — a concern for latency-sensitive paths (mitigated by minimum-scale settings, keeping a warm Pod, at the cost of losing pure scale-to-zero savings).
Knative Eventing
Theory
Serverless isn't only request/response — much of it is event-driven: run code in reaction to events (a message on a queue, a file uploaded, a cron tick, a cloud event). Knative Eventing provides the infrastructure to build such systems on Kubernetes, based on loosely-coupled, standards-based event delivery (it uses the CloudEvents specification for a common event format).
Its building blocks:
- Sources: produce events from something (a Kafka topic, a database, a timer/PingSource, cloud services) and send them into the mesh.
- Brokers and Triggers: a Broker is an event hub/bus that receives events; Triggers subscribe to it with filters (by event type/attributes) and route matching events to a sink (e.g., a Knative Service). This decouples producers from consumers — producers emit to the broker without knowing who consumes.
- Sinks: anything that receives events (typically a Knative Service that then scales from zero to handle them).
The result is a declarative, decoupled, event-driven architecture: producers, brokers/triggers, and consumers are all Kubernetes resources, events are standardized (CloudEvents), and consumers can scale to zero and spin up on events. Knative Eventing complements Knative Serving (which runs the consuming services), turning Kubernetes into a platform for event-driven serverless applications.
Example
# A Trigger routes filtered events from a Broker to a sink (a Knative Service):
apiVersion: eventing.knative.dev/v1
kind: Trigger
metadata: { name: on-order }
spec:
broker: default
filter:
attributes: { type: com.example.order.created } # only these events
subscriber:
ref: { apiVersion: serving.knative.dev/v1, kind: Service, name: order-handler }
# Producers send CloudEvents to the Broker; order-handler scales from 0 to process them.
Exercises
- (Beginner) What does Knative Eventing enable, in contrast to Knative Serving?
- (Beginner) What event format standard does Knative Eventing use?
- (Intermediate) What roles do Brokers and Triggers play, and how do they decouple producers from consumers?
- (Interview) How do Knative Eventing and Serving combine to form an event-driven serverless architecture? (Hint: sources→broker/triggers→sink service that scales from zero.)
Answers
- Building event-driven systems — running code in reaction to events (from queues, timers, cloud services, etc.) — whereas Knative Serving handles request/response HTTP services and their autoscaling.
- CloudEvents (a common, standardized event format specification).
- A Broker is a central event hub that receives events; Triggers subscribe to the broker with filters (by event type/attributes) and forward matching events to a subscriber/sink. This decouples producers from consumers: producers just emit events to the broker without knowing who (if anyone) consumes them, and consumers declare via triggers which events they want — so either side can change independently and multiple consumers can subscribe to the same events.
- Sources produce CloudEvents (from Kafka, timers, cloud services, etc.) into a Broker; Triggers filter and route those events to sinks, which are typically Knative Serving Services. Those consuming services can scale from zero, spinning up only when events arrive and back down when idle. So Eventing provides the decoupled, standardized event pipeline (sources → broker/triggers) and Serving provides the elastic, scale-to-zero compute that processes the events — together forming a fully serverless, event-driven architecture on Kubernetes where you pay only for actual event processing.
OpenFaaS on Kubernetes
Theory
OpenFaaS ("Functions as a Service") is a popular framework for running functions and microservices on Kubernetes with a strong focus on developer simplicity. Where Knative is a lower-level, feature-rich serverless platform, OpenFaaS emphasizes an easy, opinionated developer experience for packaging code as functions and deploying them.
Its model and features:
- Functions as containers: you write a function (in many languages via templates), and OpenFaaS packages it into a container behind a standard watchdog process that handles the HTTP/invocation interface — so any code/binary can become a function.
- Simple CLI/workflow:
faas-cliscaffolds (new), builds, pushes, and deploys functions; a UI and API gateway manage and invoke them. - Autoscaling (including scale-to-zero via add-ons) driven by request load, and integration with the Kubernetes ecosystem (it runs on top of Kubernetes, using its scheduling/scaling).
- Async invocations via a queue (NATS) for event-driven/long-running work.
OpenFaaS's niche is making serverless functions approachable: a gentle, batteries-included workflow (write function → faas-cli up → invoke via gateway) that abstracts Kubernetes details. It's a good fit for teams wanting FaaS ergonomics and quick function deployment without the deeper complexity/flexibility of Knative.
Example
faas-cli new hello --lang python3 # scaffold a function from a template
# edit handler.py ...
faas-cli up -f hello.yml # build + push + deploy to the cluster
curl http://gateway/function/hello -d "hi" # invoke via the API gateway
OpenFaaS model: your code -> template + watchdog -> container -> deployed function
gateway routes/invokes; autoscaling (incl. scale-to-zero add-on)
Exercises
- (Beginner) What is OpenFaaS's main focus/niche?
- (Beginner) What component wraps your code so any language/binary can be a function?
- (Intermediate) Outline the typical OpenFaaS developer workflow.
- (Interview) How does OpenFaaS differ in emphasis from Knative, and when might a team prefer it? (Hint: developer-simplicity/opinionated FaaS ergonomics vs. Knative's lower-level richness/flexibility.)
Answers
- Making serverless functions/microservices simple to build and deploy on Kubernetes — a developer-experience-focused FaaS framework.
- The watchdog process (a small server that fronts your function and handles the HTTP/invocation interface), so any code or binary packaged with it becomes an invocable function.
- Scaffold a function from a language template (
faas-cli new), implement the handler, then build/push/deploy in one step (faas-cli up), and invoke it through the API gateway (e.g.,curl .../function/<name>), managing/monitoring via the CLI/UI.- OpenFaaS emphasizes an opinionated, batteries-included, easy developer workflow (templates, CLI, gateway, UI) that abstracts Kubernetes details for quick function deployment. Knative is a lower-level, more feature-rich and flexible serverless platform (fine-grained revisions/traffic splitting, request-based autoscaling, eventing) with more capability but more complexity. A team might prefer OpenFaaS when they want straightforward FaaS ergonomics and fast, simple function deployment without needing Knative's depth/flexibility or wanting to manage its complexity.
KEDA for scale-to-zero
Theory
KEDA (Kubernetes Event-Driven Autoscaling) was introduced in Chapter 8 as the event-driven autoscaler; in the serverless context, its standout role is enabling scale-to-zero for ordinary Kubernetes Deployments driven by event sources — bringing serverless elasticity to regular workloads without adopting a full serverless framework like Knative.
Why it fits serverless:
- Scale-to-zero on events: KEDA can scale a Deployment down to 0 replicas when there's no work (e.g., an empty queue) and scale it up from 0 when events arrive (queue depth > threshold) — the core serverless behavior, applied to any Deployment.
- Rich event triggers: it scales on real workload signals — Kafka lag, RabbitMQ/SQS queue length, Prometheus queries, cron, cloud pub/sub, etc. — via dozens of scalers, which is exactly what event-driven/serverless workers need (scale by backlog, not CPU).
- Lightweight & Kubernetes-native: it works with your existing Deployments (via a
ScaledObject), reusing the HPA under the hood for the 1..N range and handling activation from/to zero itself — no need to rewrite apps as "functions."
So KEDA is often the pragmatic path to serverless-style, event-driven, scale-to-zero workloads on standard Kubernetes — especially queue-driven workers — complementing (or as a lighter alternative to) Knative when you don't need a full FaaS/serving platform.
Example
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata: { name: worker }
spec:
scaleTargetRef: { name: worker } # an ordinary Deployment
minReplicaCount: 0 # scale-to-zero when the queue is empty
maxReplicaCount: 50
triggers:
- type: aws-sqs-queue
metadata: { queueURL: https://sqs.../jobs, queueLength: "5" } # scale by backlog
Exercises
- (Beginner) What serverless capability does KEDA bring to ordinary Deployments?
- (Beginner) What kinds of signals does KEDA scale on?
- (Intermediate) How does KEDA relate to the HPA when scaling in the 1..N range versus to/from zero?
- (Interview) When might KEDA be preferable to a full serverless framework like Knative for event-driven workloads? (Hint: scale-to-zero on real Deployments by backlog, no app rewrite, lighter footprint.)
Answers
- Scale-to-zero (and scale-up-from-zero) driven by events — serverless elasticity applied to standard Kubernetes Deployments.
- External event-source metrics — e.g., message-queue depth/lag (Kafka, SQS, RabbitMQ), Prometheus query results, cron schedules, cloud pub/sub — via its many scalers (i.e., real workload signals like backlog, not just CPU).
- For the 1..N range, KEDA creates and uses a standard HPA under the hood to do the scaling math on the event metric. For the 0↔1 activation (scale-to-zero and back), KEDA handles it itself (since the HPA's minimum is 1), watching the event source and activating the workload when work appears and deactivating it when idle.
- KEDA is preferable when you want serverless-style, event-driven, scale-to-zero behavior for existing Deployments — especially queue/backlog-driven workers — without rewriting apps as functions or adopting a full serving/FaaS platform. It's lightweight, Kubernetes-native, integrates with your normal Deployments via a
ScaledObject, and scales on the real workload signal (queue depth) rather than CPU. Choose it over Knative when you don't need Knative's request-serving features (revisions, traffic splitting, HTTP activation) and just want event-driven autoscaling (including to zero) on standard workloads with minimal footprint and change.
17.4 AI/ML Workloads on Kubernetes
Kubernetes has become the standard platform for machine learning. This subchapter covers the ML-specific ecosystem.
Kubeflow overview and components
Theory
Machine learning involves a whole lifecycle — data prep, experimentation, training, tuning, serving, monitoring — not just running a container. Kubeflow is the leading open-source project for making Kubernetes a complete ML platform, providing an integrated suite of components covering that lifecycle so teams can do ML at scale on Kubernetes with reproducibility and portability.
Key Kubeflow components:
- Notebooks: managed Jupyter notebook servers in the cluster for interactive development.
- Kubeflow Pipelines: build, run, and track ML workflows/pipelines (DAGs of steps like preprocess → train → evaluate → deploy) with experiment tracking and artifact lineage.
- Katib: automated hyperparameter tuning and neural architecture search.
- Training Operators: distributed training for frameworks (next topic).
- KServe: model serving (later topic).
- Central Dashboard & multi-tenancy (profiles) to organize teams/projects.
The value proposition: instead of stitching together bespoke ML infrastructure, Kubeflow offers a cohesive, Kubernetes-native ML platform — reproducible pipelines, scalable distributed training, tuning, and serving — leveraging Kubernetes' scheduling (including GPUs) and portability across clouds/on-prem. It turns Kubernetes from "runs containers" into "runs the ML lifecycle."
Example
Kubeflow = an ML platform on Kubernetes (integrated components):
Notebooks -> interactive dev (Jupyter)
Pipelines -> reproducible ML workflows (preprocess->train->eval->deploy)
Katib -> hyperparameter tuning / NAS
Training Operators -> distributed training (PyTorch/TF/...)
KServe -> model serving (scalable inference)
Dashboard + profiles -> UI + multi-tenant organization
Exercises
- (Beginner) What is Kubeflow's purpose?
- (Beginner) Name three Kubeflow components and what each does.
- (Intermediate) What does Kubeflow Pipelines provide for the ML workflow?
- (Interview) Why is running the ML lifecycle on Kubernetes/Kubeflow attractive compared to bespoke ML infrastructure? (Hint: cohesive platform, reproducibility, GPU scheduling, portability, scale.)
Answers
- To make Kubernetes a complete, integrated machine-learning platform covering the full ML lifecycle (development, pipelines, tuning, distributed training, and serving).
- Any three: Notebooks (managed Jupyter for interactive dev), Pipelines (build/run/track ML workflows), Katib (hyperparameter tuning/NAS), Training Operators (distributed training for PyTorch/TF/etc.), KServe (model serving), Central Dashboard/profiles (UI and multi-tenancy).
- A way to define, run, and track ML workflows as DAGs of steps (e.g., data prep → train → evaluate → deploy), with experiment tracking, artifact/lineage recording, and reproducibility — turning ad-hoc scripts into repeatable, versioned pipelines.
- Kubeflow provides a cohesive, Kubernetes-native platform for the entire ML lifecycle instead of teams stitching together separate bespoke tools. It gives reproducibility (versioned pipelines/experiments), leverages Kubernetes scheduling including GPUs/accelerators and elasticity for scalable distributed training and serving, offers portability across clouds/on-prem (standard Kubernetes), and integrates development, tuning, training, and serving in one place with multi-tenancy. This reduces the effort and fragmentation of building ML infrastructure and lets ML teams operate at scale on the same robust platform used for other workloads.
Training operators (PyTorch, TensorFlow)
Theory
Large models are trained across many machines/GPUs in parallel (distributed training), which requires coordinating multiple worker processes, assigning roles (e.g., parameter servers, workers, masters), wiring up their network addresses, and handling failures. Doing this by hand on Kubernetes is complex. Training operators (part of the Kubeflow Training Operator, formerly per-framework operators like tf-operator/pytorch-operator) automate distributed training via CRDs for each framework — e.g., PyTorchJob, TFJob, MPIJob, XGBoostJob.
You declare a training job as a CRD specifying the replica roles and counts (e.g., 1 master + N workers for PyTorch, or workers + parameter servers for TF) and the container/image, and the operator:
- Creates the Pods with the correct roles and injects the environment each framework needs for distributed coordination (e.g.,
MASTER_ADDR, rank/world-size,TF_CONFIG). - Manages the job lifecycle — startup ordering, restarts on failure, and completion.
- Integrates with GPU scheduling, gang-scheduling, and topology.
So training operators turn "run a distributed PyTorch/TensorFlow job across a GPU cluster" into a declarative Kubernetes resource — you describe the distributed job's shape and the operator handles the orchestration, making large-scale, multi-node/multi-GPU training practical and reproducible on Kubernetes.
Example
apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata: { name: train }
spec:
pytorchReplicaSpecs:
Master: { replicas: 1, template: { spec: { containers: [ { name: pytorch, image: my-train:latest,
resources: { limits: { nvidia.com/gpu: 1 } } } ] } } }
Worker: { replicas: 3, template: { spec: { containers: [ { name: pytorch, image: my-train:latest,
resources: { limits: { nvidia.com/gpu: 1 } } } ] } } }
# The operator creates 1 master + 3 workers, wiring distributed-training env vars.
Exercises
- (Beginner) What problem do training operators solve?
- (Beginner) Name two CRDs provided for distributed training.
- (Intermediate) What does a training operator do beyond just creating Pods?
- (Interview) How do training operators make large-scale distributed training declarative and reproducible on Kubernetes? (Hint: CRD describes replica roles; operator wires coordination env, lifecycle, GPU/gang scheduling.)
Answers
- Automating the complex orchestration of distributed training across multiple machines/GPUs — coordinating multiple worker processes, assigning roles, wiring their networking, and handling lifecycle/failures — which is tedious and error-prone to do manually.
- Any two:
PyTorchJob,TFJob,MPIJob,XGBoostJob.- It assigns the correct replica roles (e.g., master/workers, or workers/parameter servers), injects the framework-specific coordination environment (e.g.,
MASTER_ADDR, rank/world-size,TF_CONFIG) so the processes can find each other, manages startup ordering and restarts on failure, tracks job completion, and integrates with GPU/gang scheduling and topology.- You describe the distributed job's shape as a CRD — the replica roles, counts, image, and resources (GPUs) — and the operator handles all the orchestration: creating the right Pods, wiring the distributed-training coordination env, managing lifecycle/failures, and scheduling onto GPUs. Because the entire job is a declarative Kubernetes resource (versionable, repeatable) rather than manual, imperative setup, large-scale multi-node/multi-GPU training becomes reproducible and portable — run the same manifest to get the same distributed job — while leveraging Kubernetes scheduling for the accelerators.
KServe for model serving
Theory
Training produces a model; serving it — exposing it for low-latency, scalable inference — is a distinct challenge with its own needs (autoscaling with demand, GPU use, canary rollouts of model versions, standardized prediction APIs, and scale-to-zero for cost). KServe (formerly KFServing) is the standard Kubernetes framework for model inference serving, built on Knative (for serverless autoscaling/scale-to-zero) and often Istio.
What KServe provides:
- A simple
InferenceServiceCRD: you point it at a trained model (in storage) and a model framework runtime (TensorFlow, PyTorch/TorchServe, scikit-learn, XGBoost, ONNX, or custom), and it deploys a scalable serving endpoint — no need to build serving containers yourself. - Serverless inference: leveraging Knative for autoscaling including scale-to-zero (idle models cost nothing; spin up on request), and GPU support for accelerated inference.
- Advanced serving features: canary rollouts and traffic splitting between model versions, explainability and payload logging, transformers (pre/post-processing), and support for model ensembles/pipelines.
- Standardized inference protocol (the V2 inference protocol) for consistent client APIs across frameworks.
So KServe turns "deploy a model for production inference" into a declarative resource with production-grade serving features and serverless economics — the serving counterpart to the training operators, completing the ML lifecycle on Kubernetes.
Example
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata: { name: sklearn-iris }
spec:
predictor:
minReplicas: 0 # scale-to-zero when idle (via Knative)
model:
modelFormat: { name: sklearn }
storageUri: "gs://models/iris/v1" # point at the trained model
# KServe deploys a scalable inference endpoint with a standardized predict API.
Exercises
- (Beginner) What problem does KServe address in the ML lifecycle?
- (Beginner) What CRD do you use to deploy a model with KServe?
- (Intermediate) What does KServe gain by building on Knative?
- (Interview) What production serving features does KServe provide beyond just running a model, and why do they matter? (Hint: canary/traffic-split, scale-to-zero, standardized protocol, explainability/transformers.)
Answers
- Model serving/inference — exposing trained models as scalable, production-grade, low-latency prediction endpoints (the deployment/serving stage, as opposed to training).
- An
InferenceService.- Serverless capabilities — autoscaling driven by request load and scale-to-zero (idle models consume no resources, spinning up on demand), plus request-based scaling — leveraging Knative Serving so inference endpoints are elastic and cost-efficient.
- Beyond running the model, KServe offers: canary rollouts / traffic splitting between model versions (safely test new models with a fraction of traffic, easy rollback); scale-to-zero and autoscaling for cost-efficiency and elasticity; a standardized inference protocol/API across frameworks (consistent clients regardless of model type); transformers for pre/post-processing; explainability and payload logging for trust/monitoring; and support for ensembles/pipelines and GPUs. These matter because production ML serving needs safe model updates, cost control, consistent interfaces, observability, and the ability to scale with real demand — KServe provides them declaratively so teams don't build bespoke serving infrastructure per model.
Data pipelines with Argo Workflows
Theory
ML and data work is full of multi-step batch workflows — ETL/data processing, feature engineering, training pipelines, batch scoring — that need orchestration: run steps in a defined order (a DAG), pass artifacts between them, run steps in parallel, and handle retries/failures. Argo Workflows is the leading Kubernetes-native workflow engine for exactly this: it runs each step as a container (Pod) and orchestrates them as a DAG (or step sequence), all defined declaratively as a CRD.
Key features:
- Container-native steps: every step is a container, so any tool/language works, leveraging Kubernetes scheduling (including GPUs) and scaling.
- DAG and step templates: express dependencies and parallelism (fan-out/fan-in), loops, conditionals.
- Artifact passing: outputs of one step (files, data) passed to the next, with artifact storage (S3/GCS) integration.
- Retries, timeouts, and resource control per step, plus scheduling (CronWorkflow) for recurring pipelines.
Argo Workflows is the engine underneath many pipeline systems (Kubeflow Pipelines runs on Argo Workflows). Its niche: general-purpose, Kubernetes-native pipeline/batch orchestration — well-suited to data pipelines, CI/CD jobs, and ML workflows where you need to chain containerized steps reliably at scale. It complements the ML-specific components (training operators, KServe) by handling the surrounding data-processing and orchestration glue.
Example
apiVersion: argoproj.io/v1alpha1
kind: Workflow
metadata: { generateName: ml-pipeline- }
spec:
entrypoint: dag
templates:
- name: dag
dag:
tasks:
- { name: preprocess, template: run, arguments: {...} }
- { name: train, template: run, dependencies: [preprocess] } # DAG ordering
- { name: evaluate, template: run, dependencies: [train] }
- name: run
container: { image: my-ml-step:latest, command: ["python","step.py"] }
Exercises
- (Beginner) What is Argo Workflows used for?
- (Beginner) How is each step in an Argo Workflow executed?
- (Intermediate) What features make Argo Workflows suitable for ML/data pipelines (name two)?
- (Interview) What is Argo Workflows' niche relative to ML-specific tools like training operators and KServe, and how does it relate to Kubeflow Pipelines? (Hint: general container-native DAG/batch orchestration; Kubeflow Pipelines runs on Argo.)
Answers
- Orchestrating multi-step, container-based batch workflows/pipelines on Kubernetes — e.g., data processing/ETL, ML training pipelines, batch jobs — defined as DAGs or step sequences.
- As a container (a Kubernetes Pod) — each step runs in its own container, orchestrated by the workflow engine.
- Any two: DAG/step definition with dependencies and parallelism (fan-out/fan-in, loops, conditionals); artifact passing between steps (with S3/GCS storage); per-step retries/timeouts/resource control; container-native steps leveraging Kubernetes scheduling (incl. GPUs); and scheduled/recurring runs (CronWorkflow).
- Argo Workflows is a general-purpose, Kubernetes-native workflow/DAG engine for orchestrating any containerized multi-step pipeline (data, ML, CI/CD), whereas training operators and KServe are ML-specific (distributed training and model serving respectively). Argo handles the surrounding orchestration and data-processing glue that connects such steps. Notably, Kubeflow Pipelines is built on top of Argo Workflows — it uses Argo as its execution engine — so Argo underpins ML pipeline orchestration while remaining a broadly applicable batch/workflow tool beyond ML.
17.5 Edge and IoT Kubernetes
Kubernetes is stretching beyond the data center to edge and IoT environments. This subchapter covers that frontier.
k3s for edge deployments
Theory
Running Kubernetes at the edge — retail stores, factories, telco sites, vehicles, remote locations — means constrained hardware, limited/intermittent networking, and often many small sites with little on-site expertise. Standard Kubernetes is too heavy for such environments. k3s (Chapter 3) is the lightweight, CNCF-certified distribution purpose-built for exactly this: a single small binary (<100 MB), low memory footprint, bundled components, and SQLite as the default datastore — able to run on modest hardware like a Raspberry Pi or small industrial gateway.
Why k3s fits the edge:
- Small footprint & simplicity: minimal resources and a one-command install make it deployable on constrained devices and manageable at many sites.
- Still real Kubernetes: CNCF-conformant, so the same manifests/tooling/skills apply — you develop against standard Kubernetes and deploy to the edge unchanged.
- Edge-appropriate defaults: bundled essentials (a lightweight ingress, local storage, etc.) reduce the setup burden; can run as a single node or small HA cluster (with embedded etcd).
So k3s brings the Kubernetes API and ecosystem to resource-constrained, distributed edge locations without the overhead of full Kubernetes — the foundation for edge/IoT Kubernetes deployments (and often combined with fleet-management tooling from Chapter 15 to manage many edge clusters).
Example
# One-command install on a small edge device (Raspberry Pi, gateway):
curl -sfL https://get.k3s.io | sh -
# Single binary, ~<100MB, SQLite datastore by default, low memory use.
# Manage many edge k3s clusters centrally with fleet tooling (Rancher Fleet, Ch.15).
Exercises
- (Beginner) Why is standard Kubernetes often unsuitable for edge devices?
- (Beginner) What characteristics make k3s a fit for the edge?
- (Intermediate) Why does k3s being CNCF-conformant matter for edge development?
- (Interview) What challenges of edge environments does k3s address, and how are many edge clusters typically managed? (Hint: constrained hardware/network/expertise; small footprint + fleet management/GitOps.)
Answers
- Because edge devices have constrained CPU/memory and standard Kubernetes has a relatively heavy resource footprint and complex setup, which is impractical on modest hardware and hard to operate across many small remote sites.
- A single small binary (<100 MB), low memory footprint, bundled components, SQLite default datastore, and simple one-command install — able to run on modest hardware (e.g., Raspberry Pi) while remaining full Kubernetes.
- Because conformance guarantees k3s implements the standard Kubernetes APIs/behaviors, so the same manifests, tooling, and skills work — teams develop against standard Kubernetes and deploy to constrained edge clusters unchanged, without maintaining edge-specific variants.
- k3s addresses edge constraints: limited hardware (small footprint, low memory, single binary), operational simplicity (easy install, bundled essentials, less expertise needed per site), and running standard Kubernetes on modest/remote devices (with single-node or small HA options). Because edge deployments involve many clusters/sites, they're typically managed centrally with fleet-management and GitOps tooling (e.g., Rancher Fleet, Argo CD ApplicationSet — Chapter 15), which push and reconcile configuration/apps and policy across all the edge clusters consistently despite intermittent connectivity.
KubeEdge architecture
Theory
k3s runs a full (small) cluster at each edge site, but some IoT/edge scenarios need something different: extending a central cluster's control out to edge nodes/devices that may be numerous, resource-tiny, and only intermittently connected. KubeEdge (a CNCF project) is built for this — it extends Kubernetes orchestration to the edge while keeping the control plane in the cloud, and is designed to tolerate unreliable networks and to interface with IoT devices.
KubeEdge's architecture splits into:
- CloudCore (runs in the cloud/central cluster): syncs with the Kubernetes API server and manages edge nodes/metadata, communicating with the edge over a reliable messaging channel.
- EdgeCore (runs on edge nodes): includes EdgeD (a lightweight edge agent that manages Pods locally, like a kubelet) and MetaManager (local metadata store) — crucially enabling offline autonomy: edge nodes keep running their workloads and cache state even when disconnected from the cloud, then resync when connectivity returns.
- Device management: a DeviceTwin/device model and MQTT-based mapper support integrating actual IoT devices (sensors/actuators) into Kubernetes as manageable resources.
So where k3s = "a small standalone cluster at the edge," KubeEdge = "one cloud control plane orchestrating many edge nodes/devices, resilient to disconnection, with IoT device integration." It targets large-scale IoT/edge fleets with central management and offline-capable edge execution.
Example
KubeEdge (cloud control plane -> many edge nodes/devices):
Cloud: [ K8s API server ] <-> [ CloudCore ] ==reliable channel==>
Edge: [ EdgeCore: EdgeD (mini-kubelet) + MetaManager (local cache) ]
|-> runs Pods locally; AUTONOMOUS when offline; resyncs on reconnect
|-> DeviceTwin + MQTT mapper -> real IoT sensors/actuators
Exercises
- (Beginner) How does KubeEdge's model differ from running k3s at each site?
- (Beginner) What are the two main components (cloud side and edge side) of KubeEdge?
- (Intermediate) What does "offline autonomy" mean for a KubeEdge edge node?
- (Interview) What edge/IoT challenges is KubeEdge specifically designed to handle that standard Kubernetes/k3s doesn't emphasize? (Hint: central control of many edge nodes, unreliable networks/offline operation, IoT device integration.)
Answers
- k3s runs a full standalone (small) Kubernetes cluster at each edge site; KubeEdge instead keeps the control plane centrally (in the cloud) and extends it out to many lightweight edge nodes/devices, orchestrating them from the center rather than each site being its own cluster.
- CloudCore (cloud side — syncs with the API server, manages edge nodes over a reliable channel) and EdgeCore (edge side — EdgeD mini-kubelet + MetaManager local cache running workloads on the edge node).
- That an edge node continues running its assigned workloads and serving using locally cached state/metadata even when disconnected from the cloud control plane, and then reconciles/resyncs with the cloud once connectivity is restored — so intermittent network loss doesn't stop the edge from operating.
- KubeEdge targets: central management of large numbers of edge nodes/devices from a cloud control plane (rather than many independent clusters); resilience to unreliable/intermittent networks via offline autonomy (edge keeps running and caches state when disconnected, resyncing later); running on very resource-constrained edge hardware; and IoT device integration (DeviceTwin/device models with MQTT mappers to manage sensors/actuators as Kubernetes resources). Standard Kubernetes (and k3s) assume relatively reliable connectivity and don't focus on cloud-orchestrated fleets of tiny, sometimes-offline edge nodes or native IoT device management — which is exactly KubeEdge's design focus.
Akri for IoT device discovery
Theory
A hard problem at the edge is leaf devices — cameras, sensors, USB peripherals, IP devices — that aren't nodes and can't run an agent, yet you want to discover them and use them from Kubernetes workloads. Akri (a CNCF project, "A Kubernetes Resource Interface" for the edge) solves exactly this: it lets you dynamically discover such devices and expose them as Kubernetes resources that Pods can request, even as devices come and go.
How Akri works:
- You define a Configuration describing what to look for using a discovery protocol/handler (ONVIF for IP cameras, udev for USB, OPC UA for industrial devices, or custom).
- Akri's discovery agents (a DaemonSet) find matching devices on/near each node and, for each discovered device, Akri creates an Instance (a custom resource representing that device) and advertises it via the device plugin framework as a schedulable resource.
- Workloads request the device like any extended resource; Akri can also automatically deploy a "broker" Pod per device to expose it (e.g., a camera's frames) to the rest of the cluster. When a device disappears, Akri cleans up; when new ones appear, they're discovered automatically.
So Akri extends the device plugin concept from node-attached hardware (GPUs) to networked/leaf IoT devices, making dynamic, heterogeneous edge devices first-class, discoverable, schedulable Kubernetes resources — filling a key gap for IoT/edge use cases.
Example
# Akri Configuration: discover ONVIF IP cameras and expose them as resources
apiVersion: akri.sh/v0
kind: Configuration
metadata: { name: onvif-cameras }
spec:
discoveryHandler: { name: onvif } # protocol to find devices
brokerSpec: { ... } # optional per-device broker Pod
# Discovered cameras -> Akri Instances -> advertised via device plugin ->
# Pods request them like: resources: { limits: { akri.sh/onvif-cameras: "1" } }
Exercises
- (Beginner) What problem does Akri solve at the edge?
- (Beginner) Give two examples of discovery protocols Akri supports.
- (Intermediate) How does Akri make a discovered device usable by Pods?
- (Interview) How does Akri extend the device plugin concept, and why is that valuable for IoT/edge? (Hint: from node-attached hardware to dynamic networked/leaf devices as schedulable resources.)
Answers
- Discovering leaf/IoT devices (cameras, sensors, USB/IP devices) that can't run a Kubernetes agent themselves, and exposing them as Kubernetes resources that workloads can find and use — handling devices dynamically appearing and disappearing.
- Any two: ONVIF (IP cameras), udev (USB devices), OPC UA (industrial devices), or custom discovery handlers.
- Akri's discovery agents find matching devices and create an Instance custom resource per device, advertising each via the device plugin framework as a schedulable extended resource; Pods then request that resource (e.g.,
akri.sh/<config>: "1"), and Akri can deploy a broker Pod to expose the device's data to the cluster. The device becomes a normal, requestable Kubernetes resource.- The device plugin framework was designed for node-attached hardware (like GPUs) that's statically present on a node. Akri extends this to dynamic, networked/leaf devices that aren't nodes and appear/disappear over time — discovering them via pluggable protocols and advertising each as a schedulable resource (with optional broker Pods). This is valuable for IoT/edge because such environments are full of heterogeneous, transient devices (cameras, sensors) that you want to orchestrate from Kubernetes; Akri makes them first-class, discoverable, schedulable resources, so workloads can consume edge devices with the same request model as any other resource, dynamically as the fleet changes.
Offline and low-bandwidth cluster operation
Theory
Edge and IoT clusters frequently operate with intermittent, low-bandwidth, or no connectivity to central systems (a ship at sea, a remote site, a factory with locked-down networks). Designing for this — offline/disconnected operation — is a distinct discipline that cuts across the edge stack. The core principle: edge clusters must keep functioning autonomously when disconnected and degrade gracefully, syncing when connectivity returns.
Key techniques and considerations:
- Local autonomy: workloads and the local control/agent (e.g., k3s's local datastore, KubeEdge's EdgeCore offline mode) must run and self-heal without the cloud; cached state and local decision-making are essential.
- Air-gapped images/artifacts: with no registry access, you need local image registries/mirrors and pre-loaded images, plus offline installation bundles (no pulling from the internet at deploy time).
- Efficient/eventual sync: minimize data over constrained links — batch/compress telemetry, sync deltas, tolerate lag; GitOps pull models suit this (edge pulls config when it can, reconciles locally).
- Store-and-forward: buffer data/events locally when offline and forward on reconnect; avoid designs that assume constant cloud availability.
- Resilient management: fleet tooling and GitOps must tolerate clusters being unreachable for periods and reconcile opportunistically.
So operating at the edge means engineering for autonomy, local artifacts, bandwidth efficiency, and eventual synchronization rather than assuming a well-connected data center. Tools like k3s (local datastore), KubeEdge (offline autonomy), local registries, and pull-based GitOps together make disconnected/low-bandwidth Kubernetes operation viable.
Example
Designing for disconnected/low-bandwidth edge:
autonomy: edge runs + self-heals offline (k3s local DB / KubeEdge EdgeCore)
air-gapped: local registry/mirror + pre-loaded images + offline install bundles
efficient sync: batch/compress/delta telemetry; tolerate lag
store-forward: buffer events locally -> forward on reconnect
management: pull-based GitOps reconciles when connectivity allows
Exercises
- (Beginner) What is the core requirement for an edge cluster during a network outage?
- (Beginner) Why are local image registries/mirrors important for edge/air-gapped clusters?
- (Intermediate) Why does a pull-based GitOps model suit intermittently-connected edge clusters?
- (Interview) What design principles make Kubernetes viable in offline/low-bandwidth edge environments? (Hint: local autonomy, air-gapped artifacts, efficient/eventual sync, store-and-forward, resilient management.)
Answers
- That it keeps functioning autonomously — continuing to run workloads and self-heal using local state/decision-making — despite being disconnected, then resyncing when connectivity returns.
- Because air-gapped/edge clusters may have no (or poor) access to public registries at deploy time, so images must be available locally — a local registry/mirror and pre-loaded images let Pods start without pulling over the internet, and enable offline installs/upgrades.
- In a pull model, the edge cluster's agent pulls desired state from Git (and reconciles locally) whenever it has connectivity, rather than depending on a central system to push to it in real time. This tolerates intermittent links — the edge reconciles opportunistically when connected and keeps running from its last-known state when offline — whereas a push model assumes the central system can reach the cluster on demand.
- Principles: local autonomy (edge runs and self-heals offline via local datastore/agent, e.g., k3s/KubeEdge); air-gapped artifacts (local registries/mirrors, pre-loaded images, offline install bundles so no internet is needed at runtime); bandwidth efficiency and eventual sync (batch/compress/delta data, tolerate lag rather than assuming constant high-bandwidth links); store-and-forward (buffer telemetry/events locally and forward on reconnect); and resilient, pull-based management (GitOps/fleet tooling that reconciles opportunistically and tolerates clusters being unreachable). Together these replace the data-center assumption of constant connectivity with designs that keep the edge operational and eventually consistent under disconnection and constrained networks.