
1. Introduction
I have been wondering about relationships between Kubernetes components, container runtimes, how they work together to create and manage containers, and what happens when Kubernetes creates a pod. I decided to write this post to put everything together.
- Container standards: OCI specifications,
runc, and CRI - The execution flow: How
kubelet, container runtimes, andruncwork together - Filesystem layering: OverlayFS for efficient container images
- Hands-on implementation: Building a Kubernetes pod from scratch using Linux primitives
You'll understand both what happens and how it works at the kernel level. Below are the prerequisites that I assume you already know:
- Familiar with Linux virtual filesystems (eg.
procfs(opens in a new tab)) - Familiar with Linux namespaces (opens in a new tab)
- How Kubernetes components work, especially
kubelet
2. Container Runtime Fundamentals
2.1. What is OCI?
The Open Container Initiative (OCI) is a Linux Foundation project established in 2015.
- Primary purpose: to create open industry standards for container formats and runtimes
- Before OCI, there was a risk of fragmentation in the container world, with different companies creating their own incompatible container technologies
- OCI was formed to prevent this and ensure interoperability and portability across different container tools and platforms
OCI focuses on 3 main specifications:
-
OCI Image Specification (image-spec (opens in a new tab)):
- What it defines: The format for a container image. This includes how an image is structured on disk, its layers, manifest (metadata), and configuration
- Why it matters: It ensures that an image built by one tool (eg. Docker) can be pulled, stored, and run by any other OCI-compliant container runtime. This is why a Docker image can be run by
containerdorCRI-O, even after Docker's shim was removed from Kubernetes - Analogy: It's like the JPEG standard for images or the PDF standard for documents. Any software that understands the standard can create or read it
-
OCI Runtime Specification (runtime-spec (opens in a new tab)):
- What it defines: How a container runtime should execute a filesystem bundle (an unpacked container image) and manage its lifecycle (create, start, stop, delete, etc.). It specifies the
config.jsonfile, which describes how the container process should be run (eg. entrypoint, environment variables, resource limits, security settings) - Why it matters: It ensures that different container runtimes can produce consistent execution environments for containers
- Analogy: It's like a detailed instruction manual for how to turn on a specific type of machine and what controls it should have
- What it defines: How a container runtime should execute a filesystem bundle (an unpacked container image) and manage its lifecycle (create, start, stop, delete, etc.). It specifies the
-
OCI Distribution Specification (distribution-spec (opens in a new tab)):
- What it defines: An API protocol for distributing container images. This standardizes how container registries (eg. Docker Hub, GCR, ECR, Harbor) store, pull, and push container images
- Why it matters: It allows various container tools to interact with different registries, promoting a unified ecosystem for image distribution
2.2. What is runc?
runc (opens in a new tab) is a lightweight, portable, low-level container runtime that serves as the reference implementation of the OCI Runtime Specification.
When a higher-level container runtime (eg. containerd or CRI-O) decides to actually start a container, it hands off the specific task of creating and executing the container process to runc.
runc interacts directly with the Linux kernel's low-level features, specifically:
- Namespaces: Provide process isolation (PID, network, mount, IPC, UTS, user namespaces)
- Cgroups: Enforce resource limits (CPU, memory, I/O) on the container process
pivot_root/chroot: Change the root filesystem of the process to the container's rootfs bundle- Seccomp, AppArmor, SELinux: Apply security profiles for granular control over system calls and permissions
runc doesn't have its own daemon or long-running process that orchestrates many containers. Instead, it's a simple command-line tool that performs its job (spawning a container process) and then exits. It delegates the ongoing management of the running process to the kernel.
runc itself is largely stateless. It receives all necessary configuration (from the config.json defined by OCI runtime-spec) at runtime.
Docker initially open-sourced runc and contributed it to the OCI project. It's often used as the default or underlying runtime for many higher-level container engines.
2.3. What is CRI?
The Container Runtime Interface (CRI) is a Kubernetes API that standardizes how the kubelet communicates with container runtimes. It was introduced in Kubernetes v1.5 (December 2016) to solve the problem of tight coupling between Kubernetes and specific container runtimes.
Before CRI, Kubernetes had hardcoded integrations with specific container runtimes:
- The
containeddirect, runtime-specific code to interact with Docker's REST API - Adding support for new runtimes (like
rkt) required modifyingkubelet's core code - This created vendor lock-in and made it difficult to innovate in the container runtime space
- Each runtime integration had to be maintained within the Kubernetes codebase
CRI defines a gRPC API with two main services:
-
ImageService: Manages container images
- Pull, list, remove, and inspect images
- Image filesystem usage statistics
-
RuntimeService: Manages pods and containers
- Create, start, stop, remove, and inspect containers
- Create, stop, remove, and inspect pods (sandbox containers)
- Execute commands in containers
- Attach to containers
- Port forwarding
Popular CRI implementations include:
containerd: Docker's donated runtime, now CNCF graduated projectCRI-O: Red Hat's minimalist CRI implementation, designed specifically for Kubernetescri-dockerd: Adapter that allows Docker Engine to work with CRI (after Docker's CRI support was removed in Kubernetes v1.24)
The CRI acts as a translation layer, converting kubelet's high-level requests (eg. create a pod) into the appropriate low-level operations that the container runtime can understand and execute.
Why does containerd implement CRI despite predating Kubernetes?
-
Docker &
containerd's origins (2013-2016):- Docker (the company) launched its container technology in 2013
- Initially, the Docker daemon (
dockerd) was a monolithic piece of software that handled everything from image management to container execution - Over time, Docker recognized the need for modularity. They extracted the core container execution logic into a separate, lower-level component called
containerd, which was designed to be a robust, industry-standard container runtime with an emphasis on simplicity and portability. This happened around 2015-2016 containerdwas built to be an OCI-compliant runtime, meaning it could understand and execute OCI runtime bundles (whichruncwould then perform at the kernel level)
-
Kubernetes' initial days (2014-2016):
- Kubernetes was open-sourced by Google in 2014
- In its early versions (up to v1.4), Kubernetes had direct, hardcoded integrations with specific container runtimes, primarily Docker, and later
rkt. The Kubelet's code directly knew how to talk to Docker's REST API. This made it difficult to swap out runtimes or introduce new ones
-
The birth of CRI (Kubernetes v1.5, December 2016):
- As Kubernetes gained adoption and the container ecosystem diversified, the Kubernetes community realized the need for a standardized pluggable interface for container runtimes. This was to avoid vendor lock-in and simplify the integration of new runtimes
- This led to the creation and introduction of the Container Runtime Interface (CRI) in Kubernetes v1.5 (released in December 2016). The CRI defined a gRPC API that the Kubelet would speak
-
containerdimplements CRI (2017 onwards):- After the CRI was defined by Kubernetes,
containerd(which was already a robust, standalone runtime) saw the opportunity to become the de-facto CRI-compliant runtime for Kubernetes - In March 2017, Docker (the company) famously donated
containerdto the Cloud Native Computing Foundation (CNCF), the same foundation that hosts Kubernetes. This was a significant move to ensurecontainerd's neutrality and wide adoption - Immediately following its donation,
containerdbegan implementing the CRI specification as a plugin (cri-containerd). This allowedcontainerdto directly receive and process gRPC calls from the Kubelet
- After the CRI was defined by Kubernetes,
2.4. What is CNI?
The Container Network Interface (CNI) is a Cloud Native Computing Foundation project that consists of a specification and libraries for writing plugins to configure network interfaces in Linux containers. CNI defines how container runtimes should set up networking for containers, providing a standard to manage network connectivity in containerized environments.
CNI was originally created by CoreOS (now part of Red Hat) and adopted by Kubernetes to solve the problem of container networking in a pluggable, vendor-neutral way. Before CNI, different container orchestration platforms had their own networking implementations, making it difficult to share networking solutions across platforms.
CNI plugins are executable programs that the container runtime calls to configure networking for containers. The process works as follows:
- Container Creation: When a container runtime (like
containerdorCRI-O) creates a container, it also needs to set up networking for that container - CNI Plugin Execution: The runtime executes CNI plugins in a specific order to configure the container's network interface
- Network Configuration: CNI plugins perform various networking tasks:
- Create network interfaces (veth pairs, bridges, etc.)
- Assign IP addresses from predefined pools
- Set up routing rules
- Configure firewall rules
- Connect containers to networks
2.5. The Complete Flow
Now that we understand the individual components, let's see how OCI, runc, CRI, CNI, container runtimes, and kubelet work together to create and manage containers in Kubernetes.
Here's what happens when Kubernetes creates a pod with containers:
-
Pod Specification:
- A pod specification is sent to the Kubernetes API server
- The
kubeleton the target node receives the pod spec through its watch on the API serverShow me the Kubernetes code
In
pkg/kubelet/kuberuntime/kuberuntime_manager.go: callSyncPod()(opens in a new tab).
-
CRI Communication:
kubeletcalls the CRI runtime (likecontainerdorCRI-O) via gRPCRunPodSandbox(): Creates the pod's shared namespaces (pause container)Show me the Kubernetes code
In
pkg/kubelet/kuberuntime/kuberuntime_manager.go, inSyncPod(): callm.createPodSandbox()(opens in a new tab).In
pkg/kubelet/kuberuntime/kuberuntime_sandbox.go, increatePodSandbox(): callm.runtimeService.RunPodSandbox()(opens in a new tab).PullImage(): Downloads required container imagesShow me the Kubernetes code
In
pkg/kubelet/kuberuntime/kuberuntime_manager.go, inSyncPod(): callm.startContainer()(opens in a new tab).In
pkg/kubelet/kuberuntime/kuberuntime_container.go, instartContainer(): callm.imagePuller.EnsureImageExists()(opens in a new tab).In
pkg/kubelet/images/image_manager.go, inEnsureImageExists(): callm.pullImage()(opens in a new tab), which then callsm.puller.pullImage()(opens in a new tab).In
pkg/kubelet/images/puller.go, inpullImage(): callsip.imageService.PullImage()(opens in a new tab).CreateContainer(): Prepares container configurationShow me the Kubernetes code
In
pkg/kubelet/kuberuntime/kuberuntime_container.go, instartContainer(): callm.runtimeService.CreateContainer()(opens in a new tab).StartContainer(): Starts the container processShow me the Kubernetes code
In
pkg/kubelet/kuberuntime/kuberuntime_container.go, instartContainer(): callm.runtimeService.StartContainer()(opens in a new tab).
-
OCI Runtime API Call:
- Pull container images from registries
- Extract, store, manage image layers, filesystem layers
- Call CNI plugins to configure networking for the pod
- Call
runc(or another OCI-compliant runtime) to create the container processShow me the
containerdcodeIn
internal/cri/server/sandbox_run.go, inRunPodSandbox(): callc.setupPodNetwork()(opens in a new tab), which then callsnetPlugin.Setup()(opens in a new tab).netPlugin.Setup()will then call the CNI plugin to configure networking for the pod.In
internal/cri/server/container_create.go, inCreateContainer(): callc.createContainer()(opens in a new tab), which then callsc.buildContainerSpec()(opens in a new tab), which then callsc.buildLinuxSpec()(opens in a new tab) andc.runtimeSpec()(opens in a new tab).Inside
c.createContainer():- Prepare the container root filesystem
- Call
c.buildContainerSpec()(opens in a new tab) to build the complete container specification and integrates image config with runtime requirements
- Call
c.buildLinuxSpec()(opens in a new tab) to add Linux-specific configurations (cgroups, namespaces, security) and sets up resource constraints and capabilities
Inside
c.createContainer(), it also callsc.client.NewContainer()(opens in a new tab), which delegates tocontainerd's core services:- In
client/client.go, insideNewContainer(), it callsc.ContainerService().Create()(opens in a new tab), which only stores container metadata incontainerd's database. It does not callrunc, or start any runtime processes
runccalls to create containers are actually implemented ininternal/cri/server/container_start.goinStartContainer(), where it callscontainer.NewTask()(opens in a new tab):- In
client/container.go, inNewTask(), it callsc.client.TaskService().Create()(opens in a new tab). ThisCreate()is implemented inplugins/services/tasks/local.go(opens in a new tab)
- In
plugins/services/tasks/local.go, inCreate(), it callsrtime.Create()(opens in a new tab), which calls the Runtime V2 implementation. This creates the OCI bundle and runtime shim. The shim then executesrunccommands. The Runtime V2 shim is implemented incmd/containerd-shim-runc-v2/process/init.go
- In
cmd/containerd-shim-runc-v2/process/init.go, inCreate(), it callsp.runtime.Create()(opens in a new tab), which calls the realrunccommand (opens in a new tab) to create a new container
-
Linux Kernel Interaction:
runccreates the actual container process using Linux kernel features:- Namespaces: Process, network, mount, IPC, UTS, user isolation
- Cgroups: Resource limits (CPU, memory, I/O)
- Filesystem:
pivot_rootto container'srootfs - Security: Seccomp, AppArmor, SELinux profiles
runcexits after creating the process (it's not a daemon)Show me the
runccodecreate.go(opens in a new tab) creates the container process. It then callsstartContainer().In
utils_linux.go, instartContainer(), it callscreateContainer()(opens in a new tab), which callslibcontainer.Create()(opens in a new tab), which is like a constructor that sets up the container metadata and state directory. The creation of the container happens whenstartContainer()callsr.run()(opens in a new tab).In
utils_linux.go, inrun(), it callsr.container.Start()(opens in a new tab).In
libcontainer/container_linux.go, inStart(), it calls:c.start()(opens in a new tab), which callsnewParentProcess()(opens in a new tab), which callsc.newInitProcess()(opens in a new tab)
parent.start()(opens in a new tab), which is implemented inlibcontainer/process_linux.go
In
libcontainer/process_linux.go, instart(), it calls:p.cmd.Start()(opens in a new tab) which uses theexecsystem calls to create the new process
p.manager.Apply()(opens in a new tab) which applies cgroup configuration by writing to cgroup filesystem paths like/sys/fs/cgroup/...
p.goCreateMountSources()(opens in a new tab) which uses thesetns()system call to join the container's mount namespace
p.createNetworkInterfaces()(opens in a new tab) which creates network interfaces using network-related system calls
This layered architecture allows for flexibility and innovation at each level while maintaining compatibility through standardized interfaces (OCI, CRI, and CNI).
3. The Union Filesystem OverlayFS
OverlayFS (opens in a new tab) is the union filesystem that allows us to create a layered filesystem structure. It is commonly used in container runtimes to provide a read-only base layer and a writable layer for each container. When using containerd, CRI-O, or dockerd with Kubernetes, OverlayFS is often the preferred storage driver (opens in a new tab) for managing container image layers and writable container filesystems. It provides efficient disk usage by layering changes on top of read-only image layers. In this section, we will explore how OverlayFS works and how it can be used to create a layered filesystem structure.
OverlayFS combines two directory trees:
- Lower layer: The read-only base layer, which contains the common files and directories that all containers share
- Upper layer: The writable layer, which contains the changes made by the container. Each container has its own upper layer, allowing them to modify files without affecting the lower layer
The result is a merged view of the two layers, where files from the upper layer take precedence over files from the lower layer. This allows containers to have their own writable filesystem while still sharing a common base layer. If we delete a file in the upper layer, it will not be deleted in the lower layer. Instead, OverlayFS will create a copy of the file in the upper layer or delete it there, leaving the lower layer intact. If we delete or rename a file in the lower layer, OverlayFS uses special whiteout files (opens in a new tab) to indicate that the file has been deleted or renamed. This allows OverlayFS to maintain the integrity of the lower layer while still allowing containers to modify their own files.
Let's demonstrate how to set up OverlayFS for two isolated environments. We will create a minimal Alpine Linux root filesystem and use it as the lower layer. Each environment will have its own upper layer, while sharing the same lower layer. Note that we are not using the term "container" here, as we will be still in the host's namespaces.
mkdir -p /root/tung/{lower,a0-upper,a0-work,a0-merged,a1-upper,a1-work,a1-merged}
cd /root/tung
# You may want to change to your OS architecture, eg. `x86_64` or `aarch64`
wget https://dl-cdn.alpinelinux.org/alpine/v3.20/releases/aarch64/alpine-minirootfs-3.20.3-aarch64.tar.gz
tar -xzf alpine-minirootfs-3.20.3-aarch64.tar.gz -C lower
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a0-upper,workdir=/root/tung/a0-work /root/tung/a0-merged
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a1-upper,workdir=/root/tung/a1-work /root/tung/a1-mergedExplain the mount command:
mount: Command to mount a filesystem ref (opens in a new tab)-t overlay: Specifies the type of filesystem to mount, which is an overlay filesystemoverlay: the source of the mount. For a regular filesystem, this would be a physical device, like/dev/sda1. However, since theprocfilesystem is a virtual one, there is no physical device to specify-o lowerdir=/root/tung/lower,upperdir=/root/tung/a0-upper,workdir=/root/tung/a0-work /root/tung/a0-merged: Options for the overlay filesystem:lowerdir: The lower layer of the overlay filesystem, which is the base filesystem that all containers shareupperdir: The upper layer of the overlay filesystem, which contains changes made by the containerworkdir: The work directory used by the overlay filesystem to manage the upper layer
/root/tung/a0-merged: The mount points for the overlay filesystems
Now, let's verify that the overlay filesystems are mounted correctly. Firstly, change the root directory to the overlay filesystem for environment A0 and start a shell. Note that we are still in the host's namespaces.
# in host, change root to the overlay filesystem for environment A0
chroot /root/tung/a0-merged /bin/sh
# in the new shell
touch a0
# in a new terminal in host, change root to the overlay filesystem for environment A1
chroot /root/tung/a1-merged /bin/sh
# in the new shell
touch a1Explain chroot command ref (opens in a new tab):
chroot: Command to change the root directory for the current running process. This creates an isolated environment, often called a chroot jail where the process can only access files and commands within that new directory tree/root/tung/a0-merged: The new root directory for the current process/bin/sh: The command to run in the new root directory, which is a shell in this case
Next, let's verify that the files are created in the upper layer of each environment's overlay filesystem. We can do this by checking the contents of the upper layer directories.
# in host, check the files in the lower layer
ls /root/tung/lower
# should not see a0 or a1 files, only the files from the Alpine Linux root filesystem
# in host, check the files in the upper layer for A0
ls /root/tung/a0-upper
# should see a0 file, but not a1 file
# in host, check the files in the merged view for A0
ls /root/tung/a0-merged
# should see a0 file, but not a1 fileSimilarly, you can verify the a1 file is created in the upper layer and the merged layer of environment A1.
In conclusion, we have set up OverlayFS for two environments, A0 and A1, each with its own upper layer while sharing the same lower layer. This is how container runtimes like Docker, containerd, CRI-O, and others use OverlayFS to provide a layered filesystem structure for containers. In Kubernetes, this is used to provide a common base layer for all containers in a pod while allowing each container to have its own writable layer.
4. Create Kubernetes Pod from Scratch
4.1. Create Pause Container
In Kubernetes, the Pause container (opens in a new tab) (also called sandbox or infra container) is a special container that maintains cgroups and namespaces for the pod. It is responsible for providing a shared network namespace for all other containers in the pod. Kubernetes uses Pause containers to allow for worker containers crashing or restarting without losing any of the networking configuration.
Let's set up the necessary overlay filesystems for the pod's namespaces. We will create a directory structure to hold the lower and upper layers of the overlay filesystem, and then extract a minimal Alpine Linux root filesystem into it. Next, we will create overlay mounts for each container in the pod. Each container will have its own upper layer, while sharing the same lower layer.
mkdir -p /root/tung/{lower,pause-upper,pause-work,pause-merged,a0-upper,a0-work,a0-merged,a1-upper,a1-work,a1-merged}
cd /root/tung
# You may want to change to your OS architecture, eg. `x86_64` or `aarch64`
wget https://dl-cdn.alpinelinux.org/alpine/v3.20/releases/aarch64/alpine-minirootfs-3.20.3-aarch64.tar.gz
tar -xzf alpine-minirootfs-3.20.3-aarch64.tar.gz -C lower
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/pause-upper,workdir=/root/tung/pause-work /root/tung/pause-merged
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a0-upper,workdir=/root/tung/a0-work /root/tung/a0-merged
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a1-upper,workdir=/root/tung/a1-work /root/tung/a1-mergedWe will start the Pause container using the unshare command to create a new set of namespaces.
unshare -Cunimpf chroot /root/tung/pause-merged /bin/shExplain the command:
unshareis used to create a new set of namespaces for the command that follows it ref (opens in a new tab)-C: Create a new cgroup namespace (for resource limits)-u: Create a new UTS namespace (for hostname)-n: Create a new network namespace-i: Create a new IPC namespace-m: Create a new mount namespace-p: Create a new PID namespace-f: Fork the command in a new processchroot /root/tung/pause-merged /bin/sh: Change the root directory to the overlay filesystem for the Pause container and start a shell
4.2. Create Application Containers
To interact with the Pause container, we firstly need to find its PID. This is the process that was forked from the unshare command. Next, we will use nsenter to enter the Pause container's namespaces and start the application containers A0 and A1, which will also share the same network namespace with the Pause container.
# find the PAUSE process's PID, the process that is forked from the unshare command
ps aux | grep /bin/sh
# for example, let's say the output is:
root 535 0.0 0.0 5260 820 pts/0 S 01:37 0:00 unshare -Cunimpf chroot /root/tung/pause-merged /bin/sh
root 536 0.0 0.0 1816 1016 pts/0 S+ 01:37 0:00 /bin/sh
# we can see that the PID of the Pause container is 536
PAUSE_PID=<pause-pid>
# in host, create application containers A0 by entering the Pause container's namespaces
# and changing the root directory to the overlay filesystem for A0
nsenter -t $PAUSE_PID -a chroot /root/tung/a0-merged /bin/sh
# in host, in a new terminal, similarly, create application containers A1
nsenter -t $PAUSE_PID -a chroot /root/tung/a1-merged /bin/shExplain the command:
nsenter: Command to enter the namespaces of another process ref (opens in a new tab)-t $PAUSE_PID: Target the PID of the Pause container process, allowing us to enter its namespaces-a: Enter all namespaces (PID, network, mount, IPC, UTS, user)chroot /root/tung/a0-merged /bin/sh: Change the root directory to the overlay filesystem for the application container A0 and start
For more details on how nsenter could be used, refer to my previous post on nsenter experiments.
Inside each application container, we can verify that they share the same network namespace with the Pause container by checking the /proc/self/ns/net symlink. This symlink should point to the same network namespace ID for all containers in the pod.
# in each container, mount the proc filesystem to access process information
mount -t proc proc /proc
# in each container, check the network namespace
ls -l /proc/self/ns/net
# All should point to the same net:[ID], eg. net:[4026532321]That's it! We have just created a Pause container and two application containers A0 and A1, all sharing the same network namespace. This is exactly how Kubernetes creates pods and manages containers within them. The Pause container acts as the network namespace anchor, while the application containers can run their own processes and communicate over the shared network.
5. Conclusion
This post covered the fundamentals of container runtimes, OCI specifications, and how Kubernetes uses these components to create and manage containers. We explored the role of runc, CRI, and OverlayFS in providing a layered filesystem structure for containers. Finally, we demonstrated how to create a Kubernetes pod from scratch, including the Pause container and application containers.
This post didn't cover NRI (Node Resource Interface) or CSI (Container Storage Interface), which are also important components in the Kubernetes ecosystem.