Tech
Kubernetes
Kubernetes Container From Scratch
Last updated on Aug 08, 2025

valley

1. Introduction

I have been wondering about relationships between Kubernetes components, container runtimes, how they work together to create and manage containers, and what happens when Kubernetes creates a pod. I decided to write this post to put everything together.

  • Container standards: OCI specifications, runc, and CRI
  • The execution flow: How kubelet, container runtimes, and runc work together
  • Filesystem layering: OverlayFS for efficient container images
  • Hands-on implementation: Building a Kubernetes pod from scratch using Linux primitives

You'll understand both what happens and how it works at the kernel level. Below are the prerequisites that I assume you already know:

2. Container Runtime Fundamentals

2.1. What is OCI?

The Open Container Initiative (OCI) is a Linux Foundation project established in 2015.

  • Primary purpose: to create open industry standards for container formats and runtimes
  • Before OCI, there was a risk of fragmentation in the container world, with different companies creating their own incompatible container technologies
  • OCI was formed to prevent this and ensure interoperability and portability across different container tools and platforms

OCI focuses on 3 main specifications:

  1. OCI Image Specification (image-spec (opens in a new tab)):

    • What it defines: The format for a container image. This includes how an image is structured on disk, its layers, manifest (metadata), and configuration
    • Why it matters: It ensures that an image built by one tool (eg. Docker) can be pulled, stored, and run by any other OCI-compliant container runtime. This is why a Docker image can be run by containerd or CRI-O, even after Docker's shim was removed from Kubernetes
    • Analogy: It's like the JPEG standard for images or the PDF standard for documents. Any software that understands the standard can create or read it
  2. OCI Runtime Specification (runtime-spec (opens in a new tab)):

    • What it defines: How a container runtime should execute a filesystem bundle (an unpacked container image) and manage its lifecycle (create, start, stop, delete, etc.). It specifies the config.json file, which describes how the container process should be run (eg. entrypoint, environment variables, resource limits, security settings)
    • Why it matters: It ensures that different container runtimes can produce consistent execution environments for containers
    • Analogy: It's like a detailed instruction manual for how to turn on a specific type of machine and what controls it should have
  3. OCI Distribution Specification (distribution-spec (opens in a new tab)):

    • What it defines: An API protocol for distributing container images. This standardizes how container registries (eg. Docker Hub, GCR, ECR, Harbor) store, pull, and push container images
    • Why it matters: It allows various container tools to interact with different registries, promoting a unified ecosystem for image distribution

2.2. What is runc?

runc (opens in a new tab) is a lightweight, portable, low-level container runtime that serves as the reference implementation of the OCI Runtime Specification.

When a higher-level container runtime (eg. containerd or CRI-O) decides to actually start a container, it hands off the specific task of creating and executing the container process to runc.

runc interacts directly with the Linux kernel's low-level features, specifically:

  • Namespaces: Provide process isolation (PID, network, mount, IPC, UTS, user namespaces)
  • Cgroups: Enforce resource limits (CPU, memory, I/O) on the container process
  • pivot_root/chroot: Change the root filesystem of the process to the container's rootfs bundle
  • Seccomp, AppArmor, SELinux: Apply security profiles for granular control over system calls and permissions

runc doesn't have its own daemon or long-running process that orchestrates many containers. Instead, it's a simple command-line tool that performs its job (spawning a container process) and then exits. It delegates the ongoing management of the running process to the kernel.

runc itself is largely stateless. It receives all necessary configuration (from the config.json defined by OCI runtime-spec) at runtime.

Docker initially open-sourced runc and contributed it to the OCI project. It's often used as the default or underlying runtime for many higher-level container engines.

2.3. What is CRI?

The Container Runtime Interface (CRI) is a Kubernetes API that standardizes how the kubelet communicates with container runtimes. It was introduced in Kubernetes v1.5 (December 2016) to solve the problem of tight coupling between Kubernetes and specific container runtimes.

Before CRI, Kubernetes had hardcoded integrations with specific container runtimes:

  • The contained direct, runtime-specific code to interact with Docker's REST API
  • Adding support for new runtimes (like rkt) required modifying kubelet's core code
  • This created vendor lock-in and made it difficult to innovate in the container runtime space
  • Each runtime integration had to be maintained within the Kubernetes codebase

CRI defines a gRPC API with two main services:

  1. ImageService: Manages container images

    • Pull, list, remove, and inspect images
    • Image filesystem usage statistics
  2. RuntimeService: Manages pods and containers

    • Create, start, stop, remove, and inspect containers
    • Create, stop, remove, and inspect pods (sandbox containers)
    • Execute commands in containers
    • Attach to containers
    • Port forwarding

Popular CRI implementations include:

  • containerd: Docker's donated runtime, now CNCF graduated project
  • CRI-O: Red Hat's minimalist CRI implementation, designed specifically for Kubernetes
  • cri-dockerd: Adapter that allows Docker Engine to work with CRI (after Docker's CRI support was removed in Kubernetes v1.24)

The CRI acts as a translation layer, converting kubelet's high-level requests (eg. create a pod) into the appropriate low-level operations that the container runtime can understand and execute.

Why does containerd implement CRI despite predating Kubernetes?
  1. Docker & containerd's origins (2013-2016):

    • Docker (the company) launched its container technology in 2013
    • Initially, the Docker daemon (dockerd) was a monolithic piece of software that handled everything from image management to container execution
    • Over time, Docker recognized the need for modularity. They extracted the core container execution logic into a separate, lower-level component called containerd, which was designed to be a robust, industry-standard container runtime with an emphasis on simplicity and portability. This happened around 2015-2016
    • containerd was built to be an OCI-compliant runtime, meaning it could understand and execute OCI runtime bundles (which runc would then perform at the kernel level)
  2. Kubernetes' initial days (2014-2016):

    • Kubernetes was open-sourced by Google in 2014
    • In its early versions (up to v1.4), Kubernetes had direct, hardcoded integrations with specific container runtimes, primarily Docker, and later rkt. The Kubelet's code directly knew how to talk to Docker's REST API. This made it difficult to swap out runtimes or introduce new ones
  3. The birth of CRI (Kubernetes v1.5, December 2016):

    • As Kubernetes gained adoption and the container ecosystem diversified, the Kubernetes community realized the need for a standardized pluggable interface for container runtimes. This was to avoid vendor lock-in and simplify the integration of new runtimes
    • This led to the creation and introduction of the Container Runtime Interface (CRI) in Kubernetes v1.5 (released in December 2016). The CRI defined a gRPC API that the Kubelet would speak
  4. containerd implements CRI (2017 onwards):

    • After the CRI was defined by Kubernetes, containerd (which was already a robust, standalone runtime) saw the opportunity to become the de-facto CRI-compliant runtime for Kubernetes
    • In March 2017, Docker (the company) famously donated containerd to the Cloud Native Computing Foundation (CNCF), the same foundation that hosts Kubernetes. This was a significant move to ensure containerd's neutrality and wide adoption
    • Immediately following its donation, containerd began implementing the CRI specification as a plugin (cri-containerd). This allowed containerd to directly receive and process gRPC calls from the Kubelet

2.4. What is CNI?

The Container Network Interface (CNI) is a Cloud Native Computing Foundation project that consists of a specification and libraries for writing plugins to configure network interfaces in Linux containers. CNI defines how container runtimes should set up networking for containers, providing a standard to manage network connectivity in containerized environments.

CNI was originally created by CoreOS (now part of Red Hat) and adopted by Kubernetes to solve the problem of container networking in a pluggable, vendor-neutral way. Before CNI, different container orchestration platforms had their own networking implementations, making it difficult to share networking solutions across platforms.

CNI plugins are executable programs that the container runtime calls to configure networking for containers. The process works as follows:

  1. Container Creation: When a container runtime (like containerd or CRI-O) creates a container, it also needs to set up networking for that container
  2. CNI Plugin Execution: The runtime executes CNI plugins in a specific order to configure the container's network interface
  3. Network Configuration: CNI plugins perform various networking tasks:
    • Create network interfaces (veth pairs, bridges, etc.)
    • Assign IP addresses from predefined pools
    • Set up routing rules
    • Configure firewall rules
    • Connect containers to networks

2.5. The Complete Flow

Now that we understand the individual components, let's see how OCI, runc, CRI, CNI, container runtimes, and kubelet work together to create and manage containers in Kubernetes.

Here's what happens when Kubernetes creates a pod with containers:

Loading chart...
  1. Pod Specification:

    • A pod specification is sent to the Kubernetes API server
    • The kubelet on the target node receives the pod spec through its watch on the API server
      Show me the Kubernetes code

      In pkg/kubelet/kuberuntime/kuberuntime_manager.go: call SyncPod() (opens in a new tab).

  2. CRI Communication:

  3. OCI Runtime API Call:

  4. Linux Kernel Interaction:

This layered architecture allows for flexibility and innovation at each level while maintaining compatibility through standardized interfaces (OCI, CRI, and CNI).

3. The Union Filesystem OverlayFS

OverlayFS (opens in a new tab) is the union filesystem that allows us to create a layered filesystem structure. It is commonly used in container runtimes to provide a read-only base layer and a writable layer for each container. When using containerd, CRI-O, or dockerd with Kubernetes, OverlayFS is often the preferred storage driver (opens in a new tab) for managing container image layers and writable container filesystems. It provides efficient disk usage by layering changes on top of read-only image layers. In this section, we will explore how OverlayFS works and how it can be used to create a layered filesystem structure.

OverlayFS combines two directory trees:

  • Lower layer: The read-only base layer, which contains the common files and directories that all containers share
  • Upper layer: The writable layer, which contains the changes made by the container. Each container has its own upper layer, allowing them to modify files without affecting the lower layer

The result is a merged view of the two layers, where files from the upper layer take precedence over files from the lower layer. This allows containers to have their own writable filesystem while still sharing a common base layer. If we delete a file in the upper layer, it will not be deleted in the lower layer. Instead, OverlayFS will create a copy of the file in the upper layer or delete it there, leaving the lower layer intact. If we delete or rename a file in the lower layer, OverlayFS uses special whiteout files (opens in a new tab) to indicate that the file has been deleted or renamed. This allows OverlayFS to maintain the integrity of the lower layer while still allowing containers to modify their own files.

Loading chart...

Let's demonstrate how to set up OverlayFS for two isolated environments. We will create a minimal Alpine Linux root filesystem and use it as the lower layer. Each environment will have its own upper layer, while sharing the same lower layer. Note that we are not using the term "container" here, as we will be still in the host's namespaces.

mkdir -p /root/tung/{lower,a0-upper,a0-work,a0-merged,a1-upper,a1-work,a1-merged}
cd /root/tung
# You may want to change to your OS architecture, eg. `x86_64` or `aarch64`
wget https://dl-cdn.alpinelinux.org/alpine/v3.20/releases/aarch64/alpine-minirootfs-3.20.3-aarch64.tar.gz
tar -xzf alpine-minirootfs-3.20.3-aarch64.tar.gz -C lower
 
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a0-upper,workdir=/root/tung/a0-work /root/tung/a0-merged
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a1-upper,workdir=/root/tung/a1-work /root/tung/a1-merged

Explain the mount command:

  • mount: Command to mount a filesystem ref (opens in a new tab)
  • -t overlay: Specifies the type of filesystem to mount, which is an overlay filesystem
  • overlay: the source of the mount. For a regular filesystem, this would be a physical device, like /dev/sda1. However, since the proc filesystem is a virtual one, there is no physical device to specify
  • -o lowerdir=/root/tung/lower,upperdir=/root/tung/a0-upper,workdir=/root/tung/a0-work /root/tung/a0-merged: Options for the overlay filesystem:
    • lowerdir: The lower layer of the overlay filesystem, which is the base filesystem that all containers share
    • upperdir: The upper layer of the overlay filesystem, which contains changes made by the container
    • workdir: The work directory used by the overlay filesystem to manage the upper layer
  • /root/tung/a0-merged: The mount points for the overlay filesystems

Now, let's verify that the overlay filesystems are mounted correctly. Firstly, change the root directory to the overlay filesystem for environment A0 and start a shell. Note that we are still in the host's namespaces.

# in host, change root to the overlay filesystem for environment A0
chroot /root/tung/a0-merged /bin/sh
# in the new shell
touch a0
 
# in a new terminal in host, change root to the overlay filesystem for environment A1
chroot /root/tung/a1-merged /bin/sh
# in the new shell
touch a1

Explain chroot command ref (opens in a new tab):

  • chroot: Command to change the root directory for the current running process. This creates an isolated environment, often called a chroot jail where the process can only access files and commands within that new directory tree
  • /root/tung/a0-merged: The new root directory for the current process
  • /bin/sh: The command to run in the new root directory, which is a shell in this case

Next, let's verify that the files are created in the upper layer of each environment's overlay filesystem. We can do this by checking the contents of the upper layer directories.

# in host, check the files in the lower layer
ls /root/tung/lower
# should not see a0 or a1 files, only the files from the Alpine Linux root filesystem
 
# in host, check the files in the upper layer for A0
ls /root/tung/a0-upper
# should see a0 file, but not a1 file
 
# in host, check the files in the merged view for A0
ls /root/tung/a0-merged
# should see a0 file, but not a1 file

Similarly, you can verify the a1 file is created in the upper layer and the merged layer of environment A1.

In conclusion, we have set up OverlayFS for two environments, A0 and A1, each with its own upper layer while sharing the same lower layer. This is how container runtimes like Docker, containerd, CRI-O, and others use OverlayFS to provide a layered filesystem structure for containers. In Kubernetes, this is used to provide a common base layer for all containers in a pod while allowing each container to have its own writable layer.

4. Create Kubernetes Pod from Scratch

4.1. Create Pause Container

In Kubernetes, the Pause container (opens in a new tab) (also called sandbox or infra container) is a special container that maintains cgroups and namespaces for the pod. It is responsible for providing a shared network namespace for all other containers in the pod. Kubernetes uses Pause containers to allow for worker containers crashing or restarting without losing any of the networking configuration.

Let's set up the necessary overlay filesystems for the pod's namespaces. We will create a directory structure to hold the lower and upper layers of the overlay filesystem, and then extract a minimal Alpine Linux root filesystem into it. Next, we will create overlay mounts for each container in the pod. Each container will have its own upper layer, while sharing the same lower layer.

mkdir -p /root/tung/{lower,pause-upper,pause-work,pause-merged,a0-upper,a0-work,a0-merged,a1-upper,a1-work,a1-merged}
cd /root/tung
# You may want to change to your OS architecture, eg. `x86_64` or `aarch64`
wget https://dl-cdn.alpinelinux.org/alpine/v3.20/releases/aarch64/alpine-minirootfs-3.20.3-aarch64.tar.gz
tar -xzf alpine-minirootfs-3.20.3-aarch64.tar.gz -C lower
 
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/pause-upper,workdir=/root/tung/pause-work /root/tung/pause-merged
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a0-upper,workdir=/root/tung/a0-work /root/tung/a0-merged
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a1-upper,workdir=/root/tung/a1-work /root/tung/a1-merged

We will start the Pause container using the unshare command to create a new set of namespaces.

unshare -Cunimpf chroot /root/tung/pause-merged /bin/sh

Explain the command:

  • unshare is used to create a new set of namespaces for the command that follows it ref (opens in a new tab)
  • -C: Create a new cgroup namespace (for resource limits)
  • -u: Create a new UTS namespace (for hostname)
  • -n: Create a new network namespace
  • -i: Create a new IPC namespace
  • -m: Create a new mount namespace
  • -p: Create a new PID namespace
  • -f: Fork the command in a new process
  • chroot /root/tung/pause-merged /bin/sh: Change the root directory to the overlay filesystem for the Pause container and start a shell

4.2. Create Application Containers

To interact with the Pause container, we firstly need to find its PID. This is the process that was forked from the unshare command. Next, we will use nsenter to enter the Pause container's namespaces and start the application containers A0 and A1, which will also share the same network namespace with the Pause container.

# find the PAUSE process's PID, the process that is forked from the unshare command
ps aux | grep /bin/sh
 
# for example, let's say the output is:
root         535  0.0  0.0   5260   820 pts/0    S    01:37   0:00 unshare -Cunimpf chroot /root/tung/pause-merged /bin/sh
root         536  0.0  0.0   1816  1016 pts/0    S+   01:37   0:00 /bin/sh
 
# we can see that the PID of the Pause container is 536
PAUSE_PID=<pause-pid>
 
# in host, create application containers A0 by entering the Pause container's namespaces
# and changing the root directory to the overlay filesystem for A0
nsenter -t $PAUSE_PID -a chroot /root/tung/a0-merged /bin/sh
 
# in host, in a new terminal, similarly, create application containers A1
nsenter -t $PAUSE_PID -a chroot /root/tung/a1-merged /bin/sh

Explain the command:

  • nsenter: Command to enter the namespaces of another process ref (opens in a new tab)
  • -t $PAUSE_PID: Target the PID of the Pause container process, allowing us to enter its namespaces
  • -a: Enter all namespaces (PID, network, mount, IPC, UTS, user)
  • chroot /root/tung/a0-merged /bin/sh: Change the root directory to the overlay filesystem for the application container A0 and start

For more details on how nsenter could be used, refer to my previous post on nsenter experiments.

Inside each application container, we can verify that they share the same network namespace with the Pause container by checking the /proc/self/ns/net symlink. This symlink should point to the same network namespace ID for all containers in the pod.

# in each container, mount the proc filesystem to access process information
mount -t proc proc /proc
 
# in each container, check the network namespace
ls -l /proc/self/ns/net
# All should point to the same net:[ID], eg. net:[4026532321]

That's it! We have just created a Pause container and two application containers A0 and A1, all sharing the same network namespace. This is exactly how Kubernetes creates pods and manages containers within them. The Pause container acts as the network namespace anchor, while the application containers can run their own processes and communicate over the shared network.

5. Conclusion

This post covered the fundamentals of container runtimes, OCI specifications, and how Kubernetes uses these components to create and manage containers. We explored the role of runc, CRI, and OverlayFS in providing a layered filesystem structure for containers. Finally, we demonstrated how to create a Kubernetes pod from scratch, including the Pause container and application containers.

This post didn't cover NRI (Node Resource Interface) or CSI (Container Storage Interface), which are also important components in the Kubernetes ecosystem.