Tech
Kubernetes
Kubernetes Pod Networking From Scratch
Last updated on Aug 10, 2025

network

1. Introduction

I'm quite overwhelmed with the complexity of Kubernetes networking. There are many concepts and technologies involved such as network namespaces, veth pairs, bridges, iptables, load balancing, CNI plugins, and more. I've been trying to search on internet, but I haven't found a comprehensive article that demonstrates how we could use Linux commands to set up Kubernetes pod networking from scratch, building up from the basic concepts to the advanced features. I think this is a great opportunity to write a blog post that fills this gap.

In this post, I will show you how to implement Kubernetes pod networking using Linux commands with minimal dependencies. I will also provide necessary background knowledge and concepts to help you understand the topic better. However, there are still some prerequisites that I assume you already knew:

In the following sections, we will use Linux commands to implement:

  • Container-to-container communication in the same pod via the loopback interface
  • Pod-to-pod communication on the same node using either veth pairs or bridge-based solutions
  • Pod-to-pod communication across nodes using either static routing, IP-in-IP tunneling, and VXLAN tunneling solutions
  • Pod-to-service communication using Network Address Translation (NAT) technology

Let's start with how containers in the same pod could communicate with each other.

2. Container-to-Container Communication in the Same Pod

This section describes how containers within the same pod communicate with each other. We will configure containers in the same pod to share the same network namespace, which allows them to communicate over localhost, which is the loopback interface.

Loading chart...

2.1. Create Pause Container

In Kubernetes, Pause container (opens in a new tab) is a special container that maintains cgroups and namespaces for the pod. It is responsible for providing a shared network namespace for all other containers in the pod. This solution assumes that we have one pod with one Pause container and two application containers A0 and A1.

Let's set up an overlay filesystems for the containers. We will create a directory structure to hold the lower and upper layers of the overlay filesystem, and then extract a minimal Alpine Linux root filesystem into it. Next, we will create overlay mounts for each container in the pod. Each container will have its own upper layer, while sharing the same lower layer.

mkdir -p /root/tung/{lower,pause-upper,pause-work,pause-merged,a0-upper,a0-work,a0-merged,a1-upper,a1-work,a1-merged}
cd /root/tung
# You may want to change to your OS architecture, eg. `x86_64` or `aarch64`
wget https://dl-cdn.alpinelinux.org/alpine/v3.20/releases/aarch64/alpine-minirootfs-3.20.3-aarch64.tar.gz
tar -xzf alpine-minirootfs-3.20.3-aarch64.tar.gz -C lower
 
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/pause-upper,workdir=/root/tung/pause-work /root/tung/pause-merged
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a0-upper,workdir=/root/tung/a0-work /root/tung/a0-merged
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a1-upper,workdir=/root/tung/a1-work /root/tung/a1-merged

We will start the Pause container using the unshare command to create a new set of namespaces.

unshare -Cunimpf chroot /root/tung/pause-merged /bin/sh
 
# in pause container, mount the proc filesystem
mount -t proc proc /proc

2.2. Create Application Containers

To interact with the Pause container, we firstly need to find its PID. This is the process that was forked from the unshare command. Next, we will use nsenter to enter the Pause container's namespaces and start the application containers A0 and A1, which will also share the same network namespace with the Pause container.

# in host, find the PAUSE process's PID, the process that is forked from the unshare command
ps aux | grep /bin/sh
 
# for example, let's say the output is:
root         535  0.0  0.0   5260   820 pts/0    S    01:37   0:00 unshare -Cunimpf chroot /root/tung/pause-merged /bin/sh
root         536  0.0  0.0   1816  1016 pts/0    S+   01:37   0:00 /bin/sh
 
# we can see that the PID of the Pause container is 536
PAUSE_PID=<pause-pid>
 
# in host, create application containers A0 by entering the Pause container's namespaces
# and changing the root directory to the overlay filesystem for A0
nsenter -t $PAUSE_PID -a chroot /root/tung/a0-merged /bin/sh
 
# in host, in a new terminal, similarly, create application containers A1
nsenter -t $PAUSE_PID -a chroot /root/tung/a1-merged /bin/sh

Inside each application container, we can verify that they share the same network namespace as the Pause container by checking the /proc/self/ns/net symlink. This symlink should point to the same network namespace ID for all containers in the pod.

# in each container, mount the proc filesystem to access process information
mount -t proc proc /proc
 
# in each container, check the network namespace
ls -l /proc/self/ns/net
# All should point to the same net:[ID], eg. net:[4026532321]

2.3. Enable Loopback Interface

By default, when we use unshare or create a new network namespace, the loopback interface is down. We can verify this by checking the network interfaces inside the Pause container.

# in PAUSE, list all network interface
ip link
# should see
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
 
# check network interface config
ifconfig
# should see empty

Let's bring the loopback interface up so that containers A0 and A1 can communicate with each other using localhost.

# in pause container
ip link set lo up

Explain the command:

  • ip link set: bring a network interface up or down
  • lo: loopback interface
  • up: make it up
Show me the containerd code

In internal/cri/server/sandbox_run.go, in RunPodSandbox(): call c.setupPodNetwork() (opens in a new tab), which then calls c.bringUpLoopback() (opens in a new tab).


In internal/cri/server/sandbox_run_linux.go, in c.bringUpLoopback(), it calls netlink.LinkSetUp() (opens in a new tab).


Note: runc also implements the loopback interface setup (opens in a new tab) which may be used by containerd or other container runtimes.

Let's verify that the loopback interface is now up.

# in pause container, verify
ip link
# should see
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
    link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
 
# check network interface config
ifconfig
# should see
lo        Link encap:Local Loopback
          inet addr:127.0.0.1  Mask:255.0.0.0
          inet6 addr: ::1/128 Scope:Host
          UP LOOPBACK RUNNING  MTU:65536  Metric:1
          RX packets:0 errors:0 dropped:0 overruns:0 frame:0
          TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
          collisions:0 txqueuelen:1000
          RX bytes:0 (0.0 B)  TX bytes:0 (0.0 B)

2.4. Test Communication

In container A0, start a netcat server to listen on port 8080 and execute a shell when a connection is made. In container A1, connect to container A0 to verify that they can communicate over localhost.

# in container A0
nc -lk -p 8080 -e /bin/sh
 
# in container A1
nc -v localhost 8080
# should see a shell prompt in container A1

Explain the command nc -lk -p 8080 -e /bin/sh:

  • nc: netcat command, a networking utility
  • -l: Listen mode, for incoming connections
  • -k: Keep the server running after a connection is closed. In my nc version, the -k flag only works if we specify a program to run with the -e flag
  • -p 8080: Specify the port to listen on
  • -e /bin/sh: Execute a shell when a connection is made

Explain the command nc -v localhost 8080:

  • -v: Verbose mode, to show connection details
  • localhost: Connect to the local loopback interface
  • 8080: The port to connect to

In conclusion, we have set up a pod with two containers that can communicate with each other over localhost. This is achieved by sharing the same network namespace through the Pause container, which allows both containers to access the loopback interface and communicate using standard networking tools like netcat. This is similar to how containers in a Kubernetes pod communicate with each other in a real-world scenario.

3. Pod-to-Pod Communication on the Same Node

In this section, we will explore how pods communicate with each other on the same node. The technology used for this is typically a virtual Ethernet (veth) pair or a bridge-based networking solution. From this section now on, we won't have containers in the same pod. Hence, we will use the term pod to refer to a group of containers that share the same network namespace, similar to how Kubernetes pods work.

3.1. Solution 1: Direct veth Pair

veth stands for virtual Ethernet and is a pair of virtual network interfaces that are connected to each other. Think of a veth pair as a virtual Ethernet cable directly connecting two network namespaces. Each end of the veth pair is in a different network namespace, allowing them to communicate with each other as if they were connected by a physical Ethernet cable.

When one end of a veth pair sends a packet, it appears on the other end as if it were received from a physical network interface. veth operates at the Data Link layer (Layer 2) of the OSI model, which means it can carry Ethernet frames between network namespaces. While veth interfaces operate at Layer 2, they can be used in conjunction with Layer 3 (IP addresses, routing) to establish more complex network topologies. For example, we can assign IP addresses to veth interfaces and configure routes to enable communication between different network namespaces or containers, even if they are in different subnets.

The solution 1's idea is to create a virtual Ethernet (veth) pair for each pair of pods that need to communicate with each other. Each pod will have one end of the veth pair, allowing them to communicate directly.

In this solution, we assume:

  • Cluster CIDR is 10.200.0.0/16
  • node-0, where pod A0 and pod A1 are running, has pod subnet 10.200.0.0/24
  • Pod A0 has IP 10.200.0.2/24
  • Pod A1 has IP 10.200.0.3/24

The communication between pods A0 and A1 is described in the diagram below.

Loading chart...
Who assigns pod subnets to nodes and IP addresses to pods?

In Kubernetes, the Kubernetes controller manager is responsible for IP address management (IPAM) at the cluster level. For each new node joining the cluster, it chooses an unused subnet from the Cluster CIDR and assigns this unique subnet to the new node. The controller manager then records this assignment in the etcd database, making it available to all other cluster components.


Based on the pod subnet allocated to the node, the CNI plugin will assign IP addresses to pods. The CNI plugin knows which IPs are available by maintaining an IPAM system, which contains a local IPAM database (eg. a file, directory, or in-memory store).

3.1.1. Create Pods and a veth Pair

First, let's create two pods A0 and A1 in two different network namespaces. We will use the same overlay filesystem structure as before.

# in a new terminal
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a0-upper,workdir=/root/tung/a0-work /root/tung/a0-merged
unshare -Cunimpf chroot /root/tung/a0-merged /bin/sh
 
# in another terminal
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a1-upper,workdir=/root/tung/a1-work /root/tung/a1-merged
unshare -Cunimpf chroot /root/tung/a1-merged /bin/sh

Next, let's create a veth pair with two ends named veth-a0 and veth-a1.

# in host
ip link add veth-a0 type veth peer name veth-a1

Explain the ip link command ref (opens in a new tab):

  • ip link add: Command to create a new network interface
  • veth-a0: Name of the first end of the veth pair
  • type veth: Specifies that the interface is a virtual Ethernet interface
  • peer name veth-a1: Specifies the name of the second end of the veth pair

Let's verify that the veth pair has been created.

# in host
ip link
# should see
3: veth-a1@veth-a0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether 3a:24:90:cb:cd:83 brd ff:ff:ff:ff:ff:ff
4: veth-a0@veth-a1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
    link/ether de:68:63:b1:1b:4c brd ff:ff:ff:ff:ff:ff

Now, we need to move the veth interfaces to the corresponding pods' network namespaces. We will use the ip link set command to do this. First, we need to find the PIDs of the A0 and A1 processes that were forked from the unshare command.

# in host, find the A0 process's PID, the process that is forked from the unshare command
ps aux | grep /bin/sh
 
# for example, let's say the output is:
root        4304  0.0  0.0   5260   816 pts/1    S    09:48   0:00 unshare -Cunimpf chroot /root/tung/a0-merged /bin/sh
root        4305  0.0  0.0   1828  1212 pts/1    S+   09:48   0:00 /bin/sh
root        4328  0.0  0.0   5260   804 pts/2    S    09:49   0:00 unshare -Cunimpf chroot /root/tung/a1-merged /bin/sh
root        4329  0.0  0.0   1820  1160 pts/2    S+   09:49   0:00 /bin/sh
root        4331  0.0  0.0   6088  1948 pts/0    S+   09:49   0:00 grep /bin/sh
 
# we can see that the PID of the pod A0 is 4305 and the pod A1 is 4329
A0_PID=<a0-pid>
A1_PID=<a1-pid>
 
# move the veth interfaces to the corresponding pods' network namespaces
ip link set veth-a0 netns $A0_PID
ip link set veth-a1 netns $A1_PID

In host, if we run ip link, we will see that the veth interfaces now disappear from the host's network namespace because they are moved to the corresponding pods' network namespaces.

3.1.2. Assign IP Addresses for veth Interfaces

In order for the pods to communicate with each other via IP addresses, we need to assign an IP address to the veth interface in each pod. We will assign IP addresses in the same subnet, for example, assign IP 10.200.0.2/24 for A0 and IP 10.200.0.3/24 for A1.

# in A0
ip addr add 10.200.0.2/24 dev veth-a0
ip link set veth-a0 up
 
# in A1
ip addr add 10.200.0.3/24 dev veth-a1
ip link set veth-a1 up

Explain the command ip addr add 10.200.0.2/24 dev veth-a0 ref (opens in a new tab):

  • ip addr add: Command to add an IP address to a network interface
  • dev veth-a0: Specifies the network interface to which the IP address should be assigned

We can verify that the IP addresses are assigned correctly by checking the network interfaces in each pod.

# in A0, verify the IP address
ip addr
# should see
4: veth-a0@if3: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP qlen 1000
    link/ether de:68:63:b1:1b:4c brd ff:ff:ff:ff:ff:ff
    inet 10.200.0.2/24 scope global veth-a0
       valid_lft forever preferred_lft forever
    inet6 fe80::dc68:63ff:feb1:1b4c/64 scope link
       valid_lft forever preferred_lft forever

3.1.3. Test Communication

In pod A0, start a netcat server to listen on port 8080. In pod A1, connect to pod A0 to verify that they can communicate over the IP addresses assigned to the veth interfaces.

# in pod A0, try to ping pod A1
ping 10.200.0.3
 
# in pod A0, create a netcat server
nc -lk -p 8080 -e /bin/sh
 
# in pod A1, try to ping pod A0
ping 10.200.0.2
 
# in pod A1, connect to pod A0 via netcat
nc -v 10.200.0.2 8080
# should see a shell prompt in pod A1

We have just set up pod-to-pod communication on the same node using a veth pair. Each pod has one end of the veth pair, allowing them to communicate directly with each other using IP addresses. However, this solution has some limitations. For example, it requires a veth pair for each pair of pods that need to communicate with each other, which can lead to a large number of veth pairs if there are many pods. This can also lead to performance issues due to the overhead of managing many veth pairs. In practice, we can use a more scalable solution based on bridges. The next section will discuss the bridge-based networking solution.

3.2. Solution 2: Bridge-Based Networking

In this solution, we will use a bridge to connect multiple pods on the same node. A bridge is a virtual network switch that allows multiple network interfaces to communicate with each other as if they were connected by a physical switch. This solution is more scalable than solution 1, as it allows multiple pods to communicate with each other without the need for a separate veth pair for each pair of pods.

Bridges operate at the Data Link layer (Layer 2) of the OSI model, allowing them to forward Ethernet frames between network interfaces. They can also be used in conjunction with Layer 3 (IP addresses, routing) to establish more complex network topologies. For example, we can assign IP addresses to the bridge interface and configure routes to enable communication between different network namespaces. This is how Kubernetes sets up networking using CNI plugins, which creates a bridge for each pod network. We will mimic this behavior by creating a bridge and connecting the pods to it.

In this solution, we assume:

  • Cluster CIDR is 10.200.0.0/16
  • node-0, where pod A0 and pod A1 are running, has pod subnet 10.200.0.0/24
  • Pod A0 has IP 10.200.0.2/24
  • Pod A1 has IP 10.200.0.3/24

The communication between pods A0 and A1 is desribed in the diagram below.

Loading chart...

3.2.1. Create Pods, Bridge Interface and veth Pairs

First, let's create two pods A0 and A1 in two different network namespaces. We will use the same overlay filesystem structure as before.

# in a new terminal
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a0-upper,workdir=/root/tung/a0-work /root/tung/a0-merged
unshare -Cunimpf chroot /root/tung/a0-merged /bin/sh
 
# in another terminal
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a1-upper,workdir=/root/tung/a1-work /root/tung/a1-merged
unshare -Cunimpf chroot /root/tung/a1-merged /bin/sh

Next, let's create a bridge interface in host network namespace.

# in host
ip link add name br0 type bridge
# bring it up
ip link set br0 up

Explain the ip link command ref (opens in a new tab):

  • ip link add: Command to create a new network interface
  • name br0: Name of the bridge interface
  • type bridge: Specifies that the interface is a bridge
  • This command will also assign a unique MAC address to the bridge interface, which will be used for communication between pods at Layer 2

Let's verify that the bridge interface has been created.

# in host
ip link
# should see
3: br0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
    link/ether 3a:b0:8e:d6:55:38 brd ff:ff:ff:ff:ff:ff

Now, we need to create a veth pair for each pod and connect one end of the veth pair to the bridge interface. The other end of the veth pair will be moved to the corresponding pod's network namespace.

# in host, create veth pairs for A0 and A1
ip link add veth-a0 type veth peer name veth-a0-c
ip link set veth-a0 master br0
ip link set veth-a0 up
# find A0_PID yourself
ip link set veth-a0-c netns $A0_PID
 
ip link add veth-a1 type veth peer name veth-a1-c
ip link set veth-a1 master br0
ip link set veth-a1 up
# find A1_PID yourself
ip link set veth-a1-c netns $A1_PID

Explain the ip link set command:

  • ip link set veth-a0: specify the network interface (veth-a0) we want to modify
  • master br0: This is the action that sets br0 device as the master for veth-a0
    • This makes veth-a0 a port on the br0 bridge. Any network traffic that comes into veth-a0 will now be handled by br0 bridge, allowing the traffic to be forwarded to other interfaces connected to the same bridge
    • The Linux bridge forwards traffic by using a forwarding database (FDB), also known as a MAC address table. The bridge operates at Layer 2 of the OSI model and makes forwarding decisions based on MAC addresses, not IP addresses

Next, let's assign IP addresses to the veth interfaces in each pod. We will assign IP addresses in the same subnet, for example, assign IP 10.200.0.2/24 for A0 and IP 10.200.0.3/24 for A1. This allows them to communicate with each other using IP addresses.

# in A0
ip addr add 10.200.0.2/24 dev veth-a0-c
ip link set veth-a0-c up
 
# in A1
ip addr add 10.200.0.3/24 dev veth-a1-c
ip link set veth-a1-c up
Show me the containerd and CNI plugin code

In repo containerd, in internal/cri/server/sandbox_run.go, in RunPodSandbox(): call c.setupPodNetwork() (opens in a new tab), which then calls netPlugin.Setup() (opens in a new tab).


In internal/cri/server/service.go, in NewCRIService(), it calls c.initPlatform() (opens in a new tab) to initialize the c.netPlugin. c.initPlatform() is implemented in internal/cri/server/service_linux.go (opens in a new tab).


One of the implementation of netPlugin.Setup() is in vendor/github.com/containerd/go-cni/cni.go (opens in a new tab). This is a Go library that provides the necessary functions and data structures for a container runtime to:


  • Find the CNI plugin executables on the host system

  • Run the CNI plugins to set up, tear down, or check the status of a pod's network

  • Handle the CNI plugin's configuration and results

In Setup(), it calls c.attachNetworks() (opens in a new tab), which then calls asynchAttach() (opens in a new tab), which then calls n.Attach() (opens in a new tab).


In vendor/github.com/containerd/go-cni/namespace.go, in Attach(), it calls n.cni.AddNetworkList() (opens in a new tab), which is implemented in vendor/github.com/containernetworking/cni/libcni/api.go (opens in a new tab). In AddNetworkList(), it calls c.addNetwork() (opens in a new tab), which then calls invoke.ExecPluginWithResult() (opens in a new tab), which is the ADD command of the CNI plugin.


Let's check one of the implementation of CNI plugins in the repo plugins (opens in a new tab). This repo contains a collection of CNI plugins, which are reference and example networking plugins that are maintained by the CNI team.


In plugins/main/bridge/bridge.go, in cmdAdd(), it calls setupBridge() (opens in a new tab) and setupVeth() (opens in a new tab).


setupBridge() then calls ensureBridge() (opens in a new tab), which creates the bridge interface and brings it up.


In setupVeth() (opens in a new tab), it creates the veth pair in the host, move one end into container's network namespace, and connects the host veth end to the bridge.

3.2.2. Test Communication

In pod A0, start a netcat server to listen on port 8080. In pod A1, connect to pod A0 to verify that they can communicate over the IP addresses assigned to the veth interfaces.

# in pod A0, try to ping pod A1
ping 10.200.0.3
 
# in A0, create a netcat server
nc -lk -p 8080 -e /bin/sh
 
# in pod A1, try to ping pod A0
ping 10.200.0.2
 
# in pod A1, connect to pod A0 via netcat
nc -v 10.200.0.2 8080
# should see a shell prompt in pod A1

In conclusion, we have set up pod-to-pod communication on the same node using a bridge. The bridge allows multiple pods to communicate with each other without the need for a separate veth pair for each pair of pods. This is a more scalable solution that is used in Kubernetes networking.

In reality, Flannel CNI plugin indeed implements a bridge-based networking solution to connect pods on the same node similar to the one we just implemented. However, other CNI plugins like Calico and Cilium use different approaches. It's out of the scope of this post to cover all CNI plugins, but you can refer to their documentation for more details.

In the next section, we will explore how pods communicate with each other across different nodes in a Kubernetes cluster.

4. Pod-to-Pod Communication Across Nodes

In this section, we will explore how pods communicate with each other across different nodes in a Kubernetes cluster. We will discuss two solutions in Kubernetes networking using static routing and IP tunneling.

4.1. Solution 1: Static routing

This solution is based on the idea of routing traffic between bridges on different nodes. Each node has its own bridge interface, and pods on different nodes can communicate with each other by routing traffic through their respective bridges. From the bridge, traffic is forwarded to the appropriate network interface based on the destination IP address configured in the routing table of the node.

In Kubernetes, each node is assigned one IP address range, known as the pod subnet, which is used to assign IP addresses to pods running on that node. In this solution, we assume:

  • Cluster CIDR is 10.200.0.0/16
  • node-0:
    • It has IP 192.168.64.4
    • It has pod subnet 10.200.0.0/24
    • It has a bridge br0 with IP 10.200.0.1/24
    • Pod A0 is running on node-0 with IP 10.200.0.2/24
  • node-1
    • It has IP 192.168.64.5
    • It has pod subnet 10.200.1.0/24
    • It has a bridge br1 with IP 10.200.1.1/24
    • Pod A1 is running on node-1 with IP 10.200.1.2/24

The communication between pods A0 and A1 across nodes is described in the diagram below.

Loading chart...

4.1.1. Create Pods, Bridges, and veth Pairs

First, let's create two pods A0 in node-0 and pod A1 in node-1. We will use the same overlay filesystem structure as before.

# in node-0
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a0-upper,workdir=/root/tung/a0-work /root/tung/a0-merged
unshare -Cunimpf chroot /root/tung/a0-merged /bin/sh
 
# in node-1
# remember to download the minimal Alpine Linux and extract it into the lower directory
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a1-upper,workdir=/root/tung/a1-work /root/tung/a1-merged
unshare -Cunimpf chroot /root/tung/a1-merged /bin/sh

In node-0, let's create a bridge interface and a veth pair for pod A0. We will also assign IP addresses to the bridge and the veth interface.

# in node-0, create bridge, assign IP
ip link add br0 type bridge
ip addr add 10.200.0.1/24 dev br0
ip link set br0 up
 
# create veth pair and connect it to the bridge
ip link add veth-a0 type veth peer name veth-a0-c
ip link set veth-a0 master br0
ip link set veth-a0 up
 
# move veth-a0-c into A0's namespace
# find A0_PID yourself
ip link set veth-a0-c netns $A0_PID

In node-1, we also create a bridge interface and a veth pair for pod A1. We will assign IP addresses to the bridge and the veth interface.

# in node-1, create bridge, assign IP
ip link add br1 type bridge
ip addr add 10.200.1.1/24 dev br1
ip link set br1 up
 
# create veth pair and connect it to the bridge
ip link add veth-a1 type veth peer name veth-a1-c
ip link set veth-a1 master br1
ip link set veth-a1 up
 
# move veth-a1-c into A1's namespace and assign IP
ip link set veth-a1-c netns $A1_PID

As you may notice, we didn't assign an IP address to the bridge interface as in Solution 2: Bridge-Based Networking. This is because in that solution, the bridge interface is only used for communication between pods on the same node. In this solution, we need to assign an IP address to the bridge interface so that it can be used for routing between pods and the host network namespace. More specifically, later we will configure the pod to route traffic to the bridge interface by default, allowing the bridge to act as a gateway for traffic destined for other pods on different nodes.

Next, we assign IP addresses to the veth interfaces in each pod. For example, assign IP 10.200.0.2/24 for A0 and IP 10.200.1.2/24 for A1.

# in A0, assign IP to veth-a0-c
ip addr add 10.200.0.2/24 dev veth-a0-c
ip link set veth-a0-c up
 
# in A0, verify that it can reach the bridge br0
ping 10.200.0.1
 
 
# in A1, assign IP to veth-a1-c
ip addr add 10.200.1.2/24 dev veth-a1-c
ip link set veth-a1-c up
 
# in A1, verify that it can reach the bridge br1
ping 10.200.1.1

We don't assign an IP to the host's end (veth-a0 and veth-a1) of the veth pair because its primary purpose is to act as a bridge port or physical interface connecting a container's network namespace to the host's network. It functions as a virtual cable, and its role is to forward traffic, not to act as an endpoint with its own IP address.

In pod A0 in node-0, if we run ip route, we should see that pod A0 knows how to route traffic in the pod subnet 10.200.0.0/24 via the bridge br0. However, it does not know how to route traffic to pod A1 with IP 10.200.1.2/24 in node-1, which is in a different subnet 10.200.1.0/24.

# in A0, verify the route
ip route
# should see
10.200.0.0/24 dev veth-a0-c scope link  src 10.200.0.2

Therefore, we need to configure the default route in each pod to route traffic to the bridge. Otherwise, the pods will not be able to identify the bridge as the next hop for traffic destined for other pods.

# in A0, config to route traffic to the bridge by default
ip route add default via 10.200.0.1
 
# in A1, config to route traffic to the bridge by default
ip route add default via 10.200.1.1

Now, if we run ip route in pod A0, we should see that it has a default route to the bridge br0. When a request destined for pod A1 at 10.200.1.2/24 in node-1 is sent, it will be routed to the bridge br0 at 10.200.0.1/24.

# in A0, verify the route
ip route
# should see
default via 10.200.0.1 dev veth-a0-c
10.200.0.0/24 dev veth-a0-c scope link  src 10.200.0.2

Alright, any request destined for pod A1 at 10.200.1.2/24 in node-1 will be routed to the bridge br0, but it will not be able to reach pod A1 yet. This is because the bridge br0 in node-0 does not know how to route traffic to the pod subnet 10.200.1.0/24. To solve this, we need to guide the host network namespace to route traffic to the pod subnet 10.200.1.0/24 via node-1's IP address.

# in node-0, tell the kernel to send packets destined for node-1's subnet to node-1's IP
ip route add 10.200.1.0/24 via 192.168.64.5
 
# in node-1, tell the kernel to send packets destined for node-0's subnet to node-0's IP
ip route add 10.200.0.0/24 via 192.168.64.4

We can verify that the node-0's network namespace now has a route to the pod subnet 10.200.1.0/24 via node-1's IP address 192.168.64.5.

# in node-0, verify the route
ip route
# should see
default via 192.168.64.1 dev enp0s1
10.200.0.0/24 dev br0 proto kernel scope link src 10.200.0.1
10.200.1.0/24 via 192.168.64.5 dev enp0s1
192.168.64.0/24 dev enp0s1 proto kernel scope link src 192.168.64.4

On each node, we need to enable IPv4 forwarding.

# in both node-0 and node-1
sysctl -w net.ipv4.ip_forward=1

The IPv4 forwarding allows the host network namespace to forward packets between different network interfaces. For example:

  • Let's send a packet from pod A0 in node-0 to pod A1 in node-1
  • The packet firstly arrives at the bridge br0 in node-0
  • node-0 will find the route in the routing table that contains the destination subnet of pod A1. That route is 10.200.1.0/24 via 192.168.64.5 dev enp0s1
  • node-0 then forwards the packet to the interface enp0s1 to go to node-1's IP address 192.168.64.5
  • node-1 receives the packet and forward it to pod A1
  • Without IPv4 forwarding enabled, node-0 will not forward packets between bridge br0 and enp0s1 interface. The packet would be dropped because it's not destined for node-0 itself

4.1.2. Test Communication

In pod A0 in node-0, start a netcat server to listen on port 8080. In pod A1 in node-1, connect to pod A0 to verify that they can communicate over the network.

# in pod A0, try to ping pod A1
ping 10.200.1.2
 
# create a netcat server in pod A0
nc -lk -p 8080 -e /bin/sh
 
# in pod A1, try to ping pod A0
ping 10.200.0.2
 
# in pod A1, connect to pod A0 via netcat
nc -v 10.200.0.2 8080
# should see a shell prompt in pod A1

In conclusion, we have set up pod-to-pod communication across nodes using routing between bridges. Each pod has its own bridge interface, and the host network namespace is configured to route traffic to the appropriate pod subnet via the corresponding node's IP address. This allows pods on different nodes to communicate with each other.

While this solution works for basic direct pod-to-pod communication across nodes, it is a simplified example and not how production Kubernetes clusters handle networking. As mentioned in the previous section, this approach is closest to Flannel CNI plugin's host-gateway mode but it lacks the scalability and automatic route management features. For example, in a cluster with 1000 nodes, each with its own pod subnet, every single node would need 999 routing table entries just to handle inter-node pod communication. As the cluster grows, the routing tables would become unmanageable and slow. In the next section, we will explore another solution using IP tunneling.

4.2. Solution 2.1: IP-in-IP Tunneling

IP tunneling (opens in a new tab) is a more elegant and scalable solution for pod-to-pod communication across nodes. IP tunneling works by encapsulating packets from one network inside packets that can be routed through another network. In other words, traffic from a pod on Node A destined for a pod on Node B is wrapped in a packet with the destination IP of Node B.

IP-in-IP tunneling (opens in a new tab) is one of the simplest forms of IP tunneling, where an entire IP packet (including headers) is encapsulated as the payload of another IP packet. This creates a tunnel between two endpoints, allowing packets to be routed through networks that might not otherwise be able to route them.

In this IP-in-IP tunneling solution, we assume the same network topology as the previous section:

  • Cluster CIDR is 10.200.0.0/16
  • node-0:
    • It has IP 192.168.64.4
    • It has pod subnet 10.200.0.0/24
    • It has a bridge br0 with IP 10.200.0.1/24
    • Pod A0 is running on node-0 with IP 10.200.0.2/24
  • node-1:
    • It has IP 192.168.64.5
    • It has pod subnet 10.200.1.0/24
    • It has a bridge br1 with IP 10.200.1.1/24
    • Pod A1 is running on node-1 with IP 10.200.1.2/24

The packet flow for IP-in-IP tunneling works as follows:

Loading chart...

4.2.1. Create Pods, Bridges, and veth Pairs

First, let's create the same pod and bridge setup as in the previous section:

# in node-0
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a0-upper,workdir=/root/tung/a0-work /root/tung/a0-merged
unshare -Cunimpf chroot /root/tung/a0-merged /bin/sh
 
# create bridge, assign IP
ip link add br0 type bridge
ip addr add 10.200.0.1/24 dev br0
ip link set br0 up
 
# create veth pair and connect it to the bridge
ip link add veth-a0 type veth peer name veth-a0-c
ip link set veth-a0 master br0
ip link set veth-a0 up
 
# move veth-a0-c into A0's namespace
# find A0_PID yourself
ip link set veth-a0-c netns $A0_PID
# in node-1
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a1-upper,workdir=/root/tung/a1-work /root/tung/a1-merged
unshare -Cunimpf chroot /root/tung/a1-merged /bin/sh
 
# create bridge, assign IP
ip link add br1 type bridge
ip addr add 10.200.1.1/24 dev br1
ip link set br1 up
 
# create veth pair and connect it to the bridge
ip link add veth-a1 type veth peer name veth-a1-c
ip link set veth-a1 master br1
ip link set veth-a1 up
 
# move veth-a1-c into A1's namespace
# find A1_PID yourself
ip link set veth-a1-c netns $A1_PID

Assign IP addresses to the veth interfaces in each pod:

# in A0, assign IP to veth-a0-c
ip addr add 10.200.0.2/24 dev veth-a0-c
ip link set veth-a0-c up
ip route add default via 10.200.0.1
 
# in A1, assign IP to veth-a1-c
ip addr add 10.200.1.2/24 dev veth-a1-c
ip link set veth-a1-c up
ip route add default via 10.200.1.1

4.2.2. Create IP-in-IP Tunnel Interfaces

Instead of adding static routes, we'll create IP-in-IP tunnel interfaces. IP-in-IP tunneling uses the IPIP protocol.

# in node-0, create an IP-in-IP tunnel interface
ip tunnel add ipip0 mode ipip remote 192.168.64.5 local 192.168.64.4
ip addr add 10.200.0.1/32 dev ipip0
ip link set ipip0 up
# in node-1, create an IP-in-IP tunnel interface
ip tunnel add ipip0 mode ipip remote 192.168.64.4 local 192.168.64.5
ip addr add 10.200.1.1/32 dev ipip0
ip link set ipip0 up

Explain the IP-in-IP tunnel creation command:

  • ip tunnel add ipip0 mode ipip: Create a new IP-in-IP tunnel interface named ipip0
  • remote 192.168.64.5: The remote endpoint IP address for this tunnel
  • local 192.168.64.4: The local IP address used as the source for tunneled packets

Explain the IP address assignment command:

  • ip addr add 10.200.0.1/32 dev ipip0: Assign a point-to-point IP address to the tunnel interface so that each end of the tunnel can communicate with each other
  • The /32 subnet mask indicates a single host address, which is typical for point-to-point links

4.2.3. Configure Routing for IP-in-IP Tunneling

IP-in-IP tunneling relies on IP routing to direct traffic through the tunnel:

# in node-0, add route to reach node-1's pod subnet via the tunnel
ip route add 10.200.1.0/24 dev ipip0
 
# in node-1, add route to reach node-0's pod subnet via the tunnel
ip route add 10.200.0.0/24 dev ipip0

These routes tell the kernel:

  • Any traffic destined for the remote pod subnet should be sent through the IP-in-IP tunnel interface
  • The tunnel interface will automatically encapsulate the packets with the outer IP headers

Also ensure IPv4 forwarding is enabled on both nodes:

# in both node-0 and node-1
sysctl -w net.ipv4.ip_forward=1

When pod A0 sends a packet to pod A1:

  1. Pod A0 sends a packet destined for 10.200.1.2 (pod A1)
  2. The packet travels through veth-a0-cveth-a0br0
  3. Bridge br0 forwards the packet to ipip0 based on the routing table entry
  4. IP-in-IP interface ipip0 encapsulates the packet:
    • Adds outer IP header with source 192.168.64.4 and destination 192.168.64.5
    • Uses IP protocol 4 (IPPROTO_IPIP) to indicate IP-in-IP encapsulation
    • Adds Ethernet header for the physical network
  5. The encapsulated packet travels over the physical network from node-0 to node-1
  6. Node-1 receives the packet on its enp0s1 interface and recognizes protocol 4
  7. IP-in-IP interface ipip0 on node-1 decapsulates the packet:
    • Removes the outer IP and Ethernet headers
    • Forwards the original inner packet to br1
  8. Bridge br1 forwards the packet to veth-a1veth-a1-c → Pod A1

4.2.4. Test Communication

Now let's test if pods can communicate across nodes using IP-in-IP tunneling:

# in pod A0, try to ping pod A1
ping 10.200.1.2
 
# in A0, create a netcat server
nc -lk -p 8080 -e /bin/sh
 
# in pod A1, connect to pod A0 via netcat
nc -v 10.200.0.2 8080
# should see a shell prompt in pod A1

In conclusion, we have set up pod-to-pod communication across nodes using IP-in-IP tunneling. The IP-in-IP tunnel encapsulates packets, enabling them to traverse the network without any issues such as:

  • When configuring route add 10.200.0.0/24 via 192.168.64.4, our network admin says "NO! We don't allow custom pod subnets in our corporate network!"
    • While in IP-in-IP tunneling, the packets are encapsulated with the outer IP headers, making them look like regular traffic to the network
  • When our Kubernetes cluster spans across different cloud regions with NAT gateways between them, we run ping 10.200.1.2. This will fail because the target pod is in different region and NAT gateway doesn't know how to route pod subnet 10.200.1.0/24
    • In IP-in-IP tunneling, the command works because NAT only sees normal traffic between known node IPs

However, IP-in-IP tunneling does not solve the scalability problem of routing table entries as mentioned in the previous section. We still need to maintain a route for every pod subnet in the cluster, which can become unmanageable as the cluster grows. In the next section, we will explore a more scalable solution using VXLAN tunneling.

4.3. Solution 2.2: VXLAN Tunneling

VXLAN (Virtual Extensible LAN) is one of the most popular tunneling protocols used by Kubernetes CNI plugins like Calico and Flannel. VXLAN creates a Layer 2 overlay network (opens in a new tab) over a Layer 3 infrastructure.

How VXLAN Solves the Scalability Problem

Unlike static routing or IP-in-IP tunneling, VXLAN provides true scalability through:

  • Single Virtual Network: All nodes participate in one large Layer 2 network segment identified by a VXLAN Network Identifier (VNI)
  • Dynamic MAC Learning: VXLAN can automatically learn MAC-to-IP mappings without manual configuration
  • Single Route Entry: Only one route entry is needed regardless of cluster size
  • Multicast-based Discovery: Nodes can discover each other automatically using multicast

In this solution, we assume the same network topology as the previous section:

  • Cluster CIDR is 10.200.0.0/16
  • node-0:
    • It has IP 192.168.64.4
    • It has pod subnet 10.200.0.0/24
    • It has a bridge br0 with IP 10.200.0.1/24
    • Pod A0 is running on node-0 with IP 10.200.0.2/24
  • node-1:
    • It has IP 192.168.64.5
    • It has pod subnet 10.200.1.0/24
    • It has a bridge br1 with IP 10.200.1.1/24
    • Pod A1 is running on node-1 with IP 10.200.1.2/24

4.3.1. Create Pods, Bridges, veth Pairs, and VXLAN Interfaces

First, let's create the same pod and bridge setup as in the previous section. Then, instead of adding static routes, we'll create VXLAN tunnel interfaces with multicast support for automatic discovery.

# in node-0, create a VXLAN interface with multicast group
ip link add vxlan0 type vxlan id 100 group 239.1.1.1 local 192.168.64.4 dstport 4789 dev enp0s1
ip link set vxlan0 master br0
ip link set vxlan0 up
 
# in node-1, create a VXLAN interface with the same multicast group
ip link add vxlan0 type vxlan id 100 group 239.1.1.1 local 192.168.64.5 dstport 4789 dev enp0s1
ip link set vxlan0 master br1
ip link set vxlan0 up

Explain the scalable VXLAN creation command:

  • ip link add vxlan0 type vxlan: Create a new VXLAN interface named vxlan0
  • id 100: VXLAN Network Identifier (VNI) - a unique identifier for this VXLAN segment
  • group 239.1.1.1: Multicast group for automatic neighbor discovery (replaces manual FDB entries)
  • local 192.168.64.4: The local IP address used as the source for VXLAN tunneled packets
  • dstport 4789: The destination UDP port for VXLAN traffic (4789 is the standard VXLAN port)
  • dev enp0s1: The physical interface to send the encapsulated packets through

How the Multicast Discovery Works:

When pod A0 needs to communicate with pod A1:

  1. Pod A0 sends an ARP request for pod A1's IP (10.200.1.2)
  2. Since pod A1's MAC address is unknown, the VXLAN interface sends the ARP request to the multicast group (239.1.1.1)
  3. All nodes in the multicast group receive this ARP request
  4. node-1 (which hosts pod A1) responds with pod A1's MAC address
  5. node-0's VXLAN interface automatically learns: pod A1's MAC → node-1's IP (192.168.64.5)
  6. This mapping is stored in the FDB, enabling direct communication for future packets

Let's verify that no manual FDB entries are needed:

# in node-0, check FDB entries before any communication
bridge fdb show dev vxlan0
# should only show the multicast entry:
3a:08:d1:b9:30:f0 vlan 1 master br0 permanent
3a:08:d1:b9:30:f0 master br0 permanent
00:00:00:00:00:00 dst 239.1.1.1 via enp0s1 self permanent
 
# in A0, ping A1
ping 10.200.1.2
 
# after pod A0 pings pod A1, the FDB will automatically learn the mapping:
# in node-0, check FDB entries again
bridge fdb show dev vxlan0
# should now show learned entry for pod A1:
f6:2e:13:01:62:96 master br0
f6:89:f1:bf:25:77 master br0
3a:08:d1:b9:30:f0 vlan 1 master br0 permanent
3a:08:d1:b9:30:f0 master br0 permanent
00:00:00:00:00:00 dst 239.1.1.1 via enp0s1 self permanent
f6:2e:13:01:62:96 dst 192.168.64.5 self
f6:89:f1:bf:25:77 dst 192.168.64.5 self

Explain MAC addresses in the FDB entries:

  • f6:2e:13:01:62:96: the MAC address of br1 on node-1
  • f6:89:f1:bf:25:77: the MAC address of veth-a1-c in pod A1

Why Do We See Duplicate MAC Entries?

Notice that each MAC address appears twice with different suffixes:

f6:2e:13:01:62:96 master br0
f6:2e:13:01:62:96 dst 192.168.64.5 self

This happens because two separate forwarding databases are maintained in vxlan0 interface:

  1. Bridge FDB (master br0):

    • Maintained by the bridge br0
    • Records: "MAC f6:2e:13:01:62:96 is reachable through the vxlan0 port"
    • Standard Layer 2 bridge learning
  2. VXLAN FDB (dst 192.168.64.5 self):

    • Maintained by the VXLAN interface vxlan0
    • Records: "To reach MAC f6:2e:13:01:62:96, tunnel to node IP 192.168.64.5"
    • VXLAN-specific tunnel endpoint mapping

Why Does Packet Flow Requiring Both Entries:

  1. Pod A0 → Bridge: Packet destined for f6:2e:13:01:62:96 arrives at bridge br0
  2. Bridge Lookup: Bridge checks master br0 entries and finds MAC is reachable via vxlan0 port
  3. VXLAN Lookup: VXLAN interface checks self entries to find tunnel destination 192.168.64.5
  4. Encapsulation: VXLAN encapsulates packet and sends to node-1 at 192.168.64.5

4.3.2. Configure Scalable Routing

The key to VXLAN's scalability is that we only need one route entry regardless of cluster size. This single route entry covers ALL pod subnets in the cluster. This automatic learning is what makes VXLAN truly scalable.

# in node-0, add single route for entire cluster CIDR
ip route add 10.200.0.0/16 dev br0
 
# in node-1, add single route for entire cluster CIDR
ip route add 10.200.0.0/16 dev br1

Also ensure IPv4 forwarding is enabled on both nodes:

# in both node-0 and node-1
sysctl -w net.ipv4.ip_forward=1

4.3.3. Test Communication

Now let's test if pods can communicate across nodes using VXLAN tunneling:

# in A0, create a netcat server
nc -lk -p 8080 -e /bin/sh
 
# in pod A1, connect to pod A0 via netcat
nc -v 10.200.0.2 8080
# should see a shell prompt in pod A1

4.3.4. Test Adding a New Node

In static routing and IP-in-IP tunneling, adding node-2 requires updating every existing node:

# must run on ALL existing nodes when adding node-2
ip route add 10.200.2.0/24 via 192.168.64.6   # Static routing
ip route add 10.200.2.0/24 dev ipip0          # IP-in-IP tunneling

In VXLAN, adding node-2 requires zero configuration on existing nodes:

# only run on the NEW node-2
ip link add vxlan0 type vxlan id 100 group 239.1.1.1 local 192.168.64.6 dstport 4789 dev enp0s1
ip link set vxlan0 master br2
ip link set vxlan0 up
ip route add 10.200.0.0/16 dev br2
 
# Existing nodes automatically discover node-2 through multicast

4.5. Routing Solutions Comparison

AspectStatic RoutingIP-in-IPVXLAN
Encapsulation ProtocolNoneIP (Protocol 4)UDP (Port 4789)
OverheadNone~20 bytes (IP header only)~50 bytes (UDP + VXLAN headers)
OSI LayerLayer 3 (Network)Layer 3 (Network)Layer 2 (Data Link)
Routes per NodeO(N)O(N)O(1)
Adding New NodeUpdate all existing nodesUpdate all existing nodesZero config on existing nodes
FDB EntriesN/AN/ADynamic learning
ComplexitySimpleMediumMore complex
ScalabilityPoorPoorExcellent
CNI ExamplesFlannel host-gatewayCalico IPIP modeFlannel VXLAN, Calico VXLAN

In conclusion, VXLAN provides the truly scalable solution for pod-to-pod communication across nodes in Kubernetes. While static routing and IP-in-IP tunneling both require O(N) configuration entries that grow linearly with cluster size, VXLAN achieves O(1) scalability through:

  1. Single route entry regardless of cluster size
  2. Automatic MAC learning through multicast discovery
  3. Dynamic FDB population without manual configuration
  4. Zero-touch node addition - new nodes are automatically discovered

This makes VXLAN the preferred choice for large-scale production Kubernetes clusters, despite its slightly higher network overhead compared to IP-in-IP tunneling.

5. Pod-to-Service Communication

The previous sections focused on pod-to-pod communication across nodes. In this section, we will explore how pods communicate with services in Kubernetes, specifically how kube-proxy uses iptables to implement load balancing for services. Before we dive into the solution, let's check some Linux networking concepts and iptables's role in load balancing.

5.1. The main routing stack in Linux

The main routing stack in Linux is made up of several interconnected components:

  • Routing Table: the core of the routing decision-making. The kernel uses the routing table to determine the outgoing interface and gateway for a packet based on its destination IP address. We can view this with the ip route command
  • Netfilter Framework: a framework within the Linux kernel that provides a flexible and powerful way to handle network packets. It consists of multiple tables (filter, nat, mangle, raw, security) and built-in chains (PREROUTING, INPUT, OUTPUT, FORWARD, POSTROUTING) where rules are placed. iptables is a user-space utility program in Linux used to configure the Linux kernel's firewall, which is implemented as Netfilter modules
    • A chain refers to a sequence of defined rules within the iptables system. Each chain is a list of rules which can match a set of packets. Each rule specifies what to do with a packet that matches. This is called a target, which may be a jump to a user-defined chain in the same table
  • Connection Tracking (conntrack): a kernel module that keeps a record of all active connections. It is critical for features like NAT and stateful firewalls, allowing the kernel to identify a packet as part of an existing conversation and process it accordingly

5.2. Use nat table in iptables for load balancing

There are currently five independent tables in the iptables system: filter, nat, mangle, raw, and security ref. Each table contains a set of built-in chains, which are lists of rules that match packets and specify actions to take on them. The nat table is used for Network Address Translation (NAT) operations, such as modifying the source or destination IP address of packets.

The NAT table is consulted when a packet that creates a new connection is encountered. It consists of four built-in chains:

  • PREROUTING: for altering packets as soon as they come in
  • INPUT: for altering packets destined for local sockets
  • OUTPUT: for altering locally-generated packets before routing
  • POSTROUTING: for altering packets as they are about to go out
Loading chart...

The nat table is where network address translation happens. Load balancing is fundamentally a form of address translation, where the destination IP (the Kubernetes service's ClusterIP) is rewritten to the IP of a specific backend pod. This is handled by the DNAT (Destination NAT) target in the PREROUTING chain.

The nat table is the only iptables table that can directly perform load balancing. We can also potentially use the mangle table in combination with other tools to achieve a similar effect. The filter, raw, and security tables are not suitable for load balancing.

Besides DNAT, there is also SNAT (Source NAT). For more information about iptables and SNAT, refer to this resource (opens in a new tab).

5.3. CNI Plugins and kube-proxy

There is a clear separation of concerns in Kubernetes networking. CNI Plugins (eg. Flannel, Calico, Cilium) are responsible for pod networking. When a pod is created, the CNI plugin does the following:

  • Create a network namespace for the pod
  • Assign a unique IP address to the pod from a defined cluster subnet
  • Set up the pod's network interface (eg. a veth pair)
  • Ensure that traffic can be routed between pods, even on different nodes

In Kubernetes, a Service is an abstraction that defines a logical set of pods and a policy by which to access them. A Kubernetes Service provides a stable IP address and DNS name that can be used to access the pods, even if the underlying pods change over time.

kube-proxy is responsible for service networking. It creates the iptables rules (or IPVS rules) that intercept traffic destined for a service's ClusterIP. These rules perform DNAT, rewriting the service's ClusterIP to the IP of one of the healthy backend pods.

Some advanced CNI plugins, like Cilium, can replace kube-proxy entirely by using a more efficient technology called eBPF (opens in a new tab). In such cases, the CNI plugin itself handles service load balancing, but it's not using iptables in the traditional sense.

5.4. Implementation

In a Kubernetes cluster, the service-cluster-ip-range (opens in a new tab) option defines the Classless Inter-Domain Routing (CIDR) block from which IP addresses are allocated to Services within a cluster. These are known as ClusterIPs, and they provide a stable virtual IP address for a Service. The service-cluster-ip-range must be mutually exclusive with other IP ranges used within the cluster, such as the Pod CIDR range and the IP addresses of the cluster nodes, to prevent IP conflicts.

In this solution, we will create a Kubernetes Service that load balances traffic across multiple pods running on different nodes. We will use iptables to implement the load balancing logic. The Service will have a stable IP address that can be used to access the pods, and iptables will be used to route traffic to the appropriate pod based on the load balancing rules.

This solution assumes:

  • Cluster CIDR is 10.200.0.0/16
  • node-0:
    • It has IP 192.168.64.4
    • It has pod subnet 10.200.0.0/24
    • It has a bridge br0 with IP 10.200.0.1/24
    • Pod A0 is running in node-0 with IP 10.200.0.2/24
    • Pod A1 is running in node-0 with IP 10.200.0.3/24
  • node-1:
    • It has IP 192.168.64.5
    • It has pod subnet 10.200.1.0/24
    • It has a bridge br1 with IP 10.200.1.1/24
    • Pod B is running in node-1 with IP 10.200.1.2/24
  • node-2:
    • It has IP 192.168.64.6
    • It has pod subnet 10.200.2.0/24
    • It has a bridge br2 with IP 10.200.2.1/24
    • Pod C is running in node-2 with IP 10.200.2.2/24
  • The Kubernetes Service CIDR is 10.96.0.0/12
    • There is one Kubernetes Service named KUBE-SVC-1 having IP 10.96.0.2/12. This Service is configured to load balance traffic across pods A0 and B. This is to simulate a scenario where the Service has multiple endpoints across different nodes

The diagram below illustrates the flow of packets when a pod C sends traffic to the Service KUBE-SVC-1 at IP 10.96.0.2/32 on port 8080. The traffic is load balanced across pods A0 and B, which are living in node-0 and node-1 respectively.

Loading chart...

In this solution, we will test these three cases:

  • Case 1: send traffic from pod C to the Service KUBE-SVC-1. This is to simulate the scenario where we send traffic from a pod that is not part of the Service's endpoints and that pod is living in a different subnet (different node) with the Service's endpoints
  • Case 2: send traffic from pod A1 to the Service KUBE-SVC-1. This is to simulate the scenario where we send traffic from a pod that is not part of the Service's endpoints and that pod is living in the same subnet (same node) with one of the Service's endpoints
  • Case 3: send traffic from pod A0 to the Service KUBE-SVC-1. This is to simulate a scenario where we send traffic from a pod that is one of the Service's endpoints

On each node, let's create corresponding pods, bridges, veth pairs and VXLAN interfaces as we did in the previous section.

On each node, add the following routes to enable communication between the pod subnets. This will allow traffic destined for the pod subnets to be routed through the respective node's IP address.

# in node-0
ip route add 10.200.1.0/24 via 192.168.64.5
ip route add 10.200.2.0/24 via 192.168.64.6
 
# in node-1
ip route add 10.200.0.0/24 via 192.168.64.4
ip route add 10.200.2.0/24 via 192.168.64.6
 
# in node-2
ip route add 10.200.0.0/24 via 192.168.64.4
ip route add 10.200.1.0/24 via 192.168.64.5

Let's verify that the routes are set up correctly on each node.

# in A0, ping A1, B, and C
ping 10.200.0.3
ping 10.200.1.2
ping 10.200.2.2
 
# in A1, ping A0, B, and C
ping 10.200.0.2
ping 10.200.1.2
ping 10.200.2.2
 
# in B, ping A0, A1, and C
ping 10.200.0.2
ping 10.200.0.3
ping 10.200.2.2
 
# in C, ping A0, A1, and B
ping 10.200.0.2
ping 10.200.0.3
ping 10.200.1.2

Now, we will create a Kubernetes Service named KUBE-SVC-1 which is actually just a custom chain in the table nat of iptables. We will then add rules to the KUBE-SVC-1 chain to load balance traffic across the pods A0 and B.

# create a custom iptables chain for our service, call it KUBE-SVC-1
iptables -t nat -N KUBE-SVC-1
 
# add a rule to the PREROUTING chain to send all traffic for the Service IP to this new custom chain
iptables -t nat -A PREROUTING -d 10.96.0.2/32 -p tcp --dport 8080 -j KUBE-SVC-1
# verify
iptables -t nat -L PREROUTING -v -n --line-numbers
 
# add rule 1: redirects 50% of the traffic to Pod A0
iptables -t nat -A KUBE-SVC-1 -p tcp --dport 8080 -m statistic --mode random --probability 0.5 -j DNAT --to-destination 10.200.0.2:8080
# add rule 2: redirects the remaining 50% of traffic to Pod B
iptables -t nat -A KUBE-SVC-1 -p tcp --dport 8080 -j DNAT --to-destination 10.200.1.2:8080
# verify
iptables -t nat -L KUBE-SVC-1 -v -n --line-numbers
 
# verify all rules
iptables -L -v -n --line-numbers

Explain the command iptables -t nat -N KUBE-SVC-1:

  • -t nat: Specify that we are working with the nat table, which is used for Network Address Translation
  • -N KUBE-SVC-1: Create a new chain named KUBE-SVC-1 in the nat table. This chain will be used to define rules for handling traffic destined for the Kubernetes Service IP 10.96.0.2/32

Explain the command iptables -t nat -A PREROUTING -d 10.96.0.2/32 -p tcp --dport 8080 -j KUBE-SVC-1:

  • -A PREROUTING: Append a rule to the PREROUTING chain, which is the first chain that packets traverse when they arrive at the system
  • -d 10.96.0.2/32 -p tcp --dport 8080: Specify that this rule applies to packets destined for the IP address 10.96.0.2/32 on TCP port 8080
  • -j KUBE-SVC-1: Jump to the KUBE-SVC-1 chain, where the actual load balancing rules are defined

Explain rule 1 command iptables -t nat -A KUBE-SVC-1 -p tcp --dport 8080 -m statistic --mode random --probability 0.5 -j DNAT --to-destination 10.200.0.2:8080:

  • -A KUBE-SVC-1: Append a rule to the KUBE-SVC-1 chain
  • -p tcp --dport 8080: Specify that this rule applies to TCP packets destined for port 8080
  • -m statistic --mode random --probability 0.5: Use the statistic module to randomly select 50% of the packets that match this rule
  • -j DNAT --to-destination 10.200.0.2:8080: Perform Destination Network Address Translation (DNAT) on the selected packets, changing their destination IP address to 10.200.0.2:8080, which is the IP address of pod A0

Explain rule 2 command iptables -t nat -A KUBE-SVC-1 -p tcp --dport 8080 -j DNAT --to-destination 10.200.1.2:8080:

  • -A KUBE-SVC-1: Append a rule to the KUBE-SVC-1 chain
  • -p tcp --dport 8080: Specify that this rule applies to TCP packets destined for port 8080
  • -j DNAT --to-destination 10.200.1.2:8080: Perform Destination Network Address Translation (DNAT) on the remaining packets, changing their destination IP address to 10.200.1.2:8080, which is the IP address of pod B
  • Note: for a Service with N pods, we will set the probabilities in a cascading manner: 1/N, 1/(N-1), 1/(N-2), etc., down to 1/1 for the last pod
Show me the Kubernetes code

In pkg/proxy/iptables/proxier.go, syncProxyRules() (opens in a new tab) is where all of the iptables-save/restore calls happen. It's called by OnServiceSynced() and OnEndpointSlicesSynced().


In syncProxyRules(), it calls proxier.writeServiceToEndpointRules() (opens in a new tab) to create DNAT rules (opens in a new tab) for service-to-endpoint mapping.

5.5. Test Load Balancing

In pod A0 and pod B, start a netcat server to listen on port 8080.

# in A0
nc -lk -p 8080 -e /bin/sh
# in B
nc -lk -p 8080 -e /bin/sh

Case 1: Send traffic from pod C to the Service KUBE-SVC-1 using nc -v 10.96.0.2 8080. Pod C is living in node-2, which doesn't have any endpoints of the Service KUBE-SVC-1. Break and retry, we should see traffic reaches pod A0 and pod B equally.

Case 2: Send traffic from pod A1. Pod A1 is living in node-0, which has one endpoint for the Service KUBE-SVC-1 (pod A0). Break and retry, we should see traffic reaches pod A0 and pod B equally.

For this case, Kubernetes has a feature called Topology Aware Routing (opens in a new tab) and the internalTrafficPolicy: Local setting. These features can change the default behavior to prefer routing traffic to pods on the same node or in the same availability zone.

Case 3: Send traffic from pod A0. Pod A0 is one endpoint of the Service KUBE-SVC-1. Break and retry, we should see traffic reaches pod B only.

# in node-0, enter pod A0's network namespace
# find A0_PID yourself
nsenter -t $A0_PID -a chroot /root/tung/a0-merged /bin/sh
 
# in the new shell, send traffic to the Service
nc -v 10.96.0.2 8080

For this case, when a pod sends traffic to a service it's a part of, the traffic will be routed to a different, random pod within that service, not back to itself. This is a behavior known as hairpinning.

Hairpinning is when a pod's traffic goes out to the service's ClusterIP and then back in to a pod. In Kubernetes, when pod A0 sends traffic to KUBE-SVC-1, kube-proxy will randomly select one of two pods A0 and B as the destination. Because the goal is to load balance, it's highly unlikely that it would route the traffic back to pod A0 itself. This ensures that the load is distributed and prevents a single pod from becoming overwhelmed by its own requests, which could lead to deadlocks or other issues. We could change this default behavior using hairpinMode option (opens in a new tab).

In conclusion, we have set up a Kubernetes Service that load balances traffic across multiple pods. We use iptables to create a custom chain for the Service and add rules to load balance traffic between the pods. This allows us to distribute traffic across multiple pods, providing high availability and scalability for our applications.

I will leave the setup that simulates a scenario where the Service has multiple endpoints in the same node for you to explore because I'm very lazy now.

6. Conclusion

This post has explored the fundamental concepts of Kubernetes pod networking, including pod-to-pod communication on the same node and across nodes. We have discussed two solutions for pod-to-pod communication on the same node: one using veth pairs and another using bridges. We also set up direct pod-to-pod communication across nodes using routing between bridges and implemented load balancing for pod-to-service communication using iptables.

This post didn't cover all the details of Kubernetes networking, such as how to route traffic to external sources (eg. the internet) using SNAT, how to configure network policies, or how to troubleshoot networking issues in Kubernetes. However, it provides a solid foundation for understanding how pods communicate with each other in a Kubernetes cluster.