
1. Introduction
I'm quite overwhelmed with the complexity of Kubernetes networking. There are many concepts and technologies involved such as network namespaces, veth pairs, bridges, iptables, load balancing, CNI plugins, and more. I've been trying to search on internet, but I haven't found a comprehensive article that demonstrates how we could use Linux commands to set up Kubernetes pod networking from scratch, building up from the basic concepts to the advanced features. I think this is a great opportunity to write a blog post that fills this gap.
In this post, I will show you how to implement Kubernetes pod networking using Linux commands with minimal dependencies. I will also provide necessary background knowledge and concepts to help you understand the topic better. However, there are still some prerequisites that I assume you already knew:
- Familiar with basic networking concepts (eg. OSI model (opens in a new tab))
- Familiar with Linux namespaces (opens in a new tab) and Linux virtual filesystems (eg.
procfs(opens in a new tab),overlayfs(opens in a new tab)) - How Kubernetes components work, especially
kubeletandkube-proxy - How to create multiple Linux containers sharing the same namespaces using
unshare,chroot, andnsentercommands. Check my previous post to learn more
In the following sections, we will use Linux commands to implement:
- Container-to-container communication in the same pod via the loopback interface
- Pod-to-pod communication on the same node using either veth pairs or bridge-based solutions
- Pod-to-pod communication across nodes using either static routing, IP-in-IP tunneling, and VXLAN tunneling solutions
- Pod-to-service communication using Network Address Translation (NAT) technology
Let's start with how containers in the same pod could communicate with each other.
2. Container-to-Container Communication in the Same Pod
This section describes how containers within the same pod communicate with each other. We will configure containers in the same pod to share the same network namespace, which allows them to communicate over localhost, which is the loopback interface.
2.1. Create Pause Container
In Kubernetes, Pause container (opens in a new tab) is a special container that maintains cgroups and namespaces for the pod. It is responsible for providing a shared network namespace for all other containers in the pod. This solution assumes that we have one pod with one Pause container and two application containers A0 and A1.
Let's set up an overlay filesystems for the containers. We will create a directory structure to hold the lower and upper layers of the overlay filesystem, and then extract a minimal Alpine Linux root filesystem into it. Next, we will create overlay mounts for each container in the pod. Each container will have its own upper layer, while sharing the same lower layer.
mkdir -p /root/tung/{lower,pause-upper,pause-work,pause-merged,a0-upper,a0-work,a0-merged,a1-upper,a1-work,a1-merged}
cd /root/tung
# You may want to change to your OS architecture, eg. `x86_64` or `aarch64`
wget https://dl-cdn.alpinelinux.org/alpine/v3.20/releases/aarch64/alpine-minirootfs-3.20.3-aarch64.tar.gz
tar -xzf alpine-minirootfs-3.20.3-aarch64.tar.gz -C lower
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/pause-upper,workdir=/root/tung/pause-work /root/tung/pause-merged
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a0-upper,workdir=/root/tung/a0-work /root/tung/a0-merged
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a1-upper,workdir=/root/tung/a1-work /root/tung/a1-mergedWe will start the Pause container using the unshare command to create a new set of namespaces.
unshare -Cunimpf chroot /root/tung/pause-merged /bin/sh
# in pause container, mount the proc filesystem
mount -t proc proc /proc2.2. Create Application Containers
To interact with the Pause container, we firstly need to find its PID. This is the process that was forked from the unshare command. Next, we will use nsenter to enter the Pause container's namespaces and start the application containers A0 and A1, which will also share the same network namespace with the Pause container.
# in host, find the PAUSE process's PID, the process that is forked from the unshare command
ps aux | grep /bin/sh
# for example, let's say the output is:
root 535 0.0 0.0 5260 820 pts/0 S 01:37 0:00 unshare -Cunimpf chroot /root/tung/pause-merged /bin/sh
root 536 0.0 0.0 1816 1016 pts/0 S+ 01:37 0:00 /bin/sh
# we can see that the PID of the Pause container is 536
PAUSE_PID=<pause-pid>
# in host, create application containers A0 by entering the Pause container's namespaces
# and changing the root directory to the overlay filesystem for A0
nsenter -t $PAUSE_PID -a chroot /root/tung/a0-merged /bin/sh
# in host, in a new terminal, similarly, create application containers A1
nsenter -t $PAUSE_PID -a chroot /root/tung/a1-merged /bin/shInside each application container, we can verify that they share the same network namespace as the Pause container by checking the /proc/self/ns/net symlink. This symlink should point to the same network namespace ID for all containers in the pod.
# in each container, mount the proc filesystem to access process information
mount -t proc proc /proc
# in each container, check the network namespace
ls -l /proc/self/ns/net
# All should point to the same net:[ID], eg. net:[4026532321]2.3. Enable Loopback Interface
By default, when we use unshare or create a new network namespace, the loopback interface is down. We can verify this by checking the network interfaces inside the Pause container.
# in PAUSE, list all network interface
ip link
# should see
1: lo: <LOOPBACK> mtu 65536 qdisc noop state DOWN qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
# check network interface config
ifconfig
# should see emptyLet's bring the loopback interface up so that containers A0 and A1 can communicate with each other using localhost.
# in pause container
ip link set lo upExplain the command:
ip link set: bring a network interface up or downlo: loopback interfaceup: make it up
Show me the containerd code
In internal/cri/server/sandbox_run.go, in RunPodSandbox(): call c.setupPodNetwork() (opens in a new tab),
which then calls c.bringUpLoopback() (opens in a new tab).
In internal/cri/server/sandbox_run_linux.go, in c.bringUpLoopback(), it calls netlink.LinkSetUp() (opens in a new tab).
Note: runc also implements the loopback interface setup (opens in a new tab)
which may be used by containerd or other container runtimes.
Let's verify that the loopback interface is now up.
# in pause container, verify
ip link
# should see
1: lo: <LOOPBACK,UP,LOWER_UP> mtu 65536 qdisc noqueue state UNKNOWN qlen 1000
link/loopback 00:00:00:00:00:00 brd 00:00:00:00:00:00
# check network interface config
ifconfig
# should see
lo Link encap:Local Loopback
inet addr:127.0.0.1 Mask:255.0.0.0
inet6 addr: ::1/128 Scope:Host
UP LOOPBACK RUNNING MTU:65536 Metric:1
RX packets:0 errors:0 dropped:0 overruns:0 frame:0
TX packets:0 errors:0 dropped:0 overruns:0 carrier:0
collisions:0 txqueuelen:1000
RX bytes:0 (0.0 B) TX bytes:0 (0.0 B)2.4. Test Communication
In container A0, start a netcat server to listen on port 8080 and execute a shell when a connection is made. In container A1, connect to container A0 to verify that they can communicate over localhost.
# in container A0
nc -lk -p 8080 -e /bin/sh
# in container A1
nc -v localhost 8080
# should see a shell prompt in container A1Explain the command nc -lk -p 8080 -e /bin/sh:
nc:netcatcommand, a networking utility-l: Listen mode, for incoming connections-k: Keep the server running after a connection is closed. In myncversion, the-kflag only works if we specify a program to run with the-eflag-p 8080: Specify the port to listen on-e /bin/sh: Execute a shell when a connection is made
Explain the command nc -v localhost 8080:
-v: Verbose mode, to show connection detailslocalhost: Connect to the local loopback interface8080: The port to connect to
In conclusion, we have set up a pod with two containers that can communicate with each other over localhost. This is achieved by sharing the same network namespace through the Pause container, which allows both containers to access the loopback interface and communicate using standard networking tools like netcat. This is similar to how containers in a Kubernetes pod communicate with each other in a real-world scenario.
3. Pod-to-Pod Communication on the Same Node
In this section, we will explore how pods communicate with each other on the same node. The technology used for this is typically a virtual Ethernet (veth) pair or a bridge-based networking solution. From this section now on, we won't have containers in the same pod. Hence, we will use the term pod to refer to a group of containers that share the same network namespace, similar to how Kubernetes pods work.
3.1. Solution 1: Direct veth Pair
veth stands for virtual Ethernet and is a pair of virtual network interfaces that are connected to each other. Think of a veth pair as a virtual Ethernet cable directly connecting two network namespaces. Each end of the veth pair is in a different network namespace, allowing them to communicate with each other as if they were connected by a physical Ethernet cable.
When one end of a veth pair sends a packet, it appears on the other end as if it were received from a physical network interface. veth operates at the Data Link layer (Layer 2) of the OSI model, which means it can carry Ethernet frames between network namespaces. While veth interfaces operate at Layer 2, they can be used in conjunction with Layer 3 (IP addresses, routing) to establish more complex network topologies. For example, we can assign IP addresses to veth interfaces and configure routes to enable communication between different network namespaces or containers, even if they are in different subnets.
The solution 1's idea is to create a virtual Ethernet (veth) pair for each pair of pods that need to communicate with each other. Each pod will have one end of the veth pair, allowing them to communicate directly.
In this solution, we assume:
- Cluster CIDR is
10.200.0.0/16 node-0, where pod A0 and pod A1 are running, has pod subnet10.200.0.0/24- Pod A0 has IP
10.200.0.2/24 - Pod A1 has IP
10.200.0.3/24
The communication between pods A0 and A1 is described in the diagram below.
Who assigns pod subnets to nodes and IP addresses to pods?
In Kubernetes, the Kubernetes controller manager is responsible for IP address management (IPAM) at the cluster level. For each new node joining the cluster, it chooses an unused subnet from the Cluster CIDR and assigns this unique subnet to the new node. The controller manager then records this assignment in the etcd database, making it available to all other cluster components.
Based on the pod subnet allocated to the node, the CNI plugin will assign IP addresses to pods. The CNI plugin knows which IPs are available by maintaining an IPAM system, which contains a local IPAM database (eg. a file, directory, or in-memory store).
3.1.1. Create Pods and a veth Pair
First, let's create two pods A0 and A1 in two different network namespaces. We will use the same overlay filesystem structure as before.
# in a new terminal
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a0-upper,workdir=/root/tung/a0-work /root/tung/a0-merged
unshare -Cunimpf chroot /root/tung/a0-merged /bin/sh
# in another terminal
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a1-upper,workdir=/root/tung/a1-work /root/tung/a1-merged
unshare -Cunimpf chroot /root/tung/a1-merged /bin/shNext, let's create a veth pair with two ends named veth-a0 and veth-a1.
# in host
ip link add veth-a0 type veth peer name veth-a1Explain the ip link command ref (opens in a new tab):
ip link add: Command to create a new network interfaceveth-a0: Name of the first end of the veth pairtype veth: Specifies that the interface is a virtual Ethernet interfacepeer name veth-a1: Specifies the name of the second end of the veth pair
Let's verify that the veth pair has been created.
# in host
ip link
# should see
3: veth-a1@veth-a0: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether 3a:24:90:cb:cd:83 brd ff:ff:ff:ff:ff:ff
4: veth-a0@veth-a1: <BROADCAST,MULTICAST,M-DOWN> mtu 1500 qdisc noop state DOWN mode DEFAULT group default qlen 1000
link/ether de:68:63:b1:1b:4c brd ff:ff:ff:ff:ff:ffNow, we need to move the veth interfaces to the corresponding pods' network namespaces. We will use the ip link set command to do this. First, we need to find the PIDs of the A0 and A1 processes that were forked from the unshare command.
# in host, find the A0 process's PID, the process that is forked from the unshare command
ps aux | grep /bin/sh
# for example, let's say the output is:
root 4304 0.0 0.0 5260 816 pts/1 S 09:48 0:00 unshare -Cunimpf chroot /root/tung/a0-merged /bin/sh
root 4305 0.0 0.0 1828 1212 pts/1 S+ 09:48 0:00 /bin/sh
root 4328 0.0 0.0 5260 804 pts/2 S 09:49 0:00 unshare -Cunimpf chroot /root/tung/a1-merged /bin/sh
root 4329 0.0 0.0 1820 1160 pts/2 S+ 09:49 0:00 /bin/sh
root 4331 0.0 0.0 6088 1948 pts/0 S+ 09:49 0:00 grep /bin/sh
# we can see that the PID of the pod A0 is 4305 and the pod A1 is 4329
A0_PID=<a0-pid>
A1_PID=<a1-pid>
# move the veth interfaces to the corresponding pods' network namespaces
ip link set veth-a0 netns $A0_PID
ip link set veth-a1 netns $A1_PIDIn host, if we run ip link, we will see that the veth interfaces now disappear from the host's network namespace because they are moved to the corresponding pods' network namespaces.
3.1.2. Assign IP Addresses for veth Interfaces
In order for the pods to communicate with each other via IP addresses, we need to assign an IP address to the veth interface in each pod. We will assign IP addresses in the same subnet, for example, assign IP 10.200.0.2/24 for A0 and IP 10.200.0.3/24 for A1.
# in A0
ip addr add 10.200.0.2/24 dev veth-a0
ip link set veth-a0 up
# in A1
ip addr add 10.200.0.3/24 dev veth-a1
ip link set veth-a1 upExplain the command ip addr add 10.200.0.2/24 dev veth-a0 ref (opens in a new tab):
ip addr add: Command to add an IP address to a network interfacedev veth-a0: Specifies the network interface to which the IP address should be assigned
We can verify that the IP addresses are assigned correctly by checking the network interfaces in each pod.
# in A0, verify the IP address
ip addr
# should see
4: veth-a0@if3: <BROADCAST,MULTICAST,UP,LOWER_UP,M-DOWN> mtu 1500 qdisc noqueue state UP qlen 1000
link/ether de:68:63:b1:1b:4c brd ff:ff:ff:ff:ff:ff
inet 10.200.0.2/24 scope global veth-a0
valid_lft forever preferred_lft forever
inet6 fe80::dc68:63ff:feb1:1b4c/64 scope link
valid_lft forever preferred_lft forever3.1.3. Test Communication
In pod A0, start a netcat server to listen on port 8080. In pod A1, connect to pod A0 to verify that they can communicate over the IP addresses assigned to the veth interfaces.
# in pod A0, try to ping pod A1
ping 10.200.0.3
# in pod A0, create a netcat server
nc -lk -p 8080 -e /bin/sh
# in pod A1, try to ping pod A0
ping 10.200.0.2
# in pod A1, connect to pod A0 via netcat
nc -v 10.200.0.2 8080
# should see a shell prompt in pod A1We have just set up pod-to-pod communication on the same node using a veth pair. Each pod has one end of the veth pair, allowing them to communicate directly with each other using IP addresses. However, this solution has some limitations. For example, it requires a veth pair for each pair of pods that need to communicate with each other, which can lead to a large number of veth pairs if there are many pods. This can also lead to performance issues due to the overhead of managing many veth pairs. In practice, we can use a more scalable solution based on bridges. The next section will discuss the bridge-based networking solution.
3.2. Solution 2: Bridge-Based Networking
In this solution, we will use a bridge to connect multiple pods on the same node. A bridge is a virtual network switch that allows multiple network interfaces to communicate with each other as if they were connected by a physical switch. This solution is more scalable than solution 1, as it allows multiple pods to communicate with each other without the need for a separate veth pair for each pair of pods.
Bridges operate at the Data Link layer (Layer 2) of the OSI model, allowing them to forward Ethernet frames between network interfaces. They can also be used in conjunction with Layer 3 (IP addresses, routing) to establish more complex network topologies. For example, we can assign IP addresses to the bridge interface and configure routes to enable communication between different network namespaces. This is how Kubernetes sets up networking using CNI plugins, which creates a bridge for each pod network. We will mimic this behavior by creating a bridge and connecting the pods to it.
In this solution, we assume:
- Cluster CIDR is
10.200.0.0/16 node-0, where pod A0 and pod A1 are running, has pod subnet10.200.0.0/24- Pod A0 has IP
10.200.0.2/24 - Pod A1 has IP
10.200.0.3/24
The communication between pods A0 and A1 is desribed in the diagram below.
3.2.1. Create Pods, Bridge Interface and veth Pairs
First, let's create two pods A0 and A1 in two different network namespaces. We will use the same overlay filesystem structure as before.
# in a new terminal
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a0-upper,workdir=/root/tung/a0-work /root/tung/a0-merged
unshare -Cunimpf chroot /root/tung/a0-merged /bin/sh
# in another terminal
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a1-upper,workdir=/root/tung/a1-work /root/tung/a1-merged
unshare -Cunimpf chroot /root/tung/a1-merged /bin/shNext, let's create a bridge interface in host network namespace.
# in host
ip link add name br0 type bridge
# bring it up
ip link set br0 upExplain the ip link command ref (opens in a new tab):
ip link add: Command to create a new network interfacename br0: Name of the bridge interfacetype bridge: Specifies that the interface is a bridge- This command will also assign a unique MAC address to the bridge interface, which will be used for communication between pods at Layer 2
Let's verify that the bridge interface has been created.
# in host
ip link
# should see
3: br0: <NO-CARRIER,BROADCAST,MULTICAST,UP> mtu 1500 qdisc noqueue state DOWN mode DEFAULT group default qlen 1000
link/ether 3a:b0:8e:d6:55:38 brd ff:ff:ff:ff:ff:ffNow, we need to create a veth pair for each pod and connect one end of the veth pair to the bridge interface. The other end of the veth pair will be moved to the corresponding pod's network namespace.
# in host, create veth pairs for A0 and A1
ip link add veth-a0 type veth peer name veth-a0-c
ip link set veth-a0 master br0
ip link set veth-a0 up
# find A0_PID yourself
ip link set veth-a0-c netns $A0_PID
ip link add veth-a1 type veth peer name veth-a1-c
ip link set veth-a1 master br0
ip link set veth-a1 up
# find A1_PID yourself
ip link set veth-a1-c netns $A1_PIDExplain the ip link set command:
ip link set veth-a0: specify the network interface (veth-a0) we want to modifymaster br0: This is the action that setsbr0device as the master forveth-a0- This makes
veth-a0a port on thebr0bridge. Any network traffic that comes intoveth-a0will now be handled bybr0bridge, allowing the traffic to be forwarded to other interfaces connected to the same bridge - The Linux bridge forwards traffic by using a forwarding database (FDB), also known as a MAC address table. The bridge operates at Layer 2 of the OSI model and makes forwarding decisions based on MAC addresses, not IP addresses
- This makes
Next, let's assign IP addresses to the veth interfaces in each pod. We will assign IP addresses in the same subnet, for example, assign IP 10.200.0.2/24 for A0 and IP 10.200.0.3/24 for A1. This allows them to communicate with each other using IP addresses.
# in A0
ip addr add 10.200.0.2/24 dev veth-a0-c
ip link set veth-a0-c up
# in A1
ip addr add 10.200.0.3/24 dev veth-a1-c
ip link set veth-a1-c upShow me the containerd and CNI plugin code
In repo containerd, in internal/cri/server/sandbox_run.go, in RunPodSandbox(): call c.setupPodNetwork() (opens in a new tab),
which then calls netPlugin.Setup() (opens in a new tab).
In internal/cri/server/service.go, in NewCRIService(), it calls c.initPlatform() (opens in a new tab)
to initialize the c.netPlugin. c.initPlatform() is implemented in internal/cri/server/service_linux.go (opens in a new tab).
One of the implementation of netPlugin.Setup() is in vendor/github.com/containerd/go-cni/cni.go (opens in a new tab).
This is a Go library that provides the necessary functions and data structures for a container runtime to:
- Find the CNI plugin executables on the host system
- Run the CNI plugins to set up, tear down, or check the status of a pod's network
- Handle the CNI plugin's configuration and results
In Setup(), it calls c.attachNetworks() (opens in a new tab),
which then calls asynchAttach() (opens in a new tab),
which then calls n.Attach() (opens in a new tab).
In vendor/github.com/containerd/go-cni/namespace.go, in Attach(), it calls n.cni.AddNetworkList() (opens in a new tab),
which is implemented in vendor/github.com/containernetworking/cni/libcni/api.go (opens in a new tab).
In AddNetworkList(), it calls c.addNetwork() (opens in a new tab),
which then calls invoke.ExecPluginWithResult() (opens in a new tab),
which is the ADD command of the CNI plugin.
Let's check one of the implementation of CNI plugins in the repo plugins (opens in a new tab). This repo contains a collection of CNI plugins, which are reference and example networking plugins that are maintained by the CNI team.
In plugins/main/bridge/bridge.go, in cmdAdd(), it calls setupBridge() (opens in a new tab)
and setupVeth() (opens in a new tab).
setupBridge() then calls ensureBridge() (opens in a new tab),
which creates the bridge interface and brings it up.
In setupVeth() (opens in a new tab), it creates the veth pair in
the host, move one end into container's network namespace, and connects the host veth end to the bridge.
3.2.2. Test Communication
In pod A0, start a netcat server to listen on port 8080. In pod A1, connect to pod A0 to verify that they can communicate over the IP addresses assigned to the veth interfaces.
# in pod A0, try to ping pod A1
ping 10.200.0.3
# in A0, create a netcat server
nc -lk -p 8080 -e /bin/sh
# in pod A1, try to ping pod A0
ping 10.200.0.2
# in pod A1, connect to pod A0 via netcat
nc -v 10.200.0.2 8080
# should see a shell prompt in pod A1In conclusion, we have set up pod-to-pod communication on the same node using a bridge. The bridge allows multiple pods to communicate with each other without the need for a separate veth pair for each pair of pods. This is a more scalable solution that is used in Kubernetes networking.
In reality, Flannel CNI plugin indeed implements a bridge-based networking solution to connect pods on the same node similar to the one we just implemented. However, other CNI plugins like Calico and Cilium use different approaches. It's out of the scope of this post to cover all CNI plugins, but you can refer to their documentation for more details.
In the next section, we will explore how pods communicate with each other across different nodes in a Kubernetes cluster.
4. Pod-to-Pod Communication Across Nodes
In this section, we will explore how pods communicate with each other across different nodes in a Kubernetes cluster. We will discuss two solutions in Kubernetes networking using static routing and IP tunneling.
4.1. Solution 1: Static routing
This solution is based on the idea of routing traffic between bridges on different nodes. Each node has its own bridge interface, and pods on different nodes can communicate with each other by routing traffic through their respective bridges. From the bridge, traffic is forwarded to the appropriate network interface based on the destination IP address configured in the routing table of the node.
In Kubernetes, each node is assigned one IP address range, known as the pod subnet, which is used to assign IP addresses to pods running on that node. In this solution, we assume:
- Cluster CIDR is
10.200.0.0/16 node-0:- It has IP
192.168.64.4 - It has pod subnet
10.200.0.0/24 - It has a bridge
br0with IP10.200.0.1/24 - Pod A0 is running on
node-0with IP10.200.0.2/24
- It has IP
node-1- It has IP
192.168.64.5 - It has pod subnet
10.200.1.0/24 - It has a bridge
br1with IP10.200.1.1/24 - Pod A1 is running on
node-1with IP10.200.1.2/24
- It has IP
The communication between pods A0 and A1 across nodes is described in the diagram below.
4.1.1. Create Pods, Bridges, and veth Pairs
First, let's create two pods A0 in node-0 and pod A1 in node-1. We will use the same overlay filesystem structure as before.
# in node-0
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a0-upper,workdir=/root/tung/a0-work /root/tung/a0-merged
unshare -Cunimpf chroot /root/tung/a0-merged /bin/sh
# in node-1
# remember to download the minimal Alpine Linux and extract it into the lower directory
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a1-upper,workdir=/root/tung/a1-work /root/tung/a1-merged
unshare -Cunimpf chroot /root/tung/a1-merged /bin/shIn node-0, let's create a bridge interface and a veth pair for pod A0. We will also assign IP addresses to the bridge and the veth interface.
# in node-0, create bridge, assign IP
ip link add br0 type bridge
ip addr add 10.200.0.1/24 dev br0
ip link set br0 up
# create veth pair and connect it to the bridge
ip link add veth-a0 type veth peer name veth-a0-c
ip link set veth-a0 master br0
ip link set veth-a0 up
# move veth-a0-c into A0's namespace
# find A0_PID yourself
ip link set veth-a0-c netns $A0_PIDIn node-1, we also create a bridge interface and a veth pair for pod A1. We will assign IP addresses to the bridge and the veth interface.
# in node-1, create bridge, assign IP
ip link add br1 type bridge
ip addr add 10.200.1.1/24 dev br1
ip link set br1 up
# create veth pair and connect it to the bridge
ip link add veth-a1 type veth peer name veth-a1-c
ip link set veth-a1 master br1
ip link set veth-a1 up
# move veth-a1-c into A1's namespace and assign IP
ip link set veth-a1-c netns $A1_PIDAs you may notice, we didn't assign an IP address to the bridge interface as in Solution 2: Bridge-Based Networking. This is because in that solution, the bridge interface is only used for communication between pods on the same node. In this solution, we need to assign an IP address to the bridge interface so that it can be used for routing between pods and the host network namespace. More specifically, later we will configure the pod to route traffic to the bridge interface by default, allowing the bridge to act as a gateway for traffic destined for other pods on different nodes.
Next, we assign IP addresses to the veth interfaces in each pod. For example, assign IP 10.200.0.2/24 for A0 and IP 10.200.1.2/24 for A1.
# in A0, assign IP to veth-a0-c
ip addr add 10.200.0.2/24 dev veth-a0-c
ip link set veth-a0-c up
# in A0, verify that it can reach the bridge br0
ping 10.200.0.1
# in A1, assign IP to veth-a1-c
ip addr add 10.200.1.2/24 dev veth-a1-c
ip link set veth-a1-c up
# in A1, verify that it can reach the bridge br1
ping 10.200.1.1We don't assign an IP to the host's end (veth-a0 and veth-a1) of the veth pair because its primary purpose is to act as a bridge port or physical interface connecting a container's network namespace to the host's network. It functions as a virtual cable, and its role is to forward traffic, not to act as an endpoint with its own IP address.
In pod A0 in node-0, if we run ip route, we should see that pod A0 knows how to route traffic in the pod subnet 10.200.0.0/24 via the bridge br0. However, it does not know how to route traffic to pod A1 with IP 10.200.1.2/24 in node-1, which is in a different subnet 10.200.1.0/24.
# in A0, verify the route
ip route
# should see
10.200.0.0/24 dev veth-a0-c scope link src 10.200.0.2Therefore, we need to configure the default route in each pod to route traffic to the bridge. Otherwise, the pods will not be able to identify the bridge as the next hop for traffic destined for other pods.
# in A0, config to route traffic to the bridge by default
ip route add default via 10.200.0.1
# in A1, config to route traffic to the bridge by default
ip route add default via 10.200.1.1Now, if we run ip route in pod A0, we should see that it has a default route to the bridge br0. When a request destined for pod A1 at 10.200.1.2/24 in node-1 is sent, it will be routed to the bridge br0 at 10.200.0.1/24.
# in A0, verify the route
ip route
# should see
default via 10.200.0.1 dev veth-a0-c
10.200.0.0/24 dev veth-a0-c scope link src 10.200.0.2Alright, any request destined for pod A1 at 10.200.1.2/24 in node-1 will be routed to the bridge br0, but it will not be able to reach pod A1 yet. This is because the bridge br0 in node-0 does not know how to route traffic to the pod subnet 10.200.1.0/24. To solve this, we need to guide the host network namespace to route traffic to the pod subnet 10.200.1.0/24 via node-1's IP address.
# in node-0, tell the kernel to send packets destined for node-1's subnet to node-1's IP
ip route add 10.200.1.0/24 via 192.168.64.5
# in node-1, tell the kernel to send packets destined for node-0's subnet to node-0's IP
ip route add 10.200.0.0/24 via 192.168.64.4We can verify that the node-0's network namespace now has a route to the pod subnet 10.200.1.0/24 via node-1's IP address 192.168.64.5.
# in node-0, verify the route
ip route
# should see
default via 192.168.64.1 dev enp0s1
10.200.0.0/24 dev br0 proto kernel scope link src 10.200.0.1
10.200.1.0/24 via 192.168.64.5 dev enp0s1
192.168.64.0/24 dev enp0s1 proto kernel scope link src 192.168.64.4On each node, we need to enable IPv4 forwarding.
# in both node-0 and node-1
sysctl -w net.ipv4.ip_forward=1The IPv4 forwarding allows the host network namespace to forward packets between different network interfaces. For example:
- Let's send a packet from pod A0 in
node-0to pod A1 innode-1 - The packet firstly arrives at the bridge
br0innode-0 node-0will find the route in the routing table that contains the destination subnet of pod A1. That route is10.200.1.0/24 via 192.168.64.5 dev enp0s1node-0then forwards the packet to the interfaceenp0s1to go tonode-1's IP address192.168.64.5node-1receives the packet and forward it to pod A1- Without IPv4 forwarding enabled,
node-0will not forward packets between bridgebr0andenp0s1interface. The packet would be dropped because it's not destined fornode-0itself
4.1.2. Test Communication
In pod A0 in node-0, start a netcat server to listen on port 8080. In pod A1 in node-1, connect to pod A0 to verify that they can communicate over the network.
# in pod A0, try to ping pod A1
ping 10.200.1.2
# create a netcat server in pod A0
nc -lk -p 8080 -e /bin/sh
# in pod A1, try to ping pod A0
ping 10.200.0.2
# in pod A1, connect to pod A0 via netcat
nc -v 10.200.0.2 8080
# should see a shell prompt in pod A1In conclusion, we have set up pod-to-pod communication across nodes using routing between bridges. Each pod has its own bridge interface, and the host network namespace is configured to route traffic to the appropriate pod subnet via the corresponding node's IP address. This allows pods on different nodes to communicate with each other.
While this solution works for basic direct pod-to-pod communication across nodes, it is a simplified example and not how production Kubernetes clusters handle networking. As mentioned in the previous section, this approach is closest to Flannel CNI plugin's host-gateway mode but it lacks the scalability and automatic route management features. For example, in a cluster with 1000 nodes, each with its own pod subnet, every single node would need 999 routing table entries just to handle inter-node pod communication. As the cluster grows, the routing tables would become unmanageable and slow. In the next section, we will explore another solution using IP tunneling.
4.2. Solution 2.1: IP-in-IP Tunneling
IP tunneling (opens in a new tab) is a more elegant and scalable solution for pod-to-pod communication across nodes. IP tunneling works by encapsulating packets from one network inside packets that can be routed through another network. In other words, traffic from a pod on Node A destined for a pod on Node B is wrapped in a packet with the destination IP of Node B.
IP-in-IP tunneling (opens in a new tab) is one of the simplest forms of IP tunneling, where an entire IP packet (including headers) is encapsulated as the payload of another IP packet. This creates a tunnel between two endpoints, allowing packets to be routed through networks that might not otherwise be able to route them.
In this IP-in-IP tunneling solution, we assume the same network topology as the previous section:
- Cluster CIDR is
10.200.0.0/16 node-0:- It has IP
192.168.64.4 - It has pod subnet
10.200.0.0/24 - It has a bridge
br0with IP10.200.0.1/24 - Pod A0 is running on
node-0with IP10.200.0.2/24
- It has IP
node-1:- It has IP
192.168.64.5 - It has pod subnet
10.200.1.0/24 - It has a bridge
br1with IP10.200.1.1/24 - Pod A1 is running on
node-1with IP10.200.1.2/24
- It has IP
The packet flow for IP-in-IP tunneling works as follows:
4.2.1. Create Pods, Bridges, and veth Pairs
First, let's create the same pod and bridge setup as in the previous section:
# in node-0
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a0-upper,workdir=/root/tung/a0-work /root/tung/a0-merged
unshare -Cunimpf chroot /root/tung/a0-merged /bin/sh
# create bridge, assign IP
ip link add br0 type bridge
ip addr add 10.200.0.1/24 dev br0
ip link set br0 up
# create veth pair and connect it to the bridge
ip link add veth-a0 type veth peer name veth-a0-c
ip link set veth-a0 master br0
ip link set veth-a0 up
# move veth-a0-c into A0's namespace
# find A0_PID yourself
ip link set veth-a0-c netns $A0_PID# in node-1
mount -t overlay overlay -o lowerdir=/root/tung/lower,upperdir=/root/tung/a1-upper,workdir=/root/tung/a1-work /root/tung/a1-merged
unshare -Cunimpf chroot /root/tung/a1-merged /bin/sh
# create bridge, assign IP
ip link add br1 type bridge
ip addr add 10.200.1.1/24 dev br1
ip link set br1 up
# create veth pair and connect it to the bridge
ip link add veth-a1 type veth peer name veth-a1-c
ip link set veth-a1 master br1
ip link set veth-a1 up
# move veth-a1-c into A1's namespace
# find A1_PID yourself
ip link set veth-a1-c netns $A1_PIDAssign IP addresses to the veth interfaces in each pod:
# in A0, assign IP to veth-a0-c
ip addr add 10.200.0.2/24 dev veth-a0-c
ip link set veth-a0-c up
ip route add default via 10.200.0.1
# in A1, assign IP to veth-a1-c
ip addr add 10.200.1.2/24 dev veth-a1-c
ip link set veth-a1-c up
ip route add default via 10.200.1.14.2.2. Create IP-in-IP Tunnel Interfaces
Instead of adding static routes, we'll create IP-in-IP tunnel interfaces. IP-in-IP tunneling uses the IPIP protocol.
# in node-0, create an IP-in-IP tunnel interface
ip tunnel add ipip0 mode ipip remote 192.168.64.5 local 192.168.64.4
ip addr add 10.200.0.1/32 dev ipip0
ip link set ipip0 up# in node-1, create an IP-in-IP tunnel interface
ip tunnel add ipip0 mode ipip remote 192.168.64.4 local 192.168.64.5
ip addr add 10.200.1.1/32 dev ipip0
ip link set ipip0 upExplain the IP-in-IP tunnel creation command:
ip tunnel add ipip0 mode ipip: Create a new IP-in-IP tunnel interface namedipip0remote 192.168.64.5: The remote endpoint IP address for this tunnellocal 192.168.64.4: The local IP address used as the source for tunneled packets
Explain the IP address assignment command:
ip addr add 10.200.0.1/32 dev ipip0: Assign a point-to-point IP address to the tunnel interface so that each end of the tunnel can communicate with each other- The
/32subnet mask indicates a single host address, which is typical for point-to-point links
4.2.3. Configure Routing for IP-in-IP Tunneling
IP-in-IP tunneling relies on IP routing to direct traffic through the tunnel:
# in node-0, add route to reach node-1's pod subnet via the tunnel
ip route add 10.200.1.0/24 dev ipip0
# in node-1, add route to reach node-0's pod subnet via the tunnel
ip route add 10.200.0.0/24 dev ipip0These routes tell the kernel:
- Any traffic destined for the remote pod subnet should be sent through the IP-in-IP tunnel interface
- The tunnel interface will automatically encapsulate the packets with the outer IP headers
Also ensure IPv4 forwarding is enabled on both nodes:
# in both node-0 and node-1
sysctl -w net.ipv4.ip_forward=1When pod A0 sends a packet to pod A1:
- Pod A0 sends a packet destined for
10.200.1.2(pod A1) - The packet travels through
veth-a0-c→veth-a0→br0 - Bridge
br0forwards the packet toipip0based on the routing table entry - IP-in-IP interface
ipip0encapsulates the packet:- Adds outer IP header with source
192.168.64.4and destination192.168.64.5 - Uses IP protocol 4 (
IPPROTO_IPIP) to indicate IP-in-IP encapsulation - Adds Ethernet header for the physical network
- Adds outer IP header with source
- The encapsulated packet travels over the physical network from
node-0tonode-1 - Node-1 receives the packet on its
enp0s1interface and recognizes protocol 4 - IP-in-IP interface
ipip0onnode-1decapsulates the packet:- Removes the outer IP and Ethernet headers
- Forwards the original inner packet to
br1
- Bridge br1 forwards the packet to
veth-a1→veth-a1-c→ Pod A1
4.2.4. Test Communication
Now let's test if pods can communicate across nodes using IP-in-IP tunneling:
# in pod A0, try to ping pod A1
ping 10.200.1.2
# in A0, create a netcat server
nc -lk -p 8080 -e /bin/sh
# in pod A1, connect to pod A0 via netcat
nc -v 10.200.0.2 8080
# should see a shell prompt in pod A1In conclusion, we have set up pod-to-pod communication across nodes using IP-in-IP tunneling. The IP-in-IP tunnel encapsulates packets, enabling them to traverse the network without any issues such as:
- When configuring
route add 10.200.0.0/24 via 192.168.64.4, our network admin says "NO! We don't allow custom pod subnets in our corporate network!"- While in IP-in-IP tunneling, the packets are encapsulated with the outer IP headers, making them look like regular traffic to the network
- When our Kubernetes cluster spans across different cloud regions with NAT gateways between them, we run
ping 10.200.1.2. This will fail because the target pod is in different region and NAT gateway doesn't know how to route pod subnet10.200.1.0/24- In IP-in-IP tunneling, the command works because NAT only sees normal traffic between known node IPs
However, IP-in-IP tunneling does not solve the scalability problem of routing table entries as mentioned in the previous section. We still need to maintain a route for every pod subnet in the cluster, which can become unmanageable as the cluster grows. In the next section, we will explore a more scalable solution using VXLAN tunneling.
4.3. Solution 2.2: VXLAN Tunneling
VXLAN (Virtual Extensible LAN) is one of the most popular tunneling protocols used by Kubernetes CNI plugins like Calico and Flannel. VXLAN creates a Layer 2 overlay network (opens in a new tab) over a Layer 3 infrastructure.
How VXLAN Solves the Scalability Problem
Unlike static routing or IP-in-IP tunneling, VXLAN provides true scalability through:
- Single Virtual Network: All nodes participate in one large Layer 2 network segment identified by a VXLAN Network Identifier (VNI)
- Dynamic MAC Learning: VXLAN can automatically learn MAC-to-IP mappings without manual configuration
- Single Route Entry: Only one route entry is needed regardless of cluster size
- Multicast-based Discovery: Nodes can discover each other automatically using multicast
In this solution, we assume the same network topology as the previous section:
- Cluster CIDR is
10.200.0.0/16 node-0:- It has IP
192.168.64.4 - It has pod subnet
10.200.0.0/24 - It has a bridge
br0with IP10.200.0.1/24 - Pod A0 is running on
node-0with IP10.200.0.2/24
- It has IP
node-1:- It has IP
192.168.64.5 - It has pod subnet
10.200.1.0/24 - It has a bridge
br1with IP10.200.1.1/24 - Pod A1 is running on
node-1with IP10.200.1.2/24
- It has IP
4.3.1. Create Pods, Bridges, veth Pairs, and VXLAN Interfaces
First, let's create the same pod and bridge setup as in the previous section. Then, instead of adding static routes, we'll create VXLAN tunnel interfaces with multicast support for automatic discovery.
# in node-0, create a VXLAN interface with multicast group
ip link add vxlan0 type vxlan id 100 group 239.1.1.1 local 192.168.64.4 dstport 4789 dev enp0s1
ip link set vxlan0 master br0
ip link set vxlan0 up
# in node-1, create a VXLAN interface with the same multicast group
ip link add vxlan0 type vxlan id 100 group 239.1.1.1 local 192.168.64.5 dstport 4789 dev enp0s1
ip link set vxlan0 master br1
ip link set vxlan0 upExplain the scalable VXLAN creation command:
ip link add vxlan0 type vxlan: Create a new VXLAN interface namedvxlan0id 100: VXLAN Network Identifier (VNI) - a unique identifier for this VXLAN segmentgroup 239.1.1.1: Multicast group for automatic neighbor discovery (replaces manual FDB entries)local 192.168.64.4: The local IP address used as the source for VXLAN tunneled packetsdstport 4789: The destination UDP port for VXLAN traffic (4789 is the standard VXLAN port)dev enp0s1: The physical interface to send the encapsulated packets through
How the Multicast Discovery Works:
When pod A0 needs to communicate with pod A1:
- Pod A0 sends an ARP request for pod A1's IP (10.200.1.2)
- Since pod A1's MAC address is unknown, the VXLAN interface sends the ARP request to the multicast group (239.1.1.1)
- All nodes in the multicast group receive this ARP request
node-1(which hosts pod A1) responds with pod A1's MAC addressnode-0's VXLAN interface automatically learns:pod A1's MAC → node-1's IP (192.168.64.5)- This mapping is stored in the FDB, enabling direct communication for future packets
Let's verify that no manual FDB entries are needed:
# in node-0, check FDB entries before any communication
bridge fdb show dev vxlan0
# should only show the multicast entry:
3a:08:d1:b9:30:f0 vlan 1 master br0 permanent
3a:08:d1:b9:30:f0 master br0 permanent
00:00:00:00:00:00 dst 239.1.1.1 via enp0s1 self permanent
# in A0, ping A1
ping 10.200.1.2
# after pod A0 pings pod A1, the FDB will automatically learn the mapping:
# in node-0, check FDB entries again
bridge fdb show dev vxlan0
# should now show learned entry for pod A1:
f6:2e:13:01:62:96 master br0
f6:89:f1:bf:25:77 master br0
3a:08:d1:b9:30:f0 vlan 1 master br0 permanent
3a:08:d1:b9:30:f0 master br0 permanent
00:00:00:00:00:00 dst 239.1.1.1 via enp0s1 self permanent
f6:2e:13:01:62:96 dst 192.168.64.5 self
f6:89:f1:bf:25:77 dst 192.168.64.5 selfExplain MAC addresses in the FDB entries:
f6:2e:13:01:62:96: the MAC address ofbr1onnode-1f6:89:f1:bf:25:77: the MAC address ofveth-a1-cin pod A1
Why Do We See Duplicate MAC Entries?
Notice that each MAC address appears twice with different suffixes:
f6:2e:13:01:62:96 master br0
f6:2e:13:01:62:96 dst 192.168.64.5 selfThis happens because two separate forwarding databases are maintained in vxlan0 interface:
-
Bridge FDB (
master br0):- Maintained by the bridge
br0 - Records: "MAC
f6:2e:13:01:62:96is reachable through thevxlan0port" - Standard Layer 2 bridge learning
- Maintained by the bridge
-
VXLAN FDB (
dst 192.168.64.5 self):- Maintained by the VXLAN interface
vxlan0 - Records: "To reach MAC
f6:2e:13:01:62:96, tunnel to node IP192.168.64.5" - VXLAN-specific tunnel endpoint mapping
- Maintained by the VXLAN interface
Why Does Packet Flow Requiring Both Entries:
- Pod A0 → Bridge: Packet destined for
f6:2e:13:01:62:96arrives at bridgebr0 - Bridge Lookup: Bridge checks
master br0entries and finds MAC is reachable viavxlan0port - VXLAN Lookup: VXLAN interface checks
selfentries to find tunnel destination192.168.64.5 - Encapsulation: VXLAN encapsulates packet and sends to
node-1at192.168.64.5
4.3.2. Configure Scalable Routing
The key to VXLAN's scalability is that we only need one route entry regardless of cluster size. This single route entry covers ALL pod subnets in the cluster. This automatic learning is what makes VXLAN truly scalable.
# in node-0, add single route for entire cluster CIDR
ip route add 10.200.0.0/16 dev br0
# in node-1, add single route for entire cluster CIDR
ip route add 10.200.0.0/16 dev br1Also ensure IPv4 forwarding is enabled on both nodes:
# in both node-0 and node-1
sysctl -w net.ipv4.ip_forward=14.3.3. Test Communication
Now let's test if pods can communicate across nodes using VXLAN tunneling:
# in A0, create a netcat server
nc -lk -p 8080 -e /bin/sh
# in pod A1, connect to pod A0 via netcat
nc -v 10.200.0.2 8080
# should see a shell prompt in pod A14.3.4. Test Adding a New Node
In static routing and IP-in-IP tunneling, adding node-2 requires updating every existing node:
# must run on ALL existing nodes when adding node-2
ip route add 10.200.2.0/24 via 192.168.64.6 # Static routing
ip route add 10.200.2.0/24 dev ipip0 # IP-in-IP tunnelingIn VXLAN, adding node-2 requires zero configuration on existing nodes:
# only run on the NEW node-2
ip link add vxlan0 type vxlan id 100 group 239.1.1.1 local 192.168.64.6 dstport 4789 dev enp0s1
ip link set vxlan0 master br2
ip link set vxlan0 up
ip route add 10.200.0.0/16 dev br2
# Existing nodes automatically discover node-2 through multicast4.5. Routing Solutions Comparison
| Aspect | Static Routing | IP-in-IP | VXLAN |
|---|---|---|---|
| Encapsulation Protocol | None | IP (Protocol 4) | UDP (Port 4789) |
| Overhead | None | ~20 bytes (IP header only) | ~50 bytes (UDP + VXLAN headers) |
| OSI Layer | Layer 3 (Network) | Layer 3 (Network) | Layer 2 (Data Link) |
| Routes per Node | O(N) | O(N) | O(1) |
| Adding New Node | Update all existing nodes | Update all existing nodes | Zero config on existing nodes |
| FDB Entries | N/A | N/A | Dynamic learning |
| Complexity | Simple | Medium | More complex |
| Scalability | Poor | Poor | Excellent |
| CNI Examples | Flannel host-gateway | Calico IPIP mode | Flannel VXLAN, Calico VXLAN |
In conclusion, VXLAN provides the truly scalable solution for pod-to-pod communication across nodes in Kubernetes. While static routing and IP-in-IP tunneling both require O(N) configuration entries that grow linearly with cluster size, VXLAN achieves O(1) scalability through:
- Single route entry regardless of cluster size
- Automatic MAC learning through multicast discovery
- Dynamic FDB population without manual configuration
- Zero-touch node addition - new nodes are automatically discovered
This makes VXLAN the preferred choice for large-scale production Kubernetes clusters, despite its slightly higher network overhead compared to IP-in-IP tunneling.
5. Pod-to-Service Communication
The previous sections focused on pod-to-pod communication across nodes. In this section, we will explore how pods communicate with services in Kubernetes, specifically how kube-proxy uses iptables to implement load balancing for services. Before we dive into the solution, let's check some Linux networking concepts and iptables's role in load balancing.
5.1. The main routing stack in Linux
The main routing stack in Linux is made up of several interconnected components:
- Routing Table: the core of the routing decision-making. The kernel uses the routing table to determine the outgoing interface and gateway for a packet based on its destination IP address. We can view this with the
ip routecommand - Netfilter Framework: a framework within the Linux kernel that provides a flexible and powerful way to handle network packets. It consists of multiple tables (
filter,nat,mangle,raw,security) and built-in chains (PREROUTING,INPUT,OUTPUT,FORWARD,POSTROUTING) where rules are placed.iptablesis a user-space utility program in Linux used to configure the Linux kernel's firewall, which is implemented as Netfilter modules- A chain refers to a sequence of defined rules within the
iptablessystem. Each chain is a list of rules which can match a set of packets. Each rule specifies what to do with a packet that matches. This is called a target, which may be a jump to a user-defined chain in the same table
- A chain refers to a sequence of defined rules within the
- Connection Tracking (
conntrack): a kernel module that keeps a record of all active connections. It is critical for features like NAT and stateful firewalls, allowing the kernel to identify a packet as part of an existing conversation and process it accordingly
5.2. Use nat table in iptables for load balancing
There are currently five independent tables in the iptables system: filter, nat, mangle, raw, and security ref. Each table contains a set of built-in chains, which are lists of rules that match packets and specify actions to take on them. The nat table is used for Network Address Translation (NAT) operations, such as modifying the source or destination IP address of packets.
The NAT table is consulted when a packet that creates a new connection is encountered. It consists of four built-in chains:
PREROUTING: for altering packets as soon as they come inINPUT: for altering packets destined for local socketsOUTPUT: for altering locally-generated packets before routingPOSTROUTING: for altering packets as they are about to go out
The nat table is where network address translation happens. Load balancing is fundamentally a form of address translation, where the destination IP (the Kubernetes service's ClusterIP) is rewritten to the IP of a specific backend pod. This is handled by the DNAT (Destination NAT) target in the PREROUTING chain.
The nat table is the only iptables table that can directly perform load balancing. We can also potentially use the mangle table in combination with other tools to achieve a similar effect. The filter, raw, and security tables are not suitable for load balancing.
Besides DNAT, there is also SNAT (Source NAT). For more information about iptables and SNAT, refer to this resource (opens in a new tab).
5.3. CNI Plugins and kube-proxy
There is a clear separation of concerns in Kubernetes networking. CNI Plugins (eg. Flannel, Calico, Cilium) are responsible for pod networking. When a pod is created, the CNI plugin does the following:
- Create a network namespace for the pod
- Assign a unique IP address to the pod from a defined cluster subnet
- Set up the pod's network interface (eg. a veth pair)
- Ensure that traffic can be routed between pods, even on different nodes
In Kubernetes, a Service is an abstraction that defines a logical set of pods and a policy by which to access them. A Kubernetes Service provides a stable IP address and DNS name that can be used to access the pods, even if the underlying pods change over time.
kube-proxy is responsible for service networking. It creates the iptables rules (or IPVS rules) that intercept traffic destined for a service's ClusterIP. These rules perform DNAT, rewriting the service's ClusterIP to the IP of one of the healthy backend pods.
Some advanced CNI plugins, like Cilium, can replace kube-proxy entirely by using a more efficient technology called eBPF (opens in a new tab). In such cases, the CNI plugin itself handles service load balancing, but it's not using iptables in the traditional sense.
5.4. Implementation
In a Kubernetes cluster, the service-cluster-ip-range (opens in a new tab) option defines the Classless Inter-Domain Routing (CIDR) block from which IP addresses are allocated to Services within a cluster. These are known as ClusterIPs, and they provide a stable virtual IP address for a Service. The service-cluster-ip-range must be mutually exclusive with other IP ranges used within the cluster, such as the Pod CIDR range and the IP addresses of the cluster nodes, to prevent IP conflicts.
In this solution, we will create a Kubernetes Service that load balances traffic across multiple pods running on different nodes. We will use iptables to implement the load balancing logic. The Service will have a stable IP address that can be used to access the pods, and iptables will be used to route traffic to the appropriate pod based on the load balancing rules.
This solution assumes:
- Cluster CIDR is
10.200.0.0/16 node-0:- It has IP
192.168.64.4 - It has pod subnet
10.200.0.0/24 - It has a bridge
br0with IP10.200.0.1/24 - Pod A0 is running in
node-0with IP10.200.0.2/24 - Pod A1 is running in
node-0with IP10.200.0.3/24
- It has IP
node-1:- It has IP
192.168.64.5 - It has pod subnet
10.200.1.0/24 - It has a bridge
br1with IP10.200.1.1/24 - Pod B is running in
node-1with IP10.200.1.2/24
- It has IP
node-2:- It has IP
192.168.64.6 - It has pod subnet
10.200.2.0/24 - It has a bridge
br2with IP10.200.2.1/24 - Pod C is running in
node-2with IP10.200.2.2/24
- It has IP
- The Kubernetes Service CIDR is
10.96.0.0/12- There is one Kubernetes Service named
KUBE-SVC-1having IP10.96.0.2/12. This Service is configured to load balance traffic across pods A0 and B. This is to simulate a scenario where the Service has multiple endpoints across different nodes
- There is one Kubernetes Service named
The diagram below illustrates the flow of packets when a pod C sends traffic to the Service KUBE-SVC-1 at IP 10.96.0.2/32 on port 8080. The traffic is load balanced across pods A0 and B, which are living in node-0 and node-1 respectively.
In this solution, we will test these three cases:
- Case 1: send traffic from pod C to the Service
KUBE-SVC-1. This is to simulate the scenario where we send traffic from a pod that is not part of the Service's endpoints and that pod is living in a different subnet (different node) with the Service's endpoints - Case 2: send traffic from pod A1 to the Service
KUBE-SVC-1. This is to simulate the scenario where we send traffic from a pod that is not part of the Service's endpoints and that pod is living in the same subnet (same node) with one of the Service's endpoints - Case 3: send traffic from pod A0 to the Service
KUBE-SVC-1. This is to simulate a scenario where we send traffic from a pod that is one of the Service's endpoints
On each node, let's create corresponding pods, bridges, veth pairs and VXLAN interfaces as we did in the previous section.
On each node, add the following routes to enable communication between the pod subnets. This will allow traffic destined for the pod subnets to be routed through the respective node's IP address.
# in node-0
ip route add 10.200.1.0/24 via 192.168.64.5
ip route add 10.200.2.0/24 via 192.168.64.6
# in node-1
ip route add 10.200.0.0/24 via 192.168.64.4
ip route add 10.200.2.0/24 via 192.168.64.6
# in node-2
ip route add 10.200.0.0/24 via 192.168.64.4
ip route add 10.200.1.0/24 via 192.168.64.5Let's verify that the routes are set up correctly on each node.
# in A0, ping A1, B, and C
ping 10.200.0.3
ping 10.200.1.2
ping 10.200.2.2
# in A1, ping A0, B, and C
ping 10.200.0.2
ping 10.200.1.2
ping 10.200.2.2
# in B, ping A0, A1, and C
ping 10.200.0.2
ping 10.200.0.3
ping 10.200.2.2
# in C, ping A0, A1, and B
ping 10.200.0.2
ping 10.200.0.3
ping 10.200.1.2Now, we will create a Kubernetes Service named KUBE-SVC-1 which is actually just a custom chain in the table nat of iptables. We will then add rules to the KUBE-SVC-1 chain to load balance traffic across the pods A0 and B.
# create a custom iptables chain for our service, call it KUBE-SVC-1
iptables -t nat -N KUBE-SVC-1
# add a rule to the PREROUTING chain to send all traffic for the Service IP to this new custom chain
iptables -t nat -A PREROUTING -d 10.96.0.2/32 -p tcp --dport 8080 -j KUBE-SVC-1
# verify
iptables -t nat -L PREROUTING -v -n --line-numbers
# add rule 1: redirects 50% of the traffic to Pod A0
iptables -t nat -A KUBE-SVC-1 -p tcp --dport 8080 -m statistic --mode random --probability 0.5 -j DNAT --to-destination 10.200.0.2:8080
# add rule 2: redirects the remaining 50% of traffic to Pod B
iptables -t nat -A KUBE-SVC-1 -p tcp --dport 8080 -j DNAT --to-destination 10.200.1.2:8080
# verify
iptables -t nat -L KUBE-SVC-1 -v -n --line-numbers
# verify all rules
iptables -L -v -n --line-numbersExplain the command iptables -t nat -N KUBE-SVC-1:
-t nat: Specify that we are working with thenattable, which is used for Network Address Translation-N KUBE-SVC-1: Create a new chain namedKUBE-SVC-1in thenattable. This chain will be used to define rules for handling traffic destined for the Kubernetes Service IP10.96.0.2/32
Explain the command iptables -t nat -A PREROUTING -d 10.96.0.2/32 -p tcp --dport 8080 -j KUBE-SVC-1:
-A PREROUTING: Append a rule to thePREROUTINGchain, which is the first chain that packets traverse when they arrive at the system-d 10.96.0.2/32 -p tcp --dport 8080: Specify that this rule applies to packets destined for the IP address10.96.0.2/32on TCP port8080-j KUBE-SVC-1: Jump to theKUBE-SVC-1chain, where the actual load balancing rules are defined
Explain rule 1 command iptables -t nat -A KUBE-SVC-1 -p tcp --dport 8080 -m statistic --mode random --probability 0.5 -j DNAT --to-destination 10.200.0.2:8080:
-A KUBE-SVC-1: Append a rule to theKUBE-SVC-1chain-p tcp --dport 8080: Specify that this rule applies to TCP packets destined for port8080-m statistic --mode random --probability 0.5: Use thestatisticmodule to randomly select 50% of the packets that match this rule-j DNAT --to-destination 10.200.0.2:8080: Perform Destination Network Address Translation (DNAT) on the selected packets, changing their destination IP address to10.200.0.2:8080, which is the IP address of pod A0
Explain rule 2 command iptables -t nat -A KUBE-SVC-1 -p tcp --dport 8080 -j DNAT --to-destination 10.200.1.2:8080:
-A KUBE-SVC-1: Append a rule to theKUBE-SVC-1chain-p tcp --dport 8080: Specify that this rule applies to TCP packets destined for port8080-j DNAT --to-destination 10.200.1.2:8080: Perform Destination Network Address Translation (DNAT) on the remaining packets, changing their destination IP address to10.200.1.2:8080, which is the IP address of pod B- Note: for a Service with N pods, we will set the probabilities in a cascading manner:
1/N,1/(N-1),1/(N-2), etc., down to1/1for the last pod
Show me the Kubernetes code
In pkg/proxy/iptables/proxier.go, syncProxyRules() (opens in a new tab)
is where all of the iptables-save/restore calls happen. It's called by OnServiceSynced() and OnEndpointSlicesSynced().
In syncProxyRules(), it calls proxier.writeServiceToEndpointRules() (opens in a new tab)
to create DNAT rules (opens in a new tab) for service-to-endpoint mapping.
5.5. Test Load Balancing
In pod A0 and pod B, start a netcat server to listen on port 8080.
# in A0
nc -lk -p 8080 -e /bin/sh
# in B
nc -lk -p 8080 -e /bin/shCase 1: Send traffic from pod C to the Service KUBE-SVC-1 using nc -v 10.96.0.2 8080. Pod C is living in node-2, which doesn't have any endpoints of the Service KUBE-SVC-1. Break and retry, we should see traffic reaches pod A0 and pod B equally.
Case 2: Send traffic from pod A1. Pod A1 is living in node-0, which has one endpoint for the Service KUBE-SVC-1 (pod A0). Break and retry, we should see traffic reaches pod A0 and pod B equally.
For this case, Kubernetes has a feature called Topology Aware Routing (opens in a new tab) and the internalTrafficPolicy: Local setting. These features can change the default behavior to prefer routing traffic to pods on the same node or in the same availability zone.
Case 3: Send traffic from pod A0. Pod A0 is one endpoint of the Service KUBE-SVC-1. Break and retry, we should see traffic reaches pod B only.
# in node-0, enter pod A0's network namespace
# find A0_PID yourself
nsenter -t $A0_PID -a chroot /root/tung/a0-merged /bin/sh
# in the new shell, send traffic to the Service
nc -v 10.96.0.2 8080For this case, when a pod sends traffic to a service it's a part of, the traffic will be routed to a different, random pod within that service, not back to itself. This is a behavior known as hairpinning.
Hairpinning is when a pod's traffic goes out to the service's ClusterIP and then back in to a pod. In Kubernetes, when pod A0 sends traffic to KUBE-SVC-1, kube-proxy will randomly select one of two pods A0 and B as the destination. Because the goal is to load balance, it's highly unlikely that it would route the traffic back to pod A0 itself. This ensures that the load is distributed and prevents a single pod from becoming overwhelmed by its own requests, which could lead to deadlocks or other issues. We could change this default behavior using hairpinMode option (opens in a new tab).
In conclusion, we have set up a Kubernetes Service that load balances traffic across multiple pods. We use iptables to create a custom chain for the Service and add rules to load balance traffic between the pods. This allows us to distribute traffic across multiple pods, providing high availability and scalability for our applications.
I will leave the setup that simulates a scenario where the Service has multiple endpoints in the same node for you to explore because I'm very lazy now.
6. Conclusion
This post has explored the fundamental concepts of Kubernetes pod networking, including pod-to-pod communication on the same node and across nodes. We have discussed two solutions for pod-to-pod communication on the same node: one using veth pairs and another using bridges. We also set up direct pod-to-pod communication across nodes using routing between bridges and implemented load balancing for pod-to-service communication using iptables.
This post didn't cover all the details of Kubernetes networking, such as how to route traffic to external sources (eg. the internet) using SNAT, how to configure network policies, or how to troubleshoot networking issues in Kubernetes. However, it provides a solid foundation for understanding how pods communicate with each other in a Kubernetes cluster.