Lead Image © balein, 123RF.com

Kubernetes networking in the kernel

Core Connection

Article from ADMIN 82/2024

By Abe Sharp

Cilium and eBPF put Kubernetes networking down in the kernel where it belongs.

For many self-managed Kubernetes clusters, networking is little more than an afterthought. Administrators install whichever network plugin they've used in the past, and as long as pods and services can be contacted, that's the end of the matter. In this article, I describe the networking requirements of a Kubernetes cluster and how the Container Network Interface (CNI) fosters innovation and choice in this vital area.

I focus on the open source Cilium network plugin, which is one of a few CNI choices that leverages eBPF (the successor to the Berkeley Packet Filter) to provide high performance, control, and observability. You'll install Cilium into a test cluster and compare its performance in unencrypted and encrypted forms with that of Flannel, implement network policies, and observe their effectiveness with the help of Hubble, Cilium's companion user interface (UI).

Networking in Kubernetes

Kubernetes is sometimes described as an orchestration layer , and that term is helpful when you think of deploying an application in a pod (or container) whose environment is abstracted from the underlying cluster of physical nodes, to the point where you don't have to know or care about those nodes. To realize this abstraction, any pod in the cluster should be able to communicate with any other pod as though they were independent "hosts" on a routable IP network, which offers many advantages over other ways that containers can interact with networks (e.g., mapping ports on a host to ports in the container) by increasing capacity in the address and port spaces.

Another requirement is that external users should be able to access application pods by consistent ingress points regardless of which node the pods are currently running on, and with no intervention required if a pod is rescheduled from one host to another. In this case, Services, another vital object in Kubenetes networking, comes into play.

Administrators might want to encrypt pod-to-pod traffic, as well, to further ensure the cluster's isolation from external influences (virtual extensible LAN (VxLAN) traffic is trivial to collect and snoop on, as you'll see later); to exert granular control over which pods can talk to which other pods and for what purposes, especially in a multitenant environment; and to have detailed observability of packet flows and network policy decisions.

Kubernetes delegates the implementation of all of these requirements to a network plugin (or add-on), with which container runtimes interact by means of the CNI, a Cloud Native Computing Foundation (CNCF) construct intended to foster choice and innovation in this important area of orchestration. The construct defines a simple set of methods that a network plugin must implement.

Network plugins must be able to provide interfaces and IP addresses to newly created pods (ADD method), handle the removal of a pod and the freeing up of its IP address (DEL method), and track the use of an IP (CHECK method). As long as a network plugin implements CNI correctly, it can be installed in a Kubernetes cluster and have complete freedom over how it gets the job done. For example, some network plugins have no way to encrypt VxLAN traffic, and others don't implement Kubernetes NetworkPolicy objects. (Cilium does both, and much more!)

Self-managed Kubernetes clusters encompass a broad scope, with many control plane implementation choices available to the administrator. In a "vanilla" kubeadm install, the core control plane services of kube-apiserver, etcd, kube-scheduler, and kube-controller-manager run as static pods on the master host and provide their services over the master host's own network interfaces. For example, kube-apiserver uses default port 6443 of the master host's primary interface. Therefore, the static pods come to the Running state before any CNI has been installed. However, other pods that need to use the pod network cannot be scheduled until a network plugin has been installed on the cluster.

If you've ever followed the official kubeadm documentation, you'll have noticed a step after running the kubeadm init command that directs you to the Addons page [1] and tells you not to proceed until you've selected and installed a network plugin from a bewildering list of 19 possible options, all described in completely different terms. How do you choose? Unless you have vendor-specific requirements, then safe general choices include Calico, Flannel, Weave, and Cilium, which is the focus of this article. All these choices use VxLAN for internode communication. VxLAN is a reliable choice because its only requirement of the underlay network is that Layer 3 (L3) connectivity must exist between all hosts and that each host must be able to receive UDP packets on port 8472.

Pods on the Same Node

When a node's container runtime (e.g., containerd) creates a new pod, it leverages the Linux namespace concept to create a dedicated network namespace for that pod, isolated from all the other network namespaces on the node, including the default network namespace (which is the one containing the host's primary interface). At this stage, the pod remains disconnected from the pod network – not much use for participating in the world of microservices! To connect the pod, the container runtime invokes the network plugin with the CNI.

The network plugin uses IP address management (IPAM) to obtain a unique IP address within the host's pod network subnet. It creates a network interface within the pod's network namespace and assigns the IP address (Figure 1; eth0@if … boxes). The plugin creates an interface on the host's Linux bridge (think of a virtual L2 switch inside each host) and connects it to the pod's network interface (Figure 1; lxc…@ boxes). If you exec into the pod and run ip a, you'll see the pod's interface and none of the host's interfaces; likewise, if you run ip a on the node, you'll see the host's interfaces, including the bridge port interfaces – one corresponding to each pod.

Figure 1: A Kubernetes network with a Linux bridge connecting pod namespaces and overlay networking for internode communication.

Now the pod can send IP packets to other pods on the same node by arping for their MAC addresses. I haven't found a straightforward way of figuring out which Linux bridge port is connected to which pod, but if you generate some traffic from one pod to another, you can look at the pod's ARP table,

kubectl exec mypod -- arp -an

and see the MAC address corresponding to the port on the Linux bridge to which the pod is connected.

Pods on Different Nodes

If you run,

kubectl get po -o wide

you'll note that each node's pods are in a subnet unique to that node. In the examples given in this article, the pod network is 192.168.0.0/16; the pods for worker 1 have IP addresses assigned from 192.168.1.0/24, the pods for worker 2 are in 192.168.2.0/24, and so on. Figure 1 shows the interface on each node's Linux bridge called cilium_host, the default gateway over which traffic destined for another node is sent. Run ip route inside a pod and on the node to see that this is so. Although it all makes sense in the context of the pod network, clearly, a packet being sent from a pod on one node to a pod on another node has to traverse the "real" network between the hosts somehow. How does this happen? There are two possible answers: native routing or encapsulation.

1. Native routing. Pod IP addresses exist in the same network as used by the hosts themselves, so the packets can traverse the network in their native form, just like a packet generated by the node. Interhost routing is performed by the native routing tables on the hosts. The size of that native subnet will limit the total number of pods that can be created in the cluster. With the Kubernetes cluster and the host network both trying to manage the same pool of IP addresses, conflicts could arise.

2. Encapsulation. To avoid the complexity and conflict potential of getting your pods to exist in the same network as the real hosts, you can leverage the kernel's built-in VxLAN functionality (or an alternative such as WireGuard or IPsec) to encapsulate the packets before sending them across the physical network to the node. Encapsulation is the default mode for most CNIs, including Cilium, so all of the following examples will use encapsulated traffic between nodes. Figure 1 shows encapsulated traffic (which could be TCP traffic on the pod network) being sent between nodes with UDP packets.

1 2 3 Next »