I have put this post together as I spend an awful lot of time selling companies on the benefits of Kubernetes and other container orchestration platforms. This usually involves talking about patterns to optimise adoption of a PaaS, but I realised the other day that my knowledge was a little bit light on the process of how pods are scheduled on to nodes.
When thinking about Kubernetes architecture, the above diagram is probably what pops into your head. This was the diagram that came into my head and while I could place all the components at the right places and knew that the Kubelet communicates with the Docker daemon using REST in order to ensure the containers are running, there were clearly some gaps.
Be warned this post is probably going to be more technical than my usual posts as this is really about understanding and explaining how some of the internals of Kubernetes actually work, and although I will try to explain the more complicated concepts in easier to understand examples and analogies, there is some stuff that just requires some technical language.
So how does the process work when we tell Kubernetes that we want a pod to run. We should start off with our context diagram, just so that we are all on the same page of what we are trying to achieve.
So first things first what are the components of the Kubernetes cluster, most of us will already know that the Kubernetes is general split up into what people refer to as the Control Plane and the Worker Nodes, which of course are where we want our container to be running when it successfully get scheduled by the Kubernetes cluster.
Below there is a list of the components grouped by whether they run on the Control Plane or on the Worker Nodes.
- The API Server
- The Scheduler
- The Controller Manager
- The Kublet
- The Kube Services Proxy
- The Container Runtime
But what is it that actually happens with all the above listed components when we as a user tell our Kubernetes cluster to run this pod for us?
All cluster state is stored in etcd, which is behind the API server component which was listed above. The other components communicate with the API server in order to modify the cluster state.
So this would seem to be a fairly standard 3 tier architecture, all clients communicate with the API server and this abstracts storage logic away meaning that the database can be swapped out if required later on.
This is also probably as good a time as any to discuss the optimistic concurrency control that is implemented within Kubernetes. Optimistic locking means that if there is delta in the version number between the client read time and the client update time, that the update is rejected. A nice by-product of this is that the first update will always succeed in the event of two clients attempting to update the same data. A slight downside would be that there would be the odd ‘dirty’ read when clients are reading from the database but there will be no deadlocks as one client waiting to read or write data will not have to wait for the lock to be released by a previous client.
When we are a user and running kubectl to schedule a pod you perform an HTTP POST, which passes a deployment object to the API Server. This data is then subjected to a gauntlet consisting of authentication plugins, authorisation plugins and then the admission controller plugins. After the API Server has successfully determined that this pod is from the right identity, this identity has the authority to perform these tasks and that this pod meets all admissions requirements ( e.g is not requesting elevated privileges ) it then validates the resource that has been posted and only then this is written into etcd. This is the first step on our journey to getting our container running within our cluster.
Once this object is stored in etcd, our request as a user has finished, the response returned to the client and the API server has finished its job. Clients can watch for changes in the Kubernetes system by opening an HTTP connection to the API server, which will give them a stream of events indicating which objects have been modified in the cluster. Most users of a Linux system will be familiar with the tail command and passing the -f flag to see changes to the stream, as the events come through.
The next step in our all important quest starts with the Kube Scheduler. The Scheduler also makes use of this watch functionality, watching for the pods which have been created and then deciding which node this pod should be assigned to. However, to assign the pod to the node, it never speaks with the node itself. This work is all handled using the API server again to update the resource in question with the node upon which it should run. Once the pod resource has been updated then The Kublet is notified, as this is also watching the API Server for updated resources. The Kubelet then schedules and runs the pods containers using the container runtime that is present on the machine.
Side note about container runtimes…. Rkt is another container runtime which is similar to Docker but it is built to be more secure and it also does not require the use of a separate daemon running in order to schedule the containers.
There are different scheduling algorithms that can be used in Kubernetes, for the purpose of this blog post we will just focus on the default scheduling algorithm.
Default scheduling is split into two phases, Filtering and Prioritisation, if there are multiple pods which meet the criteria with similar scores then a round-robin is used so that pods are provisioned across the nodes adequately. Notice that we never tell the pod which node we want to run on, okay maybe not never but most of the time we do not give the cluster that information. How does Kubernetes figure this out? Let’s walk through this bit now.
How does Kubernetes determine which nodes are acceptable for the pod? Well the scheduler effectively runs a series of functions which determine whether that node would be acceptable. It will check to see if a node can meet the hardware requests for the pod, so this is does it have sufficient CPU and RAM and Disk for the pod to be happy running on this node. Some pods specify a particular port that it wants to bind to, the scheduler will find out if that port is available. Some pods can also say they want to run on a particular node, the scheduler of course needs to check if this is that particular node.The scheduler also checks the node taints and tolerations along with the pod affinity and anti-affinity rules.
So once there is a list of all the acceptable nodes the scheduler will need to check which of these acceptable nodes would be the favourite of the best to run this pod one. For example, why would we choose a node which is already running more pods than another acceptable node, or in the case of making applications highly available we would not want this pod to be scheduled on a node which is already running an instance of this pod.
Now that the node is picked for pod scheduling, Kubernetes needs to action the next set of steps to get us closer to our end goal, which is our pod scheduled on the right node for the job. This task is handled by the Controller Manager, which actually contains a series of other controllers that perform other tasks manipulating the Kubernetes resources that are required. These include, but are not limited to, Deployment Controller, Node Controller, Namespace Controller etc.
Controllers only operate on the Kubernetes objects via the API Server; they do not directly communicate with one another and they do not know how to communicate with the Kubelets. Once the Controller Manager has finished those tasks, the Kubelet can start provisioning of a pod’s containers. The other steps we have discussed up until this point have all existed within the Control Plane, but we are finally here on the worker which has been determined as the right node for our pod to perform its duty. So what happens next? Well the Kubelet basically takes the specification from the API Server which we uploaded via Kubectl at the beginning of this process and then passes the required information about the containers required to the configured container runtime on this node. Kubelets monitor these containers on the node reporting back their status to API Server so that the clients have the information that they require. Our containers liveness probes are also handled by the Kubelet along with restarting and terminating containers and pods.
So our container is finally running on the node selected, but if Kubernetes stopped here we, nor any other clients within the system would be able to communicate with the containers. This is where the Kubernetes System Proxy ( AKA Kube Proxy ) steps in to save the day, ensuring that connections to the service actually communicate with one of the containers that is responsible for the tasks performed within that service. The proxy will also perform load balancing if there are multiple containers backing a service.
The next port of call on our cruise through the Kubernetes internals is DNS. All pods within the cluster, by default, will be configured to lookup services by their name using the internal DNS server. This pod is deployed to the Kubernetes system just like any other service which we deploy into our cluster, and makes use of the resolv.conf file on each container with a mechanism that will be pretty familiar to most Linux users. The watch system is used by the DNS system to listen for changes in services and endpoints, then providing updates to clients. This mechanism can cause invalid DNS records due to time delay between resource update and the DNS pod receiving the notification.
If our container now has its service registered in the DNS server, we should be able to communicate with the container now, at least from inside the cluster. But if we wanted to communicate with our pod from outside the cluster we would require an ingress controller to be in-place. These are mostly reverse proxy servers which forward traffic directly to a service’s pod. When defining the ingress controller the resource definition requires the services and endpoints, if either of these were to change during the lifecycle of the pod the ingress controller is ready to update these details as it is also utilising the watch mechanism that we have discussed throughout this post. — Need to draw this out as this means from the outside into the cluster we bypass the service route directly into the pod.
So we have finally achieved everything that we set out to do in the beginning of this article which was to follow the pod right the way through from resource definition posted into the cluster until the pod was scheduled on a node within the cluster and being able to communicate with that pod from outside of the cluster. We have documented each of these steps so that we know what happens at each point giving us a much more thorough understanding of the events which happen to mean that our containers are scheduled onto our Kubernetes cluster.