placeholder
thoughts and learnings in software engineering by Rotem Tamir

Resilient Kubernetes Deployments with Readiness Probes

Introduction

Containers make CI much more manageable: reproducible, isolated build environments that create portable and predictable deployment artifacts. Continuously delivering containers to production turns out to be quite a difficult problem which we call collectively container orchestration.

What do we need for continuous delivery? An automated and safe way of applying changes to our production environments. In the early days of containers, I once wrote a Node.js server that ssh’d into a host, updated a docker-compose manifest, and ran a restart command every time a new user signed up for service. (It was automated, but not particularly safe, let me just say)

The first time I heard about Kubernetes was when a team of actually talented engineers inherited my monstrosity and went on to build a proper system that would not crash and burn five times a day. Organizations turn to Kubernetes to facilitate container-based, automated, and safe delivery.

Kubernetes is truly an amazing piece of software, being the brainchild of some 10k committers it has grown into a very complete platform for organizations to run containerized applications, with very fine-grained configuration options. The problem is, that once you have so many options in your hands, some of the permutations that you can roll out can be wrong, or incomplete: failing to set correct configuration options can lead to sub-optimal (or even destructive) behavior of your applications.

Today I want to discuss one feature in the Kubernetes API which I have found to be particularly important to make our applications more resilient in production: readiness probes.

Our Dummy Application

For the purpose of this post, we will be exploring different configuration options of the Kubernetes Deployment API by playing with a tiny webserver example written in Go, this is the baseline:

package main

import (
	"fmt"
	"log"
	"net/http"
)

func main() {
	http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
		fmt.Fprint(w, "Hello, Kubernetes!")
	})
	log.Fatal(http.ListenAndServe(":8080", nil))
}

To build it into a Docker image which we can then deploy to a Kubernetes cluster, we will use this Dockerfile:

# build the binary using a normal golang image
FROM golang:1.15-buster as build

WORKDIR /go/src/app
ADD . /go/src/app

RUN go build -o /go/bin/app

# then copy the binary distroless image
FROM gcr.io/distroless/base-debian10
COPY --from=build /go/bin/app /
CMD ["/app"]

In the above example, we are using a Docker multi-stage build, to first build the server binary in a golang image, and then copy it to a barebones distroless docker image, in order to keep the size of the image minimal and reduce deployment times associated with resource downloads.

To build and push it to Public DockerHub:

$ docker build -t rotemtam/k8s-deployment-blogpost:baseline .
$ docker push rotemtam/k8s-deployment-blogpost:baseline

What is the purpose of Deployment objects?

Source
Source

Source: wiki.ciscolinux.co.uk

The Kubernetes designers did a fine job of providing us with an Orthogonal Design, that is, each part of the API is responsible for a specific task, and only it is responsible for that task. This is what the hierarchy looks like:

  • Pods are the basic unit of scheduling compute, they are ephemeral and short-lived. They specify how to run a group of containers on a host. It is the Kubernetes control plane’s responsibility to then schedule this Pod on a specific node and run it.
  • ReplicaSets are simple controllers whose task is to keep maintaining a specific amount of pods of a certain PodSpec up and running on the cluster. If a pod dies, and there is now a gap between the desired state (I want 3 of this thing running) and current state (I now have only 2 of this thing running), it is the ReplicaSet’s responsibility to schedule a new pod of said spec in its place. A ReplicaSet lives longer than a pod, but it is (usually) pinned to a specific spec the RS lives for the lifetime of a revision of your application.
  • A Deployment is a high-level construct that is supposed to live for the entire lifecycle of an application, through many versions and releases. Deployments control how a cluster rolls out a new revision and allow for version rollbacks if needed. So unless you have some very specific orchestration requirements, your interface to scheduling (stateless) applications onto your cluster should be through the Deployment API. (for deploying stateful applications we use a s similar API - StatefulSet)

Our baseline Deployment object

A minimal example for a deployment of our application would be:

# declare the object type: 
apiVersion: apps/v1
kind: Deployment

# define metadata about our deployment
metadata:
  name: webserver-deployment
  labels:
    app: webserver-deployment

# define the spec for our deployment 
spec:
  replicas: 3
  selector:
    matchLabels:
      app: webserver-deployment

  # define the template for the Pods created by this deployment
  template:
    metadata:
      labels:
        app: webserver-deployment
    spec:
      containers:
      # define a single container
      - name: webserver
        # using our pushed docker image
        image: rotemtam/k8s-deployment-blogpost:baseline
        ports:
        # expose port 8080
        - containerPort: 8080

To be able to make requests against our app we will expose it with a Service object:

# service.yaml:
apiVersion: v1
kind: Service
metadata:
  name: k8s-blogpost-svc
spec:
  selector:
    app: webserver-deployment
  ports:
    - protocol: TCP
      port: 8080
      targetPort: 8080

To test that everything is up and running:

$ kubectl run curl --image=curlimages/curl --rm --restart=Never -it curl http://k8s-blogpost-svc:8080

Hello, Kubernetes!

Hooray!

Dealing with slow starting containers

https://media.giphy.com/media/3o6MbnqLhX5tJ5wNQQ/giphy.gif

Assume that our server needs to perform some initial work before it is ready to serve traffic, perhaps it is downloading some data from storage and processing it into an in-memory data structure which it uses to answer queries. The way Kubernetes works by default is that traffic will be routed to our Pod as soon as the main process in at least one of its containers (not including initContainers) is running. This means that there will be a period of time in which traffic is routed to our Pod without it being able to serve traffic; depending on the kube-proxy mode of our clusters, this could result in a spike of 5xx errors whenever a new Pod is scheduled successfully.

To see this in action, let’s modify our webserver code such that it is slow to start by adding a sleep before starting the webserver:

package main

import (
	"fmt"
	"log"
	"net/http"
	"time"
)

func main() {
	fmt.Println("Taking a nap..")
	time.Sleep(time.Second * 30)
	fmt.Println("Ready to serve traffic!")
	http.HandleFunc("/", func(w http.ResponseWriter, r *http.Request) {
		fmt.Fprint(w, "Hello, Kubernetes!")
	})
	log.Fatal(http.ListenAndServe(":8080", nil))
}

Building and pushing the new version:

$ docker build -t rotemtam/k8s-deployment-blogpost:slow-boot .
$ docker push rotemtam/k8s-deployment-blogpost:slow-boot

Modifying our deployment yaml to change the docker image:

# ... unchanged stuff
    spec:
      containers:
      # define a single container
      - name: webserver
        # using our pushed docker image
        image: rotemtam/k8s-deployment-blogpost:slow-boot
# more unchanged stuff ...

If we immediately run our curl command we will see:

$ kubectl run curl --image=curlimages/curl --rm --restart=Never -it curl http://k8s-blogpost-svc:8080

curl: (7) Failed to connect to k8s-blogpost-svc port 8080: Connection refused

How does this happen?

  1. We update the Deployment object with a new PodSpec
  2. The Deployment creates a new ReplicaSet for the new revision and rolls out the new pods
  3. As new pods from the new ReplicaSet enter the Ready state, old ones from the existing one are terminated.
  4. As soon as pods are in a Ready state, they are connected to the k8s-blogpost-svc Service and will get traffic directed to them.
  5. We make our curl calls from within the cluster and try to connect to port 8080 in our new pods, but they are still asleep waiting for their 30-second nap to end before opening the webserver socket.
  6. We get a connection refused error message.

How do we mitigate this? If we examine the flow of events, it is easy to see that the culprit is on step 4: “As soon as the pods are in ready state”. Kubernetes thinks our app is ready, (because the container image has downloaded and the process started successfully), when it obviously isn’t. Surely there must be a way to make Kubernetes aware when it should transition a Pod’s state to Ready!

Luckily, when we define a deployment’s PodSpec, we can specify for each container something called a readinessProbe, the docs state it is a “Periodic probe of container service readiness. Container will be removed from service endpoints if the probe fails” and that it is of type Probe v1 Core, which “describes a health check to be performed against a container to determine whether it is alive or ready to receive traffic.”

The probe object is quite rich, allowing us to run arbitrary commands in the container, make HTTP requests, and more. In our example, it would be beneficial to make sure the webserver TCP socket is open before we start directing traffic at it. We could do this by changing our deployment to look like:

# ... 
    spec:
      containers:
      # define a single container
      - name: webserver
        # using our pushed docker image
        image: rotemtam/k8s-deployment-blogpost:slow-boot
        ports:
        # expose port 8080
        - containerPort: 8080

        # wait 30s, then every 5s check if the port is ready
        readinessProbe:
          tcpSocket:
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 5

Immediately after deploying the new manifest, no matter how many times we try to curl our service, we will keep getting service from live containers. What happens when we specify a readiness probe?

  1. We update the Deployment object with a new PodSpec
  2. The Deployment creates a new ReplicaSet for the new revision and rolls out the new pods
  3. As new pods from the new ReplicaSet start, the deployment controller waits for 30 seconds, then at intervals of 5s tries to open a TCP connection to port 8080 on the pod. Once it succeeds, the pod is declared Ready. After each pod in the new ReplicaSet is ready, a pod is terminated from the old one.
  4. As soon as pods are in a Ready state, they are connected to the k8s-blogpost-svc Service and will get traffic directed to them.
  5. We make our curl calls from within the cluster and try to connect to port 8080 but we always only hit Ready container, thus never receiving a connection refused error.

Hooray!

Safely dealing with crashing versions

Whether you pushed a new binary of your app without updating some required configuration, or simply put the wrong docker image tag in your deployment manifest - despite all of the best intentions and CI practices our industry is developing - sooner or later you are bound to push a version that crashes and burns as soon as it starts to run in production. Let’s see how we can utilize readiness probes to prevent our users from being impacted by our flawed nature.

https://media.giphy.com/media/rW6CpFhDj9lkc/giphy.gif

To model the scenario of a crashing version, we will be using this code:

package main

import (
	"fmt"
	"os"
	"time"
)

func main() {
	fmt.Println("Waiting before crashing..")
	time.Sleep(time.Second * 10)
	os.Exit(1)
}

Building and pushing this version:

$ docker build -t rotemtam/k8s-deployment-blogpost:crashing .
$ docker push rotemtam/k8s-deployment-blogpost:crashing

Deploying our app without a readiness probe:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: webserver-deployment
  labels:
    app: webserver-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: webserver-deployment
  template:
    metadata:
      labels:
        app: webserver-deployment
    spec:
      containers:
      - name: webserver
        image: rotemtam/k8s-deployment-blogpost:crashing
        ports:
        - containerPort: 8080

Our deployment appears to be up and running:

$ kubectl get deployment webserver-deployment

NAME                   READY   UP-TO-DATE   AVAILABLE   AGE
webserver-deployment   3/3     3            3           10d

But querying it obviously leads to connection refused errors:

$ kubectl run curl --image=curlimages/curl --rm --restart=Never -it curl http://k8s-blogpost-svc:8080
curl: (7) Failed to connect to k8s-blogpost-svc port 8080: Connection refused

Without a readiness probe, Kubernetes has no way to discern whether our deployed pods are ready and so it keeps rolling the old version out and connecting the new pods to the service. This is bad news for our users.

What would have happend if we had rolled out our Deployment with a readiness probe like:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: webserver-deployment
  labels:
    app: webserver-deployment
spec:
  replicas: 3
  selector:
    matchLabels:
      app: webserver-deployment
  template:
    metadata:
      labels:
        app: webserver-deployment
    spec:
      containers:
      - name: webserver
        image: rotemtam/k8s-deployment-blogpost:crashing
        ports:
        - containerPort: 8080
        readinessProbe:
          tcpSocket:
            port: 8080
          initialDelaySeconds: 30
          periodSeconds: 5

After deploying our new version, if we try to curl our service we will keep getting correct responses:

$ kubectl run curl --image=curlimages/curl --rm --restart=Never -it curl http://k8s-blogpost-svc:8080
Hello, Kubernetes!

How can this be? We deliberately shipped a version that cannot possibly work! Well, thanks to readiness probes, Kubernetes has our back! Let’s inspect our cluster for a better understanding of what’s going on:

$ kubectl get replicaset

NAME                               DESIRED   CURRENT   READY   AGE
webserver-deployment-c9587cf5      3         3         3       1d
webserver-deployment-dc59485db     1         1         0       1d

Off the bat, we can see that there are 2 ReplicaSets related to our deployment, one with 3/3 pods ready, the other with 0/1. Re-running the query a few minutes later that state is the same. This means that Kubernetes has created a Replicaset for the new version of our deployment’s pods, and has spun one up to see if it will become ready and it can continue with the rollout. From our prior knowledge about the code in the container the cluster is trying to run, we know it will keep crashing and never satisfy the readiness probe.

Let’s try to get a bit more details into what the faulty ReplicaSet’s pods looks like:

$ kubectl get pods
NAME                                     READY   STATUS             RESTARTS   AGE
webserver-deployment-c9587cf5-8ng57      1/1     Running            0          2d16h
webserver-deployment-c9587cf5-bz7dw      1/1     Running            0          2d16h
webserver-deployment-c9587cf5-dc6xm      1/1     Running            0          42h
webserver-deployment-dc59485db-lf9v6     0/1     CrashLoopBackOff   6          9m8s

Here we have our 3 ready pods from the previous revision and the pod from the new ReplicaSet in a CrashLoopBackoff, as expected!

So what will be, of our poor unsuccessful deployment? Let’s inspect its state after more than 10m of trying to roll out our bad version:

$ kubectl describe deployment webserver-deployment
Name:                   webserver-deployment
Namespace:              default
# ... redacted for brevity ...
Replicas:               3 desired | 1 updated | 4 total | 3 available | 1 unavailable
# ... redacted for brevity ...
Conditions:
  Type           Status  Reason
  ----           ------  ------
  Available      True    MinimumReplicasAvailable
  Progressing    False   ProgressDeadlineExceeded
OldReplicaSets:  webserver-deployment-c9587cf5 (3/3 replicas created)
NewReplicaSet:   webserver-deployment-dc59485db (1/1 replicas created)
# ... redacted for brevity ...

Here we see some important details:

  • We have 3 desired, 3 available replicas (these map to our original, healthy version), and 1 updated and 1 unavailable (this is the ReplicaSet that’s “in-flight” with the broken version)
  • The deployment is reporting two conditions: Available=True, due to reason MinimumReplicasAvailable and Progressing=False, due to ProgressDeadlineExceeded

Let’s unpack what is going on:

  1. We update the Deployment object such that the PodSpec now requests a container with a broken version.
  2. Kubernetes responds to this by creating a new ReplicaSet with one pod which tries to run our new container.
  3. This container goes into a crash-loop, never satisfying the Deployment’s declared readiness probe.
  4. Kubernetes never decreases the desired pod count on the previous ReplicaSet, nor does it connect broken pods from the new version to the service.
  5. After 10m (by default) of no progress, Kubernetes marks the Deployment as “not making any progress beyond our defined deadline”

The implication is amazing - by defining readiness probes on our Deployment, we can protect our users from suffering downtime due to crashing versions!

We should note, that aside from not making the bad pods available for serving, and not downsizing the previous version, Kubernetes will not take any action for rolling back our desired state, the docs state:

Kubernetes takes no action on a stalled Deployment other than to report a status condition with Reason=ProgressDeadlineExceeded. Higher level orchestrators can take advantage of it and act accordingly, for example, rollback the Deployment to its previous version.

This means it’s up to us to provide the deployment monitoring mechanism that can look at this signal and make decisions that make sense for our specific system.

Caveat: unstable readiness probes

An interesting attribute of readiness probes is that they keep running throughout the applications lifecycle, and once Kubernetes sees that a probe is failing, it will mark the pod as not ready, meaning it will be disconnected from any services that are directing traffic to it. If your readiness probe is unstable, you may want to define a failureThreshold, so it must fail multiple consecutive times in order for your pod to be considered not ready.

Another way around this issue is a relatively new feature (still in alpha), called startupProbe which is a way to probe the pod only as it starts up but to cease checking once it is.

A note on Deployment strategies

Another part of the Deployment API is worth mentioning in our discussion: Strategy. Kubernetes supports two types of strategies:

  • Recreate - kill the old ReplicaSet before creating a new one. In most cases I’ve run into, this isn’t the desired behavior, but if you cannot tolerate two versions of the same application running concurrently and can tolerate downtime until your new version is live, this is probably the way to go.
  • RollingUpdate - this is the default behavior that we relied implicitly in our discussion above.

The RollingUpdate type of strategy has two important fields which we should be aware of: MaxUnavailable and MaxSurge. To quote the docs:

.spec.strategy.rollingUpdate.maxUnavailable is an optional field that specifies the maximum number of Pods that can be unavailable during the update process. The value can be an absolute number (for example, 5) or a percentage of desired Pods (for example, 10%). The absolute number is calculated from percentage by rounding down. The value cannot be 0 if .spec.strategy.rollingUpdate.maxSurge is 0. The default value is 25%.

and:

.spec.strategy.rollingUpdate.maxSurge is an optional field that specifies the maximum number of Pods that can be created over the desired number of Pods. The value can be an absolute number (for example, 5) or a percentage of desired Pods (for example, 10%). The value cannot be 0 if MaxUnavailable is 0. The absolute number is calculated from the percentage by rounding up. The default value is 25%.

How did these defaults interact with our example above? Since we defined replicas=3 and the defaults on Deployment are 25% maxUnavailable and 25% maxSurge, this meant that:

  • Kubernetes will not allow more than 0.75 (3 replicas * 25%) pods to be unavailable during transitions (rounded down to zero in effect)
  • Kubernetes will not use more than 0.75 pods (rounded up to 1 in effect) surge capacity during ReplicaSet transitions.

If we recall the actual behavior we observed in our tinkering we will see that when we updated a deployment:

  • Kubernetes first created one (==surge capacity) new pod
  • It did not terminate any existing pods (due to effective maxUnavailable of zero) before new ones became ready
  • After a pod was ready, it killed a pod from the old ReplicaSet and increased the size of the new one.
  • In this way our minimum availability and max capacity requirements were met.

maxUnavailable vs maxSurge in practice

The maxUnavailable and maxSurge configuration options give us controls to make trade-offs between transition safety, resource utilization (costs), and deployment rollout time:

Trading off deployment rollout speed, cost and transition safety

Consider a few extremes on these axes:

  1. 100% maxUnavailable, 0 maxSurge - is a cheap, fast, and unsafe configuration, because when you roll out, your cluster will kill the old generation of replicas quickly and then roll out at once the entire count of replicas we requested. You will not be able to utilize safe, readiness probe based, rollout technicques like the ones we discussed in this post.
  2. 100% maxSurge, 0 maxUnavailable - is an expensive, fast and safe configuration, because when you roll out, your cluster will first spin up a full new set of replicas while the previous one is still running. Only as new replicas pass readiness probes, will the cluster drop replicas from the previous generation. To make this work fast you need extra capacity in your cluster to place the surge containers, which can be an expensive choice.
  3. 1 maxSurge, 0 maxUnavailable - is a cheap, slow, and safe configuration. As in the example we've seen in this post, the cluster will roll out by provisioning one new pod, waiting for it to pass a readiness probe, only then terminate a pod from the previous generation. Imagine your readiness probe takes 1 minute to pass, and you have 120 replicas in your deployment - it will take you over 2 hours to finish rolling out! This isn't necessarily a bad thing, but only choose this option if you can spare the time.

Liveness probes and readiness probes

The PodSpec API has a configuration option similar to readiness probes that we should be aware of as well: livenessProbe. Liveness probes allow us to provide Kubernetes with a probe to check if a container is still alive, but it is used for a very different reason than readiness probes: Kubernetes will actively restart any container that does not pass a liveness probe. This is a powerful, but very dangerous option, and should rarely be used as described in depth by Henning Jacobs on srcco.de.

To sum up

  • Kubernetes is a full featured, highly complex container orchestration platform
  • Our main way of managing deployments of applications to production is via the Deployment API
  • Rolling out changes to applications safely requires us to provide Kubernetes with a readinessProbe so it can make informed decisions about roll-out procedures.
  • The speed vs cost vs safety of a deployment can be tweaked with the maxUnavailable and maxSurge configuration options.
  • Do not confuse readinessProbe with livenessProbe, which is most likely a bad choice for you.

Special thanks to @tapudov, @assaflavi, and @alonisser for technically reviewing this post.