Running Highly Available Apps on Kubernetes

As Kubernetes becomes the de-facto standard for deploying applications, many of us are either running our applications on Kubernetes or trying to migrate to Kubernetes. In this blog post, I will go through some tips for running highly available applications on Kubernetes.

Kubernetes: Link to heading

Kubernetes, also known as K8s, is an open-source system for automating the deployment, scaling, and management of containerized applications. It is designed on the same principles that allow Google to run billions of containers a week, Kubernetes can scale without increasing your ops team.

High Availability: Link to heading

In computing, the term availability is used to describe the period when a service is available, as well as the time required by a system to respond to a request made by a user. High availability is a quality of a system or component that assures a high level of operational performance for a given period of time.

When running applications on Kubernetes, you can follow these points to make your applications highly available

1. Don’t manage `Pod` directly. Instead, use workload resources that manage a set of pods on your behalf. Link to heading

In Kubernetes, a Pod represents a set of running containers on your cluster. Kubernetes pods have a defined lifecycle. For example, once a pod is running in your cluster then a critical fault on the node where that pod is running means that all the pods on that node fail. Kubernetes treats that level of failure as final: you would need to create a new Pod to recover, even if the node later becomes healthy.

However, you don’t need to manage each Pod directly. Instead, you can use workload resources that manage a set of pods on your behalf. These resources configure controllers that make sure the right number of the right kind of pods are running, to match the state you specified.

Kubernetes provides several built-in workload resources:

Deployment and ReplicaSet: Deployment is a good fit for managing a stateless application workload on your cluster, where any Pod in the Deployment is interchangeable and can be replaced if needed.
StatefulSet: Let’s you run one or more related Pods that do track state somehow. StatefuleSet matches each Pod with PersistentVolume to persist the data. This is useful for stateful applications which manage states for example Databases, Queues, etc.
DaemonSet: Defines Pods that provide node-local facilities. It schedules a Pod for the DaemonSet onto every node which matches DaemonSet the specification.
Job and CronJob : Define tasks that run to completion and then stop. Jobs represent one-off tasks, whereas CronJobs recur according to a schedule.

When using workload resources like Deployment or StatefulSet replicas value under spec.replicas to more than one to make your application highly available.

2. Use `RollingUpdate` update strategy for workload resources. Link to heading

Update strategy is used by Kubernetes to rollout new pods when you try to update existing workload resources (like Deployment, StatefuleSet, etc).

If you try to update all pods of a Deployment at once without an update strategy, your application may become unavailable during the update.

Kubernetes allows you to configure an update strategy for your workload using

spec.updateStrategy field DaemonSet and StatefuleSet
spec.strategy field for Deployment

For your apps, set the type of the update strategy to the RollingUpdate. when you set the strategy type to RollingUpdate , you can also specify maxUnvailable and maxSurge to control the rolling update process.

For example:

...
spec:
  replicas: 3
  stragtey:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 0

Max Unavailable :

.spec.strategy.rollingUpdate.maxUnavailable is an optional field that specifies the maximum number of Pods that can be unavailable during the update process. The value can be an absolute number (for example, 5) or a percentage of desired Pods (for example, 10%). The absolute number is calculated from the percentage by rounding down. The value cannot be 0 if .spec.strategy.rollingUpdate.maxSurge is 0. The default value is 25%.

For example, when this value is set to 30%, the old ReplicaSet can be scaled down to 70% of desired Pods immediately when the rolling update starts. Once new pods are ready, the old ReplicaSet can be scaled down further, followed by scaling up the new ReplicaSet, ensuring that the total number of Pods available at all times during the update is at least 70% of the desired Pods.

Max Surge :

.spec.strategy.rollingUpdate.maxSurge is an optional field that specifies the maximum number of Pods that can be created over the desired number of Pods. The value can be an absolute number (for example, 5) or a percentage of desired Pods (for example, 10%). The value cannot be 0 if MaxUnavailable is 0. The absolute number is calculated from the percentage by rounding up. The default value is 25%.

For example, when this value is set to 30%, the new ReplicaSet can be scaled up immediately when the rolling update starts, such that the total number of old and new Pods does not exceed 130% of desired Pods. Once old Pods have been killed, the new ReplicaSet can be scaled up further, ensuring that the total number of Pods running at any time during the update is at most 130% of desired Pods.

DaemonSet

With RollingUpdate update strategy, after you update a DaemonSet template, old DaemonSet pods will be killed, and new DaemonSet pods will be created automatically, in a controlled fashion. At most one pod of the DaemonSet will be running on each node during the whole update process.

StatefulSet When a StatefulSet .spec.updateStrategy.type is set to RollingUpdate, the StatefulSet controller will delete and recreate each Pod in the StatefulSet. It will proceed in the same order as Pod termination (from the largest ordinal to the smallest), updating each Pod one at a time.

3. Spread Pods across your cluster among failure domains such as regions, zones, and nodes. Link to heading

Node Failures

Kubernetes automatically spreads the Pods for workload resources (such as Deployment or StatefulSet) across different nodes in a cluster. This spreading helps reduce the impact of failures of nodes.

The cluster admin also needs to make sure that there are enough nodes in the Kubernetes cluster and that each node has enough resources so that each new pod of the workload can be scheduled on different nodes.

Zone Failures

The concept of the zone, and regions are not intrinsic to Kubernetes but it is more tied with the public cloud providers.

Kubernetes is designed so that a single Kubernetes cluster can run across multiple failure zones, typically where these zones fit within a logical grouping called a region. Major cloud providers define a region as a set of failure zones (also called availability zones) that provide a consistent set of features: within a region, each zone offers the same APIs and services.

Typical cloud architectures aim to minimize the chance that a failure in one zone also impairs services in another zone.

Distribute nodes across zones

In order you to make your application resilient to a zone failure, you first need to distribute your Kubernetes cluster nodes across multiple zones or regions. Kubernetes’ core does not create nodes for you; you need to do that yourself or use a tool such as the Cluster API to manage nodes on your behalf.

If your cluster spans multiple zones or regions, you can use node labels in conjunction with Pod topology spread constraints to control how Pods are spread across your cluster among fault domains: regions, zones, and even specific nodes. These hints enable the scheduler to place Pods for better-expected availability, reducing the risk that a correlated failure affects your whole workload.

There are different methods that can be used for assigning pods to nodes

Using node selectors with node labels
Affinity and anti-affinity rules
nodeName field

For scheduling your pods to different failure domains, you need to use pod affinity with topologyKey which value will be equal to the label of the failure domain.

There are two types of pod affinity:

requiredDuringSchedulingIgnoredDuringExecution: The scheduler can’t schedule the Pod unless the rule is met. This functions like nodeSelector, but with a more expressive syntax.
preferredDuringSchedulingIgnoredDuringExecution: The scheduler tries to find a node that meets the rule. If a matching node is not available, the scheduler still schedules the Pod.

topologyKey is the key of node labels. If two Nodes are labeled with this key and have identical values for that label, the scheduler treats both Nodes as being in the same topology. The scheduler tries to place a balanced number of Pods into each topology domain.

For example,

kind: Deployment 
metadata:
  name: component
spec:
  affinity:
    podAntiAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - labelSelector:
         matchExpressions:
         - key: component
           operator: In  
           values:  
           - component 
         topologyKey: "failure-domain.beta.kubernetes.io/zone"

The above affinity rule will make sure that nodes are evenly distributed across the different zone.

If you use cloud provider controllers, labels related to zone, region, etc gets automatically added on each node, otherwise, these labels need to be added by cluster admin while provisioning new nodes for the Kubernetes cluster.

4. Handle voluntary disruptions using a pod disruption budget Link to heading

There are two types of disruptions that can happen to your pods running on a Kubernetes cluster.

Involuntary disruptions

Involuntary disruptions are unavoidable hardware or system software errors. For example,

A hardware failure of the physical machine backing the node
Cluster administrator deletes VM (instance) by mistake
Cloud provider or hypervisor failure makes VM disappear
A kernel panic
The node disappears from the cluster due to cluster network partition
Eviction of a pod due to the node being out-of-resources.

Voluntary Disruptions

Voluntary disruptions are disruptions caused by actions initiated by the application owner and cluster administrator.

Application Owner actions include

Deleting the deployment or other controller that manages the pod
Updating a deployment’s pod template causes a restart
Directly deleting a pod (e.g. by accident)

Cluster administrator actions include:

Draining a node for repair or upgrade.
Draining a node from a cluster to scale the cluster down (learn about Cluster Autoscaling ).
Removing a pod from a node to permit something else to fit on that node.

These actions might be taken by the cluster administrator or the automation run by the cluster administrator or your cloud hosting provider.

Some ways to handle involuntary disruptions are

Ensure your pod requests the resources it needs.
Replicate your application for higher availability.
When running replicated applications, spread applications across racks or across zones

Handling Voluntary disruptions

As an application owner, you can create a PodDisruptionBudget (PDB) for each application. A PDB limits the number of Pods of a replicated application that are down simultaneously from voluntary disruptions. For example, a quorum-based application would like to ensure that the number of replicas running is never brought below the number needed for a quorum. A web front end might want to ensure that the number of replicas serving load never falls below a certain percentage of the total.

A PDB specifies the number of replicas that an application can tolerate having, relative to how many it is intended to have. For example, a Deployment that has a .spec.replicas: 5 is supposed to have 5 pods at any given time. If its PDB allows for there to be 4 at a time, then the Eviction API will allow voluntary disruption of one (but not two) pods at a time.

The group of pods that comprise the application is specified using a label selector, the same as the one used by the application’s controller (deployment, stateful-set, etc).

A PodDisruptionBudget has three fields:

A label selector .spec.selector specifies the set of pods to which it applies. This field is required.
.spec.minAvailable which is a description of the number of pods from that set that must still be available after the eviction, even in the absence of the evicted pod. minAvailable can be either an absolute number or a percentage.
.spec.maxUnavailable (available in Kubernetes 1.7 and higher) which is a description of the number of pods from that set that can be unavailable after the eviction. It can be either an absolute number or a percentage.

apiVersion: policy/v1 
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-app

You can specify only one of maxUnavailable and minAvailable in a single PodDisruptionBudget.

Example 1: With a minAvailable of 5, evictions are allowed as long as they leave behind 5 or more healthy pods among those selected by the PodDisruptionBudget’s selector.

Example 2: With a minAvailable of 30%, evictions are allowed as long as at least 30% of the number of desired replicas are healthy.

Example 3: With a maxUnavailable of 5, evictions are allowed as long as there are at most 5 unhealthy replicas among the total number of desired replicas.

Example 4: With a maxUnavailable of 30%, evictions are allowed as long as no more than 30% of the desired replicas are unhealthy.

Note:

Involuntary disruptions cannot be prevented by PDBs; however, they do count against the budget.
Cluster managers and hosting providers should use tools that respect PodDisruptionBudgets by calling the Eviction API instead of directly deleting pods or deployments.

References: