Disaster Recovery and Business Continuity for Kubernetes Clusters

Kubernetes is the de facto container management system due to its high adoption and active community. However, Kubernetes is not a batteries-included platform for enterprise-grade disaster recovery (DR) and business continuity. This blog will focus on critical elements of Kubernetes for DR and discuss approaches for automation to improve continuity.

Let’s start with the types of disasters organisations face today in the new microservices world and how they have changed compared to the past.

Titanic, the Not-So-Unsinkable Ship

In the old times, when virtual machines (VM) were the building blocks of cloud operations, the focus of backup and recovery was at the virtual machine (VM) level. You would back up your VM’s state where the monolithic application ran and be on the safe side until the next disaster. But as applications got smaller, they became distributed via containers to data centres around the world with the cloud-native revolution. Today, this new world carries its own disaster risk, bringing with it the need for revamped DR plans.

The success of Kubernetes relies on abstractions to reduce complexity and operational tasks. It makes it easier to develop, deploy, monitor and upgrade cloud-native applications. In addition, Kubernetes provides out-of-the-box self-healing capabilities so that applications on it are always up and running. Because of this, there is a common misconception that Kubernetes is the platform you can trust in every circumstance.

Unfortunately, like the deadly Titanic in the face of disaster, all cloud options (including Kubernetes) tend to collapse when things go wrong. It does not provide data protection and migration capabilities to bring systems back online in the same place or via another provider. This means you need to have a DR plan that takes into consideration the architecture and constraints of Kubernetes.

Two essential indicators to consider as part of a disaster recovery plan are Recovery Time Objective (RTO) and Recovery Point Objective (RPO). RTO is the length of downtime you can allow before recovery, and RPO is the amount of data you can afford to lose. Optimally, both should be zero, which means no data loss and instant recovery when a disaster hits—an objective that is not practical, plus costs spike as these two factors get closer to zero since you need to allocate more resources. Your only option is to assess your business requirements, SLAs and budget to determine a satisfactory level of RTO and RPO.

In the following two sections, we will focus on the two critical parts of your disaster recovery plan for Kubernetes: what to back up/recover, and how to operate the DR plan.

Women and Children First!

Disaster recovery requires careful thinking about what to recover and bring back online. This is essential since you need to back up beforehand in a timely manner to minimise RPO. For Kubernetes, there are two groups of data and configurations to save before your ship goes down: resources/components and infrastructure.

Kubernetes Resources and Components

Kubernetes stores all of its data in etcd, which is a distributed and military-grade key-value store, so you need to have regular etcd backups and keep them outside your blast radius.
Kubernetes resources are declarative definitions of what you are planning to run in the cluster. Similarly, if you create Kubernetes clusters with declarative definitions such as kops, you need to back them up to create similar clusters in case of failure.
Certificates used for communication inside and outside the cluster are mainly generated on the fly when the cluster is created, so for minimal RTO and seamless operations, you should back them up.

Infrastructure

Nodes: Kubernetes workloads are packaged as containers and run on nodes, which makes interaction with the nodes inevitable. If you have custom node configuration files and plugin binaries, you need to back them up to create new nodes when the current nodes become unavailable.
Storage: Stateful applications such as databases run with persistent volumes attached to their pods. This means your actual data lives in these volumes, and you need to back them up regularly as well. You also need to keep in mind that the size and architecture of storage-focused applications are directly related to your RPO!
Networking: Networking for elements such as DNS records or load balancers are outside the focus of Kubernetes, so you need to back up their configuration separately to bring your new services up and minimise RTO.

You may consider that we have forgotten the applications running inside the cluster, like your deployments or pods. Etcd backups already cover their configuration, and if we are dealing with stateful applications, their volumes are also backed up in storage as part of the infrastructure.

Disaster Recovery Strategies – diagram from AWS Documentation

Lifeboats

Disaster recovery plans are not complete without a solid recovery model. In other words, you need to have a plan to evacuate your applications and get them back up and running again, similar to the lifeboats on ships—but, of course, with better planning than the Titanic.

There are four well-known models suitable for a cloud-native DR plan:

This diagram was originally published in AWS Documentation.

Back Up & Restore

In this model, you need to have regular backups of your infrastructure, cluster and applications. When a disaster occurs, you have to create a new Kubernetes cluster, configure it and then deploy your applications. However, to do this, you need to wait until your new Kubernetes cluster is up and running before the applications are ready, resulting in a very long RTO.

Pilot Light

This model is an intermediate approach where core services are up and running in your standby cluster. This means you need to have synchronised copies of any frequently changing data such as databases or document stores. You may also have a standby Kubernetes cluster, but you would only deploy the applications after a disaster. This model can reduce RTO on a large scale compared to ‘Back Up & Restore’ but depends on the number of applications and services you are moving.

Warm Standby

The second intermediate approach is to have a standby Kubernetes cluster where all applications are deployed but scaled to zero. In this approach, RTO will be low since you only need to wait until your applications scale up and become ready.

Multi-Site Active/ Active

In this model, you need to have another Kubernetes cluster with all the applications running in it. In other words, you need to have two production environments. This may seem costly, but if you configure auto-scaling, your second cluster will be tiny while in standby mode. With a quick change of DNS records, your services in the standby cluster will be active, resulting in the lowest possible RTO and RPO.

Choosing a model based on your SLAs and application requirements is critical to a successful DR plan. But you also need to select the level of automation for backup and recovery: human operators or full automation. When you rely on humans, you can create a list of manual steps or scripts to run during recovery. However, humans are prone to error—which is the last thing you want in case of a failure.

On the other hand, automating disaster recovery is the cloud-native path. Similar to the self-healing applications of Kubernetes, you can have a complete platform up and running, including the Kubernetes cluster itself. Our business continuity services make this possible with the following features:

Backup as a Service: Automated, compressed and encrypted backup of all data in the cluster
Disaster Recovery as a Service: Brings your applications back to life in case of a cyber attack, natural disaster, human error or hardware failure
High Availability Assessment: Assessment of every possible factor, ensuring your applications are available 24/7/365

Get in touch with us today to learn more on how to recover from disaster at the speed of modern business. n