Kubernetes Platform Design Considerations for Tanzu Part 4: Recoverability
This blog post is focused on recoverability (Disaster Recoverability).
Organizations should have a DR plan, which is an IT-focused plan that consists of step-by-step safety measures and is designed to restore operability of the target system, application, or a facility at an alternate site after a disaster at the source site. The best DR plans generally begin with a framework of clear action steps including technical recovery plans for each key business application. When a disaster arises, it is essential to have accurate documentation and procedures that includes guidelines and policies. The documentation should be readily available, updated, easy to read, and stored off-site.
You should consider the following points while planning the disaster recovery:
The disaster recovery team, roles and responsibilities, and contact information. The team may include the DR lead, management team, facility team, and the network, server, application, and storage teams.
The plan has a description and location of the recovery facilities (remote site or cloud), transportation and accommodation details, and the location of backups.
The communication procedures and contacts should include authorities, employees, clients, vendors, and partners.
The process for activating the plan needs to be defined clearly. The minimum information is the who, what, when, where, and how. Who decides and who needs to be contacted? What kind of disaster and what is the scope? What is the timeline and what needs to happen when? Where is the data? How is access going to be cutover?
An organization needs to test and revise the disaster recovery plan
When it comes to Kubernetes and DR / Backup, we should consider the following:
Recover specific applications (for example after data corruption)
The ability to recover entire clusters in case of a disaster
Migrate a cluster from one environment to another (on-prem to cloud or vice versa)
Disaster Recovery: Technological Requirements
Recovery Point Objective (RPO) and Recovery Time Objective (RTO) drive the choice of a technology solution that meets an organization's requirements. For example, if the RPO requirement for a particular application is in seconds, then Synchronous replication would be the choice for the organization to meet this requirement. Similarly, if the RTO requirement is in seconds, then clustering of servers would enable meeting this requirement. When designing a backup strategy, the architect also considers the backup schedule (full, incremental, or cumulative) after analyzing the organization's environment. The key objective is to restore the environment after a disaster with acceptable cost, RTO, and RPO.
The key technological requirements for the successful DR implementation are:
Automated site recovery
It is generally a good idea to setup a backup environment outside of your platform and on isolated network segments with firewall, in the event of some security event such as ransomware for example. If you have two datacenter locations, you can consider have a local backup and then replicate those backups to the opposite site. It is possible to also backup your data in the public cloud if building a physical installation is not feasible.
Replication can be performed on compute, storage (storage system, backup device), and network-level based on the requirements. The two basic modes of remote replication are: synchronous and asynchronous. In synchronous replication, the replica is always identical to the source and provides near zero Recovery Point Objective (RPO). In asynchronous replication, the replica is behind the source by a finite time (finite RPO). A dedicated or a shared network must be in-place for remote replication.
Sizing Backup Targets
When sizing backup targets, information such as how much data needs to be backed up, how many copies you need to keep, and how long will you keep each copy, must be gathered. This information helps in calculating the total storage media required.
Capacity required for backup = (Total data to be backed up × Frequency of backup × Retention period)
Example: Calculate the total capacity of media required:
The total amount of data to be backed up = 500 GB
Full backup once a week
The retention period for full backups = 2 months
Daily incremental backups (assume average 10% change in data).
The retention period for incremental backups = 1 month
Total capacity required for backup:
Size of full backups = 500 GB × 4 per month × 2 months = 4000 GB
Size of incremental backups = (10% of 500 GB) × 26 × 1 month = 1300 GB
Total capacity required = 4000 GB + 1300 GB = 5300 GB
Backing up Kubernetes clusters is challenging. A single application in production may consist of hundreds of components, including containers/pods, ConfigMaps, cluster configuration files, certificates, secrets, and volumes. Backing up an application running on Kubernetes is not like backing up an app running in a virtual machine. Kubernetes environments are far more dynamic. VM backups will NOT work. it is possible to perform manual backups using scripts but this is highly error prone and does not scale. It is important to select a backup product that can backup all of the internal stateless items inside of Kubernetes.
VMware has partnered with MinIO and Velero to deliver a backup solution for Tanzu. MinIO basically creates a virtual S3 volume that can be used with vSAN. Velero manages and orchestrates the backup operations. I do not have any experience with Velero at this point so I will not go into more details at this time.
Pure Storage`s Portworx is a backup and DR / HA product that is certified on Tanzu. In addition to handling and orchestrating k8s backups with all of their internal components, you can also use Portworks as a DR and Site-Site HA platform. With the heart and soul of the Kubernetes environment being the hard drive, you can even use Portworx to move clusters between clouds. So if you developed an application with Tanzu on premise, you can essentially move the storage volumes into AWS or Azure and then run those clusters there. In effect Portworx is an application and cluster mobility tool.
The Tanzu + Portworx reference architecture can be found here:
Backups are a bit more complicated due to there fundamental nature, to summarize the key requirements:
Backups need to be able to be performed regardless of WAN link connectivity.
Determine a DR plan and build the DRaaS solution.
Reduce in-time for backup and disaster recovery process.
Make the system fully automated, secure and cost-effective.