Kubernetes Platform Design Considerations for Tanzu Part 1: Availability
Updated: Feb 21, 2022
One of my current focus areas is the selection, design and implementation of Kubernetes infrastructure solutions. I am working with several different solutions, including Openshift, Amazon EKS, VMware Tanzu & "vanilla" Kubernetes. I wanted to outline some of the major design considerations that should be taken into account when designing / building the VMware Tanzu platform. Tanzu should be looked at as a platform, which includes the Infrastructure layers as well if you are deploying on premise. The focus of this post will use the vSphere cloud provider.
I try to use the AMPRS design methodology whenever possible. This applies to any cloud provider, whether it be vSphere or public cloud. Each area contributes directly to the success of the platform.
Availability or High-Availability (HA) is about setting up components in a way that there is no single point of failure. A single point of failure (SPOF) refers to any individual component or aspect of a system, whose failure can make the entire system or service unavailable.
IT infrastructure is expected to be fault tolerant, in order to reduce downtime and to handle
increased workloads. An important consideration when implementing high
availability strategies is to avoid a single-point of failure. HA should be designed and built across the entire stack, from the application down thru to the infrastructure layers.
The sections below list some main areas that should be addressed when considering Kubernetes platform HA.
Consider Infrastructure Failure Scenarios
Kubernetes - Master Nodes & Multi-Master Design
One of the reasons that Kubernetes in being widely adopted is because high availability and self healing are natively baked into the platform. It contains components that make infrastructure distributed and can recover from multiple failures without administrative intervention. The first component on my list to consider is k8s master nodes.
Each master node runs its’ own copy of the Kube API server. This can be used for load balancing with other master nodes if you are going to use a multi-master design. Master nodes also run their own copy of the etcd database, which stores all the k8s cluster data. In addition to the API server and etcd database, the master nodes also run their k8s controller manager which handles replication and scheduling and schedules pods to nodes. Ultimately the master nodes are the control plane for the Kubernetes cluster.
Single master deployment:
In a single master cluster the important components like the API server and controller manager exist only on the single master node and if that node fails, you will not be able to create cannot create more services, pods etc. Single node deployments are a SPOF in the control plane.
Implementing a multi-master design provides the following advantages:
High availability for a single cluster
Improved network performance because all masters behave like a unified data center
Protects from loss of individual worker nodes
Protects from the failure of the individual master node`s etcd service
Runs the k8s API service on more that one node (Loosing this service can render the cluster inoperable)
Three master nodes are required in order to satisfy the Kubernetes master requirement. TKG/TCE production clusters automatically deploy three master nodes (control-plane) when creating a workload CLUSTER_PLAN: prod value, but this can be manually manipulated with the CONTROL_PLANE_MACHINE_COUNT: value of the cluster configuration file, which is located in the ~/.config/tanzu/tkg/clusterconfigs directory.
Kubernates - Multiple-Clusters
A multi-cluster deployment consists of two or more pods.
The benefits of multi-cluster deployments are:
Multi-cluster deployments can make things more highly-available by using one cluster as a backup or fail-over environment in the event of a cluster outage. Although this could be considered a recover-ability design quality, the same principles can be applied here.
Kubernetes - Loadbalancers
Usually k8s clusters are deployed with Loadbalancers. Loadbalancing in Kubernetes is a extensive topic and I will not go into too much details here. You can configure Loadbalancers for the following:
Kubernetes Clusters: Configuring a load balancer for each new cluster enables you to run Kubernetes CLI (kubectl) commands on the cluster.
Workloads: Configuring a load balancer for your application workloads enables external access to the services that run on your cluster.
Loadbalancers can detect node and pod failures, rerouting traffic to a healthy node or cluster.
Kubernetes - System Pods, Daemonsets & Replicasets
Kubernetes has a few native availability features "baked-in" at the pod-level, these being ReplicaSets and DaemonSets.
A ReplicasSet will ensure that the number of pods defined in our config file is always running, and it does not matter which worker node they are running on. The scheduler will schedule the pod on any node based on resource availability. In this way it is basically a "DRS for k8s" resource workload placement mechanism.
A Daemonset will ensure that one copy of a pod defined is available on each worker node at all times. TKG utilizes Daemonsets in its architecture.
Cloud Provider - Virtual Machines
Tanzu Kubernetes Clusters are deployed onto virtual machines. These are pre-validated images that are deployed into vCenter. Virtual machines provide an extra layer of protection and security isolation that is not easily doable with running on bare metal for example. vSphere provides several native high availability and resource balancing mechanisms for these workload clusters.
Each TKG cluster is configured in a stacked deployment, with Kube-API and etcd services available on each controller (master node) VM, as described in the sections above. vSphere HA provides high availability for controllers and powers them back on in the event of a failure.
DRS provides the resource balancing mechanism for the virtual machines (k8s nodes).
Soft affinity (should run) rules should be configured to separate controller VMs across different hosts when possible.
You will notice that if you try to delete a controller VM in vCenter, the VM will be re-created and startup again. This is because controller VMs are lifecycle managed by the vSphere Agent Manager (EAM) and cannot easily be powered off by administrators. If they are powered off manually, they will immediately be restarted on the same host.
Cloud Provider - Storage
The infrastructure storage should be redundant and shared across all of the hosts that make up the cluster. vSAN is a distributed storage solution that has been optimized to run in a software defined environment.
Cloud Provider - Multi-AZ and Stretched Clusters
Multi-Availability Zone (AZ) is a term used to describe a physical multi-site deployment. Stretched Clusters are a combination of AZ`s. Typically k8s clusters are deployed on a single AZ. Which is then replicated to another for disaster recovery reasons. But a lot of customers are using stretched clusters today for workload mobility and disaster recovery.
Currently TKG / vSphere with Tanzu does not support stretched clusters. Their are several reasons why, but the main being latency.
K8s clusters are extremely sensitive to latency. If latency cannot be guaranteed to be under
< 5ms the containers will become inoperable.
Also their are very strict pod affinity and anti affinity rules that need to be applied for the solution to work. Some applications might work under perfect circumstances, but typically this cannot be easily guaranteed thus the platform would not be very reliable.
There are other companies that understand Kubernetes backup and HA. One such company is Portworx, which effectively creates a stretched cluster on the storage layer to provide both site disaster recovery and k8s cluster backups. Portworx is also certified with work with Tanzu.
They have an excellent article detailing their solution here.
Consider Infrastructure Failure Scenarios
Modern platforms are getting very good at being redundant and can self-heal. But its important to be prepared for possible failure scenarios. Here are some additional failure scenarios to consider:
ESXi Host Failures