Kubernetes Platform Design Considerations for Tanzu Part 3: Performance
Updated: Feb 7, 2022
Moving forward from my last post about Manageability, we will focus on performance.
Performance is an extensive topic with simply too much to crunch into a single post. With Tanzu there are a few areas to consider. As with the other posts in this series, we have to understand that we are building a k8s management platform. With that comes specific design considerations for the k8s part. But because some deployments will be on premise, we must also consider the underlying platform as well.
For the Kubernetes layer, consider the following performance best practices
Limit containers to a single function (e.g.database OR web server). This will make things easier to troubleshoot and manage
Use smaller container images if possible, because large images are bulky and cumbersome to port.
Implement container-friendly operating systems (e.g. Photon OS) because they are more resistant to misconfigurations, and have a smaller security footprint
Configure health and readiness endpoint checks, this enables Kubernetes to take appropriate actions if your container is down.
Optimize Etcd cluster deployments. Etcd is the heart and soul (hard drive) of the cluster, and all worker nodes write to the etcd nodes on the masters. Deploy your nodes on servers with solid state disks (SSDs) with low I/O latency and high throughput
Deploy k8s near your customers. K8s is an extremely latency sensitive technology, it may make sense to run the k8s clusters as close to the end user as possible.
For the Infrastructure layer / Cloud Provider, consider the following performance areas:
Virtual Machine size for control plane & worker nodes. The size selection controls the number of CPUs and amount of memory allocated to the virtual machine that hosts a cluster node for the TKG cluster. it should also be noted that node VMs cannot be scaled up once deployed, you will need to create a new cluster and then migrate the workloads over to the new cluster if there are not enough resources allocated.
Determine number of worker nodes. You can scale up or scale down the number of worker nodes in a cluster created by the TKG service if you initially over-allocate or under-allocate nodes.
vSphere cluster CPU & Memory capacity. Consider the total number of physical cores on all servers in the cluster x The total MHz for all physical cores on all servers in the cluster.
Consider the overhead involved with transport protocols, hypervisors, storage, and other dependencies.
Include the distribution of services across hosts in order to factor any intra-host traffic.
Understand the requirements for services for network bandwidth between services and any source external to the cloud.
Consider adding the expected growth rate while calculating the overall bandwidth required.
Aggregation: To increase bandwidth capabilities at the host, aggregate host traffic across multiple ports using teaming or bonding. It is appropriate to use high bandwidth connections, such as 10 Gb links when supporting converged network traffic, many VLANs, or a dense service allocation. When using multiple NICs for redundancy, plan for sufficient bandwidth if a path fails.
Avoid Oversubscription: Understanding the amount of traffic being sent and received by each host, and the network path of that traffic is important. Traffic generating from the hosts may be aggregated into a few switch uplinks and could lead to oversubscription and bottlenecks.
Host Affinity: Virtual machines that are on the same host may communicate without the traffic ever leaving the host. If you strategically place related systems such as application and database servers, and a well-defined process is in place to maintain that placement, a significant amount of traffic may be removed from the physical switch infrastructure.
Bandwidth Priority: Some hypervisors have mechanisms to control or limit bandwidth for specific types of traffic. Although this mechanism is not a guarantee that there is enough bandwidth for all traffic, it does provide guarantee for a higher priority traffic. Setting a bandwidth priority for storage traffic can ensure that storage performance does not suffer. Research is required to determine which hypervisors and physical infrastructure supports this feature, and what the available options and limitations are.
To reduce network latency, the infrastructure design may include Quality of Service (QoS) policies, minimal hop counts, and dedicated infrastructure supporting network function components such as load balancers and firewalls.