Understanding Architecture Failure Modes & Weaknesses
Every cloud platform designed and built by humans has weaknesses and faults. The value that a platform might bring to its users is partly determined by the ability to withstand and mitigate failures that can occur to hardware and software. Faults and downtime can occur due to human error, software errors, hardware failures, etc.
Part of my role as an architect is to design systems that are not only fault tolerant, but also highly available whole addressing the requirements for our customers.
When creating an architecture design, one thing you can do is create a matrix that covers each domain of the overall platform. Lets take VMware Cloud Foundation (Stretched Cluster with vSAN) for example.
VCF at a conceptual level is made up of Storage, Network, Compute and Platform Management subdomains. vSphere abstracts these layers and turns them into something that can be consumed by workload applications and a human operator.
Within each subdomain. there are several things that can potentially go wrong. Understanding what and how things can go wrong allows us to build platforms that are more reliable and resilient against failures. Modern platforms like VCF provide several safety measures that protect against common failure scenarios.
Here are some examples of different failure modes within the context of VCF with vSAN. Please note that different solutions have different strengths and weaknesses. These should be evaluated separately. Please note that these are just examples used in one of my designs, your situation might differ based on many factors.
Once you have evaluated potential failure modes within your environment, you can use that information to develop test cases that validate your solution.