Storage System Design Implications for Modern Applications
- William B

- Oct 10
- 4 min read
Modern applications designed for enterprise organizations have specific system availability demands that are critical for ensuring performance, reliability, and user satisfaction. These demands arise from the need for continuous operation, scalability, and responsiveness in a dynamic environment.
Below are the key aspects of these demands:
1. High Availability (HA)
Modern applications require high availability to minimize downtime and ensure that services are consistently accessible. This often involves:
Redundant components to eliminate single points of failure.
Automated failover mechanisms to switch to backup systems seamlessly.
Load balancing to distribute traffic and workload evenly across servers.
2. Scalability
Applications must be able to scale horizontally or vertically to accommodate varying workloads. This includes:
Dynamic resource allocation to respond to traffic spikes.
Elasticity to scale resources up or down based on demand.
Support for containerization and micro-services architectures for efficient scaling.
3. Disaster Recovery
Robust disaster recovery strategies are essential for maintaining availability in the event of failures. Key components include:
Regular data backups to prevent data loss.
Geographically distributed data centers to mitigate risks from local disasters.
Comprehensive recovery plans that are regularly tested and updated.
4. Performance Monitoring and Management
Continuous monitoring of system performance is vital to ensure availability. This involves:
Real-time analytics to detect and resolve performance issues quickly.
Automated alerts for system anomalies or failures.
Performance tuning based on usage patterns and metrics.
5. Security and Compliance
Maintaining system availability also requires robust security measures to protect against attacks that could lead to downtime. This includes:
Regular security audits and vulnerability assessments.
Implementation of firewalls, intrusion detection systems, and encryption.
Compliance with industry standards and regulations to prevent legal issues.
6. User Experience and Responsiveness
Finally, applications must deliver a responsive user experience to meet user expectations. This entails:
Optimizing application performance to reduce latency.
Implementing caching strategies to speed up content delivery.
Ensuring consistent performance across different devices and networks.
What are you protecting against?
If you do not understand the risks, then you cannot protect against them. Make sure you fully define this in your design. The most common failure scenarios you should protect against are:
•Single Point of Failure (Hardware or Software) – this is failure within the data center and normally protected with High Availability mechanisms such as Application Clustering, Database Clustering and Load-balancing.
•Single Site Failure – your main data center experiences a complete failure that renders it unusable (natural, man-made, accidental disaster).
•Accidental Deletion of Data – an administrator accidentally “fat fingers” a database, disk, datastore, etc. which may or may not be replicated to the remote site.
•Malicious Deletion of Data – an administrator deliberately targets a mission critical system and deletes data which is replicated to the remote site. This is the most difficult to protect against and most difficult to recover from.
Your application data physically lives on the storage system beneath the cloud. Careful consideration should be put into how this storage system is designed and configured.
There are two main storage design personas within the scope of this post:
"Zonal" (AZ) & "Cross-Zonal" (Cross-AZ) storage systems. I should also point out that the terms "Zone" and "Availability Zone or AZ" are used interchangeably here.
Zonal Storage (AZ)
In the context of vSphere (or VMware Cloud Foundation) these are vSphere datastores that are attached to all ESXi hosts within an AZ.
The storage could offer better availability within the zone by replicating data. Ex: vSAN with FTT=1 will have 2 replicas for the data within a Zone.
In other words, you can provide availability vertically within a Zone plus availability across Zones. This is a good example of a "Shared-Nothing" architecture.

Volumes in a zonal storage are accessible to workloads only within that zone, meaning that an Zonal disaster event could bring down all the volumes in the zonal storage. With this design, data persistence and replication across zones needs to be solved in the application layer.
Lets take an example where we have a Kubernetes Workload cluster with a MongoDB StatefulSet that is spanned across three Zones in the following example:
One AZ goes down
MongoDB will still have quorum and the data service is still available. Note that MongoDB is in degraded mode.
One AZ is isolated (network partitioned)
MongoDB will still have quorum and the data service is still available. Note that MongoDB is in degraded mode.
Two AZs went down
MongoDB will not have quorum, so it is down.
Cross-Zonal Storage (Cross-AZ)
With a Cross-AZ design, the datastore is shared across Zones. Ex - NFS datastore deployed in a global location mounted on multiple Zones. The volumes will survive one or more AZs going down as long as the Cross-Zonal storage is accessible from at least one AZ.
The main use case here is that the application does not replicate it's data and wants the data to be available on AZ failure. It expects the storage to do the replication across Zones or the storage is global and is attached to all or some Zones.
There are other considerations to this design, for example if a failure occurs at the storage system layer, all applications across all Zones could potentially be affected. This is an example of a "Shared Everything" architecture.

One AZ went down
Workers in that AZ will be marked as NotReady by k8s. K8s rescheduled the pod to an healthy node in another AZ.
One AZ is isolated(network partitioned)
Workers in that AZ will be marked as NotReady by k8s. K8s rescheduled the pod to an healthy node in another AZ.
Two AZs went down
etcd in Kubernetes itself will not have quorum, so the Kubernetes will be in read only mode, and pretty much nothing works. If the app was running on an AZ that is still available, the app continues to work as the data path is still up. If the app was running on an AZ that went down, the app is down since k8s cannot reschedule it.
The system availability demands of modern applications in private clouds are multifaceted and require a comprehensive approach to architecture, infrastructure, and operations. By addressing these demands, organizations can ensure that their applications remain reliable, scalable, and secure in a competitive landscape.
The aspects I have laid out in this blog are just the tip of the iceberg when it comes to the careful planning that's needed to implement a well-designed storage platform (and potentially a MultiZone architecture).
If you are planning a project like this and would like VCDX expert assistance, please feel free to reach out to me or your Professional Services point of contact.




Comments