top of page

The (Unofficial) VKS Workload Latency Verification Framework

  • Writer: William B
    William B
  • 2 minutes ago
  • 6 min read

With the introduction of VCF9, many organizations are opting to protect their workloads using geo-redundancy methods such as MultiZone workload domain technologies. Using a MultiZone design brings considerable availability benefits, but also introduces unique challenges, such as increased latency caused by the physical locations being further apart from each other.


Checking latency between all core VCF components are critical to the success of a project like this. We should also take things a step further and verify the the latency at the pod level as well, to ensure there are no issues that would prevent our application from working properly.


This updated vSphere Kubernetes Service (VKS) testing and validation blueprint for VCF 9.0.2 includes dedicated VKS (Kubernetes) Pod Latency Verification frameworks. These steps explicitly measure cross-zone container network interface (CNI) processing times, Inter-Pod network jitter, and application response tolerances within a multi-zone configuration.


Note: This is an example blueprint. Never make configuration changes to any VCF system components unless explicitly directed to by Broadcom support or documentation. Always customize your testing plans according to your design and desirable KPI´s.


Topology Overview
Topology Overview

📊 Phase 1: Pre-Deployment Physical & Kernel Latency Verification


Before deploying multi-zone guest workloads, the physical hypervisor mesh must pass strict service level agreements (SLAs) to guarantee stateful volume replication and cluster consensus sync.


### 1.1 Performance SLA Boundaries

 *Inter-Zone Worker Round-Trip Time (RTT)**: Maximum 10ms RTT (Ideal: < 2ms). High latency directly triggers Etcd database election timeouts.

 *Storage Replication Links**: Maximum 5ms RTT with jitter variance staying under 0.5ms for vSAN active-active mirroring.

 *Packet Droppage Threshold**: Strict 0% packet loss tolerated under peak interface stressing.


### 1.2 Latency Validation Execution

Log into an ESXi host terminal in Zone A and execute an MTU-enforced, payload-stressed ping sweep targeting the transport interfaces in Zone B and Zone C:


```bash

# Test cross-zone vSAN Storage interfaces using Jumbo Frames (MTU 9000)

vmkping -I vmk3 -s 8972 -d -c 100 <ZONE_B_STORAGE_IP>


# Test inter-zone Overlay / Host management interfaces (Standard MTU 1500)

vmkping -I vmk0 -s 1472 -d -c 100 <ZONE_C_MGMT_IP>


 **Success Rule**: Standard deviation (`mdev`) must not exceed 0.5ms across 100 sequential packets.


### 1.3 Hypervisor Interface Droppage Audit

Ensure that high-throughput inter-zone storage syncing does not exhaust physical switch buffers:


# Query active performance counters on the primary uplink adapter


esxcli network nic stats get -n vmnic0


 *Required Baseline Metric**: The fields `Receive packets dropped` and `Transmit packets dropped` must reflect exactly 0.


 🚀 Phase 2: Day-1 Control Plane Activation Post-Checks


This phase checks the architectural health of the Multi-Zone Supervisor after activation to ensure proper separation across fault domains.


### 2.1 Anti-Affinity Quorum Layout Verification

Validate that the Supervisor control plane VMs are cleanly and uniformly split across your physical zones:


kubectl get nodes --selector=node-role.kubernetes.io/control-plane \

  -o custom-columns=NAME:.metadata.name,ZONE:.metadata.labels.topology\.kubernetes\.io/zone


#### Expected Balanced Matrix Output:


NAME ZONE

Supervisor-VM-01 zone-alpha

Supervisor-VM-02 zone-bravo

Supervisor-VM-03 zone-charlie

```


### 2.2 Etcd Commit Latency Performance Check

Etcd database write time dictates multi-zone stability. Query the Supervisor API server metrics endpoint to evaluate Write-Ahead Logging (WAL) duration:


kubectl get --raw /metrics | grep etcd_disk_wal_write_duration_seconds_bucket


 *Failure Threshold**: If 99th percentile WAL disk write operations spike above 10ms, the inter-zone network path is unstable, putting the cluster control quorum at high risk for split-brain behavior.


🛠️ Phase 3: Guest Cluster Multi-Zone Balance Testing


### 3.1 Declarative Cluster Topology Verification

Deploy a multi-zone VKS guest cluster manifest using the Local Consumption Interface (LCI) or direct `kubectl` CLI tools.


kind: Cluster

metadata:

  name: vks-multizone-cluster

  namespace: vks-testing-zone

spec:

  topology:

    class: tanzukubernetescluster

    version: v1.30.x

    controlPlane:

      replicas: 3

    workers:

      replicas: 3


Verify worker node balance across failure boundaries once status switches to `Ready`:


kubectl get nodes -o custom-columns=NAME:.metadata.name,ZONE:.metadata.labels.topology\.kubernetes\.io/zone


 *Verification Check**: Ensure exactly one worker node instance settles onto `zone-alpha`, `zone-bravo`, and `zone-charlie` respectively.


⚡ Phase 4: Kubernetes Pod-to-Pod Latency Testing


This phase provides programmatic validation of the Container Network Interface (CNI) datapath overlay spanning across separate physical zones.


### 4.1 Deploying the Cross-Zone Latency Probe Pods

Apply the following manifest to pin two network benchmark pods to explicit infrastructure zones (`zone-alpha` and `zone-bravo`) using `nodeAffinity`.


apiVersion: apps/v1

kind: Deployment

metadata:

  name: pod-latency-server-zone-a

  namespace: vks-testing-zone

spec:

  replicas: 1

  selector:

    matchLabels:

      app: latency-server

  template:

    metadata:

      labels:

        app: latency-server

    spec:

      affinity:

        nodeAffinity:

          requiredDuringSchedulingIgnoredDuringExecution:

            nodeSelectorTerms:

            - matchExpressions:

              - key: topology.kubernetes.io/zone

                operator: In

                values:

                - zone-alpha

      containers:

      - name: network-tool

        image: amir20/iperf3:latest

        command: ["iperf3", "-s"]

        ports:

        - containerPort: 5201

---

apiVersion: apps/v1

kind: Deployment

metadata:

  name: pod-latency-client-zone-b

  namespace: vks-testing-zone

spec:

  replicas: 1

  selector:

    matchLabels:

      app: latency-client

  template:

    metadata:

      labels:

        app: latency-client

    spec:

      affinity:

        nodeAffinity:

          requiredDuringSchedulingIgnoredDuringExecution:

            nodeSelectorTerms:

            - matchExpressions:

              - key: topology.kubernetes.io/zone

                operator: In

                values:

                - zone-bravo

      containers:

      - name: network-tool

        image: amir20/iperf3:latest

        command: ["sleep", "3600"]


### 4.2 Executing Inter-Pod Network Performance Tests

Once both pods report a status of `Running`, retrieve the internal Pod IP of the server instance and execute the evaluation checks from the client pod in the neighboring zone.


# Get Server Pod IP

SERVER_POD_IP=\$(kubectl get pod -l app=latency-server -n vks-testing-zone -o jsonpath='{.items.status.podIP}')


# Test Cross-Zone Pod Network Jitter and Packet Latency via UDP

kubectl exec -it deployment/pod-latency-client-zone-b -n vks-testing-zone \

  -- iperf3 -c \$SERVER_POD_IP -u -b 10M -t 30 --json


#### Verification Criteria (UDP Execution)

 *Average Target Latency (RTT)**: Under 5ms within standard data centers.

 *Jitter Performance Variance**: Must remain below 0.8ms. Higher jitter points to congestion or pinning errors within the NSX/VDS virtual switch overlay.


# Test Cross-Zone Pod Round-Trip Times (RTT) via TCP Round-Trip Sampling

kubectl exec -it deployment/pod-latency-client-zone-b -n vks-testing-zone \

  -- ping -c 50 \$SERVER_POD_IP


 *Success Metric**: Zero (0%) packet drops across 50 execution sequences. Standard deviation (`mdev`) must be under 1.0ms.


🔋 Phase 5: Workload Core Capabilities & Storage Latency Checks


### 5.1 Topology-Aware Persistent Volume (CNS) Mounting

Deploy a stateful manifest requiring independent storage volume tracking across zones:


apiVersion: apps/v1

kind: StatefulSet

metadata:

  name: storage-latency-validation-set

  namespace: vks-testing-zone

spec:

  serviceName: "zonal-storage-test"

  replicas: 3

  selector:

    matchLabels:

      app: database-mock

  template:

    metadata:

      labels:

        app: database-mock

    spec:

      containers:

      - name: db-node

        image: nginx:alpine

        volumeMounts:

        - name: data-vol

          mountPath: /usr/share/nginx/html

  volumeClaimTemplates:

  - metadata:

      name: data-vol

    spec:

      accessModes: [ "ReadWriteOnce" ]

      storageClassName: "vks-multizone-storage-policy"

      resources:

        requests:

          storage: 5Gi


Execute verification steps to trace persistent storage attachment latency:


# Step 1: Record pod deletion and reschedule timestamp

date +"%T.%N" && kubectl delete pod storage-latency-validation-set-0 -n vks-testing-zone


# Step 2: Track cross-zone volume migration and remount loops


 *Target Metric**: The volume unbind, cross-zone cluster configuration sync, and host remount loop must completely finalize in under 30 seconds.


### 5.2 Avi Layer 4 / Layer 7 Virtual Service Automation Sync

Expose the stateful set using a standard Kubernetes LoadBalancer service:

```bash

kubectl expose statefulset storage-latency-validation-set --type=LoadBalancer \

  --name=avi-loadbalancer-check --port=80 -n vks-testing-zone

```

 *Verification Check**: Confirm that the Avi Kubernetes Operator (AKO) captures the configuration modification, instructs the Avi Controller via API, maps a Virtual Service onto the Service Engines, and switches the `EXTERNAL-IP` status field out of `<pending>` in under 45 seconds.


💥 Phase 6: Disaster & Resiliency Failover Simulations


### 6.1 Simulated Zone Failure Outage

Simulate a catastrophic host facility outage by disconnecting the network


#### Verification Execution Sequence during Outage:

1. Control Plane Quorum Reachability: Run `kubectl cluster-info`. The API endpoint must respond immediately via the Avi Supervisor VIP because the surviving two control plane VMs retain cluster quorum.

2. Stateless Workload Eviction: Run `kubectl get pods -A -o wide -w`. Verify that stateless pods running on the lost zone nodes are flagged, terminated, and safely rescheduled onto `zone-alpha` and `zone-bravo` worker pools.

3. Avi Real-Time Rerouting: Use an external client to run continuous curl checks:

   ```bash

   while true; do curl -s -o /dev/null -w "%{http_code}" http://<AVI_VIP_IP>; sleep 0.5; done

   ```

   *Expected Result**: Avi Service Engines must detect backend path dropdowns and clear dead container destinations within 10 seconds, maintaining user uptime.


📋 Phase 7: Post-Deployment Verification Matrix Checklist


### 🔳 SECTION A: HARDWARE & KERNEL VERIFICATION

- [X] Inter-Zone RTT Check: Matrix ping tests between host zone clusters confirm round-trip times stand stable under 10ms.

- [X] Frame Transit Consistency: MTU-enforced configurations (`-s 8972`) pass through cross-zone switches without dropping or fragmenting packets.

- [X] Interface Droppage Pass: Uplink analytics reports confirm zero buffer packet discards under high storage loads.


### 🔳 SECTION B: KUBERNETES POD-LEVEL LATENCY CHECK

- [X] CNI Overlay Performance: Pod-to-pod cross-zone ping metrics reflect an average RTT under 5ms.

- [X] Inter-Pod Jitter Variance: Network diagnostic telemetry returns a cross-zone UDP packet jitter under 0.8ms.

- [X] Zero Fabric Loss: Under network-stressed conditions, cross-zone container data pathways register 0% packet loss.


### 🔳 SECTION C: CONTROL PLANE & CLUSTER LAUNCH CHECKS

- [X] Supervisor Balance: System logs confirm exactly 1 Supervisor Control VM is located inside each independent physical zone.

- [ X] Etcd Write Tolerance: Write-Ahead Logging latency metrics track strictly under the 10ms safety barrier.

- [X] Worker Node Equalization: Guest cluster node allocations balance completely across all three defined fault domains.


### 🔳 SECTION D: AUTOMATION INTEGRATION & HIGH AVAILABILITY

- [X] AKO Log Compliance: Operator pods register clear configurations, returning no persistent REST API sync failures.

- [X] Dynamic VIP Allocation: Service resources type `LoadBalancer` automatically receive an IP from the Avi network allocation block within 45 seconds.

- [X] Cross-Zone Storage Migration: Forcefully terminating stateful instances shows persistent storage unbinding and attaching to a new zone in under 30 seconds.

- [X] Outage Self-Healing Uptime: Simulating a total zone disaster proves the remaining control plane maintain cluster quorum and reroute ingress network traffic in under 10 seconds.



© 2021 SEVENLOGIC.IO

bottom of page