IsvaraIsvara
The Guide
Part 3/Overview

Overview

A cluster is operationally a collection of ESXi hosts. As a result, the basic counters of CPU, memory, disk and network are basically the sum of the member host.

vSphere Cluster

What makes a cluster more complex than the sum of its hosts is the various cluster-level features and configuration. I see vSphere Cluster as the smallest logical building block. From operations management, it’s basically a single computer. It’s a huge and complex machine, much more than just a group of ESXi hosts sharing a common network and storage.

Let’s start by looking its 2 most basic features:

FeaturesImpactImpacts
HACapacityThe various options of HA complicate usable capacity calculation.
Availability

HA results in 2 metrics: actual availability and operational availability.

HA event requires VM availability to be verified as application dependency could be affected. The order of booting needs to be kept up to date.

HA event needs to be reported and investigated. This typically requires log analysis to find the root cause.

Configuration

ESXi hosts in the cluster should have identical hardware & software configuration.

Customers typically have multiple clusters, and need them to be consistently configured.

InventoryActual needs to match plan. Not only the amount, but also the movement and their status.
DRSPerformance

Degradation of vMotion stunned time could give a clue to overall ESXi performance.

vMotion may impact latency-sensitive application. Rate of vMotion should be measured against expectation.

Configuration

Various DRS settings such as automation level should match plan and standard.

VM-level exception can get buried in large environment.

Customers typically have multiple clusters, and need them to be consistently configured.

You can see that the above complicate operations, especially in a very large environment with hundreds of clusters. If you add these features on top, you further increase complexity of your operations.

FeaturesImpactImpacts
AffinityConfigurationThe settings of affinity and anti-affinity should match plan. In large environment with hundreds of clusters this can get buried and hence overlooked.
Resource PoolCapacity

Shares, Limit, Reservation done at resource pool level need to be compatible with those at its children VM.

Resource Pool should not be peer of VM.

Performance
Configuration

Complication from cascading resource pools.

Need to ensure VMs are not siblings of resource pool

DPMCapacityDPM impacts capacity as it changes total capacity.
PerformanceDPM is only considering the ESXi utilization metrics. It does not check the VM contention metric.
ConfigurationDPM settings need to match plan.

The above cover the standard vSphere cluster. There are 2 other variants, which take the operational complexity higher.

FeaturesImpactImpacts
Stretched ClusterConfigurationThe configuration of each site needs to be checked so VMs always accessed local storage
CapacityThe utilization of the 2 physical sites may be intentionally unbalanced, because one acts as primary site while the other as DR site.
Performance

Horse-shoe traffic between VMs on the same site.

Traffic ping pong between VMs on different sites.

AvailabilityThe whole purpose of a stretched cluster is they protect one another. This shall be tested at least once a year.
vSAN ClustervSAN impacts all aspects of operations management. It impacts Day 0, Day 1, and Day 2.

In addition, there are complication simply because there are multiple members in the cluster. For example, is cluster utilization simply the average of all its hosts? What if there is imbalanced? It will get buried if the cluster has many hosts.

While a cluster focuses on compute, it is where VM runs and consumes network and storage. This means network and storage counters must be considered as appropriate. If you’re using vSAN, then it’s mandatory.

Base Metrics

vSphere Client only displays basic set of metrics. They are grouped into 4, as shown in the following screenshot:

A screenshot of a computer Description automatically generated

For each of the group, there is basic set of metrics. Here it is for memory:

A screenshot of a computer Description automatically generated

The group Cluster Services only provides 3 metrics:

A screenshot of a computer Description automatically generated
VM Operations

vSphere Cluster, being the main object where VM runs, has a set of event metrics. They count the number of times an event, such as a VM gets deleted, happens. This provides insight into the dynamics of the environment.

A screenshot of a computer Description automatically generated

Take note that the metric is accumulative. So it starts since the day the cluster was created. VCF Operations converts into rate, and also make them available at higher level objects (Data Center, vCenter and vSphere World).

CategoryMetric NameDescription
Change of StateVM guest reboot countOnly a reboot. The underlying VM is not powered off.
VM guest shutdown countI think this triggers VM Power Off too.
VM standby guest countMy guess this also power off the VM
VM power off countI think this is direct, abrupt power off. It does not include proper shut down from Guest OS.
VM power on count
VM reset countPower cycle, different to Guest OS restart as the VM is momentarily powered off.
VM suspend countDeeper than Guest OS Standby. Is this like hibernate in Windows?
Change of InventoryVM create countAll creation, be it from template, direct, or cloning. So this is the total amount.
VM clone countCreation via cloning only.
VM template deploy countCounted separately to separate those VMs not deployed from template.
VM reconfigure countLog Insight tracks the actual changes.
VM register countAdd into vSphere inventory
VM unregister countTake note the VM file can still exist in datastore and LUN
VM delete countAll deletion, be it API or UI.
Change of LocationvMotion countChange of ESXi host only
Storage Motion countChange of datastore only.
VM host and datastore change countBoth change in one event. Powered-on VMs only
VM datastore change countOnly for powered-off VMs
VM host and datastore change countOnly for powered-off VMs
VM host change countOnly for powered-off VMs

You certainly have some expectation on the dynamics of your environment. Does the reality match your expectation?

In production environment, these numbers should be low. Some numbers such as shutdown should also match the change request and happens during the green zone. Some exceptions apply, such as your VDI design includes scheduled reboot on the weekend.

Previous
Capacity: VM
Home
Next
Performance