What does “good” look like?

Underpinning a private cloud is effective and proactive management of performance, capacity, and cost within the infrastructure, backed up by SLAs that explicitly define what level of service your customers can expect.

SLAs are a sign of matured, cloud-like operations. SLAs for the service you provide, which is likely the software-defined-data center and its associated components, must be complete, correct, and accurate

Complete means you have SLAs for performance – and potentially for compliance -, not just availability (which is the most common form of SLA in the world of IT). Performance and compliance are crucial components of the overall service you are providing. There is limited value in ensuring a workload is available if its performance is so poor that the application running on it is unusable, or if an environment is non-compliant leading to a security incident or a data breech.

Correct means the SLA is measured on each paying VM, and not at the infrastructure level, because ultimately the measure for success is not the health of the infrastructure platform, but the health of the workloads which are running the applications. Correct also means you are using the right metrics to track the health and performance of your service.

Accurate means the measurement must be measured/collected every 5 minutes. Longer intervals than this don’t provide the granularity you need to catch problems. Shorter intervals for collection causes impacts at the infrastructure layers for the collection, processing, and storage of the additional datapoints.

To support your infrastructure and operations teams in ensuring the service meets the defined SLAs, you will have SLA Leading Indicators, which provide you a forward-looking prediction of how a service is tracking towards its SLA. You will also have Key Performance Indicators (KPIs) which are a useful way of condensing the numerous metrics which give valuable insight into the performance of a workload (or cluster) across the different resource types, into a single score which makes performance health more visible and aids proactive troubleshooting.

This is a journey and maturity in these areas must be built step-by-step. This whitepaper is written looking from the top down as this provides a better conceptual understanding of what we’re trying to achieve and provides context for a subsequent focus on some of the lower-level details. It’s a bit like showing someone a house: you start with the conceptual-level information like how big the house is and what rooms and features it has. However, also like a house, to embark upon the journey and build maturity in these areas, you need to start from the bottom up.