Introduction

VMware Cloud Foundation (VCF) is Broadcom’s software-defined data center, built upon a reference architecture that is used at-scale by customers around the world, and wrapped with automated operations for deployment, scaling, and life-cycling.

Broadcom’s objective with VCF 9 is to enable customers to be their own cloud provider. We want to help our customers to build and operate a private cloud, and to do so cost-effectively on-premises in their data center, or within a service provider or hyperscaler environment.

We don’t use terms like “private cloud” to capitalize on buzzwords. Describing something as a “Cloud” really implies a lot about how customers can consume the services available to build and power their applications. In the context of a private cloud, “cloud” also implies things about how the infrastructure team operates and manages performance, capacity, and cost to facilitate consumption and to provide a contestable alternative to public cloud offerings.

Cloud computing is not about where you do computing, it’s about how you do computing. So, providing a private cloud to your business is about more than just having a virtualization platform to run VMs and containers. Private Cloud depends upon frictionless, cloud-like consumption of infrastructure and services (databases, load balancers, object storage, and so on), to simplify the way your customers design and build their applications. Underpinning a Private Cloud is the effective and proactive management of performance, capacity, and cost of the infrastructure and workloads, backed up by Service Level Agreements (SLAs) that explicitly define what level of service you are providing your customers.

Included within VCF 9 is VCF Operations; an operations tool which forms the cloud management component of the software-defined data center. It provides customers with the tools they need to effectively operate their software-defined data center as a private cloud. However, the implementation of SLAs to cover availability and performance is not an out-of-the-box experience, as it requires tailoring to your organization policies and procedures.

This whitepaper aims to articulate the value of implementing service-oriented operating principles for your private cloud using SLAs. It outlines the supporting components that are required for, or support, the proactive management of your private cloud in accordance with the SLAs you define.

For those of you who have thought about - or perhaps tried to implement – SRE-type concepts within the IT part of your business, the principles we’re discussing in this whitepaper are foundational for that. There is little point providing SLAs and SLOs for your applications if they are running on infrastructure that provide no guarantees at all: it’s just hoping for the best.

This whitepaper does not attempt to provide all the technical detail required to implement these principles within VCF Operations. This can be technically complex, and there is not a generic one-size-fits-all approach. We hope that the content in this whitepaper will get your operational-focused technical specialists excited about the benefits of implementing these principles and start them thinking about what such an implementation might look like for your specific business. For assistance with moving from concept to implementation, VMware Professional Services can be engaged to assist your operational teams.

The intended audience for this whitepaper is IT decision-makers who want a conceptual understanding of what service-oriented operations should look like for a private cloud on VCF, and the importance of implementing service-oriented operating principles based on proactive and effective management of performance, capacity, and cost, and wrapped with SLAs which explicitly define the service you are providing.

Do you need this whitepaper?

The following questions will help you gauge the maturity of your existing cloud operations approach and your maturity in providing a private cloud infrastructure to your business. If you can answer “yes” to any of these questions, we think you will find value in this whitepaper.

Do application team and SRE team blame you when things go wrong?

If this is the case, there is a high chance you are relying on complaints to drive your operations. If there are no complaints, there are no problems. We call this “complaint-based operations”.

The reason some customers run their infrastructure via complaint-based operations is because the operations team has no other means by which to measure success. They have not defined the acceptable performance of their infrastructure and have no benchmark for “good”. Solving this challenge is one of the goals of this whitepaper.

Does troubleshooting mean all hands-on deck?

If a troubleshooting event means all hands-on deck, that indicates that you don’t have the process or data required to triage a problem and engage specific specialists, so you engage everybody. Do you have a troubleshooting process that is followed by all teams (including network, storage, server, OS, application, etc.)? Does that process end with Root Cause Analysis (RCA)?

As part of RCA, do you set up alerts so the same issue can be detected faster if it happens again? Without an alert configured, the RCA is not complete.

Do Help Desk support tickets often require escalation?

If Help Desk simply passes issues through to the next level, you need to look at why.

Help Desk is your first line of defense. They do not go as technically deep as the specialists higher in the support framework. Equip them with simple dashboards so that they can handle complaints by proving:

Is the problem caused by the infrastructure not serving the VM well?
If yes, which part of the infrastructure? Is the problem at the CPU, memory, disk, or network layers?
If not, how can we prove this convincingly to the application owners?

Is proving the cost effectiveness of private cloud a challenge?

The commoditization of infrastructure means your private cloud is being compared with public cloud platforms like Amazon AWS, Microsoft Azure, and Google Cloud.

If your private cloud is not demonstrably cheaper and better, or if you cannot measure cost on a per-workload basis at all, the business may question the value of the private cloud. One of the primary reasons for running a private cloud, alongside privacy, compliance, and security, is cost-effectiveness. This cost equation must also include the cost of the staff and facilities required to operate the platform.

Do you worry about running out of capacity in the private cloud?

The expectations of your customers of frictionless consumption of IT services to build and run their applications means that IT cannot exist as blocker to protect the capacity of the infrastructure. Public cloud, which private cloud is often compared against, has the perception of being able to scale endlessly. Public clouds benefit from economies of scale in this respect, so an effective capacity management process is of paramount importance for a private cloud, particularly when considering the lead times of purchasing and provisioning new hardware. This whitepaper will help to give you confidence around capacity by understanding the current capacity of your private cloud and enable you to forecast future consumption against current and future capacity.

Do you struggle with over-provisioned VMs?

This is an indicator that you are operating in a ‘system builder’ function and not as a service provider. As a system builder, you are touching and customizing individual VMs. You size them and argue with the application teams, who are the customers/consumers of the infrastructure. As a result, you are busy as there are many applications and you are outnumbered.

If you are operating as a service provider of private cloud to the business, you should not be “in the way” of the business. You should be using an effective pricing model to drive the right behavior. Does a public cloud provider block customers from buying a 40 CPU VMs when they only need 2 CPU? Of course not.

This does not mean there is no value in “right-sizing” workloads and helping to drive efficient consumption by your internal customers. The tools described in this whitepaper can help you understand how to do that too.