IsvaraIsvara
The Guide
Part 1/SP5 Framework: People

SP5 Framework: People

There are many personas required to keep operations running well. Some are directly involved in the day-to-day operations, while others focus on the big picture, hence requiring a longer time frame. In small branches of a large organisation, the roles are played by the same few people, backing each other up. You can have 3 people doing everything with no structure, or 300 people with clear demarcation and formal hierarchy. Regardless, the jobs still need to be done, so document all the roles and responsibilities. Now you know why I make this document editable.

There is a many-to-many relationship between “jobs to be done” and persona. It reflects the dynamic nature of live operations. You have team members taking leave or away, hence you need someone else to step in on an ad-hoc basis.

A screenshot of a computer program AI-generated content may be incorrect.

Day to Day Roles

Level 1 Operators

Deal with the production environment. Perform a regular check on the overall environment. Use both insight and alerts.

Responsible for closing alerts. Alerts should be closed only when root cause is known, not when symptoms disappear. Closing alerts without knowing why they happened prevents lesson learned and can potentially backfire.

Perform simple troubleshooting, following SOP. Typically, these SOPs do not require reading logs.

SOP is ideally automated, taking input parameters, so the chance of human error is minimized if the number of manual steps or frequency is high.

Focus on Availability, Performance and Security.

Typically stationed at Network Operations Center room and called the Help Desk.

Platform

Activated when Level 1 is unable to solve the problem. For each problem solved, this role should update the troubleshooting guide so Level 1 can be empowered.

Focus on insight, not alerts. Look at the big picture and try to prevent alerts from happening. They focus on risk (e.g. configuration risk, compliance risk).

More senior than Level 1. May specialize is some areas (e.g. vSAN, networking). They document the knowledge base and develop SOP. They automate the SOP as much as possible.

Perform advanced troubleshooting, which often requires logs analyzis. Work with the Architecture Team.

Lead or involved in the evaluation of operations management tools. Design and maintain VCF Operations dashboards and alerts.

In larger organization, there can be more levels.

Operations Manager

Manage the level 1, level 2, level 3 operators.

Deal with tenants on SLA.

Other Roles

There are other roles that Ops Teams need to deal with. Here are some of them. In larger organisation, each could be a team on their own.

Security

In large organizations, this could be a separate department by itself. It could also report outside IT if the scope covers beyond computer system.

They work closely with the Compliance persona.

Compliance

Set the compliance settings to agreed internal and industry standard.

Verify that non-compliance alert was addressed timely and correctly by the operations team. Report & discuss the compliance status with upper management.

Focus on Risk (Configuration, Compliance).

Capacity

Plan the supply side of capacity, working with the architect role.

Plan the demand side of capacity, working with line of business or sales team.

Does not get involved in the day-to-day capacity. ESXi hosts going into maintenance mode is an operational problem, not capacity management matter.

IT FinanceTypically work with Capacity Team on purchasing.
IT Management

There can be multiple levels here, all the way to the CIO and CTO.

Look at the big picture, trends over time, and future. Not so much what’s going on hour-by-hour.

Weekly report, focusing on the overall health and not individual users or pools.

Monthly presentation and review, supported for live dashboard for an interactive discussion.

Generally, does not get involved in troubleshooting and architecture.

Primary focus is Compliance and Cost. Performance is not the focus, as that was likely promised to be “good” by the Architect as part of the design.

Architecture

Design the system architecture. Needs to get input from Day 2 team, and design with the end in mind.

Look into the future. Evaluate new technology and assess if technology refresh or migration to a new architecture makes business sense.

May lend support on complex troubleshooting.

In larger organizations:

  • The storage team is typically separate, unless it’s virtual such as vSAN.

  • OS-specific teams. They are responsible for setting the build standard for each OS.

NetworkThe network team is typically separate. This is due to the nature of networks.

Guess what role is missing?

Yes, Site Reliability Engineer (SRE).

The role focuses on automation. I see them as a variant of platform team, because in operations you should only automate what you can operate. Doing automation without mastery of operations have resulted in many DevOps turned into Dev Oops!

Exit Criteria | Entry Criteria

Among the roles and departments, there are handovers and boundaries. This is especially important for escalation. What’s the Exit Criteria and Entry Criteria?

A close-up of a sign Description automatically generated

Take for example the escalation process from Level 1 to Level 2 support.

L1 Exit Criteria defines the set of analyzis they have performed on the issue. If it’s not in their capability, they have the right to escalate to L2.

L1 Staff

Perform check following their SOP to ensure the Exit Criteria is satisfied.

The guide to analyze is provided by the respective L2 as part of the alert definition.

Add additional information that L2 needs, giving all the context an L2 personnel needs to perform their job. The context is part of the handoff process, meaning Level 2 will not accept the escalation if the necessary details are not provided

The list of information can be basic information, such as VM owner. It can also be a set of additional low-level counters or test results that the L1 Team ran but does not have the skills to analyze. It’s analogous to seeing a specialist doctor who asks you to bring your test results.

Since L1 is generalist, while L2 is specialist, the L1 staff must escalate to the right Level 2 group.
L2 StaffThe 2 pieces of information above becomes the Entry Criteria for L2.
L2 Team automatically accept the issue and now the ball is in
“It’s on my court”

Each team need provide the standard list of metrics, events, error that shows that the problem is on their court. These items are monitored by their respective team, and alerts are in-place.

The following list examples metrics and events for different team.

Application TeamNaturally, the actual metric and event are application-specific, and even version-specific.

There is general abnormal behaviour regardless of application. Examples:

  • Memory that grows over time and does not go down despite loads going down. This indicates memory leak, or application does not release the memory back to Linux or Windows.

  • CPU on a thread that runs 100% constantly. While high utilization is good, a perfectly flat line over minutes could indicate process run away.

Database Team
Windows TeamCPU: Context Switch, Run Queue
Linux TeamCPU: Context Switch, Run Queue
K8 TeamK8 Pod: CPU Throttle, Out of Memory event.
Virtualisation Team

For VMware vSphere VM:

  • VM CPU: Ready, Co-Stop, Overlap, Other Wati

  • VM Memory: Latency, Zipped, Swapped, Ballooned

  • VM Storage

  • VM Network:

  • VM System: vMotion down time

For Infrastructure :

  • ESXi Performance (%)

  • ESXi Performance Risk (%)

  • Cluster Performance (%)

  • Cluster Performance Risk (%)

Previous
SP5 Framework
Home
Next
SP5 Framework: Process