Capacity Modeling: Enhancing Analyst Well-being & SOC Efficiency

Nov 05, 2023

Recently on platform X, I shared a poll on the topic of capacity modeling in a Security Operations Center (SOC).

The results are mostly in, and frankly, they align with what I anticipated. Now, with 394 votes, we might question the statistical significance. But let's consider how often we encounter discussions emphasizing the need for solid operations management — it's not as frequent as you’d think. In todays’ post, I’ll guide you through each response, explain its significance, and how it functions, along with some implications for each.

Here’s the summary for those pressed for time: A simple capacity model does more than just show you what you can do now; it also helps you improve. It points out unnecessary tasks that can be cut, allowing your team to focus on more rewarding and important work. This leads to happier employees, better work output, and helps the whole organization move forward.

With that, let’s dive in.

What’s a capacity model?

A significant 60.5% of survey participants expressed uncertainty regarding the concept and benefits of a capacity model. Let's begin by addressing this gap in understanding.

Capacity modeling is a set of techniques that attempt to understand an organization's ability to use resources (people, technology, finances) to meet current and future demands.

Imagine you run a Help Desk, you want to understand the current demand for new support tickets. How often do requests show up? What are the various request types and how long do they take? Now, armed with this information, the next step is to evaluate whether your current staff, technology, and budget are sufficient to meet these needs. This is where a capacity model comes into play—it's designed to help you answer these critical operational questions.

The same applies to a SOC. How often do alerts show up in your alert queue? How long does it take to make the right decision? How often does an alert detect an incident? How long do incidents take to investigate and remediate? Now, given those demands, do you have enough people, tech, and finances to keep up? Or are you in a perpetual state of “just trying to keep up”.

In both our Help Desk and SOC scenarios, we're taking a deep dive into the system's present conditions. The core questions are: What is the volume of work arriving, and do we have the necessary resources to manage it effectively? Moreover, it's crucial to recognize that this methodology isn't just for assessing the status quo; it's equally valuable for enhancing it. We're looking to pinpoint the current system state and identify what enhancements in staffing, technology, and financial investments are required to elevate performance.

If you're part of or leading a SOC team, reflect on the past month's resource utilization. Did you operate over or under capacity? What's the forecast for next month's demands? Comment below.

Basic & rough model.

A modest 9.5% of survey participants indicated that they employ a basic, preliminary model for grasping the relationship between capacity and workload. In the following section, I intend to demonstrate that initiating this practice is straightforward and the benefits are substantial.

Understanding the utility of a capacity model is crucial, and now, let's sketch out a simple, initial model for our SOC. Remember, the reliability of our model is tightly linked to the validity of our assumptions.

We'll begin by conceptualizing the 'ideal' SOC analyst. To clarify, we're not imagining a superhuman with multitasking prowess; we're basing our model on a realistic profile. For our model we’ll assume that each analyst is expected to have an 8-hour workday and be available for 22 days in a typical month. However, they are realistically productive for only 70% of that time. This 'loading factor' helps us factor in real-world elements like time off, daily meetings, 1-1s with their manager and necessary breaks. To summarize, for every 8 hours, we’ll get 5.6 of work time (8 x 0.7 = 5.6).

Our model also requires the inclusion of service times for various types of alerts, mirroring the way different requests carry different time commitments. Think of our illustrative Help Desk scenario: resetting a password is typically a swift task, often just taking a few minutes, unlike the more time-intensive troubleshooting of an MDM policy issue. Similarly, in a SOC environment, the time taken to address alerts varies. For instance, high-priority alerts from Endpoint Detection and Response (EDR) tools might be triaged in minutes, whereas sifting through a suspicious login from your identity provider or SaaS applications could take significantly longer. For the sake of our foundational model, we'll work with these estimated service times to map out our capacity plan.

With a grasp of the profile for our ideal team member and rough estimates of service times, we can begin to compute the long-term average operational capacity of the SOC. Let's kick off by figuring out our weekly capacity in hours, guided by some foundational assumptions:

Here's the breakdown for our simplified model:

We have a team of 4 SOC analysts. Adjust this figure according to your team’s structure.
Each analyst is scheduled for an 8-hour workday.
To reflect the reality of the work environment, we're incorporating a productivity ratio of 70%, accounting for breaks, meetings, and other non-task-related activities.
The standard workweek is set at 5 days.

With these parameters, our weekly capacity can be calculated as follows:

Weekly capacity = 8 hours/day * 5 days/week * 70% productivity * 4 analysts, which equals 112 hours.

This gives us a baseline for understanding how much work our SOC team can handle over the course of a week.

To evaluate our SOC's workload, let's categorize and quantify the types of tasks they manage.

For illustrative purposes, we'll consider:

The team triaging 100 alerts.
Out of these, 20 alerts were escalated for in-depth investigation.
From these investigations, 3 evolved into confirmed security incidents.

In our initial model, we'll apply estimated average service times to each task category where alert triage takes 15 minutes, in-depth investigations take an hour (60 minutes), and security incidents are 3 hours (180 minutes).

[For a more refined approach, it's possible to measure actual service times. For example, in past roles, leveraging a SIEM’s API allowed us to track when an alert was generated and when it was resolved. By calculating the interval—usually in minutes—from the alert's creation to its closure, you get a precise service time.]

Given our estimated work times, we can determine the weekly workload for our SOC with the following calculation:

Weekly loading is the sum of the time taken for each task:

100 alerts, each requiring 15 minutes of triage.
20 alerts, each necessitating 60 minutes for further investigation.
3 incidents, with each demanding 180 minutes to manage.

Here's the math laid out:

Weekly loading = (100 alerts * 15 minutes) + (20 investigations * 60 minutes) + (3 incidents * 180 minutes) all divided by 60 to convert minutes into hours, which gives us a total of 54 hours of work for the week.

Upon dividing the weekly demand of 54 hours by the available weekly capacity of 112 hours, we find that the team is operating at 48% capacity. This means the team utilized less than half of their available time to manage the week's workload.

This evaluation stems from what's known as a "gross" model, which gives us an overarching view of the SOC’s long-term average state. Such a foundational model can be instrumental in informing staffing and budgetary strategies. However, it's important to remember that a capacity model represents the starting point for discussion, not the conclusion. The data should spur ongoing dialogue and iterative steps to refine and improve your analysis.

High-end predictive model.

In the previous section we built a basic model to show our SOC’s long-term average state. We have a basic understanding of “long-term”, weekly in this context, capacity vs. loading. What this model doesn't reveal, though, is how capacity affects the handling of each individual alert. To gain insight into that, we'd need to employ a latency-based model that utilizes simulation to understand the time-sensitive dynamics of alert management.

There are various python libraries that provide discrete event simulation capabilities, like SimPy or Ciw. There are event graphic process modeling tools (like Simul8 or Simio) that allow you to simulate your process with various parameters to understand the impacts on the system. Walking through the various intricacies of these tools it’s beyond the scope of this post, but consider taking a look at these python libraries and software along with their documentation to dive deeper.

The advantage of utilizing a high-end predictive model lies in its predictive capabilities, especially in forecasting how changes within the system, such as an increase in staff, will affect the management of individual alerts. Suppose you have a basic estimation that suggests for every 100 employees added, one additional alert is generated per week. This assumption implies that with a rise in the workforce, the system's demand will increment correspondingly. By employing discrete event simulation, you can discern the effects of this increased demand—or "loading"—on various aspects of alert management, including how long an alert remains in the queue before being addressed (wait times), the duration of the resolution process (service times), and the repercussions on the backlog of work (queue length). You may find that if your organization were to double its staff, your alert wait times increase by 50%! While your service times (how long it takes to review the alerts) doesn’t change, given your alerts will now wait around longer, this gives an attacker too much time to go undetected. This becomes the start of the conversation on how to use people, tech, and budget to make sure this doesn’t happen. This becomes a powerful tool that enables you to look at into the future and with reasonable certainty, understand how those future events will impact your system performance.

A noteworthy 5.4% of survey respondents use an advanced predictive model to forecast how different variables will influence their system's future performance. That's quite impressive. Are you leveraging third-party software, or have you developed a custom solution, perhaps with tools like Jupyter Notebooks? I'd love to hear more about your approach—please share your insights in the comments!

No, plan. Just improv.

About a quarter of survey respondents admitted that instead of using a capacity model, they rely on improvisation. This suggests that while these teams might not have a formal model in place, they are likely monitoring certain metrics and their trends to informally assess whether their team is overburdened. While this informal approach can work initially, as your team and its responsibilities grow, adopting even a simple capacity model could be invaluable. The benefits of even the most basic capacity model are not just in understanding if you have enough people, tech, and budget to keep up with the demand, but in the question the model should provoke.

Consider this likely scenario: a SOC where half of the team's resources are allocated to addressing benign alerts. This scenario begs the question: "Is this the most efficient use of our time?" Such a question should naturally lead to further inquiries, such as "Are we spending time on alerts that could be prevented by other controls?" or "If our analysts are productive for 70% of their time, how can we optimize that time for the most value?" and "How frequently should analysts be detecting genuine threats?"

These questions, and their subsequent answers, are crucial for driving change. SOC analysts thrive on learning and innovation. A basic capacity model is not just a measure of current capabilities; it's a catalyst for improvement. It helps in identifying and eliminating non-essential or redundant tasks, freeing up analysts to engage in more fulfilling and value-added work. This, in turn, contributes to higher job satisfaction, productivity, and overall progress for the organization.

Starting with improvisation isn't wrong, but it's clear that even the simplest of models can provide substantial benefits and be a stepping stone to greater efficiency and satisfaction within your team.

Parting Insights

There's a frequent focus on achieving "100% MITRE ATT&CK coverage" in the cybersecurity community, but discussions about the practical aspects—like the amount of time it takes for a team to sort through both real threats and false alarms—are less common. Yet, these aspects are crucial. Success isn't just about having comprehensive coverage; it's about the capability to respond effectively and efficiently when a threat is detected.

To put it simply, I'm not criticizing the MITRE ATT&CK framework. In fact, I believe it's a critical and very helpful resource for our field. What I'm trying to underline is that aiming for "100% coverage" without a strong foundation in operations management is like having a map without a compass—you won't be as prepared as you need to be when it's time to act. Operational savvy is essential for the efficacy of any SOC and the well-being of your analysts.

Travis Romero

Nov 15, 2023

Thank you for the article! Agree with you that you have to start having this conversation at some point, especially when you reach the point of making decisions on what % of MITRE ATT&CK ttps you are going to translate to alerts.

It's super easy to default to 100% coverage mode and want to alert on everything, but capacity models like this can really highlight how effective your team will be mapped to the budget allocated to SOC staffing. It's a simple message to leadership - if you want fast response time SLA's to all threats, here's how many people you will need.

Expand full comment

Capacity Builders

Discussion about this post

Ready for more?