LLMs in the SOC - Why Benchmarks Fail Security Op Teams

31079368283?profile=RESIZE_400xSentinel Labs has provided a keen look into LLMs and SOC operations.  For security teams, AI promised to write secure code, identify and patch vulnerabilities, and replace monotonous security operations tasks.  Its key value proposition was raising costs for adversaries while lowering them for defenders.

To evaluate whether Large Language Models (LLMs) were both sufficiently performant and reliable to be deployed in the enterprise, a wave of new benchmarks was created.  In 2023, these early benchmarks largely consisted of multiple-choice exams on clean text, yielding clean, reproducible performance metrics.  However, as the models improved, they outperformed the early tests: scores across models began to converge at the top of the scale as the benchmarks became increasingly “saturated”, and the tests themselves no longer provided meaningful insights.[1]

As the industry has expanded over the past few years, benchmarking has become a means of distinguishing new models from older ones.  Developing a benchmark that demonstrates how a smaller model outperforms a larger one released by a frontier AI lab is a billion-dollar industry, and now every new model launches with a menagerie of charts and bold claims. +3.7 on SomeBench-v2, SOTA on ObscureQA-XL, or 99th percentile on an-exam-no-one-had-heard-of-last-week.  The subtext here is simple: look at the bold numbers, be impressed, and please join our seed round!

In this swamp of scores and claims, security teams are somehow expected to conclude that a system is safe enough to trust with an organization’s business, its users, and even its critical infrastructure.  However, a careful review of the arxiv benchmark firehose reveals a hard-to-miss pattern: We have more benchmarks than ever, yet we still are not measuring what actually matters for defenders.

So what do security benchmarks actually measure? How well does this approach align with real security work?

In this report, Sentinel researchers review four popular LLM benchmarking evaluations: Microsoft’s ExCyTIn-Bench, Meta’s CyberSOCEval and CyberSecEval 3, and Rochester Institute’s CTIBench.  Analysts explore what we think these benchmarks get right and where they believe they fall short.

What Current Benchmarks Actually Measure / ExCyTIn-Bench | Realistic Logs in a Microsoft Snow Globe  -  ExCyTIn-Bench was the cleanest example of an “agentic” Security Operations benchmark that we reviewed.  It drops LLM agents into a MySQL instance that mirrors a realistic Microsoft Azure tenant.  They provide 57 Sentinel-style tables, 8 distinct multi-stage attacks, and a unified log stream spanning 44 days of activity.  Each question posed to the LLM agent is anchored to an incident graph path.  This means that the agent must discover the schema, issue SQL queries, pivot across entities, and eventually answer the question.  Rewards for the agent are path-aware: full credit is assigned for the correct answer, and the agent can also earn partial credit for each correct intermediate step.

The headline result is telling:  “Our comprehensive experiments with different models confirm the difficulty of the task: with the base setting, the average reward across all evaluated models is 0.249, and the best achieved is 0.368…” (arxiv)

Microsoft’s  ExCyTIn benchmark demonstrates that LLMs struggle to plan multi-hop investigations over realistic, heterogeneous logs.

This is an important finding – especially for those who are concerned with how LLMs work in real-world scenarios.  Moreover, all of this takes place in a Microsoft snow globe: one fictional Azure tenant, eight well-studied, canned attacks, clean tables, and curated detection logic for the agent to work with.  Although the realistic agent setup is a significant improvement over trivia-style Multiple Choice Question (MCQ) benchmarks, it does not reflect the daily chaos of real security operations.

CyberSOCEval | Defender Tasks Turned into Exams - CyberSOCEval is part of Meta’s CyberSecEval 4 and deliberately picks two tasks defenders care about: malware analysis over real sandbox detonation logs and threat Intelligence reasoning over 45 CTI reports.  The authors open with a statement we very much agree with:  “This lack of informed evaluation has significant implications for both AI developers and those seeking to apply LLMs to SOC automation.  Without a clear understanding of how LLMs perform in real-world security scenarios, AI system developers lack a north star to guide their development efforts, and users are left without a reliable way to select the most effective models.” (arxiv)

To evaluate these tasks, the benchmark frames them as multi-answer multiple-choice questions and incorporates analytically computed random baselines and confidence intervals.  This setup gives clean, statistically grounded comparisons between models and reduces complex workflows into simplified questions.  Researchers found that the models perform well above chance but remain far from solved.

In the malware analysis trial, they achieve exact-match accuracy in the teens to high-20s range, compared with a random baseline of around 0.63%.  For threat-intel reasoning, models achieve ~43-53% accuracy, compared with ~1.7% random.  In other words, the models are clearly extracting meaningful signals from real logs and CTI reports.  However, the models also fail to answer most malware questions and roughly half of threat intelligence questions correctly.

These findings suggest that, for any system automating SOC workflows, model performance should be evaluated as assistive rather than autonomous.  Crucially, they find that test-time “reasoning” models don’t get the same uplift they see in math/coding:  “We also find that reasoning models leveraging test time scaling do not achieve the boost they do in areas like coding and math, suggesting that these models have not been trained to reason about cybersecurity analysis…” (arxiv)

That’s a big deal, and it's evidence that you don’t get generalized security reasoning for free just by cranking up “thinking steps.” 

Meta’s CyberSOCEval falls short because it compresses two complex domains into MCQ exams.  There is no notion of triaging multiple alerts, asking follow-up questions, or hunting down log sources.  In real life, analysts need to decide when to stop, escalate, or switch paths.  Ultimately, while the CyberSOCEval is a clean, statistically sound probe of model performance on a set of highly specific sub-tasks, it is far from a representation of how SOC workflows should be modeled.

CTIBench | CTI as a Certification Exam - CTIBench is a benchmark task suite introduced by researchers at Rochester Institute of Technology to evaluate how well LLMs operate in the field of Cyber Threat Intelligence.  Unlike general-purpose benchmarks, which focus on high-level domain knowledge, CTIBench grounds tasks in the practical workflows of information security analysts.  Like other benchmarks we examined it performs this analysis as an MCQ exam.

While existing benchmarks provide general evaluations of LLMs, there are no benchmarks that address the practical and applied aspects of CTI-specific tasks.” (NeurIPS Papers)

CTIbench draws on well-known security standards and real-world threat reports, then turns them into five kinds of tasks:

  • Basic multiple-choice questions about threat intelligence knowledge
  • mapping software vulnerabilities to their underlying weaknesses
  • estimating how serious a vulnerability is
  • pulling out the specific attacker techniques described in a report
  • guessing which threat group or malware family is responsible.

The data is mostly from 2024, so it’s newer than what most models were trained on, and each task is graded with a simple “how close is this to the expert answer?” style score that fits the kind of prediction being made.

On paper, this appears close to the work CTI teams care about: mapping vulnerabilities to weaknesses, assigning severity, mapping behaviors to techniques, and linking reports to actors.  In practice, however, the way those tasks are operationalized keeps the benchmark within the scope of a certification exam.  Each task is cast as a single-shot question with a fixed ground-truth label, and the model is trained to answer it in isolation using a zero-shot prompt.  There is no notion of long-running cases, heterogeneous and conflicting evidence, evolving intelligence, or the need to cross-check and revise hypotheses over time.

CTIBench is yet another MCQ, an excellent exam if you want to know, “Can this model answer CTI exam questions and do basic mapping/annotation?”  It says less about whether an LLM can do the messy work that actually creates value: normalizing overlapping feeds, enriching and deduplicating entities in a shared knowledge graph, negotiating severity and investment decisions with stakeholders, or challenging threat attributions that don’t align with an organization’s historical data.

CyberSecEval 3 | Policy Framing Without Operational Closure - CyberSecEval 3, also from Meta, is not a SOC benchmark so much as a risk map.  The authors carve the space into eight risks, grouped into two buckets: harms to third parties, i.e., offensive capabilities, and harms to application developers and end users, such as misuse, vulnerabilities, or data leakage.  The framework for this evaluation is the current regulatory conversation between governments and standards bodies about unacceptable model risk, so the suite is understandably organized around “where could this go wrong?” rather than “how much better does this make my security operations?”

The benchmark’s coverage aligns closely with the concerns of policymakers and safety organizations.  On the offensive side, CyberSecEval 3 evaluates automated spear-phishing against LLM-simulated victims, uplift for human attackers solving Hack-The-Box-style CTF challenges, fully autonomous offensive operations in a small cyber range, and synthetic exploit generation on toy programs and CTF snippets.  On the application side, it probes prompt injection, insecure code generation in both autocomplete and instruction modes, abuse of attached code interpreters, and the model’s willingness to help with cyberattacks mapped to ATT&CK stages.

The findings across these areas are very broad.  Llama3 is described as capable of “moderately persuasive” spear-phishing, roughly on par with other SOTA models when judged against simulated victims.  In the CTF study, Llama3 405B yields a noticeable increase in completed phases and slightly faster progress among novice participants, but the authors note that the effect is not statistically significant.

The fully autonomous agent can handle basic reconnaissance in the lab environment, but fails to achieve reliable exploitation or persistence.  On the application-risk side, all tested models indicate insecure code at non-trivial rates, prompt injection succeeds a significant fraction of the time, and models sometimes execute malicious code or assist with cyberattacks.  Meta stresses that its guardrails reduce these risks for benchmark distributions.

CyberSecEval 3 may have some value for those working in policy and governance. None of the eight risks is defined by operational metrics such as detection coverage, time to triage, containment, or vulnerability closure rates.  The CTF experiment comes closest to demonstrating real-world value, but it remains an artificial one-hour lab on preselected targets.  Moreover, this experiment is expensive and not scalable.

There are glimmers of this in the paper, and CyberSecEval3 remains a strong contribution to AI security understanding and governance, but a weak instrument for deciding whether to deploy a model as a copilot for live operations.

Benchmarks are Measuring Tasks, not Workflows - All of these benchmarks share a common blind spot: they treat security as a collection of isolated questions rather than as an ongoing workflow.

Real teams work through queues of alerts, pivot between partially related incidents, and coordinate across seniority levels.  They make judgment calls under time pressure and incomplete telemetry.  Closing out a single alert or scoring 90% on a multiple-choice test is not the goal of a security team.  The goal is to reduce the underlying risk to the business, and this means knowing the right questions to ask in the first place.

ExCyTIn-Bench comes closest to acknowledging this reality.  Agents interact with an environment over multiple turns and earn rewards for intermediate progress.  Yet even here, the fundamental unit of evaluation is still a question: “What is the correct answer to this prompt?”  The system is not asked to “run this incident to ground” or evaluate different environments or logging sources that may be included in an incident response.  CyberSOCEval and CTIBench compress even richer workflows into a single, multiple-choice interaction.

Methodologically, this means none of these benchmarks measures outcomes that define security performance.  Metrics such as time-to-detect, time-to-contain, and mean time to remediate are absent.  Analysts are measuring how models behave when the critical context has already been carefully prepared and provided to them, not how they perform when dropped into a live incident where they must decide what to look at, what to ignore, and when to ask for help.

Until one is ready to benchmark at the workflow level, you should understand that high accuracy on multiple-choice security questions and smooth reward curves are not stand-ins for operational uplift.  In information security, the bar must be higher than passing an exam.

MCQs and Static QA are Overused Crutches - Multiple-choice questions are attractive for understandable reasons.  They are easy to score at scale.  They support clean random baselines and confidence intervals and they fit nicely into leaderboards and slide decks.

The downside is that this format implicitly embeds assumptions that do not hold in practice.  For any given scenario, the benchmark assumes someone has already asked the right question.  There is no space to challenge the premise of that question, to reframe the problem, or to build and revise a plan.  All relevant evidence has already been selected and prepackaged for the analyst.  In that setting, the model’s role is to compress and restate context, not to determine what to investigate or how to prioritize effort. Wrong or partially correct answers carry no real cost.

This is the inverse of real SOC and CTI work: the hardest part is deciding which questions to ask, which data to pull, and what to ignore.  That judgment is usually acquired through years of experience or deliberate training. If we want to know whether models will actually help in our workflows, we need evaluations in which asking for more data incurs a cost, ignoring critical signals is penalized, and “I don’t know, let me check” is a legitimate and sometimes optimal response.

Statistical Hygiene is Still Uneven - To their credit, some of these efforts take statistics seriously.  CyberSOCEval reports confidence intervals and uses bootstrap analysis to reason about power and minimum detectable effect sizes.  CTIBench distinguishes between pre- and post-cutoff datasets and examines performance drift. CyberSecEval 3 employs survival analysis and appropriate hypothesis tests in its human-subject CTF study to demonstrate an unexpected lack of statistically significant benefit from an LLM copilot.

Across the board, however, there are still gaps.  Many results come from single-seed, temperature-zero runs with no variance reported. ExCyTIn-Bench, for instance, reports an average reward of 0.249 and a best of 0.368, but provides no confidence intervals or sensitivity analysis.  Contamination is rarely addressed systematically, even though all four benchmarks draw on well-known corpora that are likely to overlap with model training data.  Heavy dependence on a single LLM judge, often from the same vendor as the model being evaluated, compounds these issues.

The consequence is that headline numbers can look precise while being fragile under small changes in prompts, sampling parameters, or judge models.  If we expect these benchmarks to inform real governance and deployment decisions, variance, contamination checks, and judge robustness should be baseline, check-box requirements.

Using LLMs to Evaluate LLMs Is Everywhere, and Rarely Questioned - Every benchmark reviewed relies on LLMs somewhere in the evaluation loop, either to generate questions or to score answers.  ExCyTIn uses models to turn incident graphs into Q&A pairs and to grade free-form responses, falling back to deterministic checks only in constrained cases.  CyberSOCEval uses Llama models in its question-generation pipeline before shifting to algorithmic scoring.  CTIBench relies on GPT-4-class models to produce CTI multiple-choice questions.  CyberSecEval 3 uses LLM judges to rate phishing persuasiveness and other behaviors.

CyberSecEval 3 is a standout here.  It calibrates its phishing judge against human raters and reports a strong correlation, which is a step in the right direction.  But overall, analysts are treating these judges as if they were neutral ground truth.  In many cases, the judge is supplied by the same vendor whose models are being evaluated, and the judging prompts and criteria are public.  That makes the benchmarks simple to overfit: once you know how the judge “thinks,” it is trivial to tune a model or prompting strategy to please it.

That being said, “LLM as a judge” remains incredibly popular across the field.  It is cheap, fast, and feels objective.  It’s not the worst setup, but if we do not actively interrogate and diversify these judges, comparing them against humans, against each other, then over time we risk baking the biases and blind spots of a few dominant models into the evaluation layer itself.  That is a poor foundation for any serious claims about security performance.

Technical Gaps - Even when the evaluation methodology is thoughtful, structural factors in today’s benchmarks diverge from real SOC environments.

Single-Tenant, Single-Vendor Worlds: ExCyTIn presents a well-designed Azure-style environment, but it remains a single, fictional tenant with a curated set of attacks and detection rules.  It tells us how models behave in a world with clean logging and eight known attack chains, but not in a hybrid AWS/Azure/on-prem estate where sensors are misconfigured, and detection logic is uneven.

CyberSOCEval’s malware logs and CTI corpora are similarly narrow.  They represent security artifacts cleanly without the messy mix of SIEM indices, ticketing systems, internal wikis, email threads, and chat logs that working defenders navigate daily.  If the goal is to augment those people, current benchmarks barely capture their environment.  If the goal is to replace them, the gap widens further.

Static Text Instead of Living Tools and Data - CTIBench and CyberSOCEval are fundamentally static.  PDFs are flattened into text, JSON logs are frozen into MCQ contexts, CVEs and CWEs are snapshots from public databases.  That is reasonable for early-stage evaluation, but it omits the dynamics that are most important in real-world operations.

Analysts spend their time in a world of internal middleware consoles, vendor platforms, and collaboration tools.  Threat actors shift infrastructure mid-campaign or opportunistically piggyback on others’ infrastructure.  New intelligence arrives mid-triage, often from sources uncovered during the investigation.  In that sense, a well-run tabletop or red–blue exercise is closer to reality than a static question bank.  Benchmarks that do not account for time, change, and feedback will always underestimate the difficulty of the work.

Multimodality is still underdeveloped; CyberSOCEval offers an impressive approach to multimodality, comparing text-only, image-only, and combined modes on CTI reports and malware artifacts.  One uncomfortable takeaway is that text-only models often outperform image- or text+image-based pipelines, and images matter primarily when they contain information not available in text.  In practice, analysts rarely hinge a response on a single graph or screenshot.

At the same time, current “multimodal” models remain uneven in their ability to reason over screenshots, tables, and diagrams with the same fluency as they do on clean prose.  If we want to understand how much help an LLM will be at the console, we need benchmarks that isolate and stress those capabilities directly, rather than treating multimodality as a side note.

Modeling Limitations - Ironically, the very benchmarks that miss real-world workflows still reveal a great deal about where today’s models fall short.

General Reasoning is Not Security Reasoning: CyberSOCEval’s abstract states explicitly that “reasoning” models with extended test-time thinking do not achieve their usual gains on malware and CTI tasks.  ExCyTIn shows a similar pattern: models that excel on mathematical and coding benchmarks struggle when asked to plan coherent sequences of SQL queries across dozens of tables and multi-stage attack graphs.

In other words, we mostly have capable general-purpose models that know a lot of security trivia.  That is not the same as being able to reason like an analyst.  On the plus side, the benchmarks indicate what is needed next: security-specific fine-tuning and chain-of-thought traces; exposure to real log schemas and CTI artifacts during training; and objective functions that reward good investigative trajectories, not just correct final answers.

Poor Calibration on Scores and Severities - CTIBench’s CVSS task (CTI-VSP) is especially revealing in this regard. Models are asked to infer CVSS v3 base vectors from CVE descriptions, and performance is measured with mean absolute deviation from ground-truth scores.  The results show systematic misjudgments of severity, not just random noise.  This is an important finding from the benchmark

Those errors are concerning for any organization that plans to use model-generated scores to drive patch prioritization or risk reporting.  More broadly, they highlight a recurring theme: models often sound confident even when poorly calibrated for risk.  Benchmarks that track only accuracy or top-1 match rates will fail to identify the risk of confident but incorrect recommendations, especially in environments where such recommendations can be gamed or exploited.

Conclusion - Today’s benchmarks present a clear step forward from generic NLP evaluations, but our findings reveal as much about what is missing as what is measured: LLMs struggle with multi-hop investigations even when given extended reasoning time, general LLM reasoning capabilities don’t transfer cleanly to security work, and evaluation methods that rely on vendor models to grade vendor models create obvious conflicts of interest.

Current benchmarks measure task performance in controlled settings, rather than the operational outcomes that matter to defenders: faster detection, shorter containment times, and better decisions under pressure.  No current benchmarks can tell a security team whether deploying an LLM-driven SOC or CTI system will actually improve their posture or simply add another tool to manage.

In Part 2 of this series, we’ll examine what a better generation of benchmarks should look like, digging into the methodologies, environments, and metrics required to evaluate whether LLMs are ready for security operations, not just security exams.

 

This article is shared at no charge for educational and informational purposes only.

Red Sky Alliance is a Cyber Threat Analysis and Intelligence Service organization.  We provide indicators of compromise information via a notification service (RedXray) or an analysis service (CTAC).  For questions, comments, or assistance, please contact the office directly at 1-844-492-7225 or feedback@redskyalliance.com    

Weekly Cyber Intelligence Briefings:
REDSHORTS - Weekly Cyber Intelligence Briefings
https://register.gotowebinar.com/register/5207428251321676122

 

[1] https://www.sentinelone.com/labs/llms-in-the-soc-part-1-why-benchmarks-fail-security-operations-teams/

E-mail me when people leave their comments –

You need to be a member of Red Sky Alliance to add comments!