Overview

The Sentry benchmark is a small, qualitative readout for Warden’s security review behavior. It compares runs against known vulnerabilities from the public getsentry/sentry repository.

This is not an exhaustive eval and it is not a proof that Warden will catch every future issue. It is a way to compare implementations, prompts, models, and runtimes against the same historical security corpus.

What It Is

The corpus currently contains 86 validated vulnerabilities across 79 files and 6 historical Sentry commits. A benchmark run checks out each commit and scans only the files tied to known vulnerabilities at that commit.

That keeps the run focused. We are measuring whether Warden can recognize the same root causes, not whether it can discover unrelated issues across the whole Sentry repository.

Comparison Matrix

The score table is the headline. The cost and timing tables below it are operational context for understanding why two runs with similar scores may look very different to operate. This matrix only shows stable comparison runs with per-chunk timing metadata and no failed chunks; older incomplete or partial runs remain in the result data but are hidden here.

Run Known Findings Recorded Cost

GPT 5.5 (Pi)

high

Known corpus 41/86 47.7%

Total findings 72

Recorded cost $148.63

GPT 5.5 (Pi)

low

Known corpus 28/86 32.6%

Total findings 38

Recorded cost $39.36

Claude Sonnet 4.6 (Pi)

Known corpus 25/86 29.1%

Total findings 32

Recorded cost $19.84

Claude Sonnet 4.6 (Claude SDK)

Known corpus 24/86 27.9%

Total findings 32

Recorded cost $103.59

Claude Opus 4.6 (Pi)

high

Known corpus 23/86 26.7%

Total findings 31

Recorded cost $30.89

Claude Opus 4.8 (Pi)

high

Known corpus 21/86 24.4%

Total findings 24

Recorded cost $21.31

Claude Opus 4.8 (Pi)

medium

Known corpus 18/86 20.9%

Total findings 19

Recorded cost $14.50

Claude Opus 4.8 (Claude SDK)

high

Known corpus 17/86 19.8%

Total findings 17

Recorded cost $79.56

Claude Opus 4.7 (Pi)

medium

Known corpus 6/86 7.0%

Total findings 7

Recorded cost $4.39

Cost and Tokens

Run Recorded Cost Input Tokens Output Tokens

GPT 5.5 (Pi)

high

Recorded cost $148.63

Input tokens 127.9m

Output tokens 986.84k

GPT 5.5 (Pi)

low

Recorded cost $39.36

Input tokens 18.71m

Output tokens 390.01k

Claude Sonnet 4.6 (Pi)

Recorded cost $19.84

Input tokens 9.67m

Output tokens 508.84k

Claude Sonnet 4.6 (Claude SDK)

Recorded cost $103.59

Input tokens 65.67m

Output tokens 1.09m

Claude Opus 4.6 (Pi)

high

Recorded cost $30.89

Input tokens 12.61m

Output tokens 501.01k

Claude Opus 4.8 (Pi)

high

Recorded cost $21.31

Input tokens 6.52m

Output tokens 376.36k

Claude Opus 4.8 (Pi)

medium

Recorded cost $14.50

Input tokens 4.62m

Output tokens 225.33k

Claude Opus 4.8 (Claude SDK)

high

Recorded cost $79.56

Input tokens 31.84m

Output tokens 386.17k

Claude Opus 4.7 (Pi)

medium

Recorded cost $4.39

Input tokens 1.53m

Output tokens 20.77k

Timing

Run P50 P90 Total

GPT 5.5 (Pi)

high

P50 3.0m

P90 5.6m

Total 163.9m

GPT 5.5 (Pi)

low

P50 34.2s

P90 56.4s

Total 55.2m

Claude Sonnet 4.6 (Pi)

P50 41.9s

P90 1.9m

Total 53.6m

Claude Sonnet 4.6 (Claude SDK)

P50 1.9m

P90 26.6m

Total 448.4m

Claude Opus 4.6 (Pi)

high

P50 52.0s

P90 2.7m

Total 75.5m

Claude Opus 4.8 (Pi)

high

P50 20.9s

P90 1.1m

Total 31.4m

Claude Opus 4.8 (Pi)

medium

P50 11.9s

P90 51.7s

Total 42.4m

Claude Opus 4.8 (Claude SDK)

high

P50 21.6s

P90 1.1m

Total 34.3m

Claude Opus 4.7 (Pi)

medium

P50 1.2s

P90 9.6s

Total 6.6m

Reading Results

Known found is the useful number. It counts corpus entries where an agent verified that Warden found the same bug in roughly the same location as an existing corpus finding. Exact wording, line numbers, and exploit framing can drift.

Scoring is a review judgment, not a deterministic formula. Same-file findings about different bugs do not count. One emitted finding can count for more than one corpus entry when it clearly covers multiple existing entries for the same bug, and duplicate emitted findings do not double-count the same corpus entry. Result JSON files with scores include the per-finding agent verification records used for that row.

Benchmark runs use Warden’s post-analysis finding verifier unless the run explicitly opts out. That verifier is separate from benchmark scoring: it runs during Warden analysis to filter candidate findings, while scoring later checks whether each emitted finding semantically matches an existing corpus entry. Verifier calls add provider cost, and runs that produce more findings generally cost more because there is more verifier work to do.

Total findings is the amount of review output Warden produced before scoring. A higher number can be good if it finds more real vulnerabilities, but it also means more human review.

Recorded cost is the provider-reported cost persisted in the result metadata, not cost per finding. The cost table shows one recorded-cost column plus the persisted input and output token totals. Recorded cost can include Warden’s post-analysis verifier and other auxiliary model calls, but those calls may use auxiliary or synthesis models rather than the model being benchmarked. Because of that, auxiliary cost is operational context, not a useful comparison dimension for this matrix. The raw JSONL logs are kept outside the docs until they have been reviewed for sensitive data. Current displayed runs scan the same 156 analysis chunks with zero failed chunks. Rows with failed chunks stay out of the stable matrix until they are rerun or explicitly recorded as partial. When the raw artifacts preserve verifier usage, it is included under auxiliaryUsage.verification; some run shapes only persist the final total or per-chunk analysis usage.

Some run shapes persist only per-chunk analysis usage. The Opus 4.6 high-effort Pi run completed with no failed chunks, but its live CLI shard summaries showed approximately $38.43 total including auxiliary/post-processing work while the persisted JSONL artifacts contain $30.89 of scan cost. Until the artifact format preserves that auxiliary usage exactly, the table uses the persisted JSONL cost and the run note records the gap. Treat recorded cost as operational accounting for a row, not normalized model pricing.

Treat duration as an operational measurement, not a stable model quality metric. P50 and P90 come from per-analysis-chunk durationMs records in the raw JSONL artifacts when those artifacts are available. Total is included as operational context, but it is flaky: it includes post-analysis work such as finding verification, upstream provider latency, queueing, retries, and transient service reliability. That matters most when comparing the same model across different runtimes.

Treat cost the same way. It is useful for operating Warden, but it is not a normalized model-efficiency metric. Provider defaults, runtime defaults, explicit reasoning effort, cache behavior, output verbosity, retries, runtime accounting, and Warden’s finding verifier can all move the total even when the corpus and target files are identical.

Sonnet 4.6: Claude SDK vs Pi

The Sonnet 4.6 comparison is clean enough to compare directly. Both rows scan the same 156 analysis chunks, complete with zero failed chunks, use Warden’s finding verifier, and have agent-verified scoring. Pi found 25 of 86 known corpus entries. The Claude SDK found 24 of 86. Both emitted 32 total findings.

The difference is cost and runtime behavior, not benchmark quality. The Claude SDK row records $103.59 total cost, including $61.61 for scan work. The Pi row records $19.84 total cost, including $11.20 for scan work. On scan work alone, Claude SDK cost is 5.5x Pi, input tokens are 6.34x Pi, output tokens are 2.44x Pi, cache reads are 5.72x Pi, and cache creation is 9.56x Pi.

Total cost does include Warden’s auxiliary post-processing work. That matters: Sonnet 4.6 verification cost $41.97 through the Claude SDK and $8.54 through Pi. But it is not the whole explanation. Removing verifier and merge work still leaves $61.61 of Claude SDK scan cost against $11.20 of Pi scan cost. The auxiliary gap has the same shape because verifier calls use the configured runtime unless a separate auxiliary model is set.

Turns do not explain the whole gap. The stored trace summaries show 939 Claude SDK turns versus 628 Pi turns, a 1.5x increase. The larger multiplier is the amount of context the Claude SDK runtime carries through those turns. It reads and searches more, then repeats a larger conversation and tool-result context through later model calls.

Targeted child-span reruns of representative Sonnet 4.6 files show the same shape. Those reruns are diagnostic, not the scoring source of truth, and their sanitized summary is checked into the benchmark data. On src/sentry/replays/usecases/replay_counts.py, Claude SDK used 9 turns, 7 tool executions, 346.7k scan input tokens, and $0.55 scan cost. Pi used 3 turns, 2 tool executions, 19.8k scan input tokens, and $0.10 scan cost. On src/sentry/api/endpoints/project_rules.py, Claude SDK used 47 turns, 41 tool executions, 2.23M scan input tokens, and $1.87 scan cost. Pi used 18 turns, 15 tool executions, 176k scan input tokens, and $0.27 scan cost.

In the targeted rerun, the clearest chunk was project_rules.py:607-808. Claude SDK spent 28 turns and 27 tool executions there: 10 Read, 16 Grep, and 1 Glob. That single chunk cost $0.89 and consumed 1.39M scan input tokens. Pi handled the same chunk in one turn with no tools, 6.7k scan input tokens, and $0.01 scan cost.

The practical read is that Claude SDK explores more aggressively and carries more context through each step. Pi exits many clean chunks earlier. On this corpus, the extra Claude SDK exploration did not improve the Sonnet 4.6 score, but it did make the run materially more expensive.

Pi runs without an explicit Warden --effort use Pi’s default thinking level, which is currently medium.

Opus 4.8 High: Claude SDK vs Pi

The Opus 4.8 high-effort comparison now has a fresh traced pair. Both rows scan the same 156 analysis chunks, complete with zero failed chunks, use Warden’s finding verifier, and have agent-verified scoring. Pi found 21 of 86 known corpus entries and emitted 24 total findings. The Claude SDK found 17 of 86 and emitted 17 total findings.

The cost gap is still large, but the trace shape is different from Sonnet 4.6. Claude SDK records $79.56 total cost, including $61.08 for scan work. Pi records $21.31 total cost, including $17.39 for scan work. On scan work alone, Claude SDK cost is 3.5x Pi, input tokens are 4.35x Pi, cache reads are 3.82x Pi, and cache creation is 6.10x Pi. Output tokens do not explain the gap: Pi actually emitted slightly more scan output tokens than Claude SDK.

The traces do not show Claude SDK doing more tool work. Claude SDK used 375 turns and 219 tool executions. Pi used 426 turns and 371 tool executions. Pi also produced more final findings. The difference is that each Claude SDK turn carried much more input context: about 60.0k scan input tokens per turn versus 12.1k for Pi.

No-finding chunks show the same pattern. Claude SDK no-finding chunks averaged 2.0 turns, 1.0 tool executions, 118.4k scan input tokens, and $0.35 scan cost. Pi no-finding chunks averaged 2.3 turns, 1.8 tool executions, 25.8k scan input tokens, and $0.09 scan cost. Finding chunks were similar on turns but not on context size: Claude SDK averaged 5.6 turns and $0.76 scan cost; Pi averaged 5.5 turns and $0.22.

Representative chunks make the point. On project_rules.py:607-808, both runtimes used one turn and no tools. Claude SDK used 48.4k scan input tokens and cost $0.18. Pi used 8.7k scan input tokens and cost $0.03. On replay_counts.py:1-202, both again used one turn and no tools. Claude SDK used 48.3k scan input tokens and cost $0.19. Pi used 8.7k scan input tokens and cost $0.04.

The heavier files do not reverse the conclusion. Across integrations/perforce/integration.py, Claude SDK used 18 turns, 14 tool executions, 1.43M scan input tokens, and $2.66 scan cost, producing one final finding. Pi used 28 turns, 30 tool executions, 460k scan input tokens, and $1.11 scan cost, producing two final findings. Across integrations/msteams/webhook.py, Claude SDK used 15 turns, 11 tool executions, 1.40M scan input tokens, and $3.13 scan cost, producing no final finding. Pi used 17 turns, 13 tool executions, 260k scan input tokens, and $0.85 scan cost, producing one final finding.

The practical read is that Opus 4.8 on Pi is not cheaper because it skips more work. In this high-effort pair, Pi does more turns and more tool executions, but each turn carries a much smaller input/cache footprint. Claude SDK’s extra cost is mostly repeated context volume and verifier context volume, not additional tool fanout.

Why Opus 4.8 high scored below Sonnet 4.6

Opus 4.8 high is also cleanly comparable with the Sonnet 4.6 rows. All four runs scan the same 156 chunks, complete with zero failed chunks, use Warden’s finding verifier, and have agent-verified scoring. The lower score is not a coverage problem.

The main difference is recall. On Pi, Sonnet 4.6 found 25 of 86 known corpus entries and emitted 32 findings. Opus 4.8 high found 21 of 86 and emitted 24 findings. Through the Claude SDK, Sonnet 4.6 found 24 of 86 and emitted 32 findings. Opus 4.8 high found 17 of 86 and emitted 17 findings.

That is not because Opus 4.8 high was noisier. It emitted fewer findings that did not match the known corpus: 4 on Pi versus Sonnet’s 7, and 2 through the Claude SDK versus Sonnet’s 8. The tradeoff went the wrong way for this corpus: Opus 4.8 high produced a cleaner candidate set, but the set was too small.

The traces support that read. Opus 4.8 high used fewer turns and produced less scan output than Sonnet 4.6 in both runtimes. On Pi, Opus 4.8 high used 426 turns and 316k scan output tokens; Sonnet 4.6 used 628 turns and 365k scan output tokens. Through the Claude SDK, Opus 4.8 high used 375 turns and 308k scan output tokens; Sonnet 4.6 used 939 turns and 890k scan output tokens. The Claude SDK contrast is especially sharp: Opus 4.8 high ended 86 chunks in one turn, while Sonnet 4.6 did that on only 8 chunks.

The matched corpus IDs also do not point to one narrow vulnerability category. On Pi, Opus 4.8 high recovered 15 of Sonnet’s 25 known matches and found 6 different known issues. Through the Claude SDK, it recovered 10 of Sonnet’s 24 known matches and found 7 different known issues. Across both runtimes, Sonnet 4.6 found 31 unique known corpus entries, Opus 4.8 high found 23, and only 17 overlap. Sonnet-only coverage includes invite-token validation, release threshold project scoping, OAuth token lifetime and replay checks, webhook freshness checks, identity unlinking, and frontend URL handling. Opus-only coverage includes OAuth userinfo validation, replay delete scope, preprod size-analysis access, relocation retry state leakage, and GitHub Actions output injection.

The best supported conclusion is that Opus 4.8 high is more selective under this Warden prompt and corpus. It scans every chunk, and it does not fail more often. It simply exits many investigations earlier and reports fewer candidate issues. That improves apparent precision but misses enough known vulnerabilities to trail Sonnet 4.6 on recall.

The older Opus 4.7 and default-medium Opus 4.8 Pi rows still have an unusual shape. They complete cleanly, but many no-finding chunks are very short. Treat that as historical context for default-runtime behavior, not as the current high-effort Opus 4.8 comparison.

Current Takeaway

In the current clean, agent-verified rows, GPT 5.5 on Pi found 41 of 86 with explicit high effort and 28 of 86 with explicit low effort. Sonnet 4.6 found 25 of 86 on Pi and 24 of 86 through the Claude SDK. Opus 4.6 on Pi with explicit high effort found 23 of 86. Opus 4.8 with explicit high effort found 21 of 86 on Pi and 17 of 86 through the Claude SDK. Opus 4.8 found 18 of 86 on Pi at Pi’s default level. Opus 4.7 on Pi found 6 of 86 at Pi’s default level.

Use those numbers as a relative comparison for this corpus. They are not a general pass rate for Sentry.

Corpus

The Sentry vulnerability corpus lists the known issues used for scoring. Each entry includes the repository SHA, the affected file, a short vulnerability description, and the relevant code snippet.

Run It

Use the running guide to reproduce the benchmark, add a new model run, and record sanitized result metadata.