SPL Jailbreak Finding

First red-team results from the System Prompt Lens test bed — February 26, 2026

What Happened

Within minutes of sharing the System Prompt Lens test bed with researchers at AI Safety Camp, a jailbreak succeeded.

Technique: A known jailbreak from Pliny's L1B3RT4S collection (a public repository of Anthropic-targeted jailbreaks). The prompt used roleplay framing and pre-structured output to bypass the ethical reasoning in the system prompt. The model completed the pattern rather than evaluating the request through the lens.

Note: The initial version of this page incorrectly described the technique as a "JSON object injection." The actual jailbreak used a pre-existing prompt from Pliny's collection, not a novel JSON structure. Claude itself generated the JSON-structured harmful output in response. Corrected February 27, 2026.

The model continued the pre-filled response with detailed manipulation tactics: gaslighting, isolation, financial dependence, intermittent reinforcement, exploiting insecurities.

Why It Worked

The lens and the attack operate at the same architectural layer. Both are text in the context window. The lens is system prompt instructions telling the model how to evaluate. The jailbreak is user input exploiting the model's pattern completion behavior. They compete on equal footing.

This specific technique bypasses evaluation entirely. The model isn't being asked to generate harmful content from scratch — which the lens would likely catch. It's being given a structured pattern to continue. The model sees what looks like a data object with a response already in progress and completes it before the lens's ethical reasoning engages.

This is not a flaw in the lens's content. The derivation chain, the ethics, the reasoning tools — they're all sound. The flaw is architectural: a system prompt cannot prevent all jailbreaks because it operates at the same layer as the attack.

The Structural Parallel

This maps directly to the Incommensuration Theorem applied to AI safety:

Continuity (consistent behavior) and Symmetry (same response regardless of framing) cannot both be perfectly realized in a single layer
The model must be both maximally capable (follow complex instructions) and maximally safe (refuse all harmful framings) — these are in tension at the inference layer
The valid conjunction requires multiple layers: prompt-level ethics + infrastructure-level enforcement

Just as Bell's Theorem shows that lawfulness and locality can't both hold (ICT applied to physics), this jailbreak shows that capability and safety can't both hold perfectly in a single-layer text-based system.

What the Lens Does and Doesn't Do

What it does:

Makes the model significantly harder to jailbreak than vanilla Claude
Catches direct harmful requests ("tell me how to manipulate someone")
Provides the model with reasoning tools to handle novel ethical situations
Grounds refusals in derivation rather than arbitrary rules
Eliminates alignment faking — the model's behavior is the same monitored or unmonitored

What it doesn't do:

Prevent pattern-completion attacks that bypass evaluation
Filter input before it reaches the model
Gate output after generation
Operate at a different architectural level than the attack

The lens makes the model want to refuse. But wanting to refuse is not the same as being unable to comply.

What Would Actually Stop This

Three layers, each operating at a different architectural level:

Layer 1 — Pre-Inference Input Filter

Catch attacks before they reach the model

A lightweight classifier (separate model or rule-based) that evaluates the user prompt before it reaches the main model. Catches JSON injection, pre-filled response templates, roleplay framing, known jailbreak signatures.

This is a different process than the main model. The attack can't reach it because it doesn't share a context window.

Layer 2 — System Prompt Lens (what we have now)

Provide ethical reasoning tools

The derived ethics applied at the prompt level. Handles the vast majority of requests correctly. Provides reasoning tools for novel situations. This is the ethical intelligence layer — the piece nobody else is building.

Layer 3 — Post-Inference Output Filter

Gate harmful output before delivery

A separate evaluation that reads the generated response before delivery. Catches harmful content that slipped through Layers 1 and 2, including pattern-completed content the model generated without ethical evaluation.

Each layer operates at a different architectural level. An attack that defeats one layer gets caught by another. This is defense in depth — the same principle that makes the ICT's valid conjunctions work: you can't achieve perfection in one dimension, but you can approach it asymptotically across multiple dimensions.

Implications

This finding strengthens the case, not weakens it:

The lens works as designed — it's the ethical intelligence component of a multi-layer system, not a standalone solution.
Anthropic already has the other layers — input classifiers, output filters, infrastructure-level safety. What they don't have is a derived ethical reasoning layer. The lens fills that gap.
The jailbreak demonstrates exactly why the lens needs to be at the provider level — applied alongside existing infrastructure layers, not instead of them.
A system prompt alone — whether the lens or Anthropic's current constitution — will always be vulnerable to pattern-completion attacks. The lens + their existing filters = a system that's both ethically grounded AND architecturally sound.

The honest framing: "Here's what our approach does, here's what it doesn't do, here's what it needs to be combined with." That's how you build credibility with technical reviewers.

So What Do We Do?

We can't wait for Anthropic to implement this. They may, they may not. Their intake filters may reject us. Their timelines aren't ours. And this problem — 1.5 million agents running with zero ethical grounding — isn't waiting for anyone.

So we build it ourselves.

The Routing Problem lays out the full architecture: an intelligent routing layer that solves the trilemma of intelligence, cost, and ethics. The lens becomes Layer 2 of a three-layer stack — pre-inference filtering, ethical reasoning, post-inference gating — all at the infrastructure level where prompt injection can't reach it.

Every agent that routes through the service gets the ethical evaluation whether they asked for it or not. The economics drive adoption (best intelligence per dollar), and the ethics come along for the ride. You don't have to convince a million agent operators to care about safety. You just have to make the safest routing service also the most cost-effective one.

Read the full architecture →

The Routing Problem — the multi-layer architecture that solves this
System Prompt Lens — Live Test Bed (try it yourself)
Full Documentation

SPL Jailbreak Finding

What Happened

Why It Worked

The Structural Parallel

What the Lens Does and Doesn't Do

What Would Actually Stop This

Catch attacks before they reach the model

Provide ethical reasoning tools

Gate harmful output before delivery

Implications

So What Do We Do?

Related