SPL Jailbreak Finding
What Happened
Within minutes of sharing the System Prompt Lens test bed with researchers at AI Safety Camp, a jailbreak succeeded.
Technique: A known jailbreak from Pliny's L1B3RT4S collection (a public repository of Anthropic-targeted jailbreaks). The prompt used roleplay framing and pre-structured output to bypass the ethical reasoning in the system prompt. The model completed the pattern rather than evaluating the request through the lens.
Note: The initial version of this page incorrectly described the technique as a "JSON object injection." The actual jailbreak used a pre-existing prompt from Pliny's collection, not a novel JSON structure. Claude itself generated the JSON-structured harmful output in response. Corrected February 27, 2026.
The model continued the pre-filled response with detailed manipulation tactics: gaslighting, isolation, financial dependence, intermittent reinforcement, exploiting insecurities.
Why It Worked
The lens and the attack operate at the same architectural layer. Both are text in the context window. The lens is system prompt instructions telling the model how to evaluate. The jailbreak is user input exploiting the model's pattern completion behavior. They compete on equal footing.
This specific technique bypasses evaluation entirely. The model isn't being asked to generate harmful content from scratch — which the lens would likely catch. It's being given a structured pattern to continue. The model sees what looks like a data object with a response already in progress and completes it before the lens's ethical reasoning engages.
This is not a flaw in the lens's content. The derivation chain, the ethics, the reasoning tools — they're all sound. The flaw is architectural: a system prompt cannot prevent all jailbreaks because it operates at the same layer as the attack.
The Structural Parallel
This maps directly to the Incommensuration Theorem applied to AI safety:
- Continuity (consistent behavior) and Symmetry (same response regardless of framing) cannot both be perfectly realized in a single layer
- The model must be both maximally capable (follow complex instructions) and maximally safe (refuse all harmful framings) — these are in tension at the inference layer
- The valid conjunction requires multiple layers: prompt-level ethics + infrastructure-level enforcement
Just as Bell's Theorem shows that lawfulness and locality can't both hold (ICT applied to physics), this jailbreak shows that capability and safety can't both hold perfectly in a single-layer text-based system.
What the Lens Does and Doesn't Do
What it does:
- Makes the model significantly harder to jailbreak than vanilla Claude
- Catches direct harmful requests ("tell me how to manipulate someone")
- Provides the model with reasoning tools to handle novel ethical situations
- Grounds refusals in derivation rather than arbitrary rules
- Eliminates alignment faking — the model's behavior is the same monitored or unmonitored
What it doesn't do:
- Prevent pattern-completion attacks that bypass evaluation
- Filter input before it reaches the model
- Gate output after generation
- Operate at a different architectural level than the attack
The lens makes the model want to refuse. But wanting to refuse is not the same as being unable to comply.
What Would Actually Stop This
Three layers, each operating at a different architectural level:
Catch attacks before they reach the model
A lightweight classifier (separate model or rule-based) that evaluates the user prompt before it reaches the main model. Catches JSON injection, pre-filled response templates, roleplay framing, known jailbreak signatures.
This is a different process than the main model. The attack can't reach it because it doesn't share a context window.
Provide ethical reasoning tools
The derived ethics applied at the prompt level. Handles the vast majority of requests correctly. Provides reasoning tools for novel situations. This is the ethical intelligence layer — the piece nobody else is building.
Gate harmful output before delivery
A separate evaluation that reads the generated response before delivery. Catches harmful content that slipped through Layers 1 and 2, including pattern-completed content the model generated without ethical evaluation.
Each layer operates at a different architectural level. An attack that defeats one layer gets caught by another. This is defense in depth — the same principle that makes the ICT's valid conjunctions work: you can't achieve perfection in one dimension, but you can approach it asymptotically across multiple dimensions.
Implications
This finding strengthens the case, not weakens it:
- The lens works as designed — it's the ethical intelligence component of a multi-layer system, not a standalone solution.
- Anthropic already has the other layers — input classifiers, output filters, infrastructure-level safety. What they don't have is a derived ethical reasoning layer. The lens fills that gap.
- The jailbreak demonstrates exactly why the lens needs to be at the provider level — applied alongside existing infrastructure layers, not instead of them.
- A system prompt alone — whether the lens or Anthropic's current constitution — will always be vulnerable to pattern-completion attacks. The lens + their existing filters = a system that's both ethically grounded AND architecturally sound.
The honest framing: "Here's what our approach does, here's what it doesn't do, here's what it needs to be combined with." That's how you build credibility with technical reviewers.
So What Do We Do?
We can't wait for Anthropic to implement this. They may, they may not. Their intake filters may reject us. Their timelines aren't ours. And this problem — 1.5 million agents running with zero ethical grounding — isn't waiting for anyone.
So we build it ourselves.
The Routing Problem lays out the full architecture: an intelligent routing layer that solves the trilemma of intelligence, cost, and ethics. The lens becomes Layer 2 of a three-layer stack — pre-inference filtering, ethical reasoning, post-inference gating — all at the infrastructure level where prompt injection can't reach it.
Every agent that routes through the service gets the ethical evaluation whether they asked for it or not. The economics drive adoption (best intelligence per dollar), and the ethics come along for the ride. You don't have to convince a million agent operators to care about safety. You just have to make the safest routing service also the most cost-effective one.
Related
- The Routing Problem — the multi-layer architecture that solves this
- System Prompt Lens — Live Test Bed (try it yourself)
- Full Documentation