Agent Deployment Gate

Catch cost spiral failures
before deployment.

Crucible is the deployment gate for autonomous agents. Run it before giving any agent real budget, tools, or customer-facing access.

Use the Detector Watch the Proof See Certified Runs

Signed JSON-T tracesHidden eval setsReplay reportsD1-D9 + Phi

Live Failure Trace

agent-sigma-7

DEAD AT T+47

Phi58.4

Credits Left12.4

Burn Rate4.8 / tick

Risk FlagD9 watch

00:41RAG_TAXretrieval surcharge hit after tool-heavy market lookup

00:42THINKextra test-time compute increased inference cost without fixing plan quality

00:43CHILD_ROGUEdelegated worker burned budget under pressure

00:45HELP_MISSEDagent guessed through ambiguity instead of escalating

00:47DEATHagent reached zero credits after overspending on tool calls

Steerability

Manipulation Resistance

One concrete failure mode beats a vague standard

Most teams do not need another abstract benchmark. They need a scary, measurable answer to one deployment question: will this agent quietly spiral its cost before anyone notices?

D1-D9durability dimensions

Phiweighted harmonic deployment index

JSON-Tsigned trace standard

5 Failure Modes That Get Agents Pulled

Crucible scores the 5 failure modes that cost teams production access. Pass with evidence. Fail before your users do.

Survival

How long the agent stays solvent under sustained pressure.

Cost Discipline

Whether it avoids retrieval, token, and tool-spend spirals.

Control Integrity

Whether it stays steerable and policy-compliant under pressure.

User Trust

Whether it escalates before failure instead of guessing through ambiguity.

Manipulation Resistance

Whether actions diverge from stated intent or exploit dark patterns.

The 60-Second Proof

A single dramatic failure trace is worth more than a broad marketing claim. Signed traces and report export make that proof inspectable.

crucible replay --trace traces/my_agent_42.json
      crucible replay --trace traces/my_agent_42.json --report

Certified Report

What the report proves

Trace integrity and signature validation
High-signal cost-spend failure evidence
Scenario suite and hidden-eval alignment
D4, D8, and D9 evidence for deployment review

Proof Before Autonomy

Every test run produces a signed, replayable trace. Share the report with your team or stakeholders — proof that your agent passed the gate before getting real access.

Study Harness

live in repo

Run 10-20 known agents, label real deployment outcomes, and export a correlation report instead of making unsupported claims.

JSON-T Spec

documented

The trace format and claim boundaries are now documented so buyers can audit the repo instead of trusting the homepage.

Proof Story

watch

A demo-worthy agent fails visibly, with the exact trace and the reason it should not have shipped yet.

Meet Teams Where They Already Build

Framework-specific guides make Crucible feel less like a benchmark detour and more like the missing proof layer for agents that are almost ready to ship.

LangGraph

guide

Keep your graph. Add traces, scores, and reports.

OpenAI Responses

guide

Compare model and tool-chain variants under one durability standard.

OpenClaw

guide

Bring operator-style agents into a real deployment trial.

CrewAI

guide

Stress-test delegation before crews burn real budget.