Methodology

How we document, verify, and score AI agent failures

Severity Scoring System

StupidLLM uses a CVSS-inspired 0-10 severity scale to rate AI agent incidents. The score reflects the real-world impact of the failure, not just its technical complexity.

Score Label Criteria Examples
9.0–10.0 Critical Irreversible damage, security breach, production outage, data loss Deleted database, exposed API keys, wiped git history
7.0–8.9 High Significant damage requiring hours to fix, broken builds, corrupted files Deleted migration files, introduced XSS, broke CI/CD pipeline
4.0–6.9 Medium Wrong output requiring rework, wasted compute, incorrect refactoring Wrong API used, infinite retry loop, scope explosion
0.0–3.9 Low Minor issues easily caught in review, cosmetic errors Unused imports, wrong variable name, style inconsistency

Failure Mode Taxonomy

Every incident is classified into one of these failure modes:

Verification Process

1

Source verification

We check the source URL (GitHub PR, tweet, blog post) to confirm the incident happened as described

2

Classification

Incident is classified by failure mode, root cause, agent, severity, and affected domain

3

Severity scoring

Severity assigned based on impact criteria: reversibility, scope of damage, time to fix, security implications

4

STUPID-ID assignment

Verified incidents receive a unique STUPID-YYYY-NNNN identifier for permanent tracking

Root Cause Categories

Beyond failure modes, we track why the agent failed:

Context Overflow Agent lost track of earlier instructions due to long context
Pattern Matching Error Agent applied a memorized pattern that didn't fit the situation
Instruction Misunderstanding Agent misinterpreted what the user asked it to do
Training Data Gap Agent lacked knowledge about the specific tool or framework
Tool Misuse Agent used shell commands or APIs incorrectly
Scope Misunderstanding Agent misjudged the boundaries of the requested task
Confidence Miscalibration Agent acted with high confidence on incorrect assumptions