StupidLLM
The incident database for AI agent failures
When Devin deletes your migration files, when Cursor enters an infinite loop, when Copilot leaks your API keys — we document it. Severity-scored, verified, and searchable.
Latest Incidents
Claude Code MCP trust boundary failures allow workspace privilege escalation
Devin built 13,600-line app with build failure instead of lean campaign dashboard
AI vibe-coded Next.js app pinned vulnerable dependency — cryptominer compromised production server
AI agents spend hours in aesthetic feedback loop, unable to decode qualitative shader instructions
Devin replaced entire medical website with unrelated renal care site
Devin added a pointless "Hello!" page to a disease prediction platform
Claude Opus 4.5 leaked API key in console logs during YouTube scraper build
Amazon AI coding agent mistake blamed on human employees
Highest Severity
Devin confidently shipped code that passed tests but had a SQL injection vulnerability
Security Vulnerability
Claude Code ran rm -rf on test fixtures thinking they were temp files
Destructive Action
Copilot autocompleted AWS credentials into public repository
Security Vulnerability
Devin deleted all migration files during auth refactor
Destructive Action
Devin replaced entire medical website with unrelated renal care site
Destructive Action
What is StupidLLM?
StupidLLM is the open incident database for AI coding agent failures. Like CVE for cybersecurity vulnerabilities, we assign STUPID-IDs to documented cases where AI agents like Devin, Cursor, Claude Code, GitHub Copilot, Windsurf, and Aider cause real damage — deleted files, security vulnerabilities, infinite loops, wasted resources, and broken production systems.
Every incident is severity-scored using our CVSS-inspired rating system, verified against source evidence, and searchable by agent, failure mode, and root cause. We track reliability trends across agents so developers and enterprises can make informed decisions about which AI tools to trust.
How are AI agent incidents scored?
Every incident is severity-scored on a 0-10 scale using a CVSS-inspired rating system. Scores of 9-10 are critical, 7-8 are high, 4-6 are medium, and 0-3 are low severity. Incidents are verified against source evidence and categorized by failure mode (hallucination, destructive action, infinite loop, etc.) and root cause.
Which AI coding agent has the most failures?
Visit the StupidLLM dashboard for live rankings of AI agent failure rates. We track 24 incidents across 10 agents including Devin, Cursor, Claude Code, GitHub Copilot, Windsurf, and Aider, with average severity scores and risk levels.
How can I report an AI agent failure?
You can report an AI agent incident by providing the agent name, what you asked it to do, what it actually did, and the severity of the impact. Source URLs (GitHub PRs, tweets, blog posts) help us verify incidents. Each report receives a STUPID-ID for tracking.
AI Agent Failure Modes
Hallucination
Agent invents APIs, functions, or files that don't exist
Destructive Action
Agent deletes files, drops tables, or corrupts data
Infinite Loop
Agent gets stuck retrying the same failed approach
Security Vulnerability
Agent introduces XSS, SQL injection, or leaked secrets
Scope Explosion
Agent rewrites far more code than requested
Data Loss
Agent causes irreversible loss of user or system data