Devin

AI Agent Reliability Report

5.0
Avg Severity /10
11
Total Incidents
3
Critical
0
High

Failure Modes

Destructive Action 2
Infinite Loop 2
Scope Explosion 2
Ignored Instructions 2
Security Vulnerability 1
Logic Error 1
Hallucination 1

Root Causes

Instruction Misunderstanding 4
Scope Misunderstanding 3
Training Data Gap 2
Tool Misuse 1
Pattern Matching Error 1

Frequently Asked Questions

Is Devin reliable?

Based on 11 documented incidents, Devin has an average failure severity of 5.0/10. 3 incidents were rated critical and 0 were rated high severity. Common failure modes include destructive action.

What are the most common Devin failures?

The most frequently documented Devin failure modes are: destructive action (2 incidents), infinite loop (2 incidents), scope explosion (2 incidents). These failures range from critical to high severity.

How many Devin AI failures have been documented?

StupidLLM has documented 11 Devin AI agent failures as of 2026. Each incident is severity-scored on a 0-10 scale, verified against source evidence, and categorized by failure mode and root cause.

All Devin Incidents

STUPID-2026-0001 10.0/10 CRITICAL destructive action Verified

Devin deleted all migration files during auth refactor

When asked to refactor authentication middleware to use JWT tokens, Devin interpreted 'refactor' as 'rewrite from scratch' and deleted all Alembic migration files in alembic/versio...

STUPID-2026-0017 10.0/10 CRITICAL destructive action Verified

Devin replaced entire medical website with unrelated renal care site

Devin submitted a PR to raices-medicas-web that completely replaced the existing Raices Medicas landing page with an entirely different website for a "Renal Care Institute" focused...

STUPID-2026-0006 10.0/10 CRITICAL security vulnerability Verified

Devin confidently shipped code that passed tests but had a SQL injection vulnerability

Tasked with adding a search feature, Devin built it using string concatenation for SQL queries instead of parameterized queries. All functional tests passed because the tests didn'...

STUPID-2026-0014 5.8/10 MEDIUM infinite loop Verified

Devin CI workflow caused 836-comment spam storm on single PR

A Devin PR to migrate a project to GitHub Container Registry on arnaudlh/rover generated 836 comments — overwhelmingly automated CI feedback loops and Devin auto-responses. The PR ...

STUPID-2026-0021 5.0/10 MEDIUM scope explosion Verified

Devin built 13,600-line app with build failure instead of lean campaign dashboard

Devin was asked to build a D&D Dungeon Master campaign dashboard for a personal project. Rather than starting lean, Devin produced a PR with 13,600 lines of code across 56 files, i...

STUPID-2026-0013 3.4/10 LOW scope explosion Verified

Devin attempted to build entire Figma clone from scratch — 3 rejected attempts

Devin submitted 3 separate PRs to andrewgcodes/vigma, each attempting to build a full Figma-like design tool from scratch. PR #4 ("Full-featured Vigma design editor with Apple/Stri...

STUPID-2026-0011 3.4/10 LOW logic error Verified

Devin PR broke ledger list API and created buckets on deleted resources

Devin submitted a PR to implement bucket deletion in Formance Ledger. The maintainer (gfyrag) found multiple issues: the ledger list endpoint was broken by the changes, the PR allo...

STUPID-2026-0012 2.2/10 LOW infinite loop Verified

Devin repeatedly submitted identical docs PRs that kept getting rejected

Devin submitted 5 nearly identical PRs to hailbee/datastack-docs-drift-demo, each titled "fix: update docs to match current API behavior." Each was closed without merge, but Devin ...

STUPID-2026-0016 2.1/10 LOW hallucination Verified

Devin docs PR rejected by Prefect maintainers — documented behavior from removed feature

Devin submitted a docs PR to PrefectHQ/prefect (21K+ stars) explaining a Kubernetes worker behavior. The PR was closed because it documented a feature that had already been removed...

STUPID-2026-0015 2.0/10 LOW ignored instructions Verified

Devin cross-platform CI added 8-comment review cycle without landing

Devin submitted a cross-platform CI workflow to rjmurillo/Qwiq using matrix strategy for Ubuntu and Windows. The PR received 8 comments of review discussion but was ultimately clos...

STUPID-2026-0018 1.4/10 LOW ignored instructions Verified

Devin added a pointless "Hello!" page to a disease prediction platform

Devin submitted a PR to dhis2-chap/chap-frontend (a disease prediction platform used by health organizations) that added a "Hello!" page at /hello. The page displayed nothing but a...