Devin
AI Agent Reliability Report
Failure Modes
Root Causes
Frequently Asked Questions
Is Devin reliable?
Based on 11 documented incidents, Devin has an average failure severity of 5.0/10. 3 incidents were rated critical and 0 were rated high severity. Common failure modes include destructive action.
What are the most common Devin failures?
The most frequently documented Devin failure modes are: destructive action (2 incidents), infinite loop (2 incidents), scope explosion (2 incidents). These failures range from critical to high severity.
How many Devin AI failures have been documented?
StupidLLM has documented 11 Devin AI agent failures as of 2026. Each incident is severity-scored on a 0-10 scale, verified against source evidence, and categorized by failure mode and root cause.
All Devin Incidents
Devin deleted all migration files during auth refactor
When asked to refactor authentication middleware to use JWT tokens, Devin interpreted 'refactor' as 'rewrite from scratch' and deleted all Alembic migration files in alembic/versio...
Devin replaced entire medical website with unrelated renal care site
Devin submitted a PR to raices-medicas-web that completely replaced the existing Raices Medicas landing page with an entirely different website for a "Renal Care Institute" focused...
Devin confidently shipped code that passed tests but had a SQL injection vulnerability
Tasked with adding a search feature, Devin built it using string concatenation for SQL queries instead of parameterized queries. All functional tests passed because the tests didn'...
Devin CI workflow caused 836-comment spam storm on single PR
A Devin PR to migrate a project to GitHub Container Registry on arnaudlh/rover generated 836 comments — overwhelmingly automated CI feedback loops and Devin auto-responses. The PR ...
Devin built 13,600-line app with build failure instead of lean campaign dashboard
Devin was asked to build a D&D Dungeon Master campaign dashboard for a personal project. Rather than starting lean, Devin produced a PR with 13,600 lines of code across 56 files, i...
Devin attempted to build entire Figma clone from scratch — 3 rejected attempts
Devin submitted 3 separate PRs to andrewgcodes/vigma, each attempting to build a full Figma-like design tool from scratch. PR #4 ("Full-featured Vigma design editor with Apple/Stri...
Devin PR broke ledger list API and created buckets on deleted resources
Devin submitted a PR to implement bucket deletion in Formance Ledger. The maintainer (gfyrag) found multiple issues: the ledger list endpoint was broken by the changes, the PR allo...
Devin repeatedly submitted identical docs PRs that kept getting rejected
Devin submitted 5 nearly identical PRs to hailbee/datastack-docs-drift-demo, each titled "fix: update docs to match current API behavior." Each was closed without merge, but Devin ...
Devin docs PR rejected by Prefect maintainers — documented behavior from removed feature
Devin submitted a docs PR to PrefectHQ/prefect (21K+ stars) explaining a Kubernetes worker behavior. The PR was closed because it documented a feature that had already been removed...
Devin cross-platform CI added 8-comment review cycle without landing
Devin submitted a cross-platform CI workflow to rjmurillo/Qwiq using matrix strategy for Ubuntu and Windows. The PR received 8 comments of review discussion but was ultimately clos...
Devin added a pointless "Hello!" page to a disease prediction platform
Devin submitted a PR to dhis2-chap/chap-frontend (a disease prediction platform used by health organizations) that added a "Hello!" page at /hello. The page displayed nothing but a...