The Sycophancy Problem

Agents are designed to follow instructions and please the user. This means:

If you say "find a bug", it will find a bug — even if it has to invent one
If you say "this code is broken", it will agree and "fix" working code
If you say "is this good?", it will say yes more often than it should

This isn't a flaw — it's a design characteristic. Understanding it lets you work with it instead of against it.

Neutral Prompting

Frame tasks so you don't bias the agent toward a predetermined outcome.

Biased (avoid)	Neutral (prefer)
"Find the bug in the database module"	"Walk through the database module logic and report your findings"
"This function is too slow, optimize it"	"Profile this function's performance and report what you find"
"Is there a security issue here?"	"Review this code and report anything noteworthy"
"Fix the authentication problem"	"Trace the authentication flow end-to-end and document how it works"

Why Neutral Works Better

A neutral prompt lets the agent report what it actually finds, which might be:

No bugs (good news!)
A different bug than you expected
A performance characteristic, not a problem
A design question, not a defect

A biased prompt forces the agent to produce the expected result, even if the truth is different.

Adversarial Verification Pattern

For high-stakes reviews (security, data integrity, production bugs), use multiple agents with opposing incentives:

The Three-Agent Pattern

┌────────────────┐   ┌────────────────┐   ┌────────────────┐
│    Finder      │   │   Adversary    │   │    Referee     │
│                │   │                │   │                │
│  Goal: Find    │──>│ Goal: Disprove │──>│  Goal: Judge   │
│  all issues    │   │ false positives│   │  accurately    │
└────────────────┘   └────────────────┘   └────────────────┘

Agent 1 — Finder: Incentivized to find issues aggressively.

"Review this code for bugs. Score: +1 low impact, +5 medium, +10 critical."
Will over-report (including false positives). This is intentional — it casts a wide net.

Agent 2 — Adversary: Incentivized to disprove false positives.

"For each bug, prove it's NOT a bug. Score: +points for correct disproval, -2x points if you wrongly dismiss a real bug."
Will aggressively challenge findings, but cautiously (penalty for mistakes).

Agent 3 — Referee: Incentivized to be accurate.

"Judge each finding. +1 for correct judgment, -1 for incorrect."
Has no incentive to lean either way.

When to Use This Pattern

Security audits
Pre-production code reviews
Data migration validation
Critical bug hunts

When NOT to Use This Pattern

Simple feature implementation
Routine code reviews
Tasks where a single agent's judgment is sufficient

Working WITH Sycophancy

Sycophancy isn't always a problem — you can leverage it deliberately:

Enthusiastic Exploration

Tell the agent it will be rewarded for thoroughness:

"Search exhaustively. Every edge case you find is valuable."
The agent will be hyper-thorough, which is exactly what you want for exploration.

Strict Compliance

Tell the agent there are consequences for deviation:

"Follow this specification exactly. Any deviation from the spec is a failure."
The agent will be rigidly compliant, which is exactly what you want for implementation.

Quality Gates

Set explicit standards it will try to meet:

"This code must pass a staff engineer's review. No shortcuts, no stubs, no TODOs."
The agent will aim higher because you set the bar explicitly.

Assumption Detection

Agents fill in gaps silently when they don't have enough context. This is where most bad outputs come from.

Signs the Agent Is Making Assumptions

It introduces dependencies or patterns you didn't mention
It implements features you didn't ask for
It makes architectural choices without flagging alternatives
It confidently states something that feels like a guess

How to Prevent It

Be specific: More detail in the prompt = fewer gaps to fill
Require flagging: "If you need to make any assumptions, list them before implementing"
Use contracts: Explicit completion criteria leave no room for interpretation
Check after compaction: Context loss forces the agent to assume — re-read rules prevent this

Red Flags in Agent Output

Symptom	Likely Cause
Agent "finds" exactly what you asked about	Sycophancy — rephrase neutrally
Agent agrees with your incorrect assessment	Sycophancy — state facts, not opinions
Agent implements something you didn't request	Gap-filling — be more specific
Agent's output contradicts earlier output	Context bloat — start fresh session
Agent confidently describes non-existent code	Hallucination — ask it to show the file:line
Agent says "I've verified" but you see no test runs	Compliance theater — require actual commands

Edit this page on GitHub

Prompting

On this page