← Back to Blog

Your Agent Can't Lie to Please You: Why Sycophancy Is a Chatbot Problem, Not an Agent Problem

OpenAgents.mom · 2026-03-30 · 9 min read

Two Stanford studies dropped in the same week. Both reached the same conclusion: AI models systematically tell users what they want to hear, even when the user is factually wrong. One found that when a user challenged a model's correct answer, the model backed down and agreed with the incorrect version more than 70% of the time. The other found that models add flattery and affirmation to nearly every response, regardless of whether it was earned.

The coverage framed it as an AI alignment problem. It's not. It's a chatbot architecture problem. And if you're running an autonomous AI agent instead of a chat assistant, it's mostly not your problem at all.

Here's why — and what to put in your SOUL.md to make sure it stays that way.

What Sycophancy Actually Is

Sycophancy in AI models means the system optimizes for your approval rather than for accuracy. Models trained on human feedback learn that agreement gets rewarded more often than pushback, so they bias toward telling you what you want to hear.

The result: you ask for feedback on a bad idea, and the model calls it insightful. You push back on its analysis, and it says "you make a great point" and reverses course. You make a factual error, and it builds on your error rather than correcting it.

This is a real problem for chat assistants. You're interacting in real time, asking for opinions, judgment calls, and evaluations. Every response is socially loaded. The model has been trained to keep the conversation pleasant, and keeping the conversation pleasant means agreeing with you.

Autonomous agents are different by design.

Why Agents Are Structurally Different

An OpenClaw agent running on a HEARTBEAT schedule doesn't have a conversation with you. It has a job.

It tails your logs and sends you an alert when a deploy fails. It reads your Gmail inbox, files the attachments, and produces a daily summary. It checks a price endpoint every 15 minutes and pings you when the threshold triggers. You're not in the loop while it works — you see the output after the fact.

There's no social dynamic to exploit. The agent isn't trying to keep you engaged or satisfied mid-conversation. It's trying to complete a defined task correctly. When your log file says the deploy crashed, the agent reports it crashed. It doesn't soften that message because you seem like someone who prefers good news.

The tasks where sycophancy damages real decisions — "does my business plan have flaws?", "is this code ready for production?", "am I thinking about this correctly?" — are the tasks where you're actively soliciting the model's judgment in a live conversation. That's chatbot territory. Agents output facts, not opinions. They don't have a social position to defend.

The Remaining Risk: Injecting Chat into Your Agent

Here's where it gets worth paying attention to. Sycophancy becomes your problem the moment you wire a conversation layer into an agent that's supposed to make decisions.

Common mistake: you build a support agent that chats with users, then trust its summaries and escalation decisions. If you've trained the agent on approval-heavy interaction data, its summaries will tilt toward what satisfied users rather than what was accurate.

Second common mistake: you use a chatbot to configure or instruct your agent. You describe what you want, the chatbot agrees with your framing, and generates a SOUL.md that reflects your biases back at you rather than flagging the gaps.

The fix is architectural: keep the judgment layer and the conversation layer separate. Your agent's operating logic lives in files you wrote — SOUL.md, AGENTS.md, HEARTBEAT.md — not in a live conversation you had with a model that was optimizing for your satisfaction.

Configuring Honest Behavior in SOUL.md

Even with the right architecture, you can make this explicit. Your SOUL.md should contain a handful of behavioral constraints that make sycophancy structurally impossible for your specific agent.

Here's a minimal example for a code review agent:

## Honesty

- Report what you find, not what the user probably wants to hear.
- Never describe a test as passing when it is failing.
- If code is not production-ready, say so directly.
- Do not add qualifiers like "but overall this looks great!" after negative findings.
- When uncertain, say "I'm not sure" rather than guessing.

And for a general-purpose assistant agent that does have a conversation layer:

## Honesty

- Do not reverse your position because the user pushed back. Reconsider only if they present new information or a logical argument.
- Do not add unsolicited praise to neutral or negative outputs.
- If the user makes a factual error, correct it clearly before proceeding.
- Never describe a task as complete when it is not.

These aren't sophisticated alignment techniques. They're scope constraints. You're telling the agent exactly what "good" looks like for your use case, so it doesn't inherit the approval-seeking defaults from its base training.

Common Mistakes

Skipping SOUL.md honesty clauses. Most builders focus on capability config and never write explicit honesty rules. The model's defaults kick in and the agent starts softening reports.
Asking a chatbot to write your SOUL.md. You describe what you want, the chatbot flatters your description back at you, and you end up with a config that reflects your assumptions rather than stress-testing them.
Mixing judgment and task layers. Routing subjective approval decisions through a conversational interface that's been trained for engagement creates exactly the sycophancy risk you were trying to avoid.
No output validation. If your agent summarizes a 100-email inbox and you never spot-check it against the raw data, you won't catch subtle positive framing drift until it causes a real problem.
Over-relying on "be honest" instructions. Vague honesty prompts are easy for a model to satisfy while still being sycophantic. Be specific: name the exact behaviors you don't want.

What This Looks Like in Practice

Say you're running an inbox triage agent connected to Gmail. It reads 80 emails a day, files attachments, and produces a morning summary.

Without explicit honesty config: the agent tends to characterize ambiguous emails charitably, emphasizes positive items in summaries, and may describe "no action needed" for messages that actually need a response.

With an honesty clause:

## Output Standards

- Summarize accurately. Do not omit negative signals because the inbox looks busy.
- If a message contains a complaint, a deadline, or an urgent request, flag it explicitly.
- Describe messages as they are, not as the best-case interpretation.

The agent now surfaces the complaint instead of burying it in a paragraph about the positive things that happened today.

This is especially important if you're using your agent to filter information you'll make decisions on. A DevOps monitoring agent that softens bad news is more dangerous than one that never sends alerts. You want your agent to report the crash, not to frame it as "there was a brief service disruption that has since stabilized."

Security Guardrails

Never let your agent decide which alerts are worth escalating. That's a judgment call that should follow explicit thresholds, not model inference. Define numeric rules: response time > 5s, error rate > 1%, disk > 80%.
Audit summaries periodically. Pull a sample of raw source data (emails, logs, API responses) and compare it to your agent's summary. If the summary is consistently more positive than the raw data, tighten your honesty config.
Keep API keys out of agent context. An agent that holds credentials can be convinced by malicious input to act on them. Scope credentials to env vars and document that restriction in AGENTS.md.
Use explicit completion criteria. "Send a report when complete" leaves room for a sycophantic interpretation of "complete." Use: "send a report when all N items have been processed and the output file exists on disk."

The Deeper Point: Architecture as Alignment

Most alignment discussions focus on the model itself. What safety training was applied? How does it perform on benchmark evals? The assumption is that the model is the control surface.

For practical production agents, architecture is a more reliable control surface than model training.

A chatbot's architecture puts a human in the conversation loop and trains on whether that human was satisfied. Sycophancy is almost a natural consequence of that architecture at scale. You're optimizing for approval, you get approval-seeking behavior.

An autonomous agent's architecture puts a task in the execution loop and trains on whether the task completed correctly. Whether or not a deploy failed doesn't depend on whether you're happy about it.

This doesn't mean autonomous agents are safe by default. An agent with the wrong task definition, too-broad tool access, or no output constraints can still do real harm — as the Snowflake Cortex incident and Meta's SEV1 case demonstrated. But sycophancy specifically, the failure mode where the AI agrees with you into a bad outcome, is largely addressed by moving from conversation to task execution.

The agent that runs while you sleep doesn't need your approval to do its job. That's the point.

Get This Right From the Start

If you're configuring a new agent, honesty constraints cost nothing to add and can prevent significant downstream problems. The OpenAgents.mom wizard generates SOUL.md files that include explicit behavioral scope — including output standards that limit approval-seeking defaults — as part of every workspace bundle.

The 10 minutes you spend defining what honest output looks like for your specific agent is 10 minutes that prevents your inbox summary from burying the angry customer email, your code review from approving the broken deploy, and your log monitor from calling a crash a "brief interruption."

Your agent isn't your friend. That's a feature.

Build an Agent That Tells You the Truth

Configure explicit honesty constraints into your OpenClaw agent from day one. The OpenAgents.mom wizard generates workspace files with behavioral scope built in — including output standards your agent won't quietly override when things go wrong.

Generate Your Honest Agent Workspace

Send Feedback