← Back to Blog

Turn Your Logs into Alerts: Build a DevOps Watchdog Agent with OpenClaw

Turn Your Logs into Alerts: Build a DevOps Watchdog Agent with OpenClaw

Turn Your Logs into Alerts: Build a DevOps Watchdog Agent with OpenClaw

A deploy goes out at 11 PM. Thirty minutes later, your error rate has doubled — but nobody knows yet because the on-call rotation didn't page, the dashboard wasn't open, and the log line that told the story is already buried under 40,000 subsequent lines. You find out at 8 AM when a customer emails support.

That's not a monitoring gap. That's a wiring problem. The data was always there; nothing was reading it with enough context to know it mattered.

An OpenClaw DevOps watchdog agent fixes the wiring. It tails your logs, polls your health endpoints, applies judgment about what's signal versus noise, and sends you a Telegram or Slack message before the customer does. This guide shows you exactly how to build one — including the least-privilege shell patterns and noise-reduction config that keep it from paging you for every WARN line.

What the Agent Actually Does

Before you write a single config line, get clear on the scope. This watchdog does three things:

  1. Reads log files on a schedule (or in near-real-time via tail -F output piped into the agent's stdin).
  2. Polls HTTP health endpoints/healthz, /readyz, or whatever your stack exposes.
  3. Fires alerts to Telegram or Slack when it detects a pattern worth caring about.

It does not auto-remediate. Keeping the agent read-only and alert-only is an intentional tradeoff — you want a watchdog that tells you something is wrong, not one that starts restarting pods at 11 PM without context.

If you want the broader philosophy on why that constraint matters, see Your First DevOps Agent: Use OpenClaw to Watch Deploys and Ping You When Things Break.

Project Structure

OpenClaw uses file-based configs, so your watchdog lives in a directory you can version-control and audit. Here's the layout:

watchdog-agent/
├── SOUL.md
├── AGENTS.md
├── tools/
│   ├── tail_log.sh
│   ├── check_health.sh
│   └── send_alert.sh
├── memory/
│   └── alert_history.md
└── .env.example

SOUL.md defines the agent's behavior boundaries. AGENTS.md is the operational spec — what to watch, what thresholds to use, and what to do with findings.

Writing a Tight SOUL.md

The SOUL.md is where you constrain what the agent is allowed to do. For a watchdog, you want it narrow:

# SOUL: Deploy Watchdog

## Role
You monitor log files and health endpoints for a production deployment.
You do not modify files, restart services, or execute commands
beyond the three approved tools in /tools.

## Approved Tools
- tail_log.sh — read the last N lines of a specified log file
- check_health.sh — HTTP GET a health endpoint, return status code + body
- send_alert.sh — post a formatted message to the configured channel

## Alert Thresholds (configured in AGENTS.md)
Do not alert on WARN-level entries unless they exceed the burst threshold.
Do not alert on the same error class more than once per 10 minutes.

## Hard Limits
- Never write to files outside /watchdog-agent/memory/
- Never execute ad-hoc shell commands
- Never expose environment variables in alert messages

That last point — never expose env vars in alert output — matters more than it sounds. Slack and Telegram messages get forwarded, screenshotted, and stored in places you didn't intend. Keep secrets out of the alert body.

Building the Three Shell Tools

Each tool is a small, single-purpose shell script. The agent calls them by name; it cannot invoke arbitrary shell commands.

tail_log.sh — takes a log path and line count as arguments, validates both against an allowlist:

#!/usr/bin/env bash
set -euo pipefail

ALLOWED_LOGS=("/var/log/app/app.log" "/var/log/nginx/error.log")
LOG_PATH="$1"
LINES="${2:-100}"

# Validate path is in allowlist
if [[ ! " ${ALLOWED_LOGS[*]} " =~ " ${LOG_PATH} " ]]; then
  echo "ERROR: log path not in allowlist" >&2
  exit 1
fi

# Clamp lines to a sane maximum
if (( LINES > 500 )); then LINES=500; fi

tail -n "$LINES" "$LOG_PATH"

check_health.sh — polls one endpoint from an allowlist, returns HTTP status and a truncated body:

#!/usr/bin/env bash
set -euo pipefail

ALLOWED_HOSTS=("https://api.example.com/healthz" "https://api.example.com/readyz")
URL="$1"

if [[ ! " ${ALLOWED_HOSTS[*]} " =~ " ${URL} " ]]; then
  echo "ERROR: URL not in allowlist" >&2
  exit 1
fi

curl -sf --max-time 5 -o /tmp/health_body.txt -w "%{http_code}" "$URL" || true
HTTP_CODE=$(curl -sf --max-time 5 -w "%{http_code}" -o /dev/null "$URL" 2>/dev/null || echo "000")
BODY=$(head -c 500 /tmp/health_body.txt 2>/dev/null || echo "no body")

echo "HTTP $HTTP_CODE"
echo "$BODY"

send_alert.sh — posts to Telegram via bot API. Swap out $TELEGRAM_BOT_TOKEN and $TELEGRAM_CHAT_ID from your .env:

#!/usr/bin/env bash
set -euo pipefail

MESSAGE="$1"
# Hard limit: strip any string that looks like a token or secret
SAFE_MESSAGE=$(echo "$MESSAGE" | sed 's/[A-Za-z0-9_-]\{20,\}:[A-Za-z0-9_-]\{30,\}/[REDACTED]/g')

curl -sf -X POST \
  "https://api.telegram.org/bot${TELEGRAM_BOT_TOKEN}/sendMessage" \
  -d chat_id="${TELEGRAM_CHAT_ID}" \
  -d text="🚨 Watchdog Alert: ${SAFE_MESSAGE}" \
  -d parse_mode=Markdown > /dev/null

For Slack, replace the curl call with a webhook POST to your SLACK_WEBHOOK_URL.

Writing the AGENTS.md Operational Spec

This file is the agent's actual job description — what to watch, when to alert, and how to format messages. Keep it concrete:

# Deploy Watchdog — Operational Spec

## Watch Targets
- Log: /var/log/app/app.log — scan every 2 minutes
- Health: https://api.example.com/healthz — poll every 60 seconds

## Alert Conditions
| Condition | Threshold | Severity |
|---|---|---|
| ERROR lines in log | >= 5 in any 2-min window | HIGH |
| WARN lines in log | >= 20 in any 2-min window | MEDIUM |
| Health endpoint non-200 | Any single failure | HIGH |
| Health endpoint timeout | 2 consecutive failures | HIGH |

## Deduplication
Do not fire the same alert class within 10 minutes of the last identical alert.
Log deduplicated alerts to memory/alert_history.md with timestamp.

## Alert Format
[SEVERITY] [SERVICE] [CONDITION]
Sample log lines (max 3):
...
Timestamp: YYYY-MM-DD HH:MM UTC

The table-driven threshold approach is deliberate. When alert fatigue creeps in, you can tune one number instead of rewriting prompt logic.

Security Guardrails

  • Use allowlists, not blocklists. Enumerate exactly which log paths and URLs the agent can access. An open-ended shell tool is a privilege escalation waiting to happen.
  • Run the agent as a dedicated OS user. Create a watchdog user with read-only access to log files. It should have no write permissions outside /watchdog-agent/memory/.
  • Never pass raw log content into alert messages. Logs can contain PII, tokens, and internal IPs. The agent should extract patterns and counts, not forward raw lines.
  • Store secrets in .env, not in AGENTS.md. Your AGENTS.md will end up in git. Your Telegram token shouldn't.

Least-Privilege Shell Access

The OS-level setup matters as much as the config. Create a dedicated system user:

# Create watchdog user, no login shell
sudo useradd -r -s /usr/sbin/nologin watchdog

# Grant read access to specific log files via ACL
sudo setfacl -m u:watchdog:r /var/log/app/app.log
sudo setfacl -m u:watchdog:r /var/log/nginx/error.log

# Watchdog user owns only its own working directory
sudo chown -R watchdog:watchdog /opt/watchdog-agent

Run the OpenClaw process itself under this user via sudo -u watchdog. If you're using systemd, set User=watchdog and ProtectSystem=strict in the unit file.

For a complete checklist of what to lock down before you trust any agent with shell access, see OpenClaw Security Checklist: 15 Things to Lock Down Before You Trust an Agent.

Killing Alert Noise Before It Starts

A watchdog that pages you 40 times a night is worse than no watchdog. Here's what actually causes alert noise and how to handle it:

Burst filtering. Don't alert on the first ERROR — alert when you see 5 within a rolling window. Your AGENTS.md threshold table handles this.

Deployment windows. Tell the agent about planned deploy windows by writing a simple flag file:

# Before deploy
touch /opt/watchdog-agent/memory/deploy_in_progress

# After deploy stabilizes
rm /opt/watchdog-agent/memory/deploy_in_progress

Add a check in AGENTS.md: If memory/deploy_in_progress exists, suppress WARN-level alerts for 5 minutes after file creation.

Error class deduplication. Track the last alert timestamp per condition class in memory/alert_history.md. The agent writes to this file after every alert. Before firing a new alert, it checks whether the same class fired within the cooldown window.

Common Mistakes

  • Alerting on every WARN line. Application logs are chatty by design. WARN is not an emergency. Set a burst threshold or you'll turn off notifications within two days.
  • Forgetting to scope the memory file. If alert_history.md grows unbounded, the agent starts spending tokens summarizing its own history. Cap it at the last 50 entries with a periodic trim.
  • Using the same alert channel for HIGH and MEDIUM severity. Mix them and everything feels urgent. Route HIGH to a Telegram channel that wakes you up; route MEDIUM to a Slack thread you check in the morning.
  • No timeout on health checks. Without --max-time, a hung endpoint hangs your agent. Always set an explicit timeout.

Running and Scheduling the Agent

OpenClaw agents can run on a cron-style schedule or as a persistent process. For a watchdog, a short polling loop works well:

# /etc/systemd/system/watchdog-agent.service
[Unit]
Description=OpenClaw Deploy Watchdog
After=network.target

[Service]
User=watchdog
WorkingDirectory=/opt/watchdog-agent
ExecStart=/usr/local/bin/openclaw run --config /opt/watchdog-agent/AGENTS.md
Restart=on-failure
RestartSec=30
EnvironmentFile=/opt/watchdog-agent/.env
ProtectSystem=strict
ReadWritePaths=/opt/watchdog-agent/memory
NoNewPrivileges=true

[Install]
WantedBy=multi-user.target

ProtectSystem=strict means the agent process cannot write to system directories at all. ReadWritePaths pokes one hole — your memory directory. Everything else is locked.

Enable and start it:

sudo systemctl daemon-reload
sudo systemctl enable watchdog-agent
sudo systemctl start watchdog-agent
journalctl -u watchdog-agent -f

Validating Your Setup

Don't trust the agent until you've confirmed alerts fire correctly. Run a manual smoke test before the next real deploy:

# Inject fake error lines into a test log
for i in {1..10}; do
  echo "$(date -u +%FT%TZ) ERROR fake error for watchdog test" >> /var/log/app/app.log
done

# Confirm agent fires an alert within 2 minutes
# Check memory/alert_history.md for the logged entry
cat /opt/watchdog-agent/memory/alert_history.md | tail -5

Also test the silence: inject 3 ERROR lines (below your threshold of 5) and confirm no alert fires. Test the dedup window by injecting errors twice within 10 minutes and confirming only one alert goes out.

Once you're satisfied with the basic watchdog, you can extend it toward multi-service coordination — see Exploring Multi-Agent Coordination for patterns that apply when you need more than one agent covering different parts of your stack.

You now have an OpenClaw DevOps watchdog agent that tails logs, checks health endpoints, deduplicates alerts, and runs under a locked-down system user. It won't wake you up for every WARN line, and it won't forward your database credentials to Telegram. That's the baseline any production deploy watcher should meet — and it's all auditable in six plain text files.

The next step is making it smarter about which deploys to watch extra closely. That's where feeding it recent commit metadata and feature-flag state pays off — topics for a follow-up post.

Wire Up a Production-Ready Watchdog Agent in Minutes

Get an OpenClaw config pre-built with health polling, Telegram alerts, and least-privilege shell tools — so you're watching your next deploy, not still writing YAML.

Build Your Watchdog Agent

Share