Monitoring and troubleshooting n8n on a VPS

Learn how to monitor and troubleshoot n8n on a VPS. Metrics, logs, and debugging tips for production reliability.
Colorful abstract design with smooth, curved lines, blending various hues to create a lively and artistic background.

Why monitoring matters before troubleshooting

Most people only look at monitoring after something is already broken. That’s backwards.

When you run n8n in production, you want to catch problems before they become user-visible. Monitoring gives you early signals: workflows taking longer than normal, queues backing up, or errors piling up silently.

Troubleshooting without metrics is blind guesswork.

What you should be watching

System-level signals

  • CPU and memory usage of main and worker processes
  • Disk I/O and available space for Postgres
  • Network latency to critical APIs

Application-level metrics

  • Workflow execution times (average, p95, p99)
  • Webhook request rate and error codes
  • Queue depth and worker health in queue mode
  • Database connection count and slow queries

If you already have Prometheus and Grafana set up, scrape the n8n /metrics endpoint. If not, start small: even a tool like Uptime Kuma checking webhook URLs gives you valuable signal.

Logs: Your first stop for answers

n8n logs to stdout by default. Capture these logs with Docker and ship them to a log system like Loki or Elasticsearch if possible.

N8N_LOG_LEVEL=debug
N8N_LOG_OUTPUT=console,file

When troubleshooting:

  • Search for error stacks in failed executions.
  • Look at timestamps: slow queries or API timeouts often cluster.
  • Check for credential issues: expired tokens or misconfigured OAuth cause silent failures.

Reverse proxy logs

Do not forget the proxy. Nginx or Caddy logs show if requests even reached n8n. Many “broken webhooks” turn out to be proxy misconfigurations.

Debugging workflows inside n8n

Execution data

Open the execution list in the editor. Each run shows inputs, outputs, and errors for every node. You can pin data in a node to replay the workflow without hitting external APIs.

Breakpoints with the pause node

The Pause node can halt execution until manually resumed or until a condition is met. This is useful for debugging flows that trigger downstream effects you don’t want during testing.

Error workflows

n8n lets you create a global error workflow. Any failing execution can trigger it. Set one up to capture context and alert you when things break.

Troubleshooting queue mode

If you run n8n in queue mode, issues often show up in Redis or Postgres rather than the editor.

  • Jobs stuck waiting: not enough workers or Redis unreachable.
  • Jobs stuck active: worker crashed mid-execution.
  • Workers idle: wrong database settings or workers pointing at the wrong Redis.

Check n8n_scaling_mode_queue_jobs_waiting and n8n_scaling_mode_queue_jobs_failed metrics if you expose them.

Database-related failures

Postgres is a common choke point.

  • Connections exceeded: too many workers or bad pooling.
  • Slow queries: execution history table bloated. Enable pruning.
  • Disk full: backups, WAL files, or oversized logs eating space.

Use pg_stat_activity and pg_stat_statements to diagnose.

When workflows are just slow

Not every problem is a crash. Some workflows are slow because:

  • External APIs throttle you or respond with high latency.
  • Large JSON payloads bog down parsing.
  • Code nodes do heavy loops instead of efficient transforms.
  • Workers compete for limited CPU on a small VPS.

Mitigation: break workflows into smaller pieces, use queues, and scale workers.

Alerts that actually help

Configure alerts that point you in the right direction:

  • Webhook down for more than 2 minutes
  • Queue backlog above 50 jobs for 10 minutes
  • Failed job rate over 1 per second
  • Database connections hitting 90% of max
  • Disk space below 15% free

Route these alerts somewhere real people see them, not buried in logs.

A workflow for troubleshooting itself

Some teams build a meta workflow in n8n to watch itself. Example:

  • Cron node every 5 minutes → call Prometheus API → check queue depth → if backlog > 50 send Slack alert.
  • Another branch checks SSL expiry on your domain → send email if under 14 days.

It sounds silly but it works: n8n keeping tabs on n8n.

FAQ

How do I see what caused a workflow to fail?

Check the execution details in the editor. Every node shows its input and error message.

Why are webhook URLs showing wrong in logs?

Usually a proxy misconfiguration. Make sure N8N_HOST, N8N_PROTOCOL, and WEBHOOK_URL are set correctly.

My workers are idle but jobs are queued. What now?

Check Redis connectivity. Also confirm both workers and main point to the same Postgres.

Do I need Prometheus and Grafana to monitor n8n?

They’re the best for scaling, but for small setups Uptime Kuma or even log tailing is better than nothing.

Can I debug workflows without hitting external APIs every time?

Yes. Use pinned data in the editor to replay runs with stored inputs.