Monitoring n8n on a VPS with Prometheus and Grafana

Set up n8n metrics, Prometheus, Grafana and alerts on a VPS. See queue health, failures and latency before users do.

Mark O'Connor
September 13, 2025

If you run n8n in production you want more than a green status page. You want to know when webhooks slow down, why workers stall, which workflows fail more than usual, and whether your database or Redis is the real bottleneck. This guide shows how to expose n8n metrics, scrape them with Prometheus, visualize them in Grafana, and wire useful alerts that catch problems early.

I’ll keep the setup Docker-first since most teams deploy n8n with Compose. We will cover single-node and queue mode, include exporters for Redis and Postgres, and point out the sharp edges that bite busy systems.

Why observability matters for n8n in production

n8n is the glue between APIs, databases and internal tools. When it is healthy, everything else feels faster. When it degrades, you get quiet failures and angry webhooks. A practical monitoring stack gives you:

Fast incident triage: see if the issue is n8n, Redis, Postgres or the host
Throughput and latency: track executions per second and webhook timing
Queue health: watch waiting jobs and failed jobs trend before users notice
Capacity planning: learn when to add workers or move to a bigger VPS
Change safety: validate upgrades and workflow changes with real numbers

What n8n exposes out of the box

n8n can expose a Prometheus /metrics endpoint. It uses prom-client under the hood and ships several toggles to include extra labels and queues.

Enable metrics:

# n8n container
N8N_METRICS=true
# optional: include queue metrics on main
N8N_METRICS_INCLUDE_QUEUE_METRICS=true
# optional: sampling period in ms for queue metrics
N8N_METRICS_QUEUE_METRICS_INTERVAL=5000

# labels and extras (enable only what you need)
N8N_METRICS_INCLUDE_DEFAULT_METRICS=true
N8N_METRICS_INCLUDE_API_ENDPOINTS=false
N8N_METRICS_INCLUDE_WORKFLOW_ID_LABEL=false
N8N_METRICS_INCLUDE_NODE_TYPE_LABEL=false
N8N_METRICS_INCLUDE_CREDENTIAL_TYPE_LABEL=false

A few useful points:

Both main and worker processes can expose /metrics when N8N_METRICS=true is set. In queue mode, queue metrics are exposed by the main instance.
Queue metrics include gauges and counters such as:
- n8n_scaling_mode_queue_jobs_waiting
- n8n_scaling_mode_queue_jobs_active
- n8n_scaling_mode_queue_jobs_failed
- n8n_scaling_mode_queue_jobs_completed
You can also keep default Node.js process metrics like CPU time and heap from prom-client. These are handy early-warning signals.

Tip: verify each container exposes /metrics locally with curl http://127.0.0.1:5678/metrics inside the network before you wire Prometheus.

Single node vs queue mode: what to watch

Single node: everything runs in one process. Watch process CPU and memory, Postgres latency, webhook request rate and execution duration. If CPU is fine but latency grows, it is often database I/O or an external API.
Queue mode: editor and API live in main, jobs run on one or more workers, Redis is the queue, Postgres stores state. Watch:
- jobs_waiting and jobs_active on the main
- Redis memory and connected clients
- Worker process memory and restarts
- Postgres connections and slow queries

If jobs_waiting keeps climbing while CPU is idle, you are under-provisioned on workers or blocked on Postgres or a slow external API.

A Docker Compose stack that you can extend

Below is a compact but production-friendly Compose file. It runs n8n in queue mode with Postgres and Redis, plus Prometheus, Grafana, and exporters for Redis, Postgres, container and host metrics.

Use pinned images in real deployments. Replace volumes and passwords with your own secrets. Put Grafana behind your reverse proxy or a VPN.

version: "3.9"

services:
  n8n-main:
    image: docker.n8n.io/n8nio/n8n:latest
    restart: unless-stopped
    depends_on:
      - postgres
      - redis
    environment:
      - N8N_METRICS=true
      - N8N_METRICS_INCLUDE_QUEUE_METRICS=true
      - N8N_METRICS_QUEUE_METRICS_INTERVAL=5000
      - DB_TYPE=postgresdb
      - DB_POSTGRESDB_HOST=postgres
      - DB_POSTGRESDB_DATABASE=n8n
      - DB_POSTGRESDB_USER=n8n
      - DB_POSTGRESDB_PASSWORD=${POSTGRES_PASSWORD}
      - QUEUE_BULL_REDIS_HOST=redis
      - EXECUTIONS_MODE=queue
      - WEBHOOK_URL=${PUBLIC_BASE_URL}
      - N8N_HOST=0.0.0.0
      - N8N_PORT=5678
      - N8N_LOG_LEVEL=info
    ports:
      - "5678:5678"
    volumes:
      - n8n_data:/home/node/.n8n

  n8n-worker-1:
    image: docker.n8n.io/n8nio/n8n:latest
    restart: unless-stopped
    depends_on:
      - redis
      - postgres
    command: n8n worker
    environment:
      - N8N_METRICS=true
      - DB_TYPE=postgresdb
      - DB_POSTGRESDB_HOST=postgres
      - DB_POSTGRESDB_DATABASE=n8n
      - DB_POSTGRESDB_USER=n8n
      - DB_POSTGRESDB_PASSWORD=${POSTGRES_PASSWORD}
      - QUEUE_BULL_REDIS_HOST=redis
      - EXECUTIONS_MODE=queue

  postgres:
    image: postgres:16
    restart: unless-stopped
    environment:
      - POSTGRES_USER=n8n
      - POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
      - POSTGRES_DB=n8n
    volumes:
      - pg_data:/var/lib/postgresql/data

  redis:
    image: redis:7
    restart: unless-stopped

  # Metrics stack
  prometheus:
    image: prom/prometheus:v2.54.1
    restart: unless-stopped
    volumes:
      - ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
      - prom_data:/prometheus
    ports:
      - "9090:9090"

  grafana:
    image: grafana/grafana:11.1.4
    restart: unless-stopped
    ports:
      - "3000:3000"
    volumes:
      - grafana_data:/var/lib/grafana

  redis-exporter:
    image: oliver006/redis_exporter:v1.62.0
    restart: unless-stopped
    environment:
      - REDIS_ADDR=redis:6379

  postgres-exporter:
    image: prometheuscommunity/postgres-exporter:v0.15.0
    restart: unless-stopped
    environment:
      - DATA_SOURCE_NAME=postgresql://n8n:${POSTGRES_PASSWORD}@postgres:5432/n8n?sslmode=disable

  cadvisor:
    image: gcr.io/cadvisor/cadvisor:v0.49.2
    restart: unless-stopped
    privileged: true
    ports:
      - "8080:8080"
    volumes:
      - /:/rootfs:ro
      - /var/run:/var/run:ro
      - /sys:/sys:ro
      - /var/lib/docker/:/var/lib/docker:ro

  node-exporter:
    image: prom/node-exporter:v1.8.2
    restart: unless-stopped
    pid: host
    network_mode: host

volumes:
  n8n_data:
  pg_data:
  prom_data:
  grafana_data:

Reverse proxy notes

Set WEBHOOK_URL to the public HTTPS URL that clients reach.
If you are behind multiple proxies, set N8N_PROXY_HOPS so n8n trusts the correct X-Forwarded-* chain.
Terminate TLS at your proxy and keep Prometheus private. Grafana can sit behind the proxy with basic auth or SSO.

Prometheus scrape configuration

Create ./prometheus/prometheus.yml and point it at your services. This example uses static targets inside the Compose network. Adjust ports if you map them differently.

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: "n8n-main"
    metrics_path: /metrics
    static_configs:
      - targets: ["n8n-main:5678"]
        labels:
          role: main

  - job_name: "n8n-workers"
    metrics_path: /metrics
    static_configs:
      - targets: ["n8n-worker-1:5678"]
        labels:
          role: worker

  - job_name: "redis"
    static_configs:
      - targets: ["redis-exporter:9121"]

  - job_name: "postgres"
    static_configs:
      - targets: ["postgres-exporter:9187"]

  - job_name: "cadvisor"
    static_configs:
      - targets: ["cadvisor:8080"]

  - job_name: "node"
    static_configs:
      - targets: ["localhost:9100"]   # node-exporter on host network

Start the stack:

docker compose up -d

Open Prometheus on http://<host>:9090 and check Status → Targets. You should see green up states. If a target is down, test from the Prometheus container:

docker exec -it <prometheus_container> curl -s n8n-main:5678/metrics | head

Dashboards that answer real questions

Build a folder in Grafana named n8n Production and add panels that map to the questions you ask during incidents.

Is the queue healthy

Queue depth: n8n_scaling_mode_queue_jobs_waiting (gauge)
Active jobs: n8n_scaling_mode_queue_jobs_active
Failed jobs rate: rate(n8n_scaling_mode_queue_jobs_failed[5m])
Completed jobs rate: rate(n8n_scaling_mode_queue_jobs_completed[5m])

If waiting climbs for 10 minutes and active stays flat, you likely need more workers or you are blocked on Postgres or an external API.

Are workflows slower than usual

If you enabled API endpoint metrics, graph the request duration histogram or latency percentiles. Otherwise pull execution duration from Postgres directly into a Grafana panel with a read-only database user. It is common to track p50 and p95 execution time for a small set of critical workflows.

Are we resource bound

Main and worker memory: process_resident_memory_bytes filtered by container
CPU saturation: rate(process_cpu_seconds_total[5m]) paired with host CPU
Postgres: pg_stat_activity_count, pg_up, slow query count if you expose it
Redis: memory used, keys, clients from the exporter

Webhook front door

Add panels from your proxy or load balancer. Request rate, 4xx and 5xx, and upstream time tell you if the slowdown is at the edge or inside n8n.

Alert rules that catch real problems

Store alert rules in a file like prometheus/alerts.yml and include it from your Prometheus config. Calibrate thresholds for your traffic.

groups:
- name: n8n.rules
  rules:
  # n8n main or worker down
  - alert: N8nInstanceDown
    expr: up{job=~"n8n-main|n8n-workers"} == 0
    for: 2m
    labels:
      severity: critical
    annotations:
      summary: "n8n {{ $labels.role }} is down"
      description: "Target {{ $labels.instance }} not scraping for 2m"

  # queue is backing up
  - alert: N8nQueueBacklog
    expr: n8n_scaling_mode_queue_jobs_waiting > 50
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Queue backlog growing"
      description: "jobs_waiting > 50 for 10 minutes"

  # error bursts
  - alert: N8nFailedJobsSpike
    expr: rate(n8n_scaling_mode_queue_jobs_failed[5m]) > 1
    for: 5m
    labels:
      severity: warning
    annotations:
      summary: "Failed jobs rate spike"
      description: "More than 1 failed job per second over 5m"

  # main memory ballooning
  - alert: N8nMainHighMemory
    expr: process_resident_memory_bytes{job="n8n-main"} > 1.5e9
    for: 10m
    labels:
      severity: warning
    annotations:
      summary: "Main memory high"
      description: "Resident memory > 1.5 GB for 10m"

Wire Alertmanager to Slack, email or PagerDuty and you have a basic but useful early-warning system.

Logs and traces support your metrics

Metrics show what is wrong. Logs show why. n8n logs to stdout by default. You can bump verbosity or log to files:

N8N_LOG_LEVEL=debug          # error | warn | info | debug
N8N_LOG_OUTPUT=console       # console, file or both (comma separated)

Use a centralized system if you can. Loki, OpenSearch or Elasticsearch work well with Docker. If you keep logs local, mount a volume and rotate aggressively. For HTTP traces consider enabling logs at the proxy layer and sampling only error cases.

Common pitfalls and fixes

Metrics endpoint closed: you set N8N_METRICS=true only on main. Set it on workers too so you get process metrics there.
Queue metrics missing: enable N8N_METRICS_INCLUDE_QUEUE_METRICS=true. Remember, they come from the main process.
Wrong public URLs: set WEBHOOK_URL to the exact external HTTPS URL users hit. Behind chains of proxies, set N8N_PROXY_HOPS so n8n trusts forwarded headers properly.
Prometheus cannot reach /metrics: network mode mismatch or the proxy blocks internal paths. Scrape the containers directly on the Docker network.
CPU is fine but everything is slow: check Postgres I/O and slow queries first, then Redis memory. Many bottlenecks are storage or external API latency.

Hardening quick wins

Keep Prometheus private. Do not expose it to the internet.
Put Grafana behind your proxy with authentication.
Pin image tags and keep a rollback tag around for n8n main and workers.
Back up Postgres daily. Store dumps off-box.
Limit who can see /metrics if you expose it via the proxy. Better to scrape internally.

Where to go from here

Add workflow-level metrics with labels only for a few critical flows to avoid high-cardinality explosions.
Stream n8n logs to a central store and attach sample logs to alerts.
Track business KPIs in Grafana using Postgres queries next to your system panels.
If you run multiple mains, use the leader label to avoid double counting queue metrics.

FAQ

How do I turn on metrics in n8n

Set N8N_METRICS=true on the containers. Visit /metrics on the main process to confirm you see output.

Do workers expose /metrics too

Yes. Set N8N_METRICS=true on workers for process metrics. Queue metrics are exposed by the main.

Which metric tells me the queue is backing up

n8n_scaling_mode_queue_jobs_waiting is the one to watch. If it climbs for several minutes it is time to add workers or find the bottleneck.

Can I see which workflows fail most

Enable API endpoint and workflow labels if you need them, but be careful with cardinality. Many teams pull failure counts from Postgres with a scheduled query and graph the top offenders.

Should I monitor Postgres and Redis separately

Yes. Use exporters for both. Problems that look like n8n are often database or queue pressure.

How do I alert on failed jobs without noise

Use a rate over a window, not raw counters. For example rate(n8n_scaling_mode_queue_jobs_failed[5m]) crossing a threshold for 5 minutes is a better signal than a single spike.

Where do webhook metrics live

You get request rate and status at the proxy. Pair those with execution metrics in Grafana to see the full path from request to job completion.