If you run n8n in production you want more than a green status page. You want to know when webhooks slow down, why workers stall, which workflows fail more than usual, and whether your database or Redis is the real bottleneck. This guide shows how to expose n8n metrics, scrape them with Prometheus, visualize them in Grafana, and wire useful alerts that catch problems early.
I’ll keep the setup Docker-first since most teams deploy n8n with Compose. We will cover single-node and queue mode, include exporters for Redis and Postgres, and point out the sharp edges that bite busy systems.
Why observability matters for n8n in production
n8n is the glue between APIs, databases and internal tools. When it is healthy, everything else feels faster. When it degrades, you get quiet failures and angry webhooks. A practical monitoring stack gives you:
- Fast incident triage: see if the issue is n8n, Redis, Postgres or the host
- Throughput and latency: track executions per second and webhook timing
- Queue health: watch waiting jobs and failed jobs trend before users notice
- Capacity planning: learn when to add workers or move to a bigger VPS
- Change safety: validate upgrades and workflow changes with real numbers
What n8n exposes out of the box
n8n can expose a Prometheus /metrics endpoint. It uses prom-client
under the hood and ships several toggles to include extra labels and queues.
Enable metrics:
# n8n container
N8N_METRICS=true
# optional: include queue metrics on main
N8N_METRICS_INCLUDE_QUEUE_METRICS=true
# optional: sampling period in ms for queue metrics
N8N_METRICS_QUEUE_METRICS_INTERVAL=5000
# labels and extras (enable only what you need)
N8N_METRICS_INCLUDE_DEFAULT_METRICS=true
N8N_METRICS_INCLUDE_API_ENDPOINTS=false
N8N_METRICS_INCLUDE_WORKFLOW_ID_LABEL=false
N8N_METRICS_INCLUDE_NODE_TYPE_LABEL=false
N8N_METRICS_INCLUDE_CREDENTIAL_TYPE_LABEL=false
A few useful points:
- Both main and worker processes can expose
/metrics
whenN8N_METRICS=true
is set. In queue mode, queue metrics are exposed by the main instance. - Queue metrics include gauges and counters such as:
n8n_scaling_mode_queue_jobs_waiting
n8n_scaling_mode_queue_jobs_active
n8n_scaling_mode_queue_jobs_failed
n8n_scaling_mode_queue_jobs_completed
- You can also keep default Node.js process metrics like CPU time and heap from
prom-client
. These are handy early-warning signals.
Tip: verify each container exposes /metrics
locally with curl http://127.0.0.1:5678/metrics
inside the network before you wire Prometheus.
Single node vs queue mode: what to watch
- Single node: everything runs in one process. Watch process CPU and memory, Postgres latency, webhook request rate and execution duration. If CPU is fine but latency grows, it is often database I/O or an external API.
- Queue mode: editor and API live in main, jobs run on one or more workers, Redis is the queue, Postgres stores state. Watch:
jobs_waiting
andjobs_active
on the main- Redis memory and connected clients
- Worker process memory and restarts
- Postgres connections and slow queries
If jobs_waiting
keeps climbing while CPU is idle, you are under-provisioned on workers or blocked on Postgres or a slow external API.
A Docker Compose stack that you can extend
Below is a compact but production-friendly Compose file. It runs n8n in queue mode with Postgres and Redis, plus Prometheus, Grafana, and exporters for Redis, Postgres, container and host metrics.
Use pinned images in real deployments. Replace volumes and passwords with your own secrets. Put Grafana behind your reverse proxy or a VPN.
version: "3.9"
services:
n8n-main:
image: docker.n8n.io/n8nio/n8n:latest
restart: unless-stopped
depends_on:
- postgres
- redis
environment:
- N8N_METRICS=true
- N8N_METRICS_INCLUDE_QUEUE_METRICS=true
- N8N_METRICS_QUEUE_METRICS_INTERVAL=5000
- DB_TYPE=postgresdb
- DB_POSTGRESDB_HOST=postgres
- DB_POSTGRESDB_DATABASE=n8n
- DB_POSTGRESDB_USER=n8n
- DB_POSTGRESDB_PASSWORD=${POSTGRES_PASSWORD}
- QUEUE_BULL_REDIS_HOST=redis
- EXECUTIONS_MODE=queue
- WEBHOOK_URL=${PUBLIC_BASE_URL}
- N8N_HOST=0.0.0.0
- N8N_PORT=5678
- N8N_LOG_LEVEL=info
ports:
- "5678:5678"
volumes:
- n8n_data:/home/node/.n8n
n8n-worker-1:
image: docker.n8n.io/n8nio/n8n:latest
restart: unless-stopped
depends_on:
- redis
- postgres
command: n8n worker
environment:
- N8N_METRICS=true
- DB_TYPE=postgresdb
- DB_POSTGRESDB_HOST=postgres
- DB_POSTGRESDB_DATABASE=n8n
- DB_POSTGRESDB_USER=n8n
- DB_POSTGRESDB_PASSWORD=${POSTGRES_PASSWORD}
- QUEUE_BULL_REDIS_HOST=redis
- EXECUTIONS_MODE=queue
postgres:
image: postgres:16
restart: unless-stopped
environment:
- POSTGRES_USER=n8n
- POSTGRES_PASSWORD=${POSTGRES_PASSWORD}
- POSTGRES_DB=n8n
volumes:
- pg_data:/var/lib/postgresql/data
redis:
image: redis:7
restart: unless-stopped
# Metrics stack
prometheus:
image: prom/prometheus:v2.54.1
restart: unless-stopped
volumes:
- ./prometheus/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prom_data:/prometheus
ports:
- "9090:9090"
grafana:
image: grafana/grafana:11.1.4
restart: unless-stopped
ports:
- "3000:3000"
volumes:
- grafana_data:/var/lib/grafana
redis-exporter:
image: oliver006/redis_exporter:v1.62.0
restart: unless-stopped
environment:
- REDIS_ADDR=redis:6379
postgres-exporter:
image: prometheuscommunity/postgres-exporter:v0.15.0
restart: unless-stopped
environment:
- DATA_SOURCE_NAME=postgresql://n8n:${POSTGRES_PASSWORD}@postgres:5432/n8n?sslmode=disable
cadvisor:
image: gcr.io/cadvisor/cadvisor:v0.49.2
restart: unless-stopped
privileged: true
ports:
- "8080:8080"
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker/:/var/lib/docker:ro
node-exporter:
image: prom/node-exporter:v1.8.2
restart: unless-stopped
pid: host
network_mode: host
volumes:
n8n_data:
pg_data:
prom_data:
grafana_data:
Reverse proxy notes
- Set
WEBHOOK_URL
to the public HTTPS URL that clients reach. - If you are behind multiple proxies, set
N8N_PROXY_HOPS
so n8n trusts the correctX-Forwarded-*
chain. - Terminate TLS at your proxy and keep Prometheus private. Grafana can sit behind the proxy with basic auth or SSO.
Prometheus scrape configuration
Create ./prometheus/prometheus.yml
and point it at your services. This example uses static targets inside the Compose network. Adjust ports if you map them differently.
global:
scrape_interval: 15s
evaluation_interval: 15s
scrape_configs:
- job_name: "n8n-main"
metrics_path: /metrics
static_configs:
- targets: ["n8n-main:5678"]
labels:
role: main
- job_name: "n8n-workers"
metrics_path: /metrics
static_configs:
- targets: ["n8n-worker-1:5678"]
labels:
role: worker
- job_name: "redis"
static_configs:
- targets: ["redis-exporter:9121"]
- job_name: "postgres"
static_configs:
- targets: ["postgres-exporter:9187"]
- job_name: "cadvisor"
static_configs:
- targets: ["cadvisor:8080"]
- job_name: "node"
static_configs:
- targets: ["localhost:9100"] # node-exporter on host network
Start the stack:
docker compose up -d
Open Prometheus on http://<host>:9090
and check Status → Targets. You should see green up states. If a target is down, test from the Prometheus container:
docker exec -it <prometheus_container> curl -s n8n-main:5678/metrics | head
Dashboards that answer real questions
Build a folder in Grafana named n8n Production and add panels that map to the questions you ask during incidents.
Is the queue healthy
- Queue depth:
n8n_scaling_mode_queue_jobs_waiting
(gauge) - Active jobs:
n8n_scaling_mode_queue_jobs_active
- Failed jobs rate:
rate(n8n_scaling_mode_queue_jobs_failed[5m])
- Completed jobs rate:
rate(n8n_scaling_mode_queue_jobs_completed[5m])
If waiting climbs for 10 minutes and active stays flat, you likely need more workers or you are blocked on Postgres or an external API.
Are workflows slower than usual
If you enabled API endpoint metrics, graph the request duration histogram or latency percentiles. Otherwise pull execution duration from Postgres directly into a Grafana panel with a read-only database user. It is common to track p50 and p95 execution time for a small set of critical workflows.
Are we resource bound
- Main and worker memory:
process_resident_memory_bytes
filtered by container - CPU saturation:
rate(process_cpu_seconds_total[5m])
paired with host CPU - Postgres:
pg_stat_activity_count
,pg_up
, slow query count if you expose it - Redis: memory used, keys, clients from the exporter
Webhook front door
Add panels from your proxy or load balancer. Request rate, 4xx and 5xx, and upstream time tell you if the slowdown is at the edge or inside n8n.
Alert rules that catch real problems
Store alert rules in a file like prometheus/alerts.yml
and include it from your Prometheus config. Calibrate thresholds for your traffic.
groups:
- name: n8n.rules
rules:
# n8n main or worker down
- alert: N8nInstanceDown
expr: up{job=~"n8n-main|n8n-workers"} == 0
for: 2m
labels:
severity: critical
annotations:
summary: "n8n {{ $labels.role }} is down"
description: "Target {{ $labels.instance }} not scraping for 2m"
# queue is backing up
- alert: N8nQueueBacklog
expr: n8n_scaling_mode_queue_jobs_waiting > 50
for: 10m
labels:
severity: warning
annotations:
summary: "Queue backlog growing"
description: "jobs_waiting > 50 for 10 minutes"
# error bursts
- alert: N8nFailedJobsSpike
expr: rate(n8n_scaling_mode_queue_jobs_failed[5m]) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "Failed jobs rate spike"
description: "More than 1 failed job per second over 5m"
# main memory ballooning
- alert: N8nMainHighMemory
expr: process_resident_memory_bytes{job="n8n-main"} > 1.5e9
for: 10m
labels:
severity: warning
annotations:
summary: "Main memory high"
description: "Resident memory > 1.5 GB for 10m"
Wire Alertmanager to Slack, email or PagerDuty and you have a basic but useful early-warning system.
Logs and traces support your metrics
Metrics show what is wrong. Logs show why. n8n logs to stdout by default. You can bump verbosity or log to files:
N8N_LOG_LEVEL=debug # error | warn | info | debug
N8N_LOG_OUTPUT=console # console, file or both (comma separated)
Use a centralized system if you can. Loki, OpenSearch or Elasticsearch work well with Docker. If you keep logs local, mount a volume and rotate aggressively. For HTTP traces consider enabling logs at the proxy layer and sampling only error cases.
Common pitfalls and fixes
- Metrics endpoint closed: you set
N8N_METRICS=true
only on main. Set it on workers too so you get process metrics there. - Queue metrics missing: enable
N8N_METRICS_INCLUDE_QUEUE_METRICS=true
. Remember, they come from the main process. - Wrong public URLs: set
WEBHOOK_URL
to the exact external HTTPS URL users hit. Behind chains of proxies, setN8N_PROXY_HOPS
so n8n trusts forwarded headers properly. - Prometheus cannot reach /metrics: network mode mismatch or the proxy blocks internal paths. Scrape the containers directly on the Docker network.
- CPU is fine but everything is slow: check Postgres I/O and slow queries first, then Redis memory. Many bottlenecks are storage or external API latency.
Hardening quick wins
- Keep Prometheus private. Do not expose it to the internet.
- Put Grafana behind your proxy with authentication.
- Pin image tags and keep a rollback tag around for n8n main and workers.
- Back up Postgres daily. Store dumps off-box.
- Limit who can see
/metrics
if you expose it via the proxy. Better to scrape internally.
Where to go from here
- Add workflow-level metrics with labels only for a few critical flows to avoid high-cardinality explosions.
- Stream n8n logs to a central store and attach sample logs to alerts.
- Track business KPIs in Grafana using Postgres queries next to your system panels.
- If you run multiple mains, use the leader label to avoid double counting queue metrics.
FAQ
How do I turn on metrics in n8n
Set N8N_METRICS=true
on the containers. Visit /metrics
on the main process to confirm you see output.
Do workers expose /metrics too
Yes. Set N8N_METRICS=true
on workers for process metrics. Queue metrics are exposed by the main.
Which metric tells me the queue is backing up
n8n_scaling_mode_queue_jobs_waiting
is the one to watch. If it climbs for several minutes it is time to add workers or find the bottleneck.
Can I see which workflows fail most
Enable API endpoint and workflow labels if you need them, but be careful with cardinality. Many teams pull failure counts from Postgres with a scheduled query and graph the top offenders.
Should I monitor Postgres and Redis separately
Yes. Use exporters for both. Problems that look like n8n are often database or queue pressure.
How do I alert on failed jobs without noise
Use a rate over a window, not raw counters. For example rate(n8n_scaling_mode_queue_jobs_failed[5m])
crossing a threshold for 5 minutes is a better signal than a single spike.
Where do webhook metrics live
You get request rate and status at the proxy. Pair those with execution metrics in Grafana to see the full path from request to job completion.