PM2 monitoring is not monitoring
You have your Node.js app running on a LumaDock VPS with PM2 keeping it alive. pm2 monit shows you CPU and memory right now. But "right now" is not enough. You can't see what happened at 3 AM when traffic spiked. You can't scroll back and spot the memory leak that develops over hours. You can't set an alert that wakes you up when things go wrong. You're flying blind.
PM2 is excellent for process management. It restarts crashes, handles logs and scales across cores. But it's not a monitoring system. Monitoring requires history, trends and alerts.
Prometheus and Grafana solve this. Prometheus scrapes metrics from your app every few seconds and stores them for 15 days (configurable). Grafana reads those metrics and builds dashboards you can check anytime. You set alert rules: if memory exceeds 80%, if error rate climbs, if latency hits 500 ms. Notifications fire to Slack or email. You see trends before they become disasters.
Both tools are light enough to run on your VPS alongside your app. A 2GB instance handles them easily. And they're open-source, free forever.
Install Prometheus on Ubuntu 24.04
Package installation
The simplest path is the Ubuntu package repository:
sudo apt update
sudo apt install -y prometheus
This installs Prometheus as a system package. It creates a prometheus user, a systemd service and default configuration.
Enable and start it:
sudo systemctl enable prometheus
sudo systemctl start prometheus
sudo systemctl status prometheus
Verify it's running on port 9090:
curl http://localhost:9090/-/healthy
You should see "Prometheus is healthy."
Understanding prometheus.yml configuration
Prometheus reads its configuration from /etc/prometheus/prometheus.yml. This file tells Prometheus what to monitor, how often to scrape metrics and how to retain data. Edit it:
sudo nano /etc/prometheus/prometheus.yml
The file has a top-level global section that sets defaults, then a scrape_configs section that lists jobs. Each job is a target to monitor:
global:
scrape_interval: 15s # Scrape every 15 seconds
scrape_timeout: 10s
evaluation_interval: 15s # Evaluate alert rules every 15 seconds
scrape_configs:
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
- job_name: 'nodejs-app'
static_configs:
- targets: ['localhost:3001']
The first job (prometheus) monitors Prometheus itself. The second job (nodejs-app) monitors your Node.js app on port 3001. You'll expose metrics on that port using prom-client.
Adding scrape targets for your Node.js app
When you install prom-client in your Node.js app, you'll expose a /metrics endpoint on a specific port (usually 3001). Prometheus will make HTTP GET requests to that endpoint every scrape_interval seconds, parse the metrics and store them in its time-series database.
The static_configs targets list is simple: each target is host:port. If your app runs on 127.0.0.1:3001, add it like above. If you have multiple instances (PM2 cluster mode on ports 3001, 3002, 3003, 3004), add all of them:
- job_name: 'nodejs-app'
static_configs:
- targets: ['localhost:3001', 'localhost:3002', 'localhost:3003', 'localhost:3004']
Data retention and storage configuration
By default, Prometheus keeps 15 days of metrics. This uses about 1-5 GB of disk per application depending on scrape frequency and the number of unique metrics. If you want longer history, adjust the retention:
global:
scrape_interval: 15s
# Add retention settings:
# retention_time: 30d # Keep 30 days instead of 15
# retention_size: 10GB # Or limit by disk size
Longer retention means more disk usage. For a small VPS, 15 days is reasonable. For critical systems, consider 30 days or higher.
After editing prometheus.yml, restart Prometheus:
sudo systemctl restart prometheus
Prometheus loads the new configuration and starts scraping.
Install node-exporter for system metrics
Prometheus scrapes metrics from targets. Your Node.js app will expose application metrics (requests, memory, GC pauses). But you also need system-level metrics: CPU usage, disk space, network throughput. That's what node-exporter does.
Install it:
sudo apt install -y prometheus-node-exporter
Enable and start:
sudo systemctl enable prometheus-node-exporter
sudo systemctl start prometheus-node-exporter
sudo systemctl status prometheus-node-exporter
node-exporter listens on port 9100. Add it to your prometheus.yml scrape_configs:
- job_name: 'system'
static_configs:
- targets: ['localhost:9100']
Restart Prometheus:
sudo systemctl restart prometheus
Now you have system metrics and application metrics flowing into Prometheus.
Expose metrics from your Node.js app with prom-client
Installing prom-client and collecting default metrics
prom-client is the official Prometheus client for Node.js. Install it in your project:
npm install prom-client
In your app, require it and collect default metrics. These include CPU, memory, GC pause duration, event loop lag and Node.js version info:
const express = require('express');
const client = require('prom-client');
const app = express();
// Collect default metrics automatically
client.collectDefaultMetrics();
// Expose /metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', client.register.contentType);
res.end(await client.register.metrics());
});
app.listen(3001, () => {
console.log('App with metrics on :3001');
});
Test it:
curl http://localhost:3001/metrics | head -50
You'll see output like:
process_resident_memory_bytes 45670400
process_virtual_memory_bytes 2147483648
nodejs_gc_duration_seconds_bucket{kind="MarkSweepCompact",le="0.001"} 0
http_request_duration_seconds_bucket{method="GET",status="200",le="0.1"} 10
Each line is a metric with labels (method, status, etc.) and a value. Prometheus scrapes these every 15 seconds and stores them.
Creating custom histograms for HTTP latency
Default metrics are good but not specific to your app. Custom metrics give you business insight. A histogram tracks the distribution of values: how many requests took 10-100 ms, 100-500 ms, 500+ ms. This gives you P50, P95 and P99 latencies.
Create a histogram middleware:
const httpDuration = new client.Histogram({
name: 'http_request_duration_seconds',
help: 'Duration of HTTP requests in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.05, 0.1, 0.5, 1, 2, 5]
});
app.use((req, res, next) => {
const start = Date.now();
res.on('finish', () => {
const duration = (Date.now() - start) / 1000;
httpDuration
.labels(req.method, req.route?.path || req.path, res.statusCode)
.observe(duration);
});
next();
});
This tracks every HTTP request and buckets the duration. Buckets define latency ranges: anything under 50 ms, 50-100 ms, 100-500 ms, etc. Prometheus uses buckets to calculate percentiles.
Adding custom counters for request counts
Counters go up and never down. Use them to count total requests, errors or completed jobs:
const requestCounter = new client.Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'status_code']
});
app.use((req, res, next) => {
res.on('finish', () => {
requestCounter.labels(req.method, res.statusCode).inc();
});
next();
});
Now you can query total requests, or filter by status code (200, 404, 500) to see error rates.
Using gauges for active connection tracking
Gauges go up or down. Use them for current state: active connections, queue depth, current temperature:
const activeConnections = new client.Gauge({
name: 'http_active_connections',
help: 'Current active HTTP connections'
});
app.use((req, res, next) => {
activeConnections.inc();
res.on('finish', () => {
activeConnections.dec();
});
next();
});
This gauge goes up when a request arrives, down when it finishes. Grafana can graph this to spot connection leaks.
The /metrics endpoint and response format
All metrics (default and custom) are exposed at /metrics in Prometheus text format. The format is simple: metric_name{labels} value. Prometheus scrapes this every 15 seconds.
The Content-Type header is important. Set it to client.register.contentType so Prometheus knows what format to expect:
res.set('Content-Type', client.register.contentType);
res.end(await client.register.metrics());
Never expose /metrics to the public internet. It reveals your app's internals. If you have multiple instances, expose /metrics only on localhost and let Prometheus scrape from the same machine.
Handling PM2 cluster mode and metrics aggregation
Per-instance metrics on separate ports
When PM2 runs multiple instances (via instances: 'max'), all instances listen on the same port. Your /metrics endpoint exists on that port, but PM2 load-balances requests. Prometheus might only scrape one instance, missing metrics from the others.
Solution: each instance listens on a different port. Configure this in your PM2 ecosystem file:
module.exports = {
apps: [
{
name: 'api',
script: './server.js',
instances: 'max',
exec_mode: 'cluster',
env: {
METRICS_PORT: 3001
}
}
]
};
In your app, calculate a unique port for each instance:
const basePort = parseInt(process.env.METRICS_PORT || 3001);
const instanceId = process.env.instance_id || 0;
const metricsPort = basePort + parseInt(instanceId);
// Metrics on metricsPort
// App on the regular port
If you have 4 CPU cores, instances 0-3 run on ports 3001, 3002, 3003, 3004. Each has its own metrics.
Aggregating metrics across instances
Prometheus can scrape all instances separately. In prometheus.yml, list each port:
- job_name: 'nodejs-app'
static_configs:
- targets: ['localhost:3001', 'localhost:3002', 'localhost:3003', 'localhost:3004']
Prometheus stores metrics per instance (with a labels showing which instance). Grafana can aggregate: sum requests across instances, average memory, etc.
Install Grafana on Ubuntu 24.04
Adding the Grafana repository
Grafana doesn't come in the default Ubuntu repos, so add the official Grafana repository:
sudo apt-get install -y apt-transport-https software-properties-common curl
curl https://apt.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://apt.grafana.com stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
sudo apt update
This adds Grafana's apt repo so you get security updates automatically.
Installing and starting Grafana
Install Grafana:
sudo apt install -y grafana-server
Enable and start:
sudo systemctl enable grafana-server
sudo systemctl start grafana-server
sudo systemctl status grafana-server
Grafana listens on port 3000. Open your browser to http://your-vps-ip:3000.
Initial login and changing the default password
Default login is admin / admin. Grafana will prompt you to change the password on first login. Do it immediately. This password protects access to your dashboards and alert configuration.
Connect Prometheus as a data source in Grafana
Grafana needs to know where Prometheus is. Go to Configuration (gear icon) > Data Sources > Add Data Source. Select Prometheus. In the URL field, enter http://localhost:9090. Save and test. Grafana will confirm it can reach Prometheus.
Now Grafana can query Prometheus metrics and build dashboards.
Importing a community Node.js dashboard
Building dashboards from scratch takes time. The Grafana community has shared dashboards for common scenarios. For Node.js monitoring, import dashboard ID 11159 (Node.js Application Dashboard).
Go to Dashboards > Import. Enter the dashboard ID 11159 in the "Import via grafana.com" field. Grafana downloads the dashboard definition. Select your Prometheus data source and import.
You now have a full Node.js dashboard with panels for memory, CPU, request rate, error rate and GC pauses. It's ready to use.
Alternative dashboards worth exploring:
ID 11074: Node Exporter dashboard (system metrics like CPU, disk, network)
ID 15172: Modified version of 11074 with better defaults for production
Building custom dashboard panels
Memory usage over time
Create a new panel. Panel title: "Memory Usage". Query type: Prometheus. In the query field, enter:
process_resident_memory_bytes / 1024 / 1024
This divides bytes by 1024 twice to get megabytes. The graph shows memory growth over time. If it always climbs and never drops, you might have a leak.
Request rate and error rate
Create two panels side by side. First panel, "Request Rate":
rate(http_requests_total[5m])
This calculates requests per second over the last 5 minutes. Second panel, "Error Rate":
rate(http_requests_total{status_code=~"5.."}[5m])
This shows 500+ errors per second. If this spikes, something is broken.
P95 and P99 latency
Create a panel with two queries:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
These show the 95th and 99th percentile latencies. If P99 is 10 seconds but P50 is 50 ms, most requests are fast but a few are slow.
GC pause duration distribution
Node.js garbage collection pauses the entire process. Monitor these:
histogram_quantile(0.99, rate(nodejs_gc_duration_seconds_bucket[5m]))
If GC pauses exceed 100 ms, your app is being stopped too often. This suggests memory pressure or inefficient allocation patterns.
Alerting rules
Memory threshold alerts
Go to Alerts > Alert Rules > Create Alert Rule. Give it a name: "High Memory Usage". Add a query:
process_resident_memory_bytes / 1024 / 1024 / 1024 > 1.5
This alerts if memory exceeds 1.5 GB. Set the duration to 5 minutes (so brief spikes don't trigger false alarms). Add a notification channel and set the severity to Warning or Critical.
Latency alerts
Alert rule: "High P95 Latency". Query:
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 0.5
This fires if 95% of requests take longer than 500 ms. Duration: 2 minutes.
Error rate alerts
Alert rule: "High Error Rate". Query:
rate(http_requests_total{status_code=~"5.."}[5m]) > 0.1
This fires if more than 0.1 errors per second (about 8 errors per minute) happen. Duration: 1 minute.
Notification channels
Alerts are useless if nobody knows about them. Set up notifications. Go to Alerting > Notification Channels > New Channel.
Email: Enter your email. Grafana sends alert notifications to your inbox.
Slack: Paste a Slack webhook URL. Grafana posts alerts to your Slack channel. This is faster than email.
PagerDuty: If you're on-call, integrate PagerDuty. Critical alerts page you immediately.
Test each notification channel before relying on it.
Securing Prometheus and Grafana
Firewall rules for metric endpoints
Prometheus (port 9090) and Grafana (port 3000) should never be public. Restrict access to your VPS only:
sudo ufw allow from 127.0.0.1 to 127.0.0.1 port 9090
sudo ufw allow from 127.0.0.1 to 127.0.0.1 port 3000
If you need remote access (checking dashboards from your laptop), use an SSH tunnel instead of exposing the ports publicly:
ssh -L 3000:localhost:3000 user@your-vps-ip
This tunnels port 3000 on your laptop through SSH to port 3000 on the VPS. Access localhost:3000 locally; it's encrypted end-to-end.
Reverse proxy with authentication
If you need remote access frequently, put Prometheus and Grafana behind Nginx with authentication. Create an Nginx config:
upstream prometheus {
server 127.0.0.1:9090;
}
upstream grafana {
server 127.0.0.1:3000;
}
server {
listen 80;
server_name monitoring.example.com;
location /prometheus/ {
proxy_pass http://prometheus/;
auth_basic "Prometheus";
auth_basic_user_file /etc/nginx/.htpasswd;
}
location / {
proxy_pass http://grafana/;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
auth_basic "Grafana";
auth_basic_user_file /etc/nginx/.htpasswd;
}
}
Create the .htpasswd file (htpasswd command from apache2-utils):
sudo apt install apache2-utils
sudo htpasswd -c /etc/nginx/.htpasswd youruser
Use SSL with Let's Encrypt:
sudo certbot --nginx -d monitoring.example.com
Now Prometheus and Grafana are accessible remotely at https://monitoring.example.com with username and password protection.
Going deeper: custom dashboards and queries
The community dashboard is a starting point. Build custom panels specific to your app:
Track custom business metrics (orders processed, signups, payments). Expose them as counters in prom-client. Build panels to track KPIs.
Create alerts on business metrics: if signups drop below expected levels, something is wrong. If payment processing fails, alert immediately.
Use dashboard variables to switch between environments (staging vs production) or apps. Variables in Grafana let you build one dashboard that works for many targets.
Integrating monitoring into your deployment workflow
After deploying with zero downtime using PM2, watch your dashboards for the first 5 minutes. Look for memory spikes, error rate changes or latency degradation. If metrics look good, you're safe.
If deploying triggers alerts, rollback immediately and investigate. This tight feedback loop catches regressions fast.
Use dashboards to understand normal behavior. Then set alerts 20% above normal. For example, if memory typically peaks at 400 MB, alert at 500 MB. This catches leaks before they cause crashes.

