Building reliable ETL pipelines with n8n on a VPS

Use n8n on a VPS to build ETL pipelines that extract, transform, and load data reliably. Step-by-step examples included.
Wavy lines create an abstract background, showcasing a blend of colors that evoke movement and fluidity.

Why ETL matters in workflow automation

Almost every automation project ends up moving data. A CRM needs to sync with a billing system, logs need to flow into analytics, or marketing data has to be cleaned and normalized before reporting. That’s basically ETL: extract, transform, load.

Dedicated ETL tools exist, but they’re often heavy, pricey, or locked down. n8n gives you enough flexibility to build your own pipelines, and when hosted on a VPS you’re not fighting quotas or usage limits.

I’ve used n8n to pull from APIs, transform records into business-friendly shapes, then load them into PostgreSQL, BigQuery, or even plain CSV exports. If you design carefully, these flows can be as robust as many “real” ETL tools.

Core building blocks of an ETL workflow in n8n

Extraction

  • HTTP Request node for APIs
  • Database nodes for pulling directly from MySQL, PostgreSQL, or SQLite
  • IMAP/POP3 nodes for ingesting emails
  • File nodes for CSV, Excel, JSON uploads

Extraction is often the least predictable step. APIs throttle, emails arrive late, CSV headers change. In practice, this means adding retries and sanity checks right after extraction.

Transformation

This is where most of the work lives:

  • Function node for reshaping JSON
  • Spreadsheet File node for flattening or pivoting data
  • Code node for heavier transformations (normalizing dates, regex parsing, type conversions)
  • Merge node for joining multiple sources

You want to keep transforms idempotent. That means if you re-run the pipeline tomorrow, the results look the same. Avoid random ordering or timestamps baked into transformed records.

Loading

Where data ends up:

  • Database nodes (Postgres, MySQL, MongoDB)
  • Google Sheets node for simpler reporting
  • S3-compatible storage for backups or raw dumps
  • HTTP Request into a downstream API

Loading should also include validation. Insert rows in batches. Roll back or mark failures if the target system rejects something.

Designing ETL pipelines that survive production

Schedule vs event-driven

Some pipelines run on a cron (nightly syncs). Others fire on events (a new customer record triggers an extract). n8n supports both.

  • Use Cron node for scheduled ETL.
  • Use Webhook node when downstream systems can push.
  • Use Polling nodes sparingly — they are simple but eat API quotas.

Queue mode and scaling

For high-volume ETL (thousands of rows or large files), run n8n in queue mode with Redis. Workers handle transformations in parallel. This keeps the editor responsive and prevents bottlenecks.

Checkpoints

If your ETL job moves 10k records, you don’t want to lose progress on record 9,999. Add state: store checkpoints in Postgres or Redis, so the next run resumes where the last left off.

An end-to-end ETL example with PostgreSQL

Let’s say you want to pull leads from a SaaS API, clean them, and load them into Postgres.

  1. Cron node: triggers nightly.
  2. HTTP Request: fetch leads from API with pagination.
  3. Function node: normalize names, format phone numbers, validate emails.
  4. If node: drop incomplete or duplicate records.
  5. Postgres node: upsert into the leads table.
  6. Slack node: send a short summary report of records inserted vs rejected.

Transformation snippet

return items.map(item => {
  const lead = item.json;
  return {
    json: {
      email: lead.email.toLowerCase().trim(),
      name: lead.name.replace(/\s+/g, ' ').trim(),
      phone: lead.phone?.replace(/[^0-9]/g, ''),
      source: lead.source || "unknown",
      created_at: new Date().toISOString()
    }
  };
});

Loading with upserts

Using Postgres, you can set the node to run INSERT ... ON CONFLICT DO UPDATE so duplicates don’t create extra rows. This is key to making ETL idempotent.

Handling errors and retries in ETL

  • API failures: wrap requests in try/catch Code nodes or use n8n’s built-in retry.
  • Partial loads: batch insert so one bad row doesn’t break the entire set.
  • Alerts: use a global error workflow to notify you on Slack if ETL jobs fail.

Good ETL isn’t about never failing, it’s about failing predictably and resuming cleanly.

Monitoring ETL pipelines on a VPS

You can monitor pipelines just like any other n8n workflow, but for ETL I track:

  • Records processed per run
  • Failures vs successes
  • Execution time per stage
  • API latency and quota usage

Export these metrics into Prometheus and Grafana if you run them. Or log them to a Postgres table for quick reporting.

Extending ETL with AI

A new trend is to inject AI into ETL. Examples I’ve actually seen:

  • Extracting fields from invoices with an LLM before loading into accounting.
  • Cleaning free-text job titles into standard categories.
  • Summarizing long customer notes into one-line CRM fields.

This makes ETL smarter, though you’ll need to enforce validation so AI doesn’t introduce garbage.

FAQ

Is n8n good enough to replace dedicated ETL tools?

For many use cases yes. If you’re syncing CRM data, cleaning CSVs, or pushing into Postgres, n8n works well. For petabyte-scale streaming, a specialized tool is better.

How do I avoid hitting API rate limits?

Use the Wait node to throttle requests, respect vendor quotas, and schedule batch runs instead of hammering APIs continuously.

Should I use SQLite or PostgreSQL as a target?

Always PostgreSQL for production. SQLite is fine for testing but not for high-volume ETL.

How do I keep ETL jobs from duplicating records?

Make your inserts idempotent. Use upserts or check for existing records by unique keys.

Can ETL pipelines run in parallel on a small VPS?

Yes, with queue mode and multiple workers. Just watch CPU and memory. For heavy jobs, move Postgres to its own machine.