Integrating AI into n8n on a VPS: OpenAI, Anthropic, and local models

Bring AI into n8n workflows on a VPS. Connect OpenAI or Anthropic or run a local LLM. Build reliable, fast, and private automation.

Mark O'Connor
September 18, 2025

Who this guide is for

You run n8n on a VPS and want to add AI. Maybe you need smarter routing for support tickets, lead scoring for sales, content cleanup for marketing, or data extraction from messy PDFs. You care about uptime, privacy, and cost. You also want examples that actually run, not vague advice.

I’ll keep this practical. We will cover OpenAI and Anthropic for hosted models, plus local LLMs through Ollama on a GPU VPS. We will design full patterns like RAG and human-in-the-loop approvals. We will talk about prompts, structured outputs, monitoring, security, and cost so your workflows do not surprise you later.

What can AI do inside n8n right now

High-value workflows you can ship this week

Ticket triage with summaries and intent
Reduce first response time. Summarize incoming messages, detect sentiment, tag priority, route to the right queue.
Lead qualification and enrichment
Classify inbound leads by intent and ICP fit. Normalize company names, extract website, fetch context, score with a small rubric.
Data extraction from documents
Split PDFs, isolate tables, extract fields into JSON. Good for invoices, contracts, or forms.
Content cleanup and SEO metadata
Turn raw drafts into clean text. Generate titles, slugs, meta descriptions. Tag posts by topic.
Ops copilots
Auto-draft incident updates from metrics and logs. Suggest remediation steps. Ask a human to approve before sending.

Each of these can be done with hosted models or a local LLM. The tradeoffs are privacy, latency, cost, and how much control you need.

Architecture patterns that work in production

Pattern 1: Simple AI step in the middle of a flow

Webhook → prepare prompt → LLM call → JSON result → next node. This covers classification, summarization, or field extraction. Keep the prompt short. Force JSON output. Validate before moving on.

Pattern 2: AI with tools

Let the model decide which action to take, then call real services in n8n. You can simulate “tool use” by asking the model to output a function name and parameters in JSON, then route that to branches in your workflow. You stay in control of side effects.

Pattern 3: Retrieval augmented generation

For answers grounded in your data, build a small RAG pipeline. Ingest documents, chunk, embed, store in a vector DB, retrieve relevant chunks on each query, then ask the model to answer using only the retrieved context. Works well for internal knowledge, runbooks, or product docs.

Pattern 4: Batch jobs

Nightly content rewriting, bulk classification, or data cleanup. Use Cron → split into batches → queue mode with workers. Batches reduce timeouts and memory spikes.

Pattern 5: Human in the loop

For risky actions like refunds, send the AI suggestion to Slack or email for approval. Only execute downstream if a human clicks approve.

Choosing a model: Hosted vs local

Hosted models

OpenAI
Strong general performance, function calling, embeddings, good tooling. You send data to a third party, so mind compliance.
Anthropic
Strong at instruction following and long contexts. Good for careful summarization and safer content. Similar privacy considerations.

Hosted models are fast to set up. They can be expensive at scale. They make compliance teams nervous if you send personal data.

Local models on your VPS

Ollama with Llama 3 or Mistral on a GPU VPS
Data stays on your box. Fixed monthly cost for the server. Lower cost per token especially at high volume. You manage updates and performance.
Tradeoffs
Small models are cheaper but weaker. Bigger models need more VRAM. You tune and cache to hit your latency goals.

I usually suggest starting with a hosted model for the first workflow then moving that workload to a local model when volume or privacy pushes you.

Prompts that survive real traffic

Shape instructions as contracts

Explain exactly what you want. Ask for JSON with a schema. Give one or two examples. Avoid long stories. No creative fluff in production prompts.

Template snippet

You are an API that returns JSON only.

Task: Summarize the message and classify intent.
Return a JSON object with fields:
- summary: string
- intent: one of ["bug-report","billing","feature-request","general"]
- priority: integer 1..5
- route: one of ["support","sales","product"]

Constraints:
- Use at most 60 words in summary.
- If unsure, set intent "general" and priority 3.

Message:
{{ $json.body }}

Enforce structure in n8n

Ask for JSON in the prompt.
Parse with the JSON Parse node.
Validate fields with a small Code node. If parsing fails, retry with a simplified prompt once.

Deterministic responses when you need them

Set temperature to 0 for classification or extraction. Use a higher value only for creative drafting where variety helps.

Structured outputs and validation

The double parse trick

Many models return valid JSON most of the time. For the rest, parse gently then sanitize.

Code node example

const raw = $json.response;
// try fast path
try {
  const parsed = JSON.parse(raw);
  return [{ json: parsed }];
} catch (e) {}

// fallback: extract JSON-like substring
const match = raw.match(/\{[\s\S]*\}$/m);
if (!match) {
  throw new Error('No JSON found in model output');
}
const cleaned = match[0]
  .replace(/,\s*}/g, '}')
  .replace(/,\s*]/g, ']');

return [{ json: JSON.parse(cleaned) }];

If strict correctness matters, add a check node that verifies intent is one of the allowed values and priority is in range. If invalid, branch to a fix-up step or send to manual review.

RAG on a VPS without a giant bill

Ingestion pipeline

Fetch docs from a source folder or a CMS API.
Split into chunks, for example 700 tokens with 100 overlap.
Create embeddings. Use a small embedding model to keep cost sane.
Upsert to a vector database.

Good vector DB options for small teams:

Qdrant or Weaviate in Docker
pgvector inside PostgreSQL if you prefer one database

Query pipeline

User asks a question.
Embed the question with the same embedding model.
Retrieve top 5 chunks.
Build a strict prompt that tells the model to use only the provided context.
Return a grounded answer. If confidence is low, ask a human.

n8n nodes to glue it together

HTTP Request to your vector DB REST API.
Function or Code node to format the prompt with retrieved chunks.
OpenAI or HTTP Request to the local model.
If node to branch on confidence score.

Running a local LLM with Ollama

Quick start on a GPU VPS

curl -fsSL https://ollama.com/install.sh | sh
ollama pull mistral
ollama serve

Ollama listens on http://localhost:11434. In n8n, use HTTP Request:

Method: POST
URL: http://localhost:11434/api/generate
Body: JSON
Payload example:

{
  "model": "mistral",
  "prompt": "Summarize in 60 words: {{ $json.body }}",
  "stream": false
}

For chat style use /api/chat. For embeddings run an embedding model like nomic-embed-text and call /api/embeddings.

GPU notes

NVIDIA T4 works well for small to medium models.
Keep contexts modest or latency climbs.
Cache frequent prompts or partial results in Redis to save time and tokens.

Anthropic with the HTTP Request node

If you prefer Anthropic, set an API key in n8n credentials then call the Messages API via HTTP Request. Always request compact JSON. Keep safety settings aligned with your content.

Headers

Content-Type: application/json
x-api-key: {{ $credentials.anthropicApi.apiKey }}
anthropic-version: 2023-06-01

Body

{
  "model": "claude-3-haiku-20240307",
  "max_tokens": 400,
  "messages": [
    {"role":"user","content":"{{ $json.prompt }}"}
  ]
}

Return the text then parse it to your schema. Same double parse technique applies.

Cost control without blunt limits

Estimate before you ship

Make a small gold set of 50 typical inputs. Run them through your prompt with a dry-run workflow and record tokens or latency. Multiply by daily volume. This tells you if the project stays cheap or demands a local model.

Simple savings that work

Force short outputs. Cap summary length.
Trim inputs before sending to the model. Remove extra fields.
Cache embedding results. Do not re-embed the same content.
Pick the right model size. Use a small model for classification and a bigger one for complex synthesis only when needed.

Backoff and retries

Vendors rate limit. Implement exponential backoff with jitter in a Code node or use n8n’s retry options on the HTTP Request node. Respect the vendor’s headers for remaining quota.

Privacy and compliance

Do not send secrets or personal data unless you must.
Redact payloads before logging.
Know the vendor’s data retention policy. Enterprise plans often disable training usage.
Prefer local models for sensitive content.
Keep all credentials in n8n’s credential store. Never hardcode in Code nodes.

Monitoring AI steps like an adult

What to record

Prompt template version
Model name and version
Token count in and out
Latency
A tiny hash of the input for deduplication

Log these to Postgres or a lightweight store. Build a Grafana panel to watch p95 latency and error rates. Alert when retries spike or latency drifts.

Failure plan

When the model errors out, your workflow should not explode it should degrade. Send a default classification, mark unknown, or route to a human. Keep the system helpful even when AI is offline.

End-to-end example: Support ticket triage on a VPS

Goals

Summarize the ticket
Classify intent and priority
Route to support or product or sales
Respect privacy
Keep costs predictable

Flow outline

Webhook receives a ticket payload from your help desk.
Code node trims fields and builds a compact prompt.
LLM call via OpenAI node or HTTP Request to Anthropic or Ollama.
JSON Parse with validation.
Router node picks a queue based on intent and priority.
Slack or Email notifies the right team.
Logging node writes model metadata to a table.
If node catches low confidence then creates a manual review task.

Compact prompt

Return JSON:
{
  "summary": string,
  "intent": "bug-report" | "billing" | "feature-request" | "general",
  "priority": 1 | 2 | 3 | 4 | 5,
  "route": "support" | "sales" | "product",
  "confidence": 0..1
}

Rules:
- Summarize in at most 60 words.
- Use "general" if unsure.
- Priority 1 if payment failed, 2 if outage keywords found, else 3.

Ticket:
{{ $json.message }}

Validation snippet

const d = $json;
const intents = ["bug-report","billing","feature-request","general"];
const routes = ["support","sales","product"];
if (!intents.includes(d.intent)) d.intent = "general";
if (!routes.includes(d.route)) d.route = "support";
if (typeof d.priority !== "number" || d.priority < 1 || d.priority > 5) d.priority = 3;
if (typeof d.confidence !== "number") d.confidence = 0.5;
return [{ json: d }];

Cost and privacy

Redact emails or card fragments before sending to the model.
Set temperature 0.
Use a small model for classification. Use a larger one only if the text is complex.

Running this reliably on a VPS

Single node vs queue mode

For occasional AI calls a single node is fine. For batch jobs or bursty webhooks move to queue mode with Redis and multiple workers. AI calls often block on network or GPU so workers keep the editor responsive.

Resource planning

Hosted models mainly use CPU for HTTP and JSON.
Local models use GPU heavily. Budget VRAM for context length and model size.
Keep Postgres fast with NVMe so logging and state do not stall workers.

Secrets and deployment

Store API keys in n8n credentials.
Pin container versions so you can roll back.
Snapshots before upgrades.
Monitor logs for prompt drift and timeouts.

FAQ

Can I force strict JSON from models without constant parsing fixes?

Use a constrained prompt plus JSON mode if your provider supports it. Still run a validator since edge cases happen in production.

How do I pick between OpenAI, Anthropic, or a local LLM?

Start with hosted for speed. If volume grows or data sensitivity is high, move to a local model on a GPU VPS. Mix both when it makes sense.

Do I need LangChain to build real workflows in n8n?

No. You can do RAG with plain nodes and HTTP calls. LangChain nodes help when you need chains, tools, or retrievers out of the box.

How big should chunks be for RAG?

Between 500 and 800 tokens with some overlap works well for most docs. Test with your content. Smaller chunks reduce irrelevant context but increase embedding calls.

What is the easiest way to add human review?

Send the AI result to Slack with Approve or Reject buttons. Only proceed on Approve. Keep a timeout path so items do not stall forever.

Will a local model be fast enough?

With a T4 or similar GPU, small to medium models can hit low seconds per request. Use caching, keep prompts tight, and avoid very long contexts.

How do I keep costs under control with hosted models?

Trim inputs, cap output length, use temperature 0 for deterministic tasks, cache embeddings, and pick the smallest model that passes your tests.