Who this guide is for
You run n8n on a VPS and want to add AI. Maybe you need smarter routing for support tickets, lead scoring for sales, content cleanup for marketing, or data extraction from messy PDFs. You care about uptime, privacy, and cost. You also want examples that actually run, not vague advice.
I’ll keep this practical. We will cover OpenAI and Anthropic for hosted models, plus local LLMs through Ollama on a GPU VPS. We will design full patterns like RAG and human-in-the-loop approvals. We will talk about prompts, structured outputs, monitoring, security, and cost so your workflows do not surprise you later.
What can AI do inside n8n right now
High-value workflows you can ship this week
- Ticket triage with summaries and intent
Reduce first response time. Summarize incoming messages, detect sentiment, tag priority, route to the right queue. - Lead qualification and enrichment
Classify inbound leads by intent and ICP fit. Normalize company names, extract website, fetch context, score with a small rubric. - Data extraction from documents
Split PDFs, isolate tables, extract fields into JSON. Good for invoices, contracts, or forms. - Content cleanup and SEO metadata
Turn raw drafts into clean text. Generate titles, slugs, meta descriptions. Tag posts by topic. - Ops copilots
Auto-draft incident updates from metrics and logs. Suggest remediation steps. Ask a human to approve before sending.
Each of these can be done with hosted models or a local LLM. The tradeoffs are privacy, latency, cost, and how much control you need.
Architecture patterns that work in production
Pattern 1: Simple AI step in the middle of a flow
Webhook → prepare prompt → LLM call → JSON result → next node. This covers classification, summarization, or field extraction. Keep the prompt short. Force JSON output. Validate before moving on.
Pattern 2: AI with tools
Let the model decide which action to take, then call real services in n8n. You can simulate “tool use” by asking the model to output a function name and parameters in JSON, then route that to branches in your workflow. You stay in control of side effects.
Pattern 3: Retrieval augmented generation
For answers grounded in your data, build a small RAG pipeline. Ingest documents, chunk, embed, store in a vector DB, retrieve relevant chunks on each query, then ask the model to answer using only the retrieved context. Works well for internal knowledge, runbooks, or product docs.
Pattern 4: Batch jobs
Nightly content rewriting, bulk classification, or data cleanup. Use Cron → split into batches → queue mode with workers. Batches reduce timeouts and memory spikes.
Pattern 5: Human in the loop
For risky actions like refunds, send the AI suggestion to Slack or email for approval. Only execute downstream if a human clicks approve.
Choosing a model: Hosted vs local
Hosted models
- OpenAI
Strong general performance, function calling, embeddings, good tooling. You send data to a third party, so mind compliance. - Anthropic
Strong at instruction following and long contexts. Good for careful summarization and safer content. Similar privacy considerations.
Hosted models are fast to set up. They can be expensive at scale. They make compliance teams nervous if you send personal data.
Local models on your VPS
- Ollama with Llama 3 or Mistral on a GPU VPS
Data stays on your box. Fixed monthly cost for the server. Lower cost per token especially at high volume. You manage updates and performance. - Tradeoffs
Small models are cheaper but weaker. Bigger models need more VRAM. You tune and cache to hit your latency goals.
I usually suggest starting with a hosted model for the first workflow then moving that workload to a local model when volume or privacy pushes you.
Prompts that survive real traffic
Shape instructions as contracts
Explain exactly what you want. Ask for JSON with a schema. Give one or two examples. Avoid long stories. No creative fluff in production prompts.
Template snippet
You are an API that returns JSON only.
Task: Summarize the message and classify intent.
Return a JSON object with fields:
- summary: string
- intent: one of ["bug-report","billing","feature-request","general"]
- priority: integer 1..5
- route: one of ["support","sales","product"]
Constraints:
- Use at most 60 words in summary.
- If unsure, set intent "general" and priority 3.
Message:
{{ $json.body }}
Enforce structure in n8n
- Ask for JSON in the prompt.
- Parse with the JSON Parse node.
- Validate fields with a small Code node. If parsing fails, retry with a simplified prompt once.
Deterministic responses when you need them
Set temperature to 0 for classification or extraction. Use a higher value only for creative drafting where variety helps.
Structured outputs and validation
The double parse trick
Many models return valid JSON most of the time. For the rest, parse gently then sanitize.
Code node example
const raw = $json.response;
// try fast path
try {
const parsed = JSON.parse(raw);
return [{ json: parsed }];
} catch (e) {}
// fallback: extract JSON-like substring
const match = raw.match(/\{[\s\S]*\}$/m);
if (!match) {
throw new Error('No JSON found in model output');
}
const cleaned = match[0]
.replace(/,\s*}/g, '}')
.replace(/,\s*]/g, ']');
return [{ json: JSON.parse(cleaned) }];
If strict correctness matters, add a check node that verifies intent
is one of the allowed values and priority
is in range. If invalid, branch to a fix-up step or send to manual review.
RAG on a VPS without a giant bill
Ingestion pipeline
- Fetch docs from a source folder or a CMS API.
- Split into chunks, for example 700 tokens with 100 overlap.
- Create embeddings. Use a small embedding model to keep cost sane.
- Upsert to a vector database.
Good vector DB options for small teams:
- Qdrant or Weaviate in Docker
- pgvector inside PostgreSQL if you prefer one database
Query pipeline
- User asks a question.
- Embed the question with the same embedding model.
- Retrieve top 5 chunks.
- Build a strict prompt that tells the model to use only the provided context.
- Return a grounded answer. If confidence is low, ask a human.
n8n nodes to glue it together
- HTTP Request to your vector DB REST API.
- Function or Code node to format the prompt with retrieved chunks.
- OpenAI or HTTP Request to the local model.
- If node to branch on confidence score.
Running a local LLM with Ollama
Quick start on a GPU VPS
curl -fsSL https://ollama.com/install.sh | sh
ollama pull mistral
ollama serve
Ollama listens on http://localhost:11434
. In n8n, use HTTP Request:
- Method: POST
- URL:
http://localhost:11434/api/generate
- Body: JSON
- Payload example:
{
"model": "mistral",
"prompt": "Summarize in 60 words: {{ $json.body }}",
"stream": false
}
For chat style use /api/chat
. For embeddings run an embedding model like nomic-embed-text
and call /api/embeddings
.
GPU notes
- NVIDIA T4 works well for small to medium models.
- Keep contexts modest or latency climbs.
- Cache frequent prompts or partial results in Redis to save time and tokens.
Anthropic with the HTTP Request node
If you prefer Anthropic, set an API key in n8n credentials then call the Messages API via HTTP Request. Always request compact JSON. Keep safety settings aligned with your content.
Headers
Content-Type: application/json
x-api-key: {{ $credentials.anthropicApi.apiKey }}
anthropic-version: 2023-06-01
Body
{
"model": "claude-3-haiku-20240307",
"max_tokens": 400,
"messages": [
{"role":"user","content":"{{ $json.prompt }}"}
]
}
Return the text then parse it to your schema. Same double parse technique applies.
Cost control without blunt limits
Estimate before you ship
Make a small gold set of 50 typical inputs. Run them through your prompt with a dry-run workflow and record tokens or latency. Multiply by daily volume. This tells you if the project stays cheap or demands a local model.
Simple savings that work
- Force short outputs. Cap summary length.
- Trim inputs before sending to the model. Remove extra fields.
- Cache embedding results. Do not re-embed the same content.
- Pick the right model size. Use a small model for classification and a bigger one for complex synthesis only when needed.
Backoff and retries
Vendors rate limit. Implement exponential backoff with jitter in a Code node or use n8n’s retry options on the HTTP Request node. Respect the vendor’s headers for remaining quota.
Privacy and compliance
- Do not send secrets or personal data unless you must.
- Redact payloads before logging.
- Know the vendor’s data retention policy. Enterprise plans often disable training usage.
- Prefer local models for sensitive content.
- Keep all credentials in n8n’s credential store. Never hardcode in Code nodes.
Monitoring AI steps like an adult
What to record
- Prompt template version
- Model name and version
- Token count in and out
- Latency
- A tiny hash of the input for deduplication
Log these to Postgres or a lightweight store. Build a Grafana panel to watch p95 latency and error rates. Alert when retries spike or latency drifts.
Failure plan
When the model errors out, your workflow should not explode it should degrade. Send a default classification, mark unknown, or route to a human. Keep the system helpful even when AI is offline.
End-to-end example: Support ticket triage on a VPS
Goals
- Summarize the ticket
- Classify intent and priority
- Route to support or product or sales
- Respect privacy
- Keep costs predictable
Flow outline
- Webhook receives a ticket payload from your help desk.
- Code node trims fields and builds a compact prompt.
- LLM call via OpenAI node or HTTP Request to Anthropic or Ollama.
- JSON Parse with validation.
- Router node picks a queue based on intent and priority.
- Slack or Email notifies the right team.
- Logging node writes model metadata to a table.
- If node catches low confidence then creates a manual review task.
Compact prompt
Return JSON:
{
"summary": string,
"intent": "bug-report" | "billing" | "feature-request" | "general",
"priority": 1 | 2 | 3 | 4 | 5,
"route": "support" | "sales" | "product",
"confidence": 0..1
}
Rules:
- Summarize in at most 60 words.
- Use "general" if unsure.
- Priority 1 if payment failed, 2 if outage keywords found, else 3.
Ticket:
{{ $json.message }}
Validation snippet
const d = $json;
const intents = ["bug-report","billing","feature-request","general"];
const routes = ["support","sales","product"];
if (!intents.includes(d.intent)) d.intent = "general";
if (!routes.includes(d.route)) d.route = "support";
if (typeof d.priority !== "number" || d.priority < 1 || d.priority > 5) d.priority = 3;
if (typeof d.confidence !== "number") d.confidence = 0.5;
return [{ json: d }];
Cost and privacy
- Redact emails or card fragments before sending to the model.
- Set temperature 0.
- Use a small model for classification. Use a larger one only if the text is complex.
Running this reliably on a VPS
Single node vs queue mode
For occasional AI calls a single node is fine. For batch jobs or bursty webhooks move to queue mode with Redis and multiple workers. AI calls often block on network or GPU so workers keep the editor responsive.
Resource planning
- Hosted models mainly use CPU for HTTP and JSON.
- Local models use GPU heavily. Budget VRAM for context length and model size.
- Keep Postgres fast with NVMe so logging and state do not stall workers.
Secrets and deployment
- Store API keys in n8n credentials.
- Pin container versions so you can roll back.
- Snapshots before upgrades.
- Monitor logs for prompt drift and timeouts.
FAQ
Can I force strict JSON from models without constant parsing fixes?
Use a constrained prompt plus JSON mode if your provider supports it. Still run a validator since edge cases happen in production.
How do I pick between OpenAI, Anthropic, or a local LLM?
Start with hosted for speed. If volume grows or data sensitivity is high, move to a local model on a GPU VPS. Mix both when it makes sense.
Do I need LangChain to build real workflows in n8n?
No. You can do RAG with plain nodes and HTTP calls. LangChain nodes help when you need chains, tools, or retrievers out of the box.
How big should chunks be for RAG?
Between 500 and 800 tokens with some overlap works well for most docs. Test with your content. Smaller chunks reduce irrelevant context but increase embedding calls.
What is the easiest way to add human review?
Send the AI result to Slack with Approve or Reject buttons. Only proceed on Approve. Keep a timeout path so items do not stall forever.
Will a local model be fast enough?
With a T4 or similar GPU, small to medium models can hit low seconds per request. Use caching, keep prompts tight, and avoid very long contexts.
How do I keep costs under control with hosted models?
Trim inputs, cap output length, use temperature 0 for deterministic tasks, cache embeddings, and pick the smallest model that passes your tests.