Rolling Updates for n8n Using Docker

Ahmed
0

Rolling Updates for n8n Using Docker

I’ve run n8n in production behind reverse proxies where even a 10-second restart could break inbound webhooks and annoy paying users.


Rolling Updates for n8n Using Docker let you ship new versions while keeping triggers alive, the editor reachable, and executions safe.


Rolling Updates for n8n Using Docker

What “zero downtime” actually means in n8n

If you update the container and your instance disappears for a moment, you usually see one (or more) of these failures:

  • Missed inbound webhooks: external services retry later (or never), and you lose events.
  • Broken OAuth callback flows: users get redirected into a dead request path.
  • Half-finished executions: long-running workflows get killed mid-flight if your shutdown is abrupt.
  • Editor disconnects: not fatal, but it looks like instability.

Your goal is simple: keep at least one healthy n8n “main” available to accept web traffic while updates happen, and make shutdowns graceful so in-flight work finishes safely.


Non-negotiables before you attempt rolling updates

1) Put n8n behind a reverse proxy and set the webhook URL correctly

If you run n8n behind a proxy (which you should for TLS and routing), set WEBHOOK_URL so n8n generates and registers the correct external URLs. This is the difference between “it works in the editor” and “it works in production.”


Use the official reference for reverse proxy webhook configuration here: n8n reverse proxy webhook URL configuration


Real-world challenge: The most common rolling-update “bug” isn’t Docker—it’s n8n producing internal URLs (like port 5678) when WEBHOOK_URL isn’t set correctly.


Fix: Set WEBHOOK_URL to your public HTTPS URL and verify new webhook registrations show the correct domain.


2) Use a real database (not SQLite) and keep it stable across replicas

Rolling updates only make sense when your state survives container churn. That means a persistent database (commonly Postgres) and predictable connectivity from each n8n instance.


If you’re using Docker, follow the official n8n Docker hosting guidance: n8n Docker installation docs


Real-world challenge: Database latency spikes during deployments can look like “n8n is unhealthy,” causing your proxy or orchestrator to flap containers.


Fix: Add health checks that test the app, not the database directly, and keep your DB on stable storage with proper resources.


3) Decide if you need Queue Mode (you probably do if you scale)

If you plan to run multiple containers, Queue Mode is the cleanest way to separate web-facing “main” instances from execution “workers.” It reduces the risk of concurrency conflicts and gives you predictable scaling.


Reference the official Queue Mode documentation: n8n Queue Mode configuration


Real-world challenge: Queue Mode adds Redis and introduces more moving parts to monitor.


Fix: Keep Redis private on an internal network, use a strong Redis password if applicable, and start with one worker, then scale gradually while watching execution throughput.


Two production-ready patterns for rolling updates

You can do rolling updates in Docker in two different ways depending on your environment:

  • Pattern A (recommended for true rolling updates): Docker Swarm services with rolling update policy.
  • Pattern B (works on a single host with Docker Compose): Blue/green using two n8n “main” containers behind a proxy.

Pattern A: Docker Swarm rolling updates (true start-first behavior)

Swarm gives you first-class rolling updates: it can start a new task, wait until healthy, then stop the old task. That “overlap” is what prevents downtime.


Core idea

  • Run 2+ replicas of the n8n “main” service behind a reverse proxy.
  • Use update order = start-first so a new container is ready before the old one exits.
  • Use health checks so Swarm only routes traffic to healthy tasks.

Swarm stack example (main + workers in queue mode)

version: "3.8"

services: n8n-main: image: n8nio/n8n:latest environment: - NODE_ENV=production - N8N_HOST=your-domain.com - N8N_PROTOCOL=https - WEBHOOK_URL=https://your-domain.com/ - N8N_PROXY_HOPS=1 - N8N_ENCRYPTION_KEY=REPLACE_WITH_LONG_RANDOM_SECRET - DB_TYPE=postgresdb - DB_POSTGRESDB_HOST=postgres - DB_POSTGRESDB_DATABASE=n8n - DB_POSTGRESDB_USER=n8n - DB_POSTGRESDB_PASSWORD=REPLACE_ME - EXECUTIONS_MODE=queue - QUEUE_BULL_REDIS_HOST=redis - QUEUE_BULL_REDIS_PORT=6379 networks: - internal - edge deploy: replicas: 2 update_config: parallelism: 1 delay: 10s order: start-first failure_action: rollback rollback_config: parallelism: 1 order: stop-first restart_policy: condition: on-failure healthcheck: test: ["CMD-SHELL", "wget -qO- http://localhost:5678/healthz || exit 1"] interval: 10s timeout: 3s retries: 6 start_period: 25s n8n-worker: image: n8nio/n8n:latest environment: - NODE_ENV=production - N8N_ENCRYPTION_KEY=REPLACE_WITH_LONG_RANDOM_SECRET - DB_TYPE=postgresdb - DB_POSTGRESDB_HOST=postgres - DB_POSTGRESDB_DATABASE=n8n - DB_POSTGRESDB_USER=n8n - DB_POSTGRESDB_PASSWORD=REPLACE_ME - EXECUTIONS_MODE=queue - QUEUE_BULL_REDIS_HOST=redis - QUEUE_BULL_REDIS_PORT=6379 command: worker networks: - internal deploy: replicas: 2 update_config: parallelism: 1 delay: 10s order: start-first postgres: image: postgres:16 environment: - POSTGRES_DB=n8n - POSTGRES_USER=n8n - POSTGRES_PASSWORD=REPLACE_ME volumes: - postgres_data:/var/lib/postgresql/data networks: - internal redis: image: redis:7 networks: - internal networks: internal: driver: overlay edge: driver: overlay volumes:
postgres_data:

Real-world challenge: Swarm is solid for rolling updates, but the ecosystem is smaller than Kubernetes, and some teams don’t have standardized Swarm ops.


Fix: Keep your stack minimal, document your update commands, and test updates in staging with a realistic webhook load before you do it on your primary domain.


Deploy and update commands

# First deploy

docker stack deploy -c stack.yml n8n # Update image (rolling update happens automatically due to update_config) docker service update --image n8nio/n8n:latest n8n_n8n-main docker service update --image n8nio/n8n:latest n8n_n8n-worker # Optional: force a rolling restart (even if the tag didn't change)
docker service update --force n8n_n8n-main

For official Docker service update behavior, use: Docker service update reference


Pattern B: Docker Compose blue/green (practical on a single VPS)

If you’re on a single VPS and don’t want Swarm, you can still get near-zero downtime by running two “main” containers and switching traffic at the proxy.


Core idea

  • Run n8n-a and n8n-b (only one is “active” at a time).
  • Both point to the same database and use the same N8N_ENCRYPTION_KEY.
  • Your reverse proxy routes traffic to the active container.
  • Update the inactive container first, verify it’s healthy, then flip the proxy route.

Compose example for blue/green mains

services:

n8n-a: image: n8nio/n8n:latest environment: - NODE_ENV=production - WEBHOOK_URL=https://your-domain.com/ - N8N_PROXY_HOPS=1 - N8N_ENCRYPTION_KEY=REPLACE_WITH_LONG_RANDOM_SECRET - DB_TYPE=postgresdb - DB_POSTGRESDB_HOST=postgres - DB_POSTGRESDB_DATABASE=n8n - DB_POSTGRESDB_USER=n8n - DB_POSTGRESDB_PASSWORD=REPLACE_ME healthcheck: test: ["CMD-SHELL", "wget -qO- http://localhost:5678/healthz || exit 1"] interval: 10s timeout: 3s retries: 6 start_period: 25s networks: [internal] n8n-b: image: n8nio/n8n:latest environment: - NODE_ENV=production - WEBHOOK_URL=https://your-domain.com/ - N8N_PROXY_HOPS=1 - N8N_ENCRYPTION_KEY=REPLACE_WITH_LONG_RANDOM_SECRET - DB_TYPE=postgresdb - DB_POSTGRESDB_HOST=postgres - DB_POSTGRESDB_DATABASE=n8n - DB_POSTGRESDB_USER=n8n - DB_POSTGRESDB_PASSWORD=REPLACE_ME healthcheck: test: ["CMD-SHELL", "wget -qO- http://localhost:5678/healthz || exit 1"] interval: 10s timeout: 3s retries: 6 start_period: 25s networks: [internal] postgres: image: postgres:16 environment: - POSTGRES_DB=n8n - POSTGRES_USER=n8n - POSTGRES_PASSWORD=REPLACE_ME volumes: - postgres_data:/var/lib/postgresql/data networks: [internal] networks: internal: volumes:
postgres_data:

Real-world challenge: Compose can’t natively “roll” replicas the way Swarm does, so you must manage the traffic switch at the proxy.


Fix: Automate your switch with a simple runbook: update inactive → health verify → flip route → drain old → remove old.


Traefik or NGINX: which proxy should you choose?

Traefik is popular with Docker because it can discover containers automatically and health-check upstreams.


Official Traefik service/healthcheck docs: Traefik routing services & health checks


Real-world challenge: Label-based configuration can become messy, especially when you’re switching between n8n-a and n8n-b.


Fix: Keep one stable router name (your domain) and only change the service target, not the public route.


NGINX is a strong option if you prefer explicit config files and predictable behavior.


Official NGINX documentation: NGINX docs


Real-world challenge: Misconfigured reloads or upstream timeouts can cause short request failures during a flip.


Fix: Use a health-checked upstream, a quick reload command, and verify the new upstream responds before switching traffic.


Comparison table: pick the right update approach

Approach Downtime risk Operational complexity Best fit
Single container (stop/start) High Low Internal testing, non-critical workflows
Compose blue/green (n8n-a / n8n-b) Low (if proxy flip is clean) Medium Single VPS, you want near-zero downtime without Swarm
Docker Swarm rolling service updates Very low (with health checks) Medium Production webhooks, multiple replicas, repeatable deployments

Hardening your rolling updates: the details that prevent surprises

Use graceful shutdown so executions don’t get cut off

When you stop a container, Docker sends a termination signal and eventually kills the process if it doesn’t exit quickly. Long webhooks, retries, and active executions can get disrupted if your shutdown window is too short.


Practical move: extend stop timeouts so n8n can finish current work before exit.

# Docker Compose example

services: n8n: image: n8nio/n8n:latest
stop_grace_period: 60s

Health checks must reflect “ready for traffic,” not just “process is alive”

If your health check is too weak, your proxy may route traffic to a container that hasn’t fully started. If it’s too strict, you’ll get unnecessary restarts.


Real-world challenge: Many teams only check the TCP port, which passes even when the app isn’t ready.


Fix: Hit a lightweight HTTP endpoint and require consistent success before the instance is considered healthy.


Sticky sessions: helpful for the editor, risky during flips

Sticky sessions can reduce “log in again” moments in some setups, but they can also pin a user to an old instance that you’re trying to drain.


Real-world challenge: During a flip, a sticky cookie can keep sending editor requests to a container that is stopping.


Fix: If you enable stickiness, shorten cookie lifetime and always do start-first updates so an alternate healthy instance is ready.


A practical blue/green runbook you can execute in minutes

This sequence works reliably on a single VPS with Compose and a proxy flip:

# 1) Pick the inactive main (example: n8n-b is inactive)

docker compose pull n8n-b docker compose up -d n8n-b # 2) Verify it becomes healthy (use your own check) docker inspect --format='{{json .State.Health.Status}}' $(docker compose ps -q n8n-b) # 3) Flip the reverse proxy to route traffic to n8n-b # (This step depends on your proxy: update upstream/labels and reload) # 4) Drain old main (n8n-a) by removing it after traffic is confirmed stable docker compose stop n8n-a
docker compose rm -f n8n-a

Common mistakes that quietly cause downtime

  • Updating the only “main” container: if there’s no second main, there’s always downtime.
  • Forgetting WEBHOOK_URL: external integrations register incorrect callback URLs and stop triggering.
  • Changing N8N_ENCRYPTION_KEY between replicas: credentials become unreadable across instances.
  • Not validating health before routing: your proxy routes traffic to a container still warming up.
  • Deploying big version jumps without staging: migrations happen under live traffic and amplify risk.

FAQ

Can you run multiple n8n “main” containers at the same time?

Yes, as long as they share the same database and the same N8N_ENCRYPTION_KEY. For higher throughput and safer scaling, use Queue Mode so workers handle executions while mains focus on web traffic and orchestration.


Do rolling updates break active webhooks?

They don’t have to. If your proxy always has one healthy upstream and you update with start-first behavior (Swarm) or blue/green flip (Compose), inbound webhooks keep landing on a live instance.


What’s the safest way to update when you’re running heavy workflows?

Use Queue Mode and scale workers separately. Update workers one at a time, then update mains. Keep a longer stop grace period so running executions have time to finish cleanly.


How do you prevent credential issues during rolling updates?

Keep N8N_ENCRYPTION_KEY identical across every instance, and don’t rotate it casually. If you must rotate secrets, plan it as a controlled maintenance event and validate credential decryption on a staging clone first.


Should you pin the n8n image tag instead of using latest?

Pinning a version reduces surprise changes and makes rollbacks predictable. Use a controlled update cadence, validate the version in staging, then deploy the exact same tag in production.


Can you do rolling updates without a reverse proxy?

Not reliably. You need something that can keep a stable public endpoint (TLS + routing) while containers come and go. That stable edge is what makes “start-first” and blue/green switches possible.



Conclusion

If you want the most reliable rolling updates, run two n8n mains behind a proxy and let Docker Swarm handle start-first updates with health checks. If you’re on a single VPS and prefer Compose, blue/green with a clean proxy flip gets you very close to zero downtime—just make health verification and WEBHOOK_URL configuration non-negotiable.


Post a Comment

0 Comments

Post a Comment (0)