Building an Enterprise Translation Platform: The Infrastructure Lessons We Learned the Hard Way

Translation technology is fascinating to build because the problem is genuinely complex. It is not enough to convert words — you need to handle dialect differences, right-to-left scripts, locale-specific formatting, translation memory for consistency, and reviewer workflows for quality control. The linguistic problem is interesting.

The infrastructure problem is less glamorous. But infrastructure failures are what actually stop a translation platform from working.

We built the Enterprise Translation System as a multi-service Docker application: separate containers for the API gateway, translation engine, memory storage, reviewer workflow, and a reverse proxy sitting in front of all of it. We hit three infrastructure failures in the first month that had nothing to do with translation logic. All three were entirely avoidable.

This is the documentation of those failures — what they looked like, why they happened, and what we changed.

The Architecture

Before the failures make sense, the architecture needs context.

Enterprise Translation System — Container Architecture showing Caddy as the only external-facing service, with Translation API, Engine, Memory, Review Workflow and Redis all on internal network only

Five containers. One reverse proxy. Internal networking between services. Clean on paper.

Failure 1: Port Conflict Took Down a Container at Startup

Symptom:

The Translation API container failed to start with:

Bind for 0.0.0.0:3001 failed: port is already allocated

We had recently added a separate client dashboard tool to the same server. It happened to also use port 3001. Docker refused to start the Translation API because the port was already claimed.

The Caddy reverse proxy had been running fine. Only the Translation API was down. Users could reach the proxy but got a 503 error on all translation endpoints.

Why it happened:

The docker-compose.yml file for the Translation API had:

# Simplified representation
ports:
  - "3001:3001"  ← Exposed to host

This bound port 3001 on the host machine to the container’s port 3001. When the dashboard tool also claimed port 3001, Docker rejected the second binding.

The fundamental mistake was exposing a port to the host at all. The Translation API does not need to be reachable directly from the internet. Caddy is the only entry point. The API only needs to be reachable from Caddy — and Caddy is in the same Docker network.

The fix:

Docker network optimisation — before (ports exposed to host causing port conflict) vs after (expose only, internal network, no host binding possible)

The rule we now apply to every containerised service: only the reverse proxy exposes host ports. Backend services use internal Docker networking only.

Failure 2: Containers Could Not Reach a Service Running Directly on the Host

Context:

During a transitional period, we were running the Translation Memory service directly on the host — not containerised yet. The other containers needed to connect to it.

Symptom:

The Translation Engine container logged:

Error: connect ECONNREFUSED 127.0.0.1:5432

And:

Error: dial tcp [::1]:5432: connect: connection refused

The Translation Memory was definitely running. psql from the host connected immediately. But from inside the container, localhost:5432 and 127.0.0.1:5432 both failed.

Why it happened:

This is a fundamental Docker networking concept that catches many teams the first time.

Inside a Docker container:
──────────────────────────
  localhost        = this container's loopback interface
  127.0.0.1        = this container's loopback interface
  
  NOT the host machine's localhost.
  NOT the host machine's 127.0.0.1.

  The container is an isolated network namespace.
  Its "localhost" has nothing at port 5432.
  The host's port 5432 is on a different network entirely.

The common suggestion for this problem is host.docker.internal — a Docker Desktop feature that resolves to the host machine’s IP. This works correctly on Docker Desktop for Mac and Windows. It works inconsistently on Linux Docker (where most production deployments run).

On our Linux server, host.docker.internal either did not resolve at all or resolved to the wrong network gateway.

The fix:

Correct approach for Linux Docker host access:
──────────────────────────────────────────────
Step 1: Find the Docker bridge gateway IP
  
  On the host:
    docker network inspect bridge
    → Look for "Gateway" field
    → Typically 172.17.0.1 or 172.19.0.1
    
  This IP is reachable from containers as the "host".

Step 2: Use gateway IP in container configuration
  
  In docker-compose.yml environment:
    DB_HOST: 172.17.0.1  ← Host gateway IP, not localhost
    
  Or, use Docker Compose extra_hosts:
    extra_hosts:
      - "host-services:172.17.0.1"
    
  Then connect to: host-services:5432

Step 3 (better long-term): Containerise the host service
  
  The real fix is to run the Translation Memory
  in Docker alongside the other services.
  Then all services share an internal network
  and communicate by service name:
    DB_HOST: translation-memory  ← Docker service name
  No IP addresses needed.

We containerised the Translation Memory service that weekend. Host access is now a solved problem — everything runs in the same Docker network.

Failure 3: Updating `.env` Did Not Update the Running Container

Context:

We had configured API keys and model endpoints in a .env file. After switching to a higher-tier translation model, we updated .env with the new endpoint and key — then ran:

docker compose restart translation-engine

The container restarted. The new model endpoint was not being used. The old API key was still active. Requests were going to the old endpoint.

Symptom:

This failure had no error. The service ran fine. Logs showed successful translation requests. But the translations were coming from the old model — noticeable because quality characteristics had not changed.

We spent over an hour convinced the model switch was not working, testing different configurations, before realising the environment had not changed at all.

Why it happened:

docker compose restart — what it actually does:
────────────────────────────────────────────────
  Stops the container process
  Starts the container process
  
  Does NOT recreate the container.
  Does NOT re-read the env_file directive.
  The container's environment is frozen at creation time.
  
  .env changes are invisible to docker compose restart.

Timeline of what we thought happened:
  Update .env  →  restart  →  new env vars active

Timeline of what actually happened:
  Update .env  →  restart  →  same container, same env vars
  .env file changed on disk  
  container does not know and does not care

env_file: in Docker Compose is only read when the container is created, not when it is restarted.

The fix:

Correct commands for env var changes:
───────────────────────────────────────

# WRONG — does not re-read env_file
docker compose restart translation-engine

# CORRECT — recreates container, re-reads env_file
docker compose up -d --force-recreate translation-engine

# ALSO CORRECT — stops and removes container, then starts fresh
docker compose down translation-engine
docker compose up -d translation-engine

# Verify env vars are what you expect (before going to prod):
docker compose exec translation-engine env | grep API_KEY

We now add this to every deployment runbook: “If you changed .env, use --force-recreate, not restart.”

The Pattern Across All Three Failures

Looking at these three failures together, they share a common root cause: a gap between what we assumed Docker did and what Docker actually does.

Mental model vs reality:

Failure 1:
  Assumed: External port exposure is optional
  Reality: Without expose + network, backend services
           can't communicate and compete for host ports

Failure 2:
  Assumed: localhost means the host machine
  Reality: Inside a container, localhost is the container

Failure 3:
  Assumed: restart = fresh start with current config
  Reality: restart = same container, same frozen config

Each assumption is reasonable if you have not used Docker deeply. Each is wrong. And each was invisible until it caused an outage.

What We Added to the Team’s Docker Runbook

After these three failures, we added a Docker Infrastructure Checklist to the project’s runbook. Any new containerised service must pass these checks before going to production:

Docker Deployment Checklist — networking, host access, environment variables, and pre-deployment verification steps

The Translation Platform Today

The Enterprise Translation System now handles:

Supported workflows:
  ─ Document translation (PDF, DOCX, HTML)
  ─ Translation memory (consistency across documents)
  ─ Human review queue (reviewer assignment + approval)
  ─ Multi-dialect support (e.g., Mandarin Simplified + Traditional)
  ─ Right-to-left layout handling (Arabic, Hebrew, Urdu)
  ─ API integration (webhook callbacks on job completion)

Performance after infrastructure fixes:
  Service uptime:         99.7%
  Port conflict incidents: 0 (since removing host bindings)
  Env var incidents:      0 (since updating runbook)
  Failed deployments:     ~1/month (down from ~1/week)

The translation logic — the part that is actually hard — was never the problem. The infrastructure was the problem, and infrastructure problems are fixable with the right patterns.

For Businesses Building on Docker

If you are containerising a multi-service application — whether a translation platform, a content management system, or any microservices architecture — these three failure patterns are worth bookmarking.

They are not obvious from the Docker documentation. They only appear when your assumptions about networking and environment management are tested in production.

If you are building a multilingual platform, localisation system, or any application that needs to serve content in multiple languages, the Saya team works with Australian businesses across all stages of multilingual implementation — from initial architecture to deployment to ongoing translation operations.

Building an Enterprise Translation Platform: The Infrastructure Lessons We Learned the Hard Way

Building an Enterprise Translation Platform: The Infrastructure Lessons We Learned the Hard Way

The Architecture

Failure 1: Port Conflict Took Down a Container at Startup

Failure 2: Containers Could Not Reach a Service Running Directly on the Host

Failure 3: Updating `.env` Did Not Update the Running Container

The Pattern Across All Three Failures

What We Added to the Team’s Docker Runbook

The Translation Platform Today

For Businesses Building on Docker

Share this article

Need to reach Australia's multilingual market?

Building an Enterprise Translation Platform: The Infrastructure Lessons We Learned the Hard Way

The Architecture

Failure 1: Port Conflict Took Down a Container at Startup

Failure 2: Containers Could Not Reach a Service Running Directly on the Host

Failure 3: Updating .env Did Not Update the Running Container

The Pattern Across All Three Failures

What We Added to the Team’s Docker Runbook

The Translation Platform Today

For Businesses Building on Docker

Share this article

Need to reach Australia's multilingual market?

Failure 3: Updating `.env` Did Not Update the Running Container