Module A-2: Production Operations & Observability — Docker In-Depth

Module A-2·25 min read

CPU/Memory resource limits, logging drivers, container metrics (docker stats), and robust health checks.

Introduction

If a container crashes in production and no one is around to see it, how do you know what went wrong?

In Module 1, we learned that containers share the host machine's hardware. Without proper limits, a single memory-leaking container can bring down the entire server. Without proper logging, debugging is impossible.

In this module, we will explore production operations: implementing Resource Limits (cgroups), managing Logging Drivers, analyzing Container Metrics, and writing robust Healthchecks.

Resource Limits (cgroups in Action)

By default, a Docker container can consume 100% of the host machine's CPU and Memory. If you deploy a Node.js API to an 8GB VPS and a bug causes an infinite loop that allocates arrays, the container will consume all 8GB of RAM. The Linux kernel will panic and start killing critical system processes.

You must place boundaries on your containers.

Memory Limits

yaml
# compose.yaml
services:
  api:
    image: my-api
    deploy:
      resources:
        limits:
          memory: 512M
        reservations:
          memory: 128M

Limit: The absolute maximum memory allowed. If the container tries to exceed 512MB, the Linux Out Of Memory (OOM) killer steps in, instantly terminating the process with Exit Code 137.
Reservation: A soft limit. Docker guarantees the container will have at least 128MB available.

CPU Limits

yaml
    deploy:
      resources:
        limits:
          cpus: '0.50'

If you specify 0.50, Docker configures the cgroups so the container can only use a maximum of 50% of a single CPU core every second.

[!TIP] Node.js and CPU Limits: Node.js is single-threaded. If you allocate cpus: '2.0', your Node app still only uses 1 core unless you utilize the cluster module or worker_threads!

Container Metrics and Observability

How do you know what limits to set? You need to observe how your container behaves under load.

`docker stats`

The easiest way to view real-time metrics is the built-in stats command:

bash
docker stats

This opens a live dashboard in your terminal showing:

CPU % usage
Memory usage vs Limit
Network I/O
Block (Disk) I/O

If you see memory creeping up continuously over hours without ever dropping, your Node.js application has a memory leak.

Advanced Observability (Prometheus)

In serious production environments, docker stats is not enough. You need historical data. The industry standard is to run Prometheus alongside your containers.

You can configure the Docker daemon to expose its internal metrics to Prometheus. Prometheus scrapes this data every few seconds and stores it. You then use a tool like Grafana to create beautiful visual dashboards and set up alerts (e.g., "Slack the engineering team if the API uses more than 80% CPU for 5 minutes").

Centralized Logging

By default, Docker captures everything written to stdout and stderr and stores it in a JSON file on the host disk.

bash
# Check how large a container's log file has grown
docker inspect --format='{{.LogPath}}' my-container

If your application logs heavily, this default JSON file can grow to hundreds of gigabytes and completely fill the host's hard drive.

Log Rotation

You should strictly limit log file sizes using the json-file logging driver options:

yaml
# compose.yaml
services:
  api:
    image: my-api
    logging:
      driver: "json-file"
      options:
        max-size: "10m" # Max 10MB per file
        max-file: "3"   # Keep only 3 files before rotating

External Logging Drivers

In a multi-server deployment, you don't want to SSH into individual machines to run docker logs. You want all logs flowing into a centralized system like Elasticsearch, Datadog, or AWS CloudWatch.

Docker supports plugging in external logging drivers:

yaml
    logging:
      driver: "syslog"
      options:
        syslog-address: "tcp://logs.papertrailapp.com:12345"

When configured this way, Docker streams logs directly out to your observability platform in real-time.

Robust Healthchecks

A container is considered "Running" as long as PID 1 hasn't crashed. But what if your Node.js application is technically running, but it has lost its connection to the database and is returning 500 Internal Server Error to every user?

Docker needs a way to verify if the application is actually healthy.

You can build a HEALTHCHECK directly into your Dockerfile:

dockerfile
# Run this command every 30 seconds.
# If it fails 3 times in a row, mark the container as 'unhealthy'
HEALTHCHECK --interval=30s --timeout=3s --retries=3 \
  CMD wget --no-verbose --tries=1 --spider http://localhost:3000/api/health || exit 1

If the container is marked as unhealthy, orchestration systems (like Docker Swarm or Kubernetes) will automatically kill it and spin up a fresh instance.

[!IMPORTANT] Your /api/health route should actually test internal systems! It should verify a SELECT 1 against the database and ping Redis. If those fail, the route should return a 500 status code, failing the Docker healthcheck.

Key Takeaways

Memory Limits: Always set memory limits to prevent a single container from causing a host-level Out Of Memory crash.
Metrics: Use docker stats for local debugging and integrate Prometheus for historical production tracking.
Log Rotation: Configure the json-file logging driver with max-size to prevent logs from eating the entire host disk.
Healthchecks: Provide Docker with a way to verify application health, not just process state.

Knowledge Check

Question 1: If you set a memory limit of 512M on a container, and the application attempts to allocate 600M, what will happen?

A) Docker will write the excess memory to disk using Swap space.
B) The Linux kernel's OOM Killer will forcefully terminate the container process (Exit Code 137).
C) Docker will throw a Node.js exception that you can catch in a try/catch block.
D) The host machine will crash.

Reveal Answer

Correct Answer: B

When a process hits the hard memory limit defined by its cgroup, the Linux kernel protects the rest of the system by instantly executing an OOM (Out Of Memory) Kill. The application has no time to run cleanup code; it is destroyed immediately.

Question 2: What is the primary operational risk of relying solely on the default Docker logging configuration in a long-running production environment without applying any additional options?

A) The logs will not capture stderr output, masking critical application crashes.
B) Docker writes all logs to an unrotated JSON file on the host's filesystem; over time, a high-traffic application can generate hundreds of gigabytes of logs, eventually exhausting the host's disk space and crashing the server.
C) The default logging driver encrypts the log data, making it impossible to read with the standard docker logs command.
D) Docker will automatically delete logs older than 24 hours to save space, meaning historical debugging data is lost.

Reveal Answer

Correct Answer: B

By default, the json-file logging driver simply appends stdout and stderr to a file on the host. It does not rotate or cap the size of this file automatically. In production, you must explicitly configure max-size and max-file options in your Compose file, or configure the Docker daemon to use an external logging driver (like syslog or awslogs) to prevent catastrophic disk exhaustion.

Question 3: A Node.js API container experiences a connection timeout to its database. The Node process remains alive (PID 1 hasn't crashed), but all incoming HTTP requests return 500 Internal Server Error. Why is defining a custom HEALTHCHECK directive essential for handling this scenario automatically?

A) The HEALTHCHECK will automatically restart the database container to fix the connection.
B) Without a HEALTHCHECK, Docker only monitors the process state; it assumes the container is perfectly fine since the Node process is still running, meaning orchestration tools won't know to replace the degraded instance.
C) A HEALTHCHECK intercepts 500 errors and returns cached responses to users until the database recovers.
D) It prevents Docker from writing the continuous stream of 500 errors to the log file.

Reveal Answer

Correct Answer: B

Docker's default monitoring is naive: if the main process is running, the container is considered "up." A custom HEALTHCHECK (which actually hits an internal /health route validating dependencies) provides application-level insight. If the database connection fails, the health check fails, the container is marked as "unhealthy," and tools like Docker Swarm or Kubernetes can automatically intervene and restart the container to attempt a recovery.

PreviousModule A-1: Hardening & Security Next Module C-1: Capstone: Deploying a SaaS Application

Discussion

Join the discussion

Loading comments...