vLLM Docker container keeps crashing on Ubuntu 22.04 with “CUDA out of memory” – how I fixed the GPU driver/version mismatch and prevented the out‑of‑memory timeout.

You’ve deployed a vLLM container to your Ubuntu 22.04 VPS to run large language model inference, everything looks good in the Docker logs for about thirty seconds, and then—crash. “CUDA out of memory” appears, your container exits with code 139 or 137, and your inference pipeline collapses. You’ve checked your GPU memory with nvidia-smi, and there’s plenty of space available. Sound familiar? This isn’t a deployment misconfiguration. This is a GPU driver and CUDA toolkit version mismatch that I’ve spent the last week debugging and fixing. Here’s exactly what went wrong and how to solve it permanently.

Quick Reference

Problem Type:
GPU driver/CUDA version incompatibility in containerized AI inference
Root Cause:
Docker container CUDA libraries conflict with host GPU driver version
Difficulty Level:
Intermediate (debugging required, some sysadmin knowledge helpful)
Estimated Fix Time:
30–60 minutes (including rebuild and test)
Required Stack:
Ubuntu 22.04, NVIDIA GPU, Docker, NVIDIA Container Runtime, vLLM
Use Case:
VPS deployment, model serving, inference automation, AI tools in production

Why This Happens: The GPU Driver and CUDA Version Mismatch

When you run vLLM in a Docker container, you’re bundling a specific version of the CUDA toolkit inside the image. On your Ubuntu 22.04 host, you have an NVIDIA GPU driver (for example, version 535, 545, or 550). These two need to be compatible.

Here’s the catch: your Dockerfile might include CUDA 12.1 inside the container, but your host GPU driver is version 535. CUDA 12.1 officially requires driver version 535 or higher, so on paper they should work. In practice, there are subtle handshake failures between the container’s CUDA runtime and the host driver. The kernel module doesn’t fully recognize memory allocation requests, or the CUDA context initialization fails partway through model loading, triggering an “out of memory” error even though memory is available.

The timeout doesn’t occur immediately because the container successfully initializes and starts loading the model. Only when vLLM tries to allocate large tensor blocks does the mismatch become apparent. By then, the process is already running, and the error cascades.

What You’ll Need

  • SSH access to your Ubuntu 22.04 server or VPS
  • Docker installed and running with NVIDIA Container Runtime configured
  • An NVIDIA GPU (tested with T4, A100, RTX series, and L4)
  • The current GPU driver version installed on your host (you’ll verify this)
  • A copy of your vLLM Dockerfile or the image you’re currently using
  • About 30–60 minutes and patience for a rebuild cycle
  • Terminal access and basic familiarity with nvidia-smi, Docker commands, and Dockerfile syntax

Step-by-Step Fix: Diagnosing and Resolving the Version Mismatch

1Check Your Host GPU Driver Version

First, confirm which driver version is actually installed on your Ubuntu 22.04 host. This is your reference point.

nvidia-smi

Look for the driver version in the top-right corner of the output. For example:

NVIDIA-SMI 550.90.07    Driver Version: 550.90.07    CUDA Version: 12.4
|===============================+======================+======================|
| GPU  Name                 Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  NVIDIA L4                 Off  | 00:1F.0     Off |                    0 |
| N/A   35C    P8              10W /  72W |   1234MiB / 24576MiB |      0%      Default |
+===============================+======================+======================+

Note your driver version. In this example, it’s 550.90.07. Write it down; you’ll need it for your Dockerfile.

2Check Your Container’s CUDA Version

Now, identify what CUDA version is inside your vLLM container. If you’re using an official vLLM image, check the tag. If you built it yourself, look at your Dockerfile.

Run an interactive shell in the container to inspect it:

docker run --rm --gpus all nvidia/cuda:12.1.1-devel-ubuntu22.04 nvcc --version

Replace the image name and tag with your actual vLLM image. This command shows the CUDA version bundled inside.

If the container crashes immediately when you try this, that’s a sign the mismatch is severe. Skip this step and proceed to Step 3.

3Verify NVIDIA Container Runtime Is Installed and Configured

The NVIDIA Container Runtime is the bridge between Docker and your GPU. Without it, or if it’s misconfigured, your container won’t see the GPU correctly.

docker run --rm --gpus all ubuntu nvidia-smi

This should output the same nvidia-smi output you saw on your host. If it doesn’t recognize the GPU, install the runtime:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
  && curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list

sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker

After installation, test again. If nvidia-smi works inside the container, you’ve confirmed the runtime is functional.

4Rebuild Your vLLM Dockerfile with a Matching CUDA Version

This is the core fix. You need to ensure your Dockerfile uses a CUDA base image that’s compatible with your host driver version.

Here’s a template Dockerfile that works well with most recent Ubuntu 22.04 + NVIDIA driver setups:

FROM nvidia/cuda:12.4.1-devel-ubuntu22.04

# Set environment variables to prevent interactive prompts
ENV DEBIAN_FRONTEND=noninteractive
ENV CUDA_HOME=/usr/local/cuda
ENV PATH=${CUDA_HOME}/bin:${PATH}
ENV LD_LIBRARY_PATH=${CUDA_HOME}/lib64:${LD_LIBRARY_PATH}

# Update and install system dependencies
RUN apt-get update && apt-get install -y \
    python3.10 \
    python3-pip \
    git \
    wget \
    curl \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

# Upgrade pip
RUN pip install --upgrade pip setuptools wheel

# Install vLLM with CUDA 12.4 support
RUN pip install vllm torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124

# Set working directory
WORKDIR /app

# Expose port for API
EXPOSE 8000

# Default command
CMD ["python3", "-m", "vllm.entrypoints.openai.api_server", "--host", "0.0.0.0", "--port", "8000"]
⚠️ Important: The base image nvidia/cuda:12.4.1-devel-ubuntu22.04 should match your driver version. Use the NVIDIA CUDA Compatibility Matrix to confirm. If your driver is 550, CUDA 12.4 is safe. If it’s 535, use CUDA 12.1. If it’s older (e.g., 520), use CUDA 11.8.

Build the image:

docker build -t vllm-fixed:latest .

This takes 10–15 minutes. Grab coffee.

5Test the Container with Memory Limits and Verbose Logging

Before deploying to production, test the new image with detailed logging enabled:

docker run --rm \
    --gpus all \
    -e CUDA_LAUNCH_BLOCKING=1 \
    -e VLLM_LOGGING_LEVEL=DEBUG \
    -p 8000:8000 \
    vllm-fixed:latest

Let it run for at least 60 seconds. Watch the logs for CUDA initialization messages. You should see:

INFO 12-15 10:34:22 llm_engine.py:72] Initializing an LLM engine with config: model='meta-llama/Llama-2-7b-hf', tensor_parallel_size=1, dtype=torch.float16, gpu_memory_utilization=0.9, ...
INFO 12-15 10:34:25 model_executor.py:80] # GPU 0: NVIDIA L4 (compute capability 8.9)
INFO 12-15 10:34:32 tokenizer.py:28] Loading HuggingFace tokenizer from meta-llama/Llama-2-7b-hf
INFO 12-15 10:34:45 llm_engine.py:320] Finished initializing an LLM engine

If you see “Finished initializing” without crashes, the version mismatch is resolved.

✅ Success indicator: If the container runs for more than 2 minutes without the “CUDA out of memory” error, you’ve fixed the core issue. The next steps ensure stability under load.

6Optimize GPU Memory Usage and Container Resource Limits

Even with the version mismatch fixed, vLLM is a memory-hungry application. Set appropriate limits to prevent kernel OOM kills.

docker run --rm \
    --gpus all \
    --memory=32g \
    --memory-swap=32g \
    --cap-add=SYS_RESOURCE \
    -e VLLM_GPU_MEMORY_UTILIZATION=0.85 \
    -p 8000:8000 \
    vllm-fixed:latest \
    python3 -m vllm.entrypoints.openai.api_server \
        --host 0.0.0.0 \
        --port 8000 \
        --model meta-llama/Llama-2-7b-hf

Key parameters:

  • --memory=32g: Host RAM limit (adjust to your actual server RAM)
  • --memory-swap=32g: Prevent swap thrashing
  • --cap-add=SYS_RESOURCE: Allow container to adjust resource limits
  • VLLM_GPU_MEMORY_UTILIZATION=0.85: Use 85% of GPU VRAM to avoid CUDA allocation failures

7Set Up a Docker Compose File for Persistent Deployment

For production use, define your vLLM deployment in Docker Compose to avoid manual command repetition:

version: '3.8'

services:
  vllm:
    image: vllm-fixed:latest
    container_name: vllm-api
    restart: unless-stopped
    ports:
      - "8000:8000"
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: all
              capabilities: [gpu]
    environment:
      - CUDA_LAUNCH_BLOCKING=1
      - VLLM_GPU_MEMORY_UTILIZATION=0.85
      - VLLM_LOGGING_LEVEL=INFO
    volumes:
      - /home/user/.cache/huggingface:/root/.cache/huggingface
      - /home/user/models:/models
    mem_limit: 32g
    memswap_limit: 32g
    command: >
      python3 -m vllm.entrypoints.openai.api_server
      --host 0.0.0.0
      --port 8000
      --model meta-llama/Llama-2-7b-hf
      --max-model-len 2048

Deploy with:

docker-compose up -d

Common Mistakes and Why They Don’t Work

❌ Mistake #1: Ignoring the Exact Driver Version

Many developers assume “any recent driver works with CUDA 12.x.” Not true. If your driver is 535 and your container has CUDA 12.4-specific features, memory allocation can silently fail. Always cross-reference the official compatibility matrix.

❌ Mistake #2: Not Installing NVIDIA Container Runtime

You’ll see the GPU in nvidia-smi, but it won’t be fully accessible. The container can allocate small tensors but fails on large ones. Install the runtime explicitly; don’t assume Docker alone is enough.

❌ Mistake #3: Setting GPU Memory Utilization to 1.0

Tempting, but dangerous. vLLM needs headroom for CUDA kernel operations, attention mechanisms, and internal buffers. 0.85–0.90 is the sweet spot. Going to 1.0 triggers allocation failures under load.

❌ Mistake #4: Using an Old or Pinned CUDA Version

Sticking with CUDA 11.8 because it “worked before” is a trap. If your driver auto-updated to version 550, CUDA 11.8 won’t initialize correctly. Update your Dockerfile to a modern, compatible CUDA version.

❌ Mistake #5: Assuming Host Driver Version Matches Container’s Reported CUDA Version

The `nvidia-smi` output sometimes shows “CUDA Version: 12.4” even when your driver is version 550. That “12.4” is just the maximum CUDA version your driver supports, not what’s running inside the container. The container’s CUDA is determined by its base image.

Optimization Tips and Follow-Up Checks

Monitor GPU Memory in Real Time

During your first inference requests, monitor the GPU to ensure memory stays stable:

watch -n 1 nvidia-smi

You should see memory usage climb to your configured utilization level and stabilize. If it keeps climbing or drops to zero before stabilizing, the mismatch may not be fully resolved.

Run Inference Load Tests

Don’t just start the server; actually make inference requests to stress-test it:

curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "meta-llama/Llama-2-7b-hf",
    "prompt": "What is machine learning?",
    "max_tokens": 100,
    "temperature": 0.7
  }'

Make 10–20 requests in quick succession. If the container survives, you’ve validated the fix under realistic load.

Enable CUDA Error Logging for Debugging

If you still see CUDA errors, enable detailed kernel logging:

export CUDA_LAUNCH_BLOCKING=1
export CUDART_DEBUG=1
docker run --rm \
    --gpus all \
    -e CUDA_LAUNCH_BLOCKING=1 \
    -e CUDART_DEBUG=1 \
    vllm-fixed:latest

This slows performance but gives you exact error locations for further investigation.

Verify Driver-Kernel Compatibility

On your Ubuntu host, check if the driver is compatible with your kernel:

cat /proc/version
nvidia-smi --query-gpu=driver_version --format=csv,noheader

If they’re wildly mismatched (e.g., driver 550 with a kernel from 2022), a driver reinstall may be needed:

sudo apt-get purge nvidia-*
sudo apt-get autoremove
sudo apt-get install nvidia-driver-550

Real-World Example: From Crash to Stable Production

Let me walk you through a scenario that illustrates how these fixes work in practice.

The Situation: A startup running model inference on a GCP Compute Engine instance with an NVIDIA L4 GPU. They deployed vLLM using a community Dockerfile that bundled CUDA 12.0. The instance had driver version 545 installed. When they started serving requests, the container would crash after 30–45 seconds, consistently reporting “CUDA out of memory” even though the GPU had 20GB free.

Diagnosis: They ran nvidia-smi on the host (driver 545), then checked the container’s CUDA version by examining the Dockerfile. The mismatch: driver 545 officially supports CUDA 12.1 and later, but their container had CUDA 12.0. This is a narrow compatibility gap that should work, but in practice, there was a subtle handshake issue during memory initialization.

The Fix: They updated the Dockerfile to use nvidia/cuda:12.4.1-devel-ubuntu22.04 (matching the driver’s capability), ensured the NVIDIA Container Runtime was installed, and set VLLM_GPU_MEMORY_UTILIZATION=0.85. They rebuilt and deployed.

Result: The container ran for 48 hours without a crash. Inference latency was consistent at 45ms per request. Memory usage stabilized at 18GB (85% of 21GB L4 capacity). No further “CUDA out of memory” errors.

Before and After Comparison

Aspect Before Fix After Fix
Crash Frequency Every 30–60 seconds Stable for 48+ hours
Error Type CUDA out of memory (misleading) No CUDA errors
GPU Memory Utilization Spikes to 100%, then crashes Stable at 85%, no spikes
Inference Latency Inconsistent or N/A (crashes) Consistent 40–50ms per request
Container Restart Count 50+ per hour 0 over 48 hours
CUDA Version in Dockerfile 12.0 or mismatched 12.4 (compatible with driver)
NVIDIA Container Runtime Not verified / may be missing Installed and verified working

Debugging Deep Dive: If the Problem Persists

If you’ve followed all steps and the container still crashes, try these advanced diagnostics:

Check for Kernel OOM Events

dmesg | grep -i "out of memory" | tail -20

If you see kernel OOM messages, the issue isn’t CUDA—it’s host RAM exhaustion. Increase the --memory limit in your Docker run command or Docker Compose file.

Inspect CUDA Device Properties Inside the Container

docker run --rm --gpus all vllm-fixed:latest python3 -c "
import torch
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'Device count: {torch.cuda.device_count()}')
for i in range(torch.cuda.device_count()):
    print(f'Device {i}: {torch.cuda.get_device_name(i)}')
    print(f'Total memory: {torch.cuda.get_device_properties(i).total_memory / 1e9:.1f} GB')
"

This confirms whether PyTorch (and by extension vLLM) can see and communicate with the GPU correctly.

Run a Synthetic Memory Stress Test

docker run --rm --gpus all vllm-fixed:latest python3 -c "
import torch
for i in range(10):
    x = torch.zeros(1000, 1000, 1000, device='cuda')
    print(f'Allocated {x.element_size() * x.nelement() / 1e9:.1f} GB')
    del x
print('Success!')
"

If this crashes, the driver-CUDA mismatch is more severe than anticipated. Consider rolling back the driver to an older, well-tested version or using a different CUDA base image.


Final Recommendations for Stable AI Tool Automation and Deployment

Deploying vLLM (or any LLM inference engine) on Ubuntu 22.04 with Docker and NVIDIA GPUs is a powerful way to build production AI automation systems. To keep them stable:

  • Pin your CUDA version in the Dockerfile. Don’t rely on “latest.” Explicitly specify a base image like nvidia/cuda:12.4.1-devel-ubuntu22.04.
  • Document your driver version. Keep a record of which driver version you’re running. When you update it, test with a new CUDA base image first in a staging environment.
  • Test before production. Always run the container for at least 5 minutes with realistic inference load before deploying to production.
  • Monitor continuously. Use tools like Prometheus and Grafana to track GPU memory, utilization, and temperature. Set up alerts for unusual spikes.
  • Use Docker Compose for reproducibility. Your configuration becomes version-controlled and easy to replicate across environments.
  • Set resource limits explicitly. Don’t let containers compete for memory. Define limits upfront to prevent kernel OOM events.

Conclusion: You’ve Fixed It

The “CUDA out of memory” crash on vLLM running in Docker on Ubuntu 22.04 is almost always caused by a GPU driver and CUDA toolkit version mismatch. By following the steps in this guide—verifying your host driver version, matching your CUDA base image to that driver, installing and confirming the NVIDIA Container Runtime, and tuning GPU memory utilization—you’ve eliminated the root cause.

The fix is surprisingly straightforward once you understand what’s happening: the container’s CUDA library is trying to speak a different dialect than your host GPU driver understands, leading to silent allocation failures that manifest as “out of memory” errors. Make them speak the same language, and the problem vanishes.

If you’re deploying LLM inference at scale on a VPS or cloud instance, these configuration patterns—pinned CUDA versions, verified container runtimes, and explicit resource limits—become your foundation for stable, predictable AI tool performance. Test once, document your setup, and your production inference pipeline will run reliably for months

Leave a Comment