Docker Compose “vllm: failed to start” on Ubuntu 22.04 – fixing CUDA 12 vs torch 2.2 “CUDA out of memory” error in a GPU‑enabled FastAPI LLM service.

You’ve containerized your large language model service with vllm, you’ve got a beefy GPU, but Docker keeps throwing cryptic CUDA memory errors. Your FastAPI LLM service won’t even start. Let’s fix this—and fast.

Quick Reference

Use Case: GPU-accelerated LLM inference with Docker on Ubuntu 22.04
Difficulty Level: Intermediate
Estimated Fix Time: 15–30 minutes
Primary Stack: Docker, CUDA 12, PyTorch 2.2, vllm, FastAPI

The Problem: Why vllm Fails on Ubuntu 22.04 with CUDA 12 and Torch 2.2

When you deploy a vllm LLM service inside Docker on Ubuntu 22.04, you’re often caught in the middle of conflicting version expectations. PyTorch 2.2, while excellent, doesn’t always play nicely with CUDA 12 in containerized environments. Docker’s GPU support adds another layer of complexity—the NVIDIA Container Runtime must be configured, CUDA paths must align, and memory allocation has to be explicit. When any of these misalign, you get the dreaded:

RuntimeError: CUDA out of memory. Tried to allocate X.XX GiB
torch.cuda.OutOfMemoryError: CUDA out of memory: tried to allocate X.XX GiB

OR

vllm: failed to start
ERROR: Container failed to start due to CUDA initialization error

The root causes almost always trace back to one of these issues:

  • PyTorch CUDA mismatch: Torch 2.2 compiled for CUDA 11.8 but CUDA 12 is installed
  • Missing GPU memory visibility: Docker isn’t granting the container access to full VRAM
  • Incorrect NVIDIA Container Runtime configuration: GPU drivers not properly exposed to containers
  • vllm engine defaults: Default tensor parallelism and GPU memory fraction misconfigured
  • Host and container CUDA library conflicts: Different CUDA versions or missing dependencies in the container

Prerequisites and Required Tools

Before diving into the fix, ensure you have the following on your Ubuntu 22.04 system:

  • NVIDIA GPU Driver: Version 535+ (check with nvidia-smi)
  • NVIDIA Container Runtime: Installed and configured (not just docker)
  • Docker: Version 20.10+ with GPU support enabled
  • CUDA Toolkit: Version 12.0+ on host (for reference and debugging)
  • Docker Compose: Version 1.29+ (supports GPU resource allocation)
  • A compatible GPU: RTX 3090, A100, L40, or similar with at least 24GB VRAM for typical LLMs
  • SSH or terminal access: To your Ubuntu 22.04 system

Step-by-Step Fix: Resolving the vllm Startup Failure

Step 1: Verify Your GPU and NVIDIA Container Runtime Setup

First, confirm that your host GPU is visible and that the NVIDIA Container Runtime is properly configured.

nvidia-smi

You should see your GPU listed with its memory (e.g., 24GB). Next, verify the NVIDIA Container Runtime is installed:

which nvidia-container-runtime

If nothing returns, install it:

curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update && sudo apt-get install -y nvidia-container-runtime

After installation, restart Docker:

sudo systemctl restart docker

Step 2: Configure Docker Daemon to Use NVIDIA Runtime by Default

Edit (or create) /etc/docker/daemon.json:

sudo nano /etc/docker/daemon.json

Ensure it contains:

{
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  },
  "default-runtime": "nvidia"
}

Save and exit (Ctrl+X, then Y, then Enter). Restart Docker:

sudo systemctl restart docker

Step 3: Use a CUDA 12–Compatible PyTorch Base Image

The critical fix: use an official NVIDIA CUDA base image that explicitly supports PyTorch 2.2 and CUDA 12. Create or update your Dockerfile:

FROM nvidia/cuda:12.1.1-runtime-ubuntu22.04

# Set environment variables for CUDA
ENV CUDA_HOME=/usr/local/cuda
ENV PATH=$CUDA_HOME/bin:$PATH
ENV LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH

# Install Python and pip
RUN apt-get update && apt-get install -y \
    python3.10 \
    python3-pip \
    git \
    && rm -rf /var/lib/apt/lists/*

# Install PyTorch with CUDA 12 support explicitly
RUN pip install --upgrade pip && \
    pip install torch==2.2.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

# Install vllm and FastAPI
RUN pip install vllm==0.3.0 fastapi uvicorn pydantic

WORKDIR /app
COPY . .

EXPOSE 8000

CMD ["python3", "-m", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Critical line: The --index-url https://download.pytorch.org/whl/cu121 ensures PyTorch is compiled against CUDA 12.1, matching your CUDA 12 installation.

Step 4: Configure docker-compose.yml with Explicit GPU Allocation

Update your docker-compose.yml to explicitly request GPU resources and configure memory properly:

version: '3.8'

services:
  vllm-api:
    build: .
    image: vllm-fastapi:latest
    container_name: vllm-lm-api
    runtime: nvidia
    shm_size: 16gb
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - CUDA_VISIBLE_DEVICES=0
      - VLLM_ATTENTION_BACKEND=flashinfer
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    volumes:
      - ./models:/app/models
      - ./logs:/app/logs
    ports:
      - "8000:8000"
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3
      start_period: 60s

Key settings explained:

  • runtime: nvidia – Forces NVIDIA Container Runtime
  • shm_size: 16gb – Allocates sufficient shared memory for CUDA operations
  • CUDA_VISIBLE_DEVICES=0 – Exposes GPU index 0 (change if using different GPU)
  • capabilities: [gpu] – Requests GPU capability in deploy section
  • VLLM_ATTENTION_BACKEND=flashinfer – Uses efficient attention backend (optional but recommended)

Step 5: Configure vllm Server Parameters

Inside your FastAPI application, ensure vllm is initialized with proper memory settings. Create a main.py file:

from fastapi import FastAPI, HTTPException
from vllm import LLM, SamplingParams
import os
import logging

logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

app = FastAPI(title="vllm LLM API")

# Initialize vllm with explicit parameters
try:
    llm = LLM(
        model="/app/models/mistral-7b-instruct",  # Your model path
        dtype="float16",  # Use float16 to reduce memory
        gpu_memory_utilization=0.9,  # Use 90% of GPU VRAM
        max_num_seqs=256,  # Concurrent sequences
        max_model_len=2048,  # Maximum token length
        tensor_parallel_size=1,  # For single GPU
        device="cuda",
        enforce_eager=False,
    )
    logger.info("vllm LLM initialized successfully")
except Exception as e:
    logger.error(f"Failed to initialize vllm: {e}")
    raise

@app.get("/health")
async def health_check():
    return {"status": "healthy", "model": "running"}

@app.post("/generate")
async def generate(prompt: str, max_tokens: int = 128):
    try:
        sampling_params = SamplingParams(temperature=0.7, max_tokens=max_tokens)
        outputs = llm.generate(prompt, sampling_params)
        return {"prompt": prompt, "response": outputs[0].outputs[0].text}
    except Exception as e:
        logger.error(f"Generation failed: {e}")
        raise HTTPException(status_code=500, detail=str(e))

if __name__ == "__main__":
    import uvicorn
    uvicorn.run(app, host="0.0.0.0", port=8000)

Critical parameters:

  • dtype="float16" – Halves memory usage compared to float32
  • gpu_memory_utilization=0.9 – Sets max GPU VRAM usage to 90% (adjust lower if OOM persists)
  • tensor_parallel_size=1 – Single GPU (increase only if using multiple GPUs)

Step 6: Build and Run with Verification

Build the Docker image:

docker-compose build --no-cache

Run the container with logs visible:

docker-compose up --abort-on-container-exit

Watch for successful initialization. You should see:

vllm-lm-api  | INFO:     Uvicorn running on http://0.0.0.0:8000
vllm-lm-api  | INFO: vllm LLM initialized successfully

Verify GPU is being used inside the container:

docker exec vllm-lm-api nvidia-smi

Step 7: Test the API Endpoint

Once running, test with a simple request:

curl -X POST "http://localhost:8000/generate" \
  -H "Content-Type: application/json" \
  -d '{"prompt": "What is artificial intelligence?", "max_tokens": 50}'

Success looks like:

{"prompt": "What is artificial intelligence?", "response": "Artificial intelligence refers to computer systems designed to perform tasks that typically require human intelligence..."}

Common Mistakes That Prevent vllm from Starting

Mistake #1: Wrong PyTorch CUDA Version

Why it happens: Installing torch without specifying the CUDA index URL defaults to an older CUDA version incompatible with CUDA 12.

Fix: Always use the explicit PyTorch wheel URL: pip install torch --index-url https://download.pytorch.org/whl/cu121

Mistake #2: Omitting NVIDIA Runtime Configuration

Why it happens: Docker defaults to using the standard runc runtime, which has no GPU access. Many tutorials skip this critical setup step.

Fix: Configure /etc/docker/daemon.json as shown in Step 2 and set runtime: nvidia in compose file.

Mistake #3: Insufficient Shared Memory

Why it happens: CUDA operations inside containers require shared memory for inter-process communication. The default is 64MB—way too small.

Fix: Set shm_size: 16gb in your compose file.

Mistake #4: Not Explicitly Specifying GPU Count or Type

Why it happens: Without explicit resource requests, Docker may not allocate the GPU properly, or it might try to use all GPUs when you only have one.

Fix: Use CUDA_VISIBLE_DEVICES and define GPU resources in the deploy section.

Mistake #5: Using float32 When float16 Is Available

Why it happens: float32 doubles memory usage. Developers often use it out of habit or fear of precision loss, which doesn’t matter for LLM inference.

Fix: Always use dtype="float16" in vllm initialization unless you’re doing fine-tuning.

Optimization Tips and Follow-Up Checks

Monitor GPU Memory in Real-Time

While the container is running, monitor GPU usage from the host:

watch -n 1 nvidia-smi

You should see vllm processes consuming GPU memory. If memory usage stays below 50% of your GPU’s VRAM, you can safely increase gpu_memory_utilization to squeeze more performance.

Enable CUDA Graph Caching for Speed

If generation latency is high, enable CUDA graph caching in your vllm initialization:

llm = LLM(
    model="/app/models/mistral-7b-instruct",
    dtype="float16",
    gpu_memory_utilization=0.9,
    enable_prefix_caching=True,  # Speeds up repeated prefixes
    enforce_eager=False,  # Allows CUDA graph optimization
)

Use Flash Attention for Further Memory Reduction

If your model supports it, explicitly set the attention backend in your compose file:

environment:
  - VLLM_ATTENTION_BACKEND=flashinfer  # Or 'flash_attn' if available

Auto-Restart on Failure

Add a restart policy to your docker-compose.yml to ensure the service recovers from transient errors:

services:
  vllm-api:
    ...
    restart_policy:
      condition: on-failure
      delay: 10s
      max_attempts: 5

Enable Debug Logging

For deep troubleshooting, increase logging verbosity:

environment:
  - VLLM_LOG_LEVEL=DEBUG
  - CUDA_LAUNCH_BLOCKING=1  # Synchronous CUDA for clearer error traces

Real-World Scenario: Deploying Mistral-7B on Ubuntu 22.04

Let’s walk through a complete, real-world deployment that ties everything together. You’re running a VPS with an NVIDIA A100 GPU on Ubuntu 22.04, and you want to expose a Mistral-7B model via FastAPI for inference tasks at your AI tools automation platform.

Complete Directory Structure

project/
├── Dockerfile
├── docker-compose.yml
├── main.py
├── requirements.txt
├── models/
│   └── mistral-7b-instruct/  # Model files downloaded here
└── logs/

requirements.txt:

vllm==0.3.0
fastapi==0.104.1
uvicorn==0.24.0
pydantic==2.5.0
python-dotenv==1.0.0

Deployment workflow:

  1. Download the Mistral-7B model: huggingface-cli download mistralai/Mistral-7B-Instruct-v0.2 --local-dir ./models/mistral-7b-instruct
  2. Build the image: docker-compose build
  3. Bring up the service: docker-compose up -d
  4. Check logs: docker-compose logs -f
  5. Test generation: curl -X POST "http://localhost:8000/generate" -d '{"prompt": "Explain machine learning", "max_tokens": 128}'

The service now runs reliably on your VPS, debugging any CUDA issues is straightforward thanks to proper logging, and performance is optimized with flash attention enabled.

Before and After: Comparison Table

Aspect ❌ Before (Fails) ✅ After (Works)
Base Image ubuntu:22.04 with manual CUDA nvidia/cuda:12.1.1-runtime-ubuntu22.04
PyTorch Install pip install torch pip install torch –index-url https://…cu121
Docker Runtime default (runc) nvidia
Shared Memory 64MB (default) 16GB
Precision float32 (high memory) float16 (50% savings)
GPU Memory Util. Undefined (crashes) 0.9 (90%, tunable)
Startup Result CUDA OOM error, container exits Healthy, Uvicorn running on port 8000

Troubleshooting Checklist If Issues Persist


  • Verify GPU driver version: nvidia-smi | grep "Driver Version" should be 535+

  • Check NVIDIA Container Runtime is installed: which nvidia-container-runtime

  • Confirm daemon.json is valid JSON: cat /etc/docker/daemon.json | python3 -m json.tool

  • Test GPU inside container: docker run --rm --runtime=nvidia nvidia/cuda:12.1.1-runtime nvidia-smi

  • Check torch CUDA availability: docker exec vllm-lm-api python3 -c "import torch; print(torch.cuda.is_available())" should print True

  • Verify CUDA version match: docker exec vllm-lm-api python3 -c "import torch; print(torch.version.cuda)" should show 12.1

  • Reduce gpu_memory_utilization: Lower it to 0.6 or 0.5 to confirm it’s not an OOM issue

Key Takeaways: What You Learned

  • Version alignment matters: Use NVIDIA’s official CUDA base images and explicitly specify PyTorch wheels for your CUDA version
  • GPU access requires setup: NVIDIA Container Runtime and proper daemon.json configuration are non-negotiable
  • Memory is the bottleneck: Use float16, increase shm_size, and tune gpu_memory_utilization based on your model and hardware
  • Container environment matters: Explicit CUDA_VISIBLE_DEVICES and resource reservations prevent ambiguity
  • Test incrementally: Verify each step (driver → runtime → container GPU access → PyTorch → vllm) rather than troubleshooting a black box

Conclusion: Your vllm Service Is Now Ready

You’ve successfully debugged and resolved the vllm startup failure on Ubuntu 22.04 with CUDA 12 and PyTorch 2.2. The combination of using NVIDIA’s official CUDA base image, configuring the NVIDIA Container Runtime, allocating adequate shared memory, and tuning vllm’s initialization parameters transforms a frustrating configuration error into a predictable, repeatable deployment.

Your FastAPI LLM service now has:

  • ✅ Proper GPU access and CUDA initialization
  • ✅ Efficient memory utilization with float16 precision
  • ✅ Predictable, scalable inference via vllm
  • ✅ Health checks and auto-restart capability
  • ✅ Clean, observable logs for debugging

This setup is production-ready for automation workflows, AI tools deployment, and high-throughput inference on your VPS or on-premise infrastructure. The error handling and environment configuration we’ve implemented will save you countless hours of debugging in the future.

Have questions about deploying vllm in your environment? The configuration patterns shown here generalize to other large language models and GPU setups. Test thoroughly in staging before pushing to production, and monitor GPU memory usage closely during your first deploymen

Leave a Comment