You’ve containerized your large language model service with vllm, you’ve got a beefy GPU, but Docker keeps throwing cryptic CUDA memory errors. Your FastAPI LLM service won’t even start. Let’s fix this—and fast.
Quick Reference
The Problem: Why vllm Fails on Ubuntu 22.04 with CUDA 12 and Torch 2.2
When you deploy a vllm LLM service inside Docker on Ubuntu 22.04, you’re often caught in the middle of conflicting version expectations. PyTorch 2.2, while excellent, doesn’t always play nicely with CUDA 12 in containerized environments. Docker’s GPU support adds another layer of complexity—the NVIDIA Container Runtime must be configured, CUDA paths must align, and memory allocation has to be explicit. When any of these misalign, you get the dreaded:
RuntimeError: CUDA out of memory. Tried to allocate X.XX GiB torch.cuda.OutOfMemoryError: CUDA out of memory: tried to allocate X.XX GiB OR vllm: failed to start ERROR: Container failed to start due to CUDA initialization error
The root causes almost always trace back to one of these issues:
- PyTorch CUDA mismatch: Torch 2.2 compiled for CUDA 11.8 but CUDA 12 is installed
- Missing GPU memory visibility: Docker isn’t granting the container access to full VRAM
- Incorrect NVIDIA Container Runtime configuration: GPU drivers not properly exposed to containers
- vllm engine defaults: Default tensor parallelism and GPU memory fraction misconfigured
- Host and container CUDA library conflicts: Different CUDA versions or missing dependencies in the container
Prerequisites and Required Tools
Before diving into the fix, ensure you have the following on your Ubuntu 22.04 system:
- NVIDIA GPU Driver: Version 535+ (check with
nvidia-smi) - NVIDIA Container Runtime: Installed and configured (not just docker)
- Docker: Version 20.10+ with GPU support enabled
- CUDA Toolkit: Version 12.0+ on host (for reference and debugging)
- Docker Compose: Version 1.29+ (supports GPU resource allocation)
- A compatible GPU: RTX 3090, A100, L40, or similar with at least 24GB VRAM for typical LLMs
- SSH or terminal access: To your Ubuntu 22.04 system
Step-by-Step Fix: Resolving the vllm Startup Failure
Step 1: Verify Your GPU and NVIDIA Container Runtime Setup
First, confirm that your host GPU is visible and that the NVIDIA Container Runtime is properly configured.
nvidia-smi
You should see your GPU listed with its memory (e.g., 24GB). Next, verify the NVIDIA Container Runtime is installed:
which nvidia-container-runtime
If nothing returns, install it:
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add - distribution=$(. /etc/os-release;echo $ID$VERSION_ID) curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \ sudo tee /etc/apt/sources.list.d/nvidia-docker.list sudo apt-get update && sudo apt-get install -y nvidia-container-runtime
After installation, restart Docker:
sudo systemctl restart docker
Step 2: Configure Docker Daemon to Use NVIDIA Runtime by Default
Edit (or create) /etc/docker/daemon.json:
sudo nano /etc/docker/daemon.json
Ensure it contains:
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia"
}
Save and exit (Ctrl+X, then Y, then Enter). Restart Docker:
sudo systemctl restart docker
Step 3: Use a CUDA 12–Compatible PyTorch Base Image
The critical fix: use an official NVIDIA CUDA base image that explicitly supports PyTorch 2.2 and CUDA 12. Create or update your Dockerfile:
FROM nvidia/cuda:12.1.1-runtime-ubuntu22.04
# Set environment variables for CUDA
ENV CUDA_HOME=/usr/local/cuda
ENV PATH=$CUDA_HOME/bin:$PATH
ENV LD_LIBRARY_PATH=$CUDA_HOME/lib64:$LD_LIBRARY_PATH
# Install Python and pip
RUN apt-get update && apt-get install -y \
python3.10 \
python3-pip \
git \
&& rm -rf /var/lib/apt/lists/*
# Install PyTorch with CUDA 12 support explicitly
RUN pip install --upgrade pip && \
pip install torch==2.2.0 torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121
# Install vllm and FastAPI
RUN pip install vllm==0.3.0 fastapi uvicorn pydantic
WORKDIR /app
COPY . .
EXPOSE 8000
CMD ["python3", "-m", "uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
Critical line: The --index-url https://download.pytorch.org/whl/cu121 ensures PyTorch is compiled against CUDA 12.1, matching your CUDA 12 installation.
Step 4: Configure docker-compose.yml with Explicit GPU Allocation
Update your docker-compose.yml to explicitly request GPU resources and configure memory properly:
version: '3.8'
services:
vllm-api:
build: .
image: vllm-fastapi:latest
container_name: vllm-lm-api
runtime: nvidia
shm_size: 16gb
environment:
- NVIDIA_VISIBLE_DEVICES=all
- CUDA_VISIBLE_DEVICES=0
- VLLM_ATTENTION_BACKEND=flashinfer
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
volumes:
- ./models:/app/models
- ./logs:/app/logs
ports:
- "8000:8000"
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 60s
Key settings explained:
runtime: nvidia– Forces NVIDIA Container Runtimeshm_size: 16gb– Allocates sufficient shared memory for CUDA operationsCUDA_VISIBLE_DEVICES=0– Exposes GPU index 0 (change if using different GPU)capabilities: [gpu]– Requests GPU capability in deploy sectionVLLM_ATTENTION_BACKEND=flashinfer– Uses efficient attention backend (optional but recommended)
Step 5: Configure vllm Server Parameters
Inside your FastAPI application, ensure vllm is initialized with proper memory settings. Create a main.py file:
from fastapi import FastAPI, HTTPException
from vllm import LLM, SamplingParams
import os
import logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
app = FastAPI(title="vllm LLM API")
# Initialize vllm with explicit parameters
try:
llm = LLM(
model="/app/models/mistral-7b-instruct", # Your model path
dtype="float16", # Use float16 to reduce memory
gpu_memory_utilization=0.9, # Use 90% of GPU VRAM
max_num_seqs=256, # Concurrent sequences
max_model_len=2048, # Maximum token length
tensor_parallel_size=1, # For single GPU
device="cuda",
enforce_eager=False,
)
logger.info("vllm LLM initialized successfully")
except Exception as e:
logger.error(f"Failed to initialize vllm: {e}")
raise
@app.get("/health")
async def health_check():
return {"status": "healthy", "model": "running"}
@app.post("/generate")
async def generate(prompt: str, max_tokens: int = 128):
try:
sampling_params = SamplingParams(temperature=0.7, max_tokens=max_tokens)
outputs = llm.generate(prompt, sampling_params)
return {"prompt": prompt, "response": outputs[0].outputs[0].text}
except Exception as e:
logger.error(f"Generation failed: {e}")
raise HTTPException(status_code=500, detail=str(e))
if __name__ == "__main__":
import uvicorn
uvicorn.run(app, host="0.0.0.0", port=8000)
Critical parameters:
dtype="float16"– Halves memory usage compared to float32gpu_memory_utilization=0.9– Sets max GPU VRAM usage to 90% (adjust lower if OOM persists)tensor_parallel_size=1– Single GPU (increase only if using multiple GPUs)
Step 6: Build and Run with Verification
Build the Docker image:
docker-compose build --no-cache
Run the container with logs visible:
docker-compose up --abort-on-container-exit
Watch for successful initialization. You should see:
vllm-lm-api | INFO: Uvicorn running on http://0.0.0.0:8000 vllm-lm-api | INFO: vllm LLM initialized successfully
Verify GPU is being used inside the container:
docker exec vllm-lm-api nvidia-smi
Step 7: Test the API Endpoint
Once running, test with a simple request:
curl -X POST "http://localhost:8000/generate" \
-H "Content-Type: application/json" \
-d '{"prompt": "What is artificial intelligence?", "max_tokens": 50}'
Success looks like:
{"prompt": "What is artificial intelligence?", "response": "Artificial intelligence refers to computer systems designed to perform tasks that typically require human intelligence..."}
Common Mistakes That Prevent vllm from Starting
Mistake #1: Wrong PyTorch CUDA Version
Why it happens: Installing torch without specifying the CUDA index URL defaults to an older CUDA version incompatible with CUDA 12.
Fix: Always use the explicit PyTorch wheel URL: pip install torch --index-url https://download.pytorch.org/whl/cu121
Mistake #2: Omitting NVIDIA Runtime Configuration
Why it happens: Docker defaults to using the standard runc runtime, which has no GPU access. Many tutorials skip this critical setup step.
Fix: Configure /etc/docker/daemon.json as shown in Step 2 and set runtime: nvidia in compose file.
Mistake #3: Insufficient Shared Memory
Why it happens: CUDA operations inside containers require shared memory for inter-process communication. The default is 64MB—way too small.
Fix: Set shm_size: 16gb in your compose file.
Mistake #4: Not Explicitly Specifying GPU Count or Type
Why it happens: Without explicit resource requests, Docker may not allocate the GPU properly, or it might try to use all GPUs when you only have one.
Fix: Use CUDA_VISIBLE_DEVICES and define GPU resources in the deploy section.
Mistake #5: Using float32 When float16 Is Available
Why it happens: float32 doubles memory usage. Developers often use it out of habit or fear of precision loss, which doesn’t matter for LLM inference.
Fix: Always use dtype="float16" in vllm initialization unless you’re doing fine-tuning.
Optimization Tips and Follow-Up Checks
Monitor GPU Memory in Real-Time
While the container is running, monitor GPU usage from the host:
watch -n 1 nvidia-smi
You should see vllm processes consuming GPU memory. If memory usage stays below 50% of your GPU’s VRAM, you can safely increase gpu_memory_utilization to squeeze more performance.
Enable CUDA Graph Caching for Speed
If generation latency is high, enable CUDA graph caching in your vllm initialization:
llm = LLM(
model="/app/models/mistral-7b-instruct",
dtype="float16",
gpu_memory_utilization=0.9,
enable_prefix_caching=True, # Speeds up repeated prefixes
enforce_eager=False, # Allows CUDA graph optimization
)
Use Flash Attention for Further Memory Reduction
If your model supports it, explicitly set the attention backend in your compose file:
environment: - VLLM_ATTENTION_BACKEND=flashinfer # Or 'flash_attn' if available
Auto-Restart on Failure
Add a restart policy to your docker-compose.yml to ensure the service recovers from transient errors:
services:
vllm-api:
...
restart_policy:
condition: on-failure
delay: 10s
max_attempts: 5
Enable Debug Logging
For deep troubleshooting, increase logging verbosity:
environment: - VLLM_LOG_LEVEL=DEBUG - CUDA_LAUNCH_BLOCKING=1 # Synchronous CUDA for clearer error traces
Real-World Scenario: Deploying Mistral-7B on Ubuntu 22.04
Let’s walk through a complete, real-world deployment that ties everything together. You’re running a VPS with an NVIDIA A100 GPU on Ubuntu 22.04, and you want to expose a Mistral-7B model via FastAPI for inference tasks at your AI tools automation platform.
Complete Directory Structure
project/ ├── Dockerfile ├── docker-compose.yml ├── main.py ├── requirements.txt ├── models/ │ └── mistral-7b-instruct/ # Model files downloaded here └── logs/
requirements.txt:
vllm==0.3.0 fastapi==0.104.1 uvicorn==0.24.0 pydantic==2.5.0 python-dotenv==1.0.0
Deployment workflow:
- Download the Mistral-7B model:
huggingface-cli download mistralai/Mistral-7B-Instruct-v0.2 --local-dir ./models/mistral-7b-instruct - Build the image:
docker-compose build - Bring up the service:
docker-compose up -d - Check logs:
docker-compose logs -f - Test generation:
curl -X POST "http://localhost:8000/generate" -d '{"prompt": "Explain machine learning", "max_tokens": 128}'
The service now runs reliably on your VPS, debugging any CUDA issues is straightforward thanks to proper logging, and performance is optimized with flash attention enabled.
Before and After: Comparison Table
| Aspect | ❌ Before (Fails) | ✅ After (Works) |
|---|---|---|
| Base Image | ubuntu:22.04 with manual CUDA | nvidia/cuda:12.1.1-runtime-ubuntu22.04 |
| PyTorch Install | pip install torch | pip install torch –index-url https://…cu121 |
| Docker Runtime | default (runc) | nvidia |
| Shared Memory | 64MB (default) | 16GB |
| Precision | float32 (high memory) | float16 (50% savings) |
| GPU Memory Util. | Undefined (crashes) | 0.9 (90%, tunable) |
| Startup Result | CUDA OOM error, container exits | Healthy, Uvicorn running on port 8000 |
Troubleshooting Checklist If Issues Persist
Verify GPU driver version:nvidia-smi | grep "Driver Version"should be 535+
Check NVIDIA Container Runtime is installed:which nvidia-container-runtime
Confirm daemon.json is valid JSON:cat /etc/docker/daemon.json | python3 -m json.tool
Test GPU inside container:docker run --rm --runtime=nvidia nvidia/cuda:12.1.1-runtime nvidia-smi
Check torch CUDA availability:docker exec vllm-lm-api python3 -c "import torch; print(torch.cuda.is_available())"should printTrue
Verify CUDA version match:docker exec vllm-lm-api python3 -c "import torch; print(torch.version.cuda)"should show12.1
Reduce gpu_memory_utilization: Lower it to 0.6 or 0.5 to confirm it’s not an OOM issue
Key Takeaways: What You Learned
- Version alignment matters: Use NVIDIA’s official CUDA base images and explicitly specify PyTorch wheels for your CUDA version
- GPU access requires setup: NVIDIA Container Runtime and proper daemon.json configuration are non-negotiable
- Memory is the bottleneck: Use float16, increase shm_size, and tune gpu_memory_utilization based on your model and hardware
- Container environment matters: Explicit CUDA_VISIBLE_DEVICES and resource reservations prevent ambiguity
- Test incrementally: Verify each step (driver → runtime → container GPU access → PyTorch → vllm) rather than troubleshooting a black box
Conclusion: Your vllm Service Is Now Ready
You’ve successfully debugged and resolved the vllm startup failure on Ubuntu 22.04 with CUDA 12 and PyTorch 2.2. The combination of using NVIDIA’s official CUDA base image, configuring the NVIDIA Container Runtime, allocating adequate shared memory, and tuning vllm’s initialization parameters transforms a frustrating configuration error into a predictable, repeatable deployment.
Your FastAPI LLM service now has:
- ✅ Proper GPU access and CUDA initialization
- ✅ Efficient memory utilization with float16 precision
- ✅ Predictable, scalable inference via vllm
- ✅ Health checks and auto-restart capability
- ✅ Clean, observable logs for debugging
This setup is production-ready for automation workflows, AI tools deployment, and high-throughput inference on your VPS or on-premise infrastructure. The error handling and environment configuration we’ve implemented will save you countless hours of debugging in the future.
Have questions about deploying vllm in your environment? The configuration patterns shown here generalize to other large language models and GPU setups. Test thoroughly in staging before pushing to production, and monitor GPU memory usage closely during your first deploymen