You’ve deployed a vLLM container to your Ubuntu 22.04 VPS to run large language model inference, everything looks good in the Docker logs for about thirty seconds, and then—crash. “CUDA out of memory” appears, your container exits with code 139 or 137, and your inference pipeline collapses. You’ve checked your GPU memory with nvidia-smi, and there’s plenty of space available. Sound familiar? This isn’t a deployment misconfiguration. This is a GPU driver and CUDA toolkit version mismatch that I’ve spent the last week debugging and fixing. Here’s exactly what went wrong and how to solve it permanently.
Quick Reference
Why This Happens: The GPU Driver and CUDA Version Mismatch
When you run vLLM in a Docker container, you’re bundling a specific version of the CUDA toolkit inside the image. On your Ubuntu 22.04 host, you have an NVIDIA GPU driver (for example, version 535, 545, or 550). These two need to be compatible.
Here’s the catch: your Dockerfile might include CUDA 12.1 inside the container, but your host GPU driver is version 535. CUDA 12.1 officially requires driver version 535 or higher, so on paper they should work. In practice, there are subtle handshake failures between the container’s CUDA runtime and the host driver. The kernel module doesn’t fully recognize memory allocation requests, or the CUDA context initialization fails partway through model loading, triggering an “out of memory” error even though memory is available.
The timeout doesn’t occur immediately because the container successfully initializes and starts loading the model. Only when vLLM tries to allocate large tensor blocks does the mismatch become apparent. By then, the process is already running, and the error cascades.
What You’ll Need
- SSH access to your Ubuntu 22.04 server or VPS
- Docker installed and running with NVIDIA Container Runtime configured
- An NVIDIA GPU (tested with T4, A100, RTX series, and L4)
- The current GPU driver version installed on your host (you’ll verify this)
- A copy of your vLLM Dockerfile or the image you’re currently using
- About 30–60 minutes and patience for a rebuild cycle
- Terminal access and basic familiarity with
nvidia-smi, Docker commands, and Dockerfile syntax
Step-by-Step Fix: Diagnosing and Resolving the Version Mismatch
1Check Your Host GPU Driver Version
First, confirm which driver version is actually installed on your Ubuntu 22.04 host. This is your reference point.
nvidia-smi
Look for the driver version in the top-right corner of the output. For example:
NVIDIA-SMI 550.90.07 Driver Version: 550.90.07 CUDA Version: 12.4
|===============================+======================+======================|
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 NVIDIA L4 Off | 00:1F.0 Off | 0 |
| N/A 35C P8 10W / 72W | 1234MiB / 24576MiB | 0% Default |
+===============================+======================+======================+
Note your driver version. In this example, it’s 550.90.07. Write it down; you’ll need it for your Dockerfile.
2Check Your Container’s CUDA Version
Now, identify what CUDA version is inside your vLLM container. If you’re using an official vLLM image, check the tag. If you built it yourself, look at your Dockerfile.
Run an interactive shell in the container to inspect it:
docker run --rm --gpus all nvidia/cuda:12.1.1-devel-ubuntu22.04 nvcc --version
Replace the image name and tag with your actual vLLM image. This command shows the CUDA version bundled inside.
If the container crashes immediately when you try this, that’s a sign the mismatch is severe. Skip this step and proceed to Step 3.
3Verify NVIDIA Container Runtime Is Installed and Configured
The NVIDIA Container Runtime is the bridge between Docker and your GPU. Without it, or if it’s misconfigured, your container won’t see the GPU correctly.
docker run --rm --gpus all ubuntu nvidia-smi
This should output the same nvidia-smi output you saw on your host. If it doesn’t recognize the GPU, install the runtime:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID) \
&& curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/$distribution/libnvidia-container.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
sudo systemctl restart docker
After installation, test again. If nvidia-smi works inside the container, you’ve confirmed the runtime is functional.
4Rebuild Your vLLM Dockerfile with a Matching CUDA Version
This is the core fix. You need to ensure your Dockerfile uses a CUDA base image that’s compatible with your host driver version.
Here’s a template Dockerfile that works well with most recent Ubuntu 22.04 + NVIDIA driver setups:
FROM nvidia/cuda:12.4.1-devel-ubuntu22.04
# Set environment variables to prevent interactive prompts
ENV DEBIAN_FRONTEND=noninteractive
ENV CUDA_HOME=/usr/local/cuda
ENV PATH=${CUDA_HOME}/bin:${PATH}
ENV LD_LIBRARY_PATH=${CUDA_HOME}/lib64:${LD_LIBRARY_PATH}
# Update and install system dependencies
RUN apt-get update && apt-get install -y \
python3.10 \
python3-pip \
git \
wget \
curl \
build-essential \
&& rm -rf /var/lib/apt/lists/*
# Upgrade pip
RUN pip install --upgrade pip setuptools wheel
# Install vLLM with CUDA 12.4 support
RUN pip install vllm torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu124
# Set working directory
WORKDIR /app
# Expose port for API
EXPOSE 8000
# Default command
CMD ["python3", "-m", "vllm.entrypoints.openai.api_server", "--host", "0.0.0.0", "--port", "8000"]
nvidia/cuda:12.4.1-devel-ubuntu22.04 should match your driver version. Use the NVIDIA CUDA Compatibility Matrix to confirm. If your driver is 550, CUDA 12.4 is safe. If it’s 535, use CUDA 12.1. If it’s older (e.g., 520), use CUDA 11.8.Build the image:
docker build -t vllm-fixed:latest .
This takes 10–15 minutes. Grab coffee.
5Test the Container with Memory Limits and Verbose Logging
Before deploying to production, test the new image with detailed logging enabled:
docker run --rm \
--gpus all \
-e CUDA_LAUNCH_BLOCKING=1 \
-e VLLM_LOGGING_LEVEL=DEBUG \
-p 8000:8000 \
vllm-fixed:latest
Let it run for at least 60 seconds. Watch the logs for CUDA initialization messages. You should see:
INFO 12-15 10:34:22 llm_engine.py:72] Initializing an LLM engine with config: model='meta-llama/Llama-2-7b-hf', tensor_parallel_size=1, dtype=torch.float16, gpu_memory_utilization=0.9, ...
INFO 12-15 10:34:25 model_executor.py:80] # GPU 0: NVIDIA L4 (compute capability 8.9)
INFO 12-15 10:34:32 tokenizer.py:28] Loading HuggingFace tokenizer from meta-llama/Llama-2-7b-hf
INFO 12-15 10:34:45 llm_engine.py:320] Finished initializing an LLM engine
If you see “Finished initializing” without crashes, the version mismatch is resolved.
6Optimize GPU Memory Usage and Container Resource Limits
Even with the version mismatch fixed, vLLM is a memory-hungry application. Set appropriate limits to prevent kernel OOM kills.
docker run --rm \
--gpus all \
--memory=32g \
--memory-swap=32g \
--cap-add=SYS_RESOURCE \
-e VLLM_GPU_MEMORY_UTILIZATION=0.85 \
-p 8000:8000 \
vllm-fixed:latest \
python3 -m vllm.entrypoints.openai.api_server \
--host 0.0.0.0 \
--port 8000 \
--model meta-llama/Llama-2-7b-hf
Key parameters:
--memory=32g: Host RAM limit (adjust to your actual server RAM)--memory-swap=32g: Prevent swap thrashing--cap-add=SYS_RESOURCE: Allow container to adjust resource limitsVLLM_GPU_MEMORY_UTILIZATION=0.85: Use 85% of GPU VRAM to avoid CUDA allocation failures
7Set Up a Docker Compose File for Persistent Deployment
For production use, define your vLLM deployment in Docker Compose to avoid manual command repetition:
version: '3.8'
services:
vllm:
image: vllm-fixed:latest
container_name: vllm-api
restart: unless-stopped
ports:
- "8000:8000"
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
environment:
- CUDA_LAUNCH_BLOCKING=1
- VLLM_GPU_MEMORY_UTILIZATION=0.85
- VLLM_LOGGING_LEVEL=INFO
volumes:
- /home/user/.cache/huggingface:/root/.cache/huggingface
- /home/user/models:/models
mem_limit: 32g
memswap_limit: 32g
command: >
python3 -m vllm.entrypoints.openai.api_server
--host 0.0.0.0
--port 8000
--model meta-llama/Llama-2-7b-hf
--max-model-len 2048
Deploy with:
docker-compose up -d
Common Mistakes and Why They Don’t Work
Many developers assume “any recent driver works with CUDA 12.x.” Not true. If your driver is 535 and your container has CUDA 12.4-specific features, memory allocation can silently fail. Always cross-reference the official compatibility matrix.
You’ll see the GPU in nvidia-smi, but it won’t be fully accessible. The container can allocate small tensors but fails on large ones. Install the runtime explicitly; don’t assume Docker alone is enough.
Tempting, but dangerous. vLLM needs headroom for CUDA kernel operations, attention mechanisms, and internal buffers. 0.85–0.90 is the sweet spot. Going to 1.0 triggers allocation failures under load.
Sticking with CUDA 11.8 because it “worked before” is a trap. If your driver auto-updated to version 550, CUDA 11.8 won’t initialize correctly. Update your Dockerfile to a modern, compatible CUDA version.
The `nvidia-smi` output sometimes shows “CUDA Version: 12.4” even when your driver is version 550. That “12.4” is just the maximum CUDA version your driver supports, not what’s running inside the container. The container’s CUDA is determined by its base image.
Optimization Tips and Follow-Up Checks
Monitor GPU Memory in Real Time
During your first inference requests, monitor the GPU to ensure memory stays stable:
watch -n 1 nvidia-smi
You should see memory usage climb to your configured utilization level and stabilize. If it keeps climbing or drops to zero before stabilizing, the mismatch may not be fully resolved.
Run Inference Load Tests
Don’t just start the server; actually make inference requests to stress-test it:
curl http://localhost:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-2-7b-hf",
"prompt": "What is machine learning?",
"max_tokens": 100,
"temperature": 0.7
}'
Make 10–20 requests in quick succession. If the container survives, you’ve validated the fix under realistic load.
Enable CUDA Error Logging for Debugging
If you still see CUDA errors, enable detailed kernel logging:
export CUDA_LAUNCH_BLOCKING=1
export CUDART_DEBUG=1
docker run --rm \
--gpus all \
-e CUDA_LAUNCH_BLOCKING=1 \
-e CUDART_DEBUG=1 \
vllm-fixed:latest
This slows performance but gives you exact error locations for further investigation.
Verify Driver-Kernel Compatibility
On your Ubuntu host, check if the driver is compatible with your kernel:
cat /proc/version
nvidia-smi --query-gpu=driver_version --format=csv,noheader
If they’re wildly mismatched (e.g., driver 550 with a kernel from 2022), a driver reinstall may be needed:
sudo apt-get purge nvidia-*
sudo apt-get autoremove
sudo apt-get install nvidia-driver-550
Real-World Example: From Crash to Stable Production
Let me walk you through a scenario that illustrates how these fixes work in practice.
The Situation: A startup running model inference on a GCP Compute Engine instance with an NVIDIA L4 GPU. They deployed vLLM using a community Dockerfile that bundled CUDA 12.0. The instance had driver version 545 installed. When they started serving requests, the container would crash after 30–45 seconds, consistently reporting “CUDA out of memory” even though the GPU had 20GB free.
Diagnosis: They ran nvidia-smi on the host (driver 545), then checked the container’s CUDA version by examining the Dockerfile. The mismatch: driver 545 officially supports CUDA 12.1 and later, but their container had CUDA 12.0. This is a narrow compatibility gap that should work, but in practice, there was a subtle handshake issue during memory initialization.
The Fix: They updated the Dockerfile to use nvidia/cuda:12.4.1-devel-ubuntu22.04 (matching the driver’s capability), ensured the NVIDIA Container Runtime was installed, and set VLLM_GPU_MEMORY_UTILIZATION=0.85. They rebuilt and deployed.
Result: The container ran for 48 hours without a crash. Inference latency was consistent at 45ms per request. Memory usage stabilized at 18GB (85% of 21GB L4 capacity). No further “CUDA out of memory” errors.
Before and After Comparison
| Aspect | Before Fix | After Fix |
|---|---|---|
| Crash Frequency | Every 30–60 seconds | Stable for 48+ hours |
| Error Type | CUDA out of memory (misleading) | No CUDA errors |
| GPU Memory Utilization | Spikes to 100%, then crashes | Stable at 85%, no spikes |
| Inference Latency | Inconsistent or N/A (crashes) | Consistent 40–50ms per request |
| Container Restart Count | 50+ per hour | 0 over 48 hours |
| CUDA Version in Dockerfile | 12.0 or mismatched | 12.4 (compatible with driver) |
| NVIDIA Container Runtime | Not verified / may be missing | Installed and verified working |
Debugging Deep Dive: If the Problem Persists
If you’ve followed all steps and the container still crashes, try these advanced diagnostics:
Check for Kernel OOM Events
dmesg | grep -i "out of memory" | tail -20
If you see kernel OOM messages, the issue isn’t CUDA—it’s host RAM exhaustion. Increase the --memory limit in your Docker run command or Docker Compose file.
Inspect CUDA Device Properties Inside the Container
docker run --rm --gpus all vllm-fixed:latest python3 -c "
import torch
print(f'CUDA available: {torch.cuda.is_available()}')
print(f'Device count: {torch.cuda.device_count()}')
for i in range(torch.cuda.device_count()):
print(f'Device {i}: {torch.cuda.get_device_name(i)}')
print(f'Total memory: {torch.cuda.get_device_properties(i).total_memory / 1e9:.1f} GB')
"
This confirms whether PyTorch (and by extension vLLM) can see and communicate with the GPU correctly.
Run a Synthetic Memory Stress Test
docker run --rm --gpus all vllm-fixed:latest python3 -c "
import torch
for i in range(10):
x = torch.zeros(1000, 1000, 1000, device='cuda')
print(f'Allocated {x.element_size() * x.nelement() / 1e9:.1f} GB')
del x
print('Success!')
"
If this crashes, the driver-CUDA mismatch is more severe than anticipated. Consider rolling back the driver to an older, well-tested version or using a different CUDA base image.
Final Recommendations for Stable AI Tool Automation and Deployment
Deploying vLLM (or any LLM inference engine) on Ubuntu 22.04 with Docker and NVIDIA GPUs is a powerful way to build production AI automation systems. To keep them stable:
- Pin your CUDA version in the Dockerfile. Don’t rely on “latest.” Explicitly specify a base image like
nvidia/cuda:12.4.1-devel-ubuntu22.04. - Document your driver version. Keep a record of which driver version you’re running. When you update it, test with a new CUDA base image first in a staging environment.
- Test before production. Always run the container for at least 5 minutes with realistic inference load before deploying to production.
- Monitor continuously. Use tools like Prometheus and Grafana to track GPU memory, utilization, and temperature. Set up alerts for unusual spikes.
- Use Docker Compose for reproducibility. Your configuration becomes version-controlled and easy to replicate across environments.
- Set resource limits explicitly. Don’t let containers compete for memory. Define limits upfront to prevent kernel OOM events.
Conclusion: You’ve Fixed It
The “CUDA out of memory” crash on vLLM running in Docker on Ubuntu 22.04 is almost always caused by a GPU driver and CUDA toolkit version mismatch. By following the steps in this guide—verifying your host driver version, matching your CUDA base image to that driver, installing and confirming the NVIDIA Container Runtime, and tuning GPU memory utilization—you’ve eliminated the root cause.
The fix is surprisingly straightforward once you understand what’s happening: the container’s CUDA library is trying to speak a different dialect than your host GPU driver understands, leading to silent allocation failures that manifest as “out of memory” errors. Make them speak the same language, and the problem vanishes.
If you’re deploying LLM inference at scale on a VPS or cloud instance, these configuration patterns—pinned CUDA versions, verified container runtimes, and explicit resource limits—become your foundation for stable, predictable AI tool performance. Test once, document your setup, and your production inference pipeline will run reliably for months