You’ve got a powerful 24 GB GPU sitting on your Ubuntu 22.04 machine, you’ve containerized your Ollama + vLLM setup in Docker, and everything should be working beautifully. But instead, you’re staring at a dreaded “CUDA out of memory” error that makes absolutely no sense. Your GPU has more than enough memory, the container claims to see it, and yet your large language model deployment fails to allocate space for inference. Sound familiar? You’re not alone, and the fix is closer than you think—it usually comes down to a version mismatch between your CUDA toolkit, driver, and containerized runtime.
Estimated Fix Time: 15–45 minutes (depending on driver/CUDA reinstall needs)
Required Stack: Ubuntu 22.04, NVIDIA GPU (24GB+), Docker, NVIDIA Container Runtime, CUDA Toolkit, cuDNN, Ollama, vLLM
Why This Error Happens (The Real Culprit)
The “CUDA out of memory” error inside a containerized Ollama + vLLM setup on Ubuntu 22.04 rarely means you’ve actually run out of GPU memory. Instead, it almost always indicates one of these underlying issues:
- GPU/CUDA Version Mismatch: Your Docker image was built with CUDA 12.x, but your host machine has CUDA 11.x drivers installed, or vice versa. The container sees a different runtime than what’s available at the kernel level.
- Incomplete NVIDIA Container Runtime Setup: Docker doesn’t have proper access to GPU resources because the NVIDIA Container Runtime isn’t configured or isn’t being invoked correctly.
- Driver Compatibility Issues: The NVIDIA GPU driver on your host machine is outdated or incompatible with your CUDA toolkit version inside the container.
- Memory Fragmentation or Allocation Policies: Even with 24 GB available, the container’s memory allocation strategy might be preventing contiguous allocation for large models.
- Incorrect Docker Runtime Configuration: You’re running the container with the wrong Docker runtime flag, so CUDA support isn’t being passed through to the container.
Before You Start: Tools and Requirements
- NVIDIA GPU with CUDA compute capability 6.0 or higher (24 GB VRAM recommended)
- Ubuntu 22.04 LTS with root or sudo access
- Docker installed (version 20.10 or later recommended)
- NVIDIA Container Runtime installed and configured
- NVIDIA GPU drivers installed on the host
- Basic Linux CLI proficiency
- Access to
nvidia-smicommand (included with driver) - Text editor (nano, vim, or VS Code with SSH extension)
Step-by-Step Fix: Resolving the CUDA Mismatch and Memory Allocation Issues
Step 1: Verify Your Current GPU and Driver Setup
First, let’s establish what you’re working with. Run these commands on your host machine (not inside the container) to see your GPU, driver version, and CUDA compatibility:
nvidia-smi
Example output to check:
+-----------------------------------------------------------------------------+ | NVIDIA-SMI 535.104.05 Driver Version: 535.104.05 | | GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC | | Fan Temp Perf Pwr:Usage/Cap| | | * | | 0 NVIDIA RTX 4090 Off | 00:1F.0 Off | N/A | | 24% 35C P0 45W / 575W | | 0MB / 24576MB | 0% 0% 0% 0% | +-----------------------------------------------------------------------------+
Note down your driver version (e.g., 535.104.05) and GPU model. Next, check your CUDA version:
nvcc --version
If nvcc isn’t found, your CUDA toolkit might not be installed on the host. That’s okay—we’ll address this. Now, check what CUDA version your Docker image expects:
docker run --rm --gpus all nvidia/cuda:12.2.0-runtime-ubuntu22.04 nvcc --version
This tells you what CUDA version the official NVIDIA container runtime image supports.
Step 2: Verify NVIDIA Container Runtime is Installed and Configured
The NVIDIA Container Runtime is what allows Docker containers to access your GPU. Install it if you haven’t already:
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit nvidia-container-runtime
After installation, restart the Docker daemon:
sudo systemctl restart docker
Verify the installation:
docker run --rm --gpus all ubuntu nvidia-smi
If you see GPU information printed, you’re good. If you get an error, move to Step 3.
Step 3: Update Docker Daemon Configuration for GPU Support
Edit the Docker daemon configuration file to set the NVIDIA runtime as the default (or at least available):
sudo nano /etc/docker/daemon.json
Add or update to this configuration:
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "runc"
}
Important: Do not set "default-runtime": "nvidia" for all containers—some non-GPU containers may fail. Instead, specify the runtime at container launch time.
Save the file (Ctrl+O, Enter, Ctrl+X in nano) and restart Docker:
sudo systemctl restart docker
Step 4: Update Your GPU Driver to Match CUDA Toolkit Requirements
This is where most developers get stuck. Your NVIDIA driver version must support the CUDA version inside your container. Check the compatibility matrix:
| CUDA Version | Minimum Driver Version | Maximum Driver Version |
|---|---|---|
| CUDA 11.8 | 450.xx | Latest (550+) |
| CUDA 12.0 | 525.xx | Latest (550+) |
| CUDA 12.1 | 530.xx | Latest (550+) |
| CUDA 12.2 | 535.xx | Latest (550+) |
| CUDA 12.3 | 545.xx | Latest (550+) |
If your driver version is below the minimum required for your CUDA version, update it:
sudo apt-get update
sudo apt-get install -y nvidia-driver-550
Replace 550 with the appropriate driver version for your CUDA toolkit. Reboot after installation:
sudo reboot
After reboot, verify with:
nvidia-smi
Step 5: Align Your Ollama + vLLM Docker Image with Host CUDA Version
Now, rebuild or pull your Ollama + vLLM Docker image with the correct CUDA base. If you’re using a pre-built image, check the Dockerfile to see what CUDA version it’s using.
Example Dockerfile using CUDA 12.2:
FROM nvidia/cuda:12.2.0-devel-ubuntu22.04
# Install dependencies
RUN apt-get update && apt-get install -y \
python3.10 \
python3-pip \
wget \
curl \
git \
build-essential \
&& rm -rf /var/lib/apt/lists/*
# Install vLLM and Ollama dependencies
RUN pip install --no-cache-dir \
vllm==0.2.7 \
ollama==0.0.11 \
torch==2.0.1 \
torchvision==0.15.2 \
torchaudio==2.0.2
EXPOSE 8000 11434
CMD ["python3", "-m", "vllm.entrypoints.api_server"]
Make sure this CUDA version (12.2.0 in this example) matches or is compatible with your host driver version. Build the image:
docker build -t ollama-vllm:cuda12.2 .
Step 6: Run Your Container with Proper GPU Access and Memory Configuration
This is critical. Run your container with the correct flags:
docker run -d \
--name ollama-vllm \
--gpus all \
--runtime=nvidia \
-e NVIDIA_VISIBLE_DEVICES=all \
-e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
-e CUDA_VISIBLE_DEVICES=0 \
-p 8000:8000 \
-p 11434:11434 \
--memory=32g \
--memory-swap=32g \
--ulimit memlock=-1 \
--ulimit stack=67108864 \
ollama-vllm:cuda12.2
Flag explanation:
--gpus all– Expose all GPUs to the container--runtime=nvidia– Use the NVIDIA container runtime-e NVIDIA_VISIBLE_DEVICES=all– Make all GPUs visible inside the container-e NVIDIA_DRIVER_CAPABILITIES=compute,utility– Enable compute and monitoring capabilities-e CUDA_VISIBLE_DEVICES=0– Specify which GPU(s) to use (0 for first GPU)--memory=32g– System RAM allocation (adjust to your host capacity)--ulimit memlock=-1– Unlock memory locking for GPU operations--ulimit stack=67108864– Increase stack size to prevent stack overflow during tensor allocation
Step 7: Verify GPU Access Inside the Container
After the container starts, verify that GPU access is working:
docker exec ollama-vllm nvidia-smi
You should see your GPU(s) listed with full VRAM available. If you see “CUDA out of memory” at this point, the issue isn’t memory—it’s driver/CUDA mismatch. If nvidia-smi shows all 24 GB available, move to the next step.
Check CUDA inside the container:
docker exec ollama-vllm python3 -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0)); print(torch.cuda.get_device_properties(0))"
This should output:
True NVIDIA RTX 4090 (or your GPU model) _CudaDeviceProperties(name='NVIDIA RTX 4090', major=8, minor=9, total_memory=25769803776, multi_processor_count=128)
Step 8: Configure vLLM and Ollama Memory Allocation Policies
Even with correct CUDA setup, vLLM’s default memory allocation strategy might be conservative. Inside your container, set environment variables to optimize memory usage:
Create or update a startup script in your container:
#!/bin/bash export CUDA_DEVICE_ORDER=PCI_BUS_ID export CUDA_VISIBLE_DEVICES=0 export VLLM_GPU_MEMORY_UTILIZATION=0.9 export TORCH_CUDA_MALLOC_ASYNC_DEBUG=0 export NCCL_DEBUG=INFO # For vLLM API server python3 -m vllm.entrypoints.api_server \ --model meta-llama/Llama-2-70b-hf \ --tensor-parallel-size 1 \ --gpu-memory-utilization 0.9 \ --dtype float16 \ --max-model-len 4096
Key setting: --gpu-memory-utilization 0.9 tells vLLM to use up to 90% of available GPU memory. This prevents the conservative default (50%) from leaving your 24 GB GPU underutilized.
Common Mistakes and Why They Happen
Mistake 1: Forgetting the --runtime=nvidia Flag
Developers often remember --gpus all but forget --runtime=nvidia. Without it, Docker doesn’t invoke the NVIDIA Container Runtime, so even with GPUs visible, CUDA operations fail. Always use both flags together.
Mistake 2: Using the Wrong CUDA Base Image Version
Pulling nvidia/cuda:12.0.0-runtime-ubuntu22.04 when your driver supports CUDA 11.8 causes a version mismatch. The container’s CUDA libraries can’t communicate with the host driver. Always check compatibility before building.
Mistake 3: Not Restarting Docker After Updating the Daemon Config
Changes to /etc/docker/daemon.json don’t take effect until Docker is restarted. Skipping sudo systemctl restart docker leaves developers wondering why the configuration didn’t apply.
Mistake 4: Assuming vLLM’s Default Memory Settings Are Optimal
vLLM defaults to 50% GPU memory utilization for safety. With a 24 GB GPU, this means only 12 GB is used—not enough for many 70B parameter models. Developers need to explicitly enable higher utilization via the --gpu-memory-utilization flag.
Mistake 5: Mixing NVIDIA Docker Runtime Versions
Installing both nvidia-docker (v1, deprecated) and nvidia-container-toolkit (v2, current) causes conflicts. Stick with nvidia-container-toolkit only on Ubuntu 22.04.
Optimization Tips and Follow-Up Checks
Monitor GPU Memory During Model Loading
In a separate terminal, while your model is loading, run:
watch -n 1 'nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv,noheader'
This shows memory allocation in real-time. You should see memory climbing as the model loads, then stabilizing.
Enable Persistent GPU Mode for Consistent Performance
Reduce GPU initialization latency on subsequent runs:
sudo nvidia-smi -pm 1
This persists across container restarts, improving inference speed.
Use nvidia-smi Queries Inside Your Monitoring Stack
Add health checks to your Docker Compose file:
docker-compose.yml segment:
services:
ollama-vllm:
image: ollama-vllm:cuda12.2
runtime: nvidia
environment:
- NVIDIA_VISIBLE_DEVICES=all
- CUDA_VISIBLE_DEVICES=0
- VLLM_GPU_MEMORY_UTILIZATION=0.9
healthcheck:
test: ["CMD", "nvidia-smi", "--query-gpu=memory.free", "--format=csv,noheader"]
interval: 30s
timeout: 10s
retries: 3
ports:
- "8000:8000"
- "11434:11434"
Profile Memory Usage with PyTorch Tools
Inside your container, use PyTorch’s memory profiler:
python3 -c "
import torch
print('GPU Memory Allocated:', torch.cuda.memory_allocated() / 1e9, 'GB')
print('GPU Memory Reserved:', torch.cuda.memory_reserved() / 1e9, 'GB')
print('GPU Memory Free:', (torch.cuda.get_device_properties(0).total_memory - torch.cuda.memory_allocated()) / 1e9, 'GB')
"
Real-World Scenario: Debugging a Failed Llama 2 70B Deployment
Let’s walk through a realistic example. You’re trying to deploy Meta’s Llama 2 70B model on your Ubuntu 22.04 machine with an RTX 4090 (24 GB). You’ve containerized it with vLLM and Ollama, but when you try to load the model, you get:
Error in container logs:
RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 23.70 GiB total capacity; 21.54 GiB already allocated; 0 bytes free; 2.00 GiB requested)
Your first instinct: “Wait, 23.70 GiB total? That’s not right. I have 24 GB.” This is the telltale sign of a CUDA/driver mismatch. Some of the GPU memory is being reserved by an incompatible driver/runtime combo.
Solution applied: You check your driver version (535.104.05) and realize your Docker image is built with CUDA 12.3, which requires driver 545.xx or later. You upgrade the driver to 550.107.02, rebuild your image with CUDA 12.2, and run the container with proper runtime flags and --gpu-memory-utilization 0.9. Now the full 24 GB is visible and usable.
Before and After: What Gets Fixed
| Aspect | Before Fix | After Fix |
|---|---|---|
| GPU Memory Visible | 21-23 GB (some reserved) | Full 24 GB available |
| CUDA Operations | Fail with “out of memory” even on large models | Run smoothly; Llama 2 70B fits with quantization |
| Model Load Time | Crashes mid-load | 30–60 seconds for 70B model |
| Inference Speed | N/A (crashes) | 50–100 tokens/second depending on batch size |
| Container Stability | Frequent OOM killer restarts | Stable, predictable performance |
| Driver/CUDA Alignment | Mismatched versions causing conflicts | Aligned, validated via nvidia-smi in container |
Troubleshooting if You’re Still Stuck
If you’ve followed all steps and still see “CUDA out of memory,” try these advanced debugging steps:
- Check CUDA Compute Compatibility: Run
nvidia-smi --query-gpu=compute_cap --format=csv,noheaderand verify your GPU supports the CUDA version in your container. - Verify Driver Capability Flags: Run
docker run --rm --gpus all --env NVIDIA_DRIVER_CAPABILITIES=compute,utility nvidia/cuda:12.2.0-runtime-ubuntu22.04 nvidia-smito confirm the runtime supports both compute and utility modes. - Check Container Logs for CUDA Initialization Errors: Run
docker logs ollama-vllm 2>&1 | grep -i cudato see detailed initialization messages. - Test with a Smaller Model First: Try deploying a smaller model (like Llama 2 7B) to isolate whether the issue is memory or a fundamental CUDA problem.
- Rebuild from Official NVIDIA Image: If you built a custom image, pull the official
nvidia/cuda:12.2.0-devel-ubuntu22.04and build from there to rule out image corruption.
Wrapping Up: You’ve Got This
The “CUDA out of memory” error inside your Docker container isn’t really about running out of memory—it’s about communication breakdown between your GPU driver, CUDA toolkit, and container runtime. By aligning these three components (driver version, CUDA version inside the container, and container runtime configuration), and by configuring memory allocation policies correctly, you unlock the full potential of your 24 GB GPU for serious AI model deployment and automation tasks.
The fix involves seven core steps: verifying your current setup, installing the NVIDIA Container Runtime, updating Docker daemon configuration, aligning your GPU driver with CUDA requirements, using compatible Docker images, running containers with proper flags, and optimizing memory utilization for vLLM and Ollama.
On a well-configured Ubuntu 22.04 VPS with proper GPU support, you can comfortably run inference on 70B parameter models, fine-tune smaller models, or parallelize multiple inference requests—all in a containerized, reproducible environment. The debugging and configuration effort upfront pays dividends in stability and performance.
If you hit issues, go back to Step 1 and verify each layer of the stack methodically. Most “CUDA out of memory” errors resolve within 15–30 minutes once you understand the real cause.
Good luck with your AI automation and VPS deployment! You’re now equipped to handle one of the trickiest GPU debugging scenarios in containerized environments.