Fix “CUDA out of memory” error while running Ollama + vLLM inside a Docker container on Ubuntu 22.04 with a 24 GB GPU (GPU/CUDA version mismatch)

You’ve got a powerful 24 GB GPU sitting on your Ubuntu 22.04 machine, you’ve containerized your Ollama + vLLM setup in Docker, and everything should be working beautifully. But instead, you’re staring at a dreaded “CUDA out of memory” error that makes absolutely no sense. Your GPU has more than enough memory, the container claims to see it, and yet your large language model deployment fails to allocate space for inference. Sound familiar? You’re not alone, and the fix is closer than you think—it usually comes down to a version mismatch between your CUDA toolkit, driver, and containerized runtime.

Use Case: AI model deployment with vLLM and Ollama in Docker containers requiring GPU acceleration on enterprise or personal VPS environments.Difficulty Level: Intermediate (requires Docker and GPU knowledge, debugging skills helpful)

Estimated Fix Time: 15–45 minutes (depending on driver/CUDA reinstall needs)

Required Stack: Ubuntu 22.04, NVIDIA GPU (24GB+), Docker, NVIDIA Container Runtime, CUDA Toolkit, cuDNN, Ollama, vLLM

Why This Error Happens (The Real Culprit)

The “CUDA out of memory” error inside a containerized Ollama + vLLM setup on Ubuntu 22.04 rarely means you’ve actually run out of GPU memory. Instead, it almost always indicates one of these underlying issues:

GPU/CUDA Version Mismatch: Your Docker image was built with CUDA 12.x, but your host machine has CUDA 11.x drivers installed, or vice versa. The container sees a different runtime than what’s available at the kernel level.
Incomplete NVIDIA Container Runtime Setup: Docker doesn’t have proper access to GPU resources because the NVIDIA Container Runtime isn’t configured or isn’t being invoked correctly.
Driver Compatibility Issues: The NVIDIA GPU driver on your host machine is outdated or incompatible with your CUDA toolkit version inside the container.
Memory Fragmentation or Allocation Policies: Even with 24 GB available, the container’s memory allocation strategy might be preventing contiguous allocation for large models.
Incorrect Docker Runtime Configuration: You’re running the container with the wrong Docker runtime flag, so CUDA support isn’t being passed through to the container.

Before You Start: Tools and Requirements

NVIDIA GPU with CUDA compute capability 6.0 or higher (24 GB VRAM recommended)
Ubuntu 22.04 LTS with root or sudo access
Docker installed (version 20.10 or later recommended)
NVIDIA Container Runtime installed and configured
NVIDIA GPU drivers installed on the host
Basic Linux CLI proficiency
Access to nvidia-smi command (included with driver)
Text editor (nano, vim, or VS Code with SSH extension)

Step-by-Step Fix: Resolving the CUDA Mismatch and Memory Allocation Issues

Step 1: Verify Your Current GPU and Driver Setup

First, let’s establish what you’re working with. Run these commands on your host machine (not inside the container) to see your GPU, driver version, and CUDA compatibility:

nvidia-smi

Example output to check:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 535.104.05             Driver Version: 535.104.05                |
| GPU  Name                 Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         |           |             * |
|   0  NVIDIA RTX 4090              Off  | 00:1F.0     Off |                  N/A |
| 24%   35C    P0    45W / 575W |         |      0MB / 24576MB |             0%   0%   0%   0% |
+-----------------------------------------------------------------------------+

Note down your driver version (e.g., 535.104.05) and GPU model. Next, check your CUDA version:

nvcc --version

If nvcc isn’t found, your CUDA toolkit might not be installed on the host. That’s okay—we’ll address this. Now, check what CUDA version your Docker image expects:

docker run --rm --gpus all nvidia/cuda:12.2.0-runtime-ubuntu22.04 nvcc --version

This tells you what CUDA version the official NVIDIA container runtime image supports.

Step 2: Verify NVIDIA Container Runtime is Installed and Configured

The NVIDIA Container Runtime is what allows Docker containers to access your GPU. Install it if you haven’t already:

distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
  sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit nvidia-container-runtime

After installation, restart the Docker daemon:

sudo systemctl restart docker

Verify the installation:

docker run --rm --gpus all ubuntu nvidia-smi

If you see GPU information printed, you’re good. If you get an error, move to Step 3.

Step 3: Update Docker Daemon Configuration for GPU Support

Edit the Docker daemon configuration file to set the NVIDIA runtime as the default (or at least available):

sudo nano /etc/docker/daemon.json

Add or update to this configuration:

{
  "runtimes": {
    "nvidia": {
      "path": "nvidia-container-runtime",
      "runtimeArgs": []
    }
  },
  "default-runtime": "runc"
}

Important: Do not set "default-runtime": "nvidia" for all containers—some non-GPU containers may fail. Instead, specify the runtime at container launch time.

Save the file (Ctrl+O, Enter, Ctrl+X in nano) and restart Docker:

sudo systemctl restart docker

Step 4: Update Your GPU Driver to Match CUDA Toolkit Requirements

This is where most developers get stuck. Your NVIDIA driver version must support the CUDA version inside your container. Check the compatibility matrix:

CUDA Version	Minimum Driver Version	Maximum Driver Version
CUDA 11.8	450.xx	Latest (550+)
CUDA 12.0	525.xx	Latest (550+)
CUDA 12.1	530.xx	Latest (550+)
CUDA 12.2	535.xx	Latest (550+)
CUDA 12.3	545.xx	Latest (550+)

If your driver version is below the minimum required for your CUDA version, update it:

sudo apt-get update
sudo apt-get install -y nvidia-driver-550

Replace 550 with the appropriate driver version for your CUDA toolkit. Reboot after installation:

sudo reboot

After reboot, verify with:

nvidia-smi

Step 5: Align Your Ollama + vLLM Docker Image with Host CUDA Version

Now, rebuild or pull your Ollama + vLLM Docker image with the correct CUDA base. If you’re using a pre-built image, check the Dockerfile to see what CUDA version it’s using.

Example Dockerfile using CUDA 12.2:

FROM nvidia/cuda:12.2.0-devel-ubuntu22.04

# Install dependencies
RUN apt-get update && apt-get install -y \
    python3.10 \
    python3-pip \
    wget \
    curl \
    git \
    build-essential \
    && rm -rf /var/lib/apt/lists/*

# Install vLLM and Ollama dependencies
RUN pip install --no-cache-dir \
    vllm==0.2.7 \
    ollama==0.0.11 \
    torch==2.0.1 \
    torchvision==0.15.2 \
    torchaudio==2.0.2

EXPOSE 8000 11434

CMD ["python3", "-m", "vllm.entrypoints.api_server"]

Make sure this CUDA version (12.2.0 in this example) matches or is compatible with your host driver version. Build the image:

docker build -t ollama-vllm:cuda12.2 .

Step 6: Run Your Container with Proper GPU Access and Memory Configuration

This is critical. Run your container with the correct flags:

docker run -d \
  --name ollama-vllm \
  --gpus all \
  --runtime=nvidia \
  -e NVIDIA_VISIBLE_DEVICES=all \
  -e NVIDIA_DRIVER_CAPABILITIES=compute,utility \
  -e CUDA_VISIBLE_DEVICES=0 \
  -p 8000:8000 \
  -p 11434:11434 \
  --memory=32g \
  --memory-swap=32g \
  --ulimit memlock=-1 \
  --ulimit stack=67108864 \
  ollama-vllm:cuda12.2

Flag explanation:

--gpus all – Expose all GPUs to the container
--runtime=nvidia – Use the NVIDIA container runtime
-e NVIDIA_VISIBLE_DEVICES=all – Make all GPUs visible inside the container
-e NVIDIA_DRIVER_CAPABILITIES=compute,utility – Enable compute and monitoring capabilities
-e CUDA_VISIBLE_DEVICES=0 – Specify which GPU(s) to use (0 for first GPU)
--memory=32g – System RAM allocation (adjust to your host capacity)
--ulimit memlock=-1 – Unlock memory locking for GPU operations
--ulimit stack=67108864 – Increase stack size to prevent stack overflow during tensor allocation

Step 7: Verify GPU Access Inside the Container

After the container starts, verify that GPU access is working:

docker exec ollama-vllm nvidia-smi

You should see your GPU(s) listed with full VRAM available. If you see “CUDA out of memory” at this point, the issue isn’t memory—it’s driver/CUDA mismatch. If nvidia-smi shows all 24 GB available, move to the next step.

Check CUDA inside the container:

docker exec ollama-vllm python3 -c "import torch; print(torch.cuda.is_available()); print(torch.cuda.get_device_name(0)); print(torch.cuda.get_device_properties(0))"

This should output:

True
NVIDIA RTX 4090 (or your GPU model)
_CudaDeviceProperties(name='NVIDIA RTX 4090', major=8, minor=9, total_memory=25769803776, multi_processor_count=128)

Step 8: Configure vLLM and Ollama Memory Allocation Policies

Even with correct CUDA setup, vLLM’s default memory allocation strategy might be conservative. Inside your container, set environment variables to optimize memory usage:

Create or update a startup script in your container:

#!/bin/bash

export CUDA_DEVICE_ORDER=PCI_BUS_ID
export CUDA_VISIBLE_DEVICES=0
export VLLM_GPU_MEMORY_UTILIZATION=0.9
export TORCH_CUDA_MALLOC_ASYNC_DEBUG=0
export NCCL_DEBUG=INFO

# For vLLM API server
python3 -m vllm.entrypoints.api_server \
  --model meta-llama/Llama-2-70b-hf \
  --tensor-parallel-size 1 \
  --gpu-memory-utilization 0.9 \
  --dtype float16 \
  --max-model-len 4096

Key setting: --gpu-memory-utilization 0.9 tells vLLM to use up to 90% of available GPU memory. This prevents the conservative default (50%) from leaving your 24 GB GPU underutilized.

Common Mistakes and Why They Happen

Mistake 1: Forgetting the `--runtime=nvidia` Flag

Developers often remember --gpus all but forget --runtime=nvidia. Without it, Docker doesn’t invoke the NVIDIA Container Runtime, so even with GPUs visible, CUDA operations fail. Always use both flags together.

Mistake 2: Using the Wrong CUDA Base Image Version

Pulling nvidia/cuda:12.0.0-runtime-ubuntu22.04 when your driver supports CUDA 11.8 causes a version mismatch. The container’s CUDA libraries can’t communicate with the host driver. Always check compatibility before building.

Mistake 3: Not Restarting Docker After Updating the Daemon Config

Changes to /etc/docker/daemon.json don’t take effect until Docker is restarted. Skipping sudo systemctl restart docker leaves developers wondering why the configuration didn’t apply.

Mistake 4: Assuming vLLM’s Default Memory Settings Are Optimal

vLLM defaults to 50% GPU memory utilization for safety. With a 24 GB GPU, this means only 12 GB is used—not enough for many 70B parameter models. Developers need to explicitly enable higher utilization via the --gpu-memory-utilization flag.

Mistake 5: Mixing NVIDIA Docker Runtime Versions

Installing both nvidia-docker (v1, deprecated) and nvidia-container-toolkit (v2, current) causes conflicts. Stick with nvidia-container-toolkit only on Ubuntu 22.04.

Optimization Tips and Follow-Up Checks

Monitor GPU Memory During Model Loading

In a separate terminal, while your model is loading, run:

watch -n 1 'nvidia-smi --query-gpu=index,memory.used,memory.free --format=csv,noheader'

This shows memory allocation in real-time. You should see memory climbing as the model loads, then stabilizing.

Enable Persistent GPU Mode for Consistent Performance

Reduce GPU initialization latency on subsequent runs:

sudo nvidia-smi -pm 1

This persists across container restarts, improving inference speed.

Use `nvidia-smi` Queries Inside Your Monitoring Stack

Add health checks to your Docker Compose file:

docker-compose.yml segment:

services:
  ollama-vllm:
    image: ollama-vllm:cuda12.2
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - CUDA_VISIBLE_DEVICES=0
      - VLLM_GPU_MEMORY_UTILIZATION=0.9
    healthcheck:
      test: ["CMD", "nvidia-smi", "--query-gpu=memory.free", "--format=csv,noheader"]
      interval: 30s
      timeout: 10s
      retries: 3
    ports:
      - "8000:8000"
      - "11434:11434"

Profile Memory Usage with PyTorch Tools

Inside your container, use PyTorch’s memory profiler:

python3 -c "
import torch
print('GPU Memory Allocated:', torch.cuda.memory_allocated() / 1e9, 'GB')
print('GPU Memory Reserved:', torch.cuda.memory_reserved() / 1e9, 'GB')
print('GPU Memory Free:', (torch.cuda.get_device_properties(0).total_memory - torch.cuda.memory_allocated()) / 1e9, 'GB')
"

Real-World Scenario: Debugging a Failed Llama 2 70B Deployment

Let’s walk through a realistic example. You’re trying to deploy Meta’s Llama 2 70B model on your Ubuntu 22.04 machine with an RTX 4090 (24 GB). You’ve containerized it with vLLM and Ollama, but when you try to load the model, you get:

Error in container logs:

RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB (GPU 0; 23.70 GiB total capacity; 21.54 GiB already allocated; 0 bytes free; 2.00 GiB requested)

Your first instinct: “Wait, 23.70 GiB total? That’s not right. I have 24 GB.” This is the telltale sign of a CUDA/driver mismatch. Some of the GPU memory is being reserved by an incompatible driver/runtime combo.

Solution applied: You check your driver version (535.104.05) and realize your Docker image is built with CUDA 12.3, which requires driver 545.xx or later. You upgrade the driver to 550.107.02, rebuild your image with CUDA 12.2, and run the container with proper runtime flags and --gpu-memory-utilization 0.9. Now the full 24 GB is visible and usable.

Before and After: What Gets Fixed

Aspect	Before Fix	After Fix
GPU Memory Visible	21-23 GB (some reserved)	Full 24 GB available
CUDA Operations	Fail with “out of memory” even on large models	Run smoothly; Llama 2 70B fits with quantization
Model Load Time	Crashes mid-load	30–60 seconds for 70B model
Inference Speed	N/A (crashes)	50–100 tokens/second depending on batch size
Container Stability	Frequent OOM killer restarts	Stable, predictable performance
Driver/CUDA Alignment	Mismatched versions causing conflicts	Aligned, validated via `nvidia-smi` in container

Troubleshooting if You’re Still Stuck

If you’ve followed all steps and still see “CUDA out of memory,” try these advanced debugging steps:

Check CUDA Compute Compatibility: Run nvidia-smi --query-gpu=compute_cap --format=csv,noheader and verify your GPU supports the CUDA version in your container.
Verify Driver Capability Flags: Run docker run --rm --gpus all --env NVIDIA_DRIVER_CAPABILITIES=compute,utility nvidia/cuda:12.2.0-runtime-ubuntu22.04 nvidia-smi to confirm the runtime supports both compute and utility modes.
Check Container Logs for CUDA Initialization Errors: Run docker logs ollama-vllm 2>&1 | grep -i cuda to see detailed initialization messages.
Test with a Smaller Model First: Try deploying a smaller model (like Llama 2 7B) to isolate whether the issue is memory or a fundamental CUDA problem.
Rebuild from Official NVIDIA Image: If you built a custom image, pull the official nvidia/cuda:12.2.0-devel-ubuntu22.04 and build from there to rule out image corruption.

Wrapping Up: You’ve Got This

The “CUDA out of memory” error inside your Docker container isn’t really about running out of memory—it’s about communication breakdown between your GPU driver, CUDA toolkit, and container runtime. By aligning these three components (driver version, CUDA version inside the container, and container runtime configuration), and by configuring memory allocation policies correctly, you unlock the full potential of your 24 GB GPU for serious AI model deployment and automation tasks.

The fix involves seven core steps: verifying your current setup, installing the NVIDIA Container Runtime, updating Docker daemon configuration, aligning your GPU driver with CUDA requirements, using compatible Docker images, running containers with proper flags, and optimizing memory utilization for vLLM and Ollama.

On a well-configured Ubuntu 22.04 VPS with proper GPU support, you can comfortably run inference on 70B parameter models, fine-tune smaller models, or parallelize multiple inference requests—all in a containerized, reproducible environment. The debugging and configuration effort upfront pays dividends in stability and performance.

If you hit issues, go back to Step 1 and verify each layer of the stack methodically. Most “CUDA out of memory” errors resolve within 15–30 minutes once you understand the real cause.

Good luck with your AI automation and VPS deployment! You’re now equipped to handle one of the trickiest GPU debugging scenarios in containerized environments.