vLLM Docker container keeps crashing with “CUDA out of memory” on Ubuntu 22.04 (RTX 4090) – step‑by‑step fix for the GPU memory leak and version mismatch issue.

You’ve been running vLLM in Docker for LLM inference, everything seemed fine in development, and then BAM—your container crashes with “CUDA out of memory” after a few minutes. Your RTX 4090 has 24GB of VRAM, but it’s behaving like you’re running on a laptop with 2GB. This is one of the most frustrating debugging sessions … Read more