Ollama Out of Memory Error on Ubuntu 22.04: Why Your Local LLM Won’t Load and How to Fix It

You’ve got Ollama installed on your Ubuntu 22.04 machine. You pull down a fresh language model. You run it. And then—nothing. The terminal freezes. Your system grinds to a halt. Or worse, you get a cryptic “out of memory” error and Ollama crashes hard. If you’ve been staring at this problem for the last hour wondering why your local LLM won’t load, you’re not alone. This is one of the most common pain points for developers trying to run AI models locally, especially on VPS deployments or resource-constrained machines.

The good news? It’s fixable. And in most cases, it doesn’t require buying new hardware. Let me walk you through exactly what’s happening, why it happens, and how to get your models running smoothly.

Quick Reference

Difficulty Level:
Intermediate
Estimated Fix Time:
15–30 minutes
Required Knowledge:
Linux basics, command line
Best For:
Developers, AI automation engineers

What You’ll Need

  • Ubuntu 22.04 with Ollama already installed
  • SSH access or direct terminal access to your machine
  • Sudo privileges for memory and system configuration changes
  • A text editor (nano, vim, or your preference)
  • Basic familiarity with Linux command-line tools
  • Access to system monitoring tools like free, htop, or nvidia-smi (if using GPU)

Understanding the Root Cause

Before we jump into fixes, let’s talk about why this happens. Ollama loads language models entirely into memory—whether that’s your system RAM or GPU VRAM. A 7-billion-parameter model needs roughly 14–16 GB of RAM. A 13-billion-parameter model? 26–32 GB. A 70-billion-parameter model can eat up 100+ GB. If your machine doesn’t have enough free memory, Ollama can’t allocate space for the model, and the loading process fails.

On Ubuntu 22.04, you might also be running into swap limitations, kernel memory constraints, or competing processes hogging resources. This is especially common on VPS deployments where you’re sharing physical hardware with other tenants, or on development machines where Docker, databases, and other services are already consuming memory.

The error usually manifests as:

  • error: failed to create process: cannot allocate memory
  • System freeze or extreme lag during model loading
  • OOM killer terminating the Ollama process
  • Silent failure with no response from the model API

Step-by-Step Diagnostic and Fix Workflow

Step 1: Check Your Current Memory Status

First, let’s get a clear picture of what you’re working with. Run this command:

free -h

You’ll see something like:

              total        used        free      shared  buff/cache   available
Mem:          15Gi       8.2Gi       2.1Gi       1.2Gi       4.7Gi       5.2Gi
Swap:         2.0Gi       0.5Gi       1.5Gi

The important number here is available. That’s the amount of memory Ollama can actually use. If you’re trying to load a 14 GB model but only have 5 GB available, you’re going to hit this error every single time.

For a more detailed view, especially if you’re using GPU acceleration, run:

nvidia-smi

This shows GPU memory separately. If you’re running NVIDIA GPUs, you need sufficient VRAM as well.

Step 2: Identify Memory Hogs

Let’s see what’s consuming your memory. Install htop if you don’t have it:

sudo apt-get update && sudo apt-get install -y htop

Then run it:

htop

Press F6 to sort by memory usage, and look at the top consumers. Common culprits include:

  • Docker containers or Kubernetes services
  • Database services (PostgreSQL, MySQL, MongoDB)
  • Web servers (Nginx, Apache)
  • Old Ollama processes still running in the background

If you find old Ollama processes, kill them:

killall ollama

Step 3: Increase Swap Space (Quick Win)

If you have limited RAM but extra disk space, increasing swap can buy you some breathing room. Swap is slower than RAM, but it’s better than crashing.

Check your current swap:

swapon --show

If you need to add more, create a swap file:

sudo fallocate -l 16G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile

Verify it took effect:

free -h

To make this permanent across reboots, add it to /etc/fstab:

echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
⚠️ Warning: Swap is a last resort. Running large models on swap is painfully slow because disk I/O is orders of magnitude slower than RAM. Use this as a temporary fix while you optimize further.

Step 4: Reduce Ollama Model Memory Usage

The smartest approach is to load a smaller model or configure Ollama to use memory more efficiently. When you run Ollama, you can set the number of GPU layers and thread count. For CPU-only inference, try a smaller quantized model:

ollama pull mistral:7b-instruct-q4_0

The q4_0 suffix indicates 4-bit quantization, which dramatically reduces memory usage at the cost of a small quality trade-off.

Then run it:

ollama run mistral:7b-instruct-q4_0

If you’re using GPU, you can limit how many layers are offloaded to VRAM. Set the OLLAMA_NUM_GPU environment variable:

export OLLAMA_NUM_GPU=20
ollama run neural-chat:7b

This tells Ollama to offload only the first 20 layers to GPU, keeping the rest in system RAM.

Step 5: Disable Unnecessary Services

If you’re on a development machine, you might not need all services running simultaneously. Check what’s active:

systemctl list-units --type=service --state=running

Stop services you don’t need right now. For example, if you’re not using Docker:

sudo systemctl stop docker

If you’re on a VPS and notice heavy memory usage from background services, consider stopping them temporarily:

sudo systemctl stop postgresql
sudo systemctl stop mysql
sudo systemctl stop redis-server

Step 6: Configure Ollama Runtime Settings

Edit Ollama’s systemd service file to add runtime constraints:

sudo nano /etc/systemd/system/ollama.service

Add these lines in the [Service] section:

MemoryMax=24G
MemoryHigh=20G

This ensures Ollama doesn’t consume more than 24 GB, and will start throttling at 20 GB. Adjust these values based on your system.

Then reload and restart:

sudo systemctl daemon-reload
sudo systemctl restart ollama

Step 7: Verify Your Fix

Now test if your model loads without errors:

ollama run mistral:7b-instruct-q4_0

In a separate terminal, monitor memory while it loads:

watch -n 1 free -h

If the model loads and memory stabilizes, you’re good. If it still fails, you may need to choose an even smaller model or add more physical RAM/swap.

Common Mistakes and Why They Happen

Mistake #1: Not Checking Available Memory Before LoadingDevelopers often assume their machine has enough memory without verifying. Always run free -h before attempting to load a model. This takes 5 seconds and saves 30 minutes of troubleshooting.

Mistake #2: Running Full-Precision ModelsFull 32-bit floating-point models are massive. A 7B parameter model in full precision is 28 GB. The same model in 4-bit quantization is 3.5 GB. Use quantized versions (Q4_0, Q5_K_M) by default.

Mistake #3: Ignoring Background ProcessesMany developers forget they’re running Docker, databases, or other heavy services. A single Kubernetes cluster can easily consume 10+ GB. Kill what you don’t need.

Mistake #4: Not Using GPU When AvailableIf you have NVIDIA GPU, offloading models to VRAM frees up system RAM. Make sure NVIDIA drivers and CUDA toolkit are installed, and that Ollama is compiled with GPU support.

Mistake #5: Setting Swap Too HighWhile swap helps, relying on it for large models is misery—constant disk thrashing, thermal throttling, and extremely slow inference. Swap is a band-aid, not a solution.

Optimization Tips and Performance Checks

Monitor in Real-Time

While Ollama is running, keep an eye on resource usage:

watch -n 0.5 'free -h && echo "---" && ps aux | grep ollama | grep -v grep'

This updates every half-second and shows memory plus the Ollama process.

Use GPU-Accelerated Inference

If you have an NVIDIA GPU, offloading to VRAM is a game-changer. Install CUDA support:

sudo apt-get install -y nvidia-cuda-toolkit

Then download an Ollama build with GPU support. Make sure your Ollama server detects it:

ollama -v

Look for CUDA in the version output. When you load a model, you should see VRAM usage increase.

Batch Request Processing

If you’re running Ollama on a VPS for automation or API calls, batch your requests. Don’t spam parallel queries—they multiply memory consumption. Process sequentially or with limited concurrency.

Enable Memory Compression

On some systems, you can enable zswap (kernel-level memory compression):

echo lz4 | sudo tee /sys/module/zswap/parameters/compressor

This is a lightweight optimization that can reduce memory pressure slightly.

Before and After: Real Numbers

Scenario Before Fix After Fix Change
Available Memory 3 GB 18 GB +500%
Model Size (Full Precision) 28 GB (won’t fit) 3.5 GB (Q4_0 quantized) –89%
Load Time Crash/OOM 8 seconds Success
Inference Speed N/A 50 tokens/sec (CPU) Usable
With GPU Offload N/A 200 tokens/sec 4x faster

Real-World Example: Debugging a VPS Deployment

Let me walk through an actual scenario that happened to me last month. I was deploying Ollama on a DigitalOcean droplet (8GB RAM) with an API backend. The API was supposed to handle LLM requests for a customer dashboard automation tool.

The Problem: Every time we tried to load the mistral model, the process crashed after 30 seconds. No error message, just dead.

The Diagnosis:

$ free -h
              total        used        free      shared  buff/cache   available
Mem:          7.8Gi       6.2Gi       0.4Gi       0.1Gi       1.2Gi       1.2Gi

Only 1.2 GB available. The full mistral model needs 14 GB. Rookie mistake—I’d deployed with a database and Redis running.

The Fix:

  1. Stopped Redis and PostgreSQL temporarily: sudo systemctl stop redis-server postgresql
  2. Freed up 4 GB immediately. Now 5.2 GB available.
  3. Pulled the quantized 7B model: ollama pull mistral:7b-instruct-q4_0
  4. Added 16 GB swap as insurance: sudo fallocate -l 16G /swapfile && sudo chmod 600 /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile
  5. Restarted Ollama and tested.

Result: Model loaded in 12 seconds. Inference averaged 40 tokens/sec on CPU. Good enough for the use case. We later upgraded to an 8-core VPS with 16 GB RAM and moved to a larger model for better quality.

The lesson: on resource-constrained machines (VPS, older laptops), start small. Use quantized models. Stop unnecessary services. Add swap as a safety net. Then scale up if needed.

Frequently Asked Questions

Q: Will quantization hurt my model quality?A: Minimally. 4-bit quantization (Q4_0, Q4_K_M) produces nearly imperceptible quality loss for most use cases. 8-bit (Q8_0) is even better if you have the memory. Full precision is mostly overkill for inference.

Q: Should I use Docker to isolate Ollama memory?A: You can, but it adds overhead. Docker memory limits are useful if Ollama is one service among many, but for standalone deployments, direct installation is simpler and more efficient.

Q: What if I have GPU but Ollama isn’t detecting it?A: Run nvidia-smi first to confirm your GPU is recognized. Then verify your Ollama build has CUDA support. If it doesn’t, uninstall and reinstall from the official releases page, choosing the CUDA-enabled version.

Q: Is running Ollama on a shared VPS realistic?A: Yes, but manage expectations. Shared VPS environments have unpredictable performance due to noisy neighbors. For production AI services, dedicated machines or GPU cloud instances (RunPod, Lambda Labs) are better. For development and testing, a shared VPS with quantized models works fine.

Advanced Troubleshooting

If you’ve tried all the above and still hit memory errors, try these advanced steps:

Check Kernel Memory Limits

sudo sysctl vm.overcommit_memory

If it returns 0 (the default), the kernel is conservative about memory allocation. You can make it more permissive:

sudo sysctl -w vm.overcommit_memory=1

Add to /etc/sysctl.conf to persist:

echo "vm.overcommit_memory=1" | sudo tee -a /etc/sysctl.conf

Increase Ulimit

Some systems have hard limits on memory per process. Check:

ulimit -a

If virtual memory is limited, increase it:

ulimit -v unlimited

Use Alternative Model Formats

If CPU inference is slow even with quantization, consider GGUF format models in Ollama, which are optimized for inference and have better memory efficiency than raw model formats.

Conclusion: Get Your Local LLM Running

You’ve Got This

Ollama out-of-memory errors are frustrating, but they’re eminently solvable. Nine times out of ten, the issue comes down to three things: running too many background services, not using quantized models, or misconfiguring GPU offload. The fixes are straightforward—use the diagnostic workflow I’ve laid out, start with a smaller model, and scale up methodically.

Whether you’re building an AI automation tool, experimenting with local inference for VPS deployment, or just curious about running models locally, these techniques will save you hours of debugging. Remember: monitor your memory, choose the right model size for your hardware, and let quantization be your friend. Local LLMs aren’t a luxury anymore—they’re a practical tool for developers who understand how to tune them.

If you hit issues after trying these steps, check Ollama’s GitHub discussions or the official documentation. The community is active, and your specific error might be documented. Good luck, and happy inferencing.

Leave a Comment