You’ve got Ollama installed on your Ubuntu 22.04 machine. You pull down a fresh language model. You run it. And then—nothing. The terminal freezes. Your system grinds to a halt. Or worse, you get a cryptic “out of memory” error and Ollama crashes hard. If you’ve been staring at this problem for the last hour wondering why your local LLM won’t load, you’re not alone. This is one of the most common pain points for developers trying to run AI models locally, especially on VPS deployments or resource-constrained machines.
The good news? It’s fixable. And in most cases, it doesn’t require buying new hardware. Let me walk you through exactly what’s happening, why it happens, and how to get your models running smoothly.
Quick Reference
Intermediate
15–30 minutes
Linux basics, command line
Developers, AI automation engineers
What You’ll Need
- Ubuntu 22.04 with Ollama already installed
- SSH access or direct terminal access to your machine
- Sudo privileges for memory and system configuration changes
- A text editor (nano, vim, or your preference)
- Basic familiarity with Linux command-line tools
- Access to system monitoring tools like
free,htop, ornvidia-smi(if using GPU)
Understanding the Root Cause
Before we jump into fixes, let’s talk about why this happens. Ollama loads language models entirely into memory—whether that’s your system RAM or GPU VRAM. A 7-billion-parameter model needs roughly 14–16 GB of RAM. A 13-billion-parameter model? 26–32 GB. A 70-billion-parameter model can eat up 100+ GB. If your machine doesn’t have enough free memory, Ollama can’t allocate space for the model, and the loading process fails.
On Ubuntu 22.04, you might also be running into swap limitations, kernel memory constraints, or competing processes hogging resources. This is especially common on VPS deployments where you’re sharing physical hardware with other tenants, or on development machines where Docker, databases, and other services are already consuming memory.
The error usually manifests as:
error: failed to create process: cannot allocate memory- System freeze or extreme lag during model loading
- OOM killer terminating the Ollama process
- Silent failure with no response from the model API
Step-by-Step Diagnostic and Fix Workflow
Step 1: Check Your Current Memory Status
First, let’s get a clear picture of what you’re working with. Run this command:
free -h
You’ll see something like:
total used free shared buff/cache available
Mem: 15Gi 8.2Gi 2.1Gi 1.2Gi 4.7Gi 5.2Gi
Swap: 2.0Gi 0.5Gi 1.5Gi
The important number here is available. That’s the amount of memory Ollama can actually use. If you’re trying to load a 14 GB model but only have 5 GB available, you’re going to hit this error every single time.
For a more detailed view, especially if you’re using GPU acceleration, run:
nvidia-smi
This shows GPU memory separately. If you’re running NVIDIA GPUs, you need sufficient VRAM as well.
Step 2: Identify Memory Hogs
Let’s see what’s consuming your memory. Install htop if you don’t have it:
sudo apt-get update && sudo apt-get install -y htop
Then run it:
htop
Press F6 to sort by memory usage, and look at the top consumers. Common culprits include:
- Docker containers or Kubernetes services
- Database services (PostgreSQL, MySQL, MongoDB)
- Web servers (Nginx, Apache)
- Old Ollama processes still running in the background
If you find old Ollama processes, kill them:
killall ollama
Step 3: Increase Swap Space (Quick Win)
If you have limited RAM but extra disk space, increasing swap can buy you some breathing room. Swap is slower than RAM, but it’s better than crashing.
Check your current swap:
swapon --show
If you need to add more, create a swap file:
sudo fallocate -l 16G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
Verify it took effect:
free -h
To make this permanent across reboots, add it to /etc/fstab:
echo '/swapfile none swap sw 0 0' | sudo tee -a /etc/fstab
Step 4: Reduce Ollama Model Memory Usage
The smartest approach is to load a smaller model or configure Ollama to use memory more efficiently. When you run Ollama, you can set the number of GPU layers and thread count. For CPU-only inference, try a smaller quantized model:
ollama pull mistral:7b-instruct-q4_0
The q4_0 suffix indicates 4-bit quantization, which dramatically reduces memory usage at the cost of a small quality trade-off.
Then run it:
ollama run mistral:7b-instruct-q4_0
If you’re using GPU, you can limit how many layers are offloaded to VRAM. Set the OLLAMA_NUM_GPU environment variable:
export OLLAMA_NUM_GPU=20
ollama run neural-chat:7b
This tells Ollama to offload only the first 20 layers to GPU, keeping the rest in system RAM.
Step 5: Disable Unnecessary Services
If you’re on a development machine, you might not need all services running simultaneously. Check what’s active:
systemctl list-units --type=service --state=running
Stop services you don’t need right now. For example, if you’re not using Docker:
sudo systemctl stop docker
If you’re on a VPS and notice heavy memory usage from background services, consider stopping them temporarily:
sudo systemctl stop postgresql
sudo systemctl stop mysql
sudo systemctl stop redis-server
Step 6: Configure Ollama Runtime Settings
Edit Ollama’s systemd service file to add runtime constraints:
sudo nano /etc/systemd/system/ollama.service
Add these lines in the [Service] section:
MemoryMax=24G
MemoryHigh=20G
This ensures Ollama doesn’t consume more than 24 GB, and will start throttling at 20 GB. Adjust these values based on your system.
Then reload and restart:
sudo systemctl daemon-reload
sudo systemctl restart ollama
Step 7: Verify Your Fix
Now test if your model loads without errors:
ollama run mistral:7b-instruct-q4_0
In a separate terminal, monitor memory while it loads:
watch -n 1 free -h
If the model loads and memory stabilizes, you’re good. If it still fails, you may need to choose an even smaller model or add more physical RAM/swap.
Common Mistakes and Why They Happen
free -h before attempting to load a model. This takes 5 seconds and saves 30 minutes of troubleshooting.
Optimization Tips and Performance Checks
Monitor in Real-Time
While Ollama is running, keep an eye on resource usage:
watch -n 0.5 'free -h && echo "---" && ps aux | grep ollama | grep -v grep'
This updates every half-second and shows memory plus the Ollama process.
Use GPU-Accelerated Inference
If you have an NVIDIA GPU, offloading to VRAM is a game-changer. Install CUDA support:
sudo apt-get install -y nvidia-cuda-toolkit
Then download an Ollama build with GPU support. Make sure your Ollama server detects it:
ollama -v
Look for CUDA in the version output. When you load a model, you should see VRAM usage increase.
Batch Request Processing
If you’re running Ollama on a VPS for automation or API calls, batch your requests. Don’t spam parallel queries—they multiply memory consumption. Process sequentially or with limited concurrency.
Enable Memory Compression
On some systems, you can enable zswap (kernel-level memory compression):
echo lz4 | sudo tee /sys/module/zswap/parameters/compressor
This is a lightweight optimization that can reduce memory pressure slightly.
Before and After: Real Numbers
| Scenario | Before Fix | After Fix | Change |
|---|---|---|---|
| Available Memory | 3 GB | 18 GB | +500% |
| Model Size (Full Precision) | 28 GB (won’t fit) | 3.5 GB (Q4_0 quantized) | –89% |
| Load Time | Crash/OOM | 8 seconds | Success |
| Inference Speed | N/A | 50 tokens/sec (CPU) | Usable |
| With GPU Offload | N/A | 200 tokens/sec | 4x faster |
Real-World Example: Debugging a VPS Deployment
Let me walk through an actual scenario that happened to me last month. I was deploying Ollama on a DigitalOcean droplet (8GB RAM) with an API backend. The API was supposed to handle LLM requests for a customer dashboard automation tool.
The Problem: Every time we tried to load the mistral model, the process crashed after 30 seconds. No error message, just dead.
The Diagnosis:
$ free -h
total used free shared buff/cache available
Mem: 7.8Gi 6.2Gi 0.4Gi 0.1Gi 1.2Gi 1.2Gi
Only 1.2 GB available. The full mistral model needs 14 GB. Rookie mistake—I’d deployed with a database and Redis running.
The Fix:
- Stopped Redis and PostgreSQL temporarily:
sudo systemctl stop redis-server postgresql - Freed up 4 GB immediately. Now 5.2 GB available.
- Pulled the quantized 7B model:
ollama pull mistral:7b-instruct-q4_0 - Added 16 GB swap as insurance:
sudo fallocate -l 16G /swapfile && sudo chmod 600 /swapfile && sudo mkswap /swapfile && sudo swapon /swapfile - Restarted Ollama and tested.
Result: Model loaded in 12 seconds. Inference averaged 40 tokens/sec on CPU. Good enough for the use case. We later upgraded to an 8-core VPS with 16 GB RAM and moved to a larger model for better quality.
The lesson: on resource-constrained machines (VPS, older laptops), start small. Use quantized models. Stop unnecessary services. Add swap as a safety net. Then scale up if needed.
Frequently Asked Questions
nvidia-smi first to confirm your GPU is recognized. Then verify your Ollama build has CUDA support. If it doesn’t, uninstall and reinstall from the official releases page, choosing the CUDA-enabled version.
Advanced Troubleshooting
If you’ve tried all the above and still hit memory errors, try these advanced steps:
Check Kernel Memory Limits
sudo sysctl vm.overcommit_memory
If it returns 0 (the default), the kernel is conservative about memory allocation. You can make it more permissive:
sudo sysctl -w vm.overcommit_memory=1
Add to /etc/sysctl.conf to persist:
echo "vm.overcommit_memory=1" | sudo tee -a /etc/sysctl.conf
Increase Ulimit
Some systems have hard limits on memory per process. Check:
ulimit -a
If virtual memory is limited, increase it:
ulimit -v unlimited
Use Alternative Model Formats
If CPU inference is slow even with quantization, consider GGUF format models in Ollama, which are optimized for inference and have better memory efficiency than raw model formats.
Conclusion: Get Your Local LLM Running
You’ve Got This
Ollama out-of-memory errors are frustrating, but they’re eminently solvable. Nine times out of ten, the issue comes down to three things: running too many background services, not using quantized models, or misconfiguring GPU offload. The fixes are straightforward—use the diagnostic workflow I’ve laid out, start with a smaller model, and scale up methodically.
Whether you’re building an AI automation tool, experimenting with local inference for VPS deployment, or just curious about running models locally, these techniques will save you hours of debugging. Remember: monitor your memory, choose the right model size for your hardware, and let quantization be your friend. Local LLMs aren’t a luxury anymore—they’re a practical tool for developers who understand how to tune them.
If you hit issues after trying these steps, check Ollama’s GitHub discussions or the official documentation. The community is active, and your specific error might be documented. Good luck, and happy inferencing.