If you’ve been working with Ollama for running large language models, you might have wondered about parallelism and how to get the most performance out of your setup. I recently went down this rabbit hole myself while building a translation service, and I thought I’d share what I learned.
So, Does Ollama Use Parallelism Internally?
This was my first question too! I was running translation tasks on both V100 and H100 GPUs and was surprised to see almost identical performance. What gives?
Well, here’s what I discovered: Ollama can use parallelism, but you need to know how to configure it properly.
The Inside Scoop on Ollama’s Parallelism
As of early 2025, Ollama has come a long way with its parallelism capabilities. If you’re using version 0.1.33 or later, you have some powerful options at your fingertips.
By default, Ollama will automatically choose between handling 1 or 4 parallel requests per model, depending on your available memory. Since version 0.2.0, this concurrency isn’t experimental anymore—it’s fully supported!
How to Configure Parallelism in Ollama
Let me share the key settings you can tweak:
- OLLAMA_NUM_PARALLEL: This controls how many requests your model can handle at once. The default is usually 4, but it might be set to 1 if your system is limited on memory.
- OLLAMA_MAX_LOADED_MODELS: This determines how many different models you can have loaded at the same time. By default, it’s 3 times your GPU count (or just 3 if you’re running on CPU).
- OLLAMA_MAX_QUEUE: This sets how many requests can wait in line before Ollama starts rejecting new ones. The default is 512, which is plenty for most use cases.
Want to apply these settings? It’s as simple as:
OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4 ollama serve
Got Multiple GPUs? Lucky You!
If you’re fortunate enough to have multiple GPUs, you’ve got even more options for speeding things up.
Option 1: Run Multiple Ollama Instances
This is my personal favorite approach. You can run a separate Ollama container on each GPU:
# First container on GPU 0
docker run -d \
--gpus=0 \
-v /path/to/models:/root/.ollama \
-p 11434:11434 \
--name ollama1 \
ollama/ollama
# Second container on GPU 1
docker run -d \
--gpus=1 \
-v /path/to/models:/root/.ollama \
-p 11435:11434 \
--name ollama2 \
ollama/ollama
This strategy works great when:
- You need to run different models at the same time
- You want to avoid the delay of loading and unloading models
- Your code can send requests to the right server
Option 2: Let Ollama Handle It
If managing multiple servers sounds like a headache, you can let Ollama use all your GPUs:
docker run -d \
--gpus=all \
--network=host \
--security-opt seccomp=unconfined \
-v ollama_data:/root/.ollama \
-e OLLAMA_NUM_PARALLEL=8 \
--name ollama \
ollama/ollama
Speeding Up Your Python Code
Let’s talk about how to make your Python code work with all this parallelism. Here’s an async approach that really made a difference for me:
import asyncio
from ollama import AsyncClient
async def translate_text(text, client, model="mistral"):
try:
response = await client.generate(
model=model,
prompt=f"translate this French text into English: {text}"
)
return response['response'].lstrip(), response['total_duration']
except Exception as e:
print(f"Oops! Translation error: {e}")
return None, 0
async def process_batch(texts, host="http://127.0.0.1:11434"):
client = AsyncClient(host=host)
tasks = [translate_text(text, client) for text in texts]
return await asyncio.gather(*tasks)
async def main():
reports = ["report_text_1", "report_text_2", "report_text_3", "report_text_4"]
# Process all reports in parallel
results = await process_batch(reports)
for i, (translation, duration) in enumerate(results):
if translation:
print(f"Report {i+1} translated in {duration}ms")
print(f"Translation: {translation[:100]}...\n")
if __name__ == "__main__":
asyncio.run(main())
Balancing the Load
If you’re running multiple Ollama instances, you might want a simple way to spread your requests between them:
import asyncio
import random
from ollama import AsyncClient
class OllamaLoadBalancer:
def __init__(self, hosts):
self.hosts = hosts
self.clients = [AsyncClient(host=host) for host in hosts]
async def generate(self, model, prompt):
# Randomly pick a server
client = random.choice(self.clients)
return await client.generate(model=model, prompt=prompt)
async def main():
# Define your Ollama servers
hosts = [
"http://127.0.0.1:11434",
"http://127.0.0.1:11435"
]
balancer = OllamaLoadBalancer(hosts)
reports = ["report1", "report2", "report3", "report4"]
async def process_report(text):
response = await balancer.generate(
model="mistral",
prompt=f"translate this French text into English: {text}"
)
return response['response'], response['total_duration']
tasks = [process_report(report) for report in reports]
results = await asyncio.gather(*tasks)
for i, (translation, duration) in enumerate(results):
print(f"Report {i+1} translated in {duration}ms")
if __name__ == "__main__":
asyncio.run(main())
Things I Learned About Performance
After a lot of testing, here are some nuggets of wisdom I picked up:
- Bigger GPU ≠ Proportionally Faster: Just because you upgrade from a V100 to an H100 doesn’t mean you’ll get dramatically faster inference for every workload. LLM inference often depends more on memory access patterns than raw compute power, especially for single requests.
- Model Size Makes a Difference: The bigger your model, the more you’ll notice improvements from a more powerful GPU.
- Batch When You Can: If you’re processing multiple similar requests, try to batch them together.
- Switching Models Is Expensive: Try to avoid constantly loading and unloading different models if speed is critical.
- Keep an Eye on Things: Use
nvidia-smi
to monitor your GPU usage and see if you’re actually utilizing your hardware effectively.
Wrapping Up
Finding the sweet spot for Ollama performance comes down to understanding both its built-in capabilities and how to structure your application. For most users, starting with those environment variables is the easiest approach. If you’ve got multiple GPUs, consider running separate instances with a simple load balancing layer.
Remember that real-world testing in your specific environment is crucial—what works for one workflow might not be optimal for another.
Have you found other ways to speed up Ollama? I’d love to hear about your experiences!