Join our Discord Server
Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.

Does Ollama Use Parallelism Internally?

3 min read

If you’ve been working with Ollama for running large language models, you might have wondered about parallelism and how to get the most performance out of your setup. I recently went down this rabbit hole myself while building a translation service, and I thought I’d share what I learned.

So, Does Ollama Use Parallelism Internally?

This was my first question too! I was running translation tasks on both V100 and H100 GPUs and was surprised to see almost identical performance. What gives?

Well, here’s what I discovered: Ollama can use parallelism, but you need to know how to configure it properly.

The Inside Scoop on Ollama’s Parallelism

As of early 2025, Ollama has come a long way with its parallelism capabilities. If you’re using version 0.1.33 or later, you have some powerful options at your fingertips.

By default, Ollama will automatically choose between handling 1 or 4 parallel requests per model, depending on your available memory. Since version 0.2.0, this concurrency isn’t experimental anymore—it’s fully supported!

How to Configure Parallelism in Ollama

Let me share the key settings you can tweak:

  • OLLAMA_NUM_PARALLEL: This controls how many requests your model can handle at once. The default is usually 4, but it might be set to 1 if your system is limited on memory.
  • OLLAMA_MAX_LOADED_MODELS: This determines how many different models you can have loaded at the same time. By default, it’s 3 times your GPU count (or just 3 if you’re running on CPU).
  • OLLAMA_MAX_QUEUE: This sets how many requests can wait in line before Ollama starts rejecting new ones. The default is 512, which is plenty for most use cases.

Want to apply these settings? It’s as simple as:

OLLAMA_NUM_PARALLEL=4 OLLAMA_MAX_LOADED_MODELS=4 ollama serve

Got Multiple GPUs? Lucky You!

If you’re fortunate enough to have multiple GPUs, you’ve got even more options for speeding things up.

Option 1: Run Multiple Ollama Instances

This is my personal favorite approach. You can run a separate Ollama container on each GPU:

# First container on GPU 0
docker run -d \
--gpus=0 \
-v /path/to/models:/root/.ollama \
-p 11434:11434 \
--name ollama1 \
ollama/ollama

# Second container on GPU 1
docker run -d \
--gpus=1 \
-v /path/to/models:/root/.ollama \
-p 11435:11434 \
--name ollama2 \
ollama/ollama

This strategy works great when:

  • You need to run different models at the same time
  • You want to avoid the delay of loading and unloading models
  • Your code can send requests to the right server

Option 2: Let Ollama Handle It

If managing multiple servers sounds like a headache, you can let Ollama use all your GPUs:

docker run -d \
--gpus=all \
--network=host \
--security-opt seccomp=unconfined \
-v ollama_data:/root/.ollama \
-e OLLAMA_NUM_PARALLEL=8 \
--name ollama \
ollama/ollama

Speeding Up Your Python Code

Let’s talk about how to make your Python code work with all this parallelism. Here’s an async approach that really made a difference for me:

import asyncio
from ollama import AsyncClient

async def translate_text(text, client, model="mistral"):
try:
response = await client.generate(
model=model,
prompt=f"translate this French text into English: {text}"
)
return response['response'].lstrip(), response['total_duration']
except Exception as e:
print(f"Oops! Translation error: {e}")
return None, 0

async def process_batch(texts, host="http://127.0.0.1:11434"):
client = AsyncClient(host=host)
tasks = [translate_text(text, client) for text in texts]
return await asyncio.gather(*tasks)

async def main():
reports = ["report_text_1", "report_text_2", "report_text_3", "report_text_4"]

# Process all reports in parallel
results = await process_batch(reports)

for i, (translation, duration) in enumerate(results):
if translation:
print(f"Report {i+1} translated in {duration}ms")
print(f"Translation: {translation[:100]}...\n")

if __name__ == "__main__":
asyncio.run(main())

Balancing the Load

If you’re running multiple Ollama instances, you might want a simple way to spread your requests between them:

import asyncio
import random
from ollama import AsyncClient

class OllamaLoadBalancer:
def __init__(self, hosts):
self.hosts = hosts
self.clients = [AsyncClient(host=host) for host in hosts]

async def generate(self, model, prompt):
# Randomly pick a server
client = random.choice(self.clients)
return await client.generate(model=model, prompt=prompt)

async def main():
# Define your Ollama servers
hosts = [
"http://127.0.0.1:11434",
"http://127.0.0.1:11435"
]

balancer = OllamaLoadBalancer(hosts)
reports = ["report1", "report2", "report3", "report4"]

async def process_report(text):
response = await balancer.generate(
model="mistral",
prompt=f"translate this French text into English: {text}"
)
return response['response'], response['total_duration']

tasks = [process_report(report) for report in reports]
results = await asyncio.gather(*tasks)

for i, (translation, duration) in enumerate(results):
print(f"Report {i+1} translated in {duration}ms")

if __name__ == "__main__":
asyncio.run(main())

Things I Learned About Performance

After a lot of testing, here are some nuggets of wisdom I picked up:

  1. Bigger GPU ≠ Proportionally Faster: Just because you upgrade from a V100 to an H100 doesn’t mean you’ll get dramatically faster inference for every workload. LLM inference often depends more on memory access patterns than raw compute power, especially for single requests.
  2. Model Size Makes a Difference: The bigger your model, the more you’ll notice improvements from a more powerful GPU.
  3. Batch When You Can: If you’re processing multiple similar requests, try to batch them together.
  4. Switching Models Is Expensive: Try to avoid constantly loading and unloading different models if speed is critical.
  5. Keep an Eye on Things: Use nvidia-smi to monitor your GPU usage and see if you’re actually utilizing your hardware effectively.

Wrapping Up

Finding the sweet spot for Ollama performance comes down to understanding both its built-in capabilities and how to structure your application. For most users, starting with those environment variables is the easiest approach. If you’ve got multiple GPUs, consider running separate instances with a simple load balancing layer.

Remember that real-world testing in your specific environment is crucial—what works for one workflow might not be optimal for another.

Have you found other ways to speed up Ollama? I’d love to hear about your experiences!

Have Queries? Join https://launchpass.com/collabnix

Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.
Collabnixx
Chatbot
Join our Discord Server
Index