Join our Discord Server
Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.

Ollama API Integration: Building Production-Ready LLM Applications

5 min read

Large Language Models (LLMs) are revolutionizing how we build intelligent applications, but deploying them in production environments presents unique challenges. Ollama has emerged as a powerful solution for running LLMs locally and at scale, offering a simple API that makes integration seamless for DevOps engineers and developers alike.

In this comprehensive guide, we’ll explore how to integrate Ollama’s API into your applications, containerize your LLM-powered services, and deploy them in Kubernetes environments with production-grade reliability.

Understanding Ollama’s Architecture

Ollama provides a REST API that abstracts the complexity of running large language models. Unlike cloud-based solutions, Ollama allows you to run models like Llama 2, Mistral, and CodeLlama on your own infrastructure, giving you complete control over data privacy and costs.

The architecture consists of three main components:

  • Ollama Server: The core service that manages model loading, inference, and API endpoints
  • Model Registry: Local storage for downloaded models with automatic version management
  • REST API: HTTP endpoints for chat completions, embeddings, and model management

Setting Up Ollama for Development

Before diving into API integration, let’s set up Ollama in a containerized environment that mirrors production deployments.

Installing Ollama with Docker

The fastest way to get started is using the official Ollama Docker image:

docker run -d \
  --name ollama \
  -p 11434:11434 \
  -v ollama_data:/root/.ollama \
  --gpus all \
  ollama/ollama:latest

For CPU-only environments, omit the --gpus all flag:

docker run -d \
  --name ollama \
  -p 11434:11434 \
  -v ollama_data:/root/.ollama \
  ollama/ollama:latest

Pulling Your First Model

Once the container is running, pull a model to begin development:

# Pull Llama 2 7B model
docker exec ollama ollama pull llama2

# Verify the model is available
docker exec ollama ollama list

Building Your First LLM-Powered Application

Let’s create a practical application that leverages Ollama’s API for intelligent document summarization and question answering.

Python Client Implementation

Here’s a robust Python client that handles connection pooling, retries, and error handling:

import requests
import json
from typing import Dict, List, Optional
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry

class OllamaClient:
    def __init__(self, base_url: str = "http://localhost:11434"):
        self.base_url = base_url
        self.session = self._create_session()
    
    def _create_session(self) -> requests.Session:
        """Create a session with retry logic and connection pooling."""
        session = requests.Session()
        retry_strategy = Retry(
            total=3,
            backoff_factor=1,
            status_forcelist=[429, 500, 502, 503, 504]
        )
        adapter = HTTPAdapter(max_retries=retry_strategy, pool_maxsize=10)
        session.mount("http://", adapter)
        session.mount("https://", adapter)
        return session
    
    def generate(self, 
                 model: str, 
                 prompt: str, 
                 stream: bool = False,
                 options: Optional[Dict] = None) -> Dict:
        """Generate a completion from the model."""
        url = f"{self.base_url}/api/generate"
        payload = {
            "model": model,
            "prompt": prompt,
            "stream": stream
        }
        if options:
            payload["options"] = options
        
        response = self.session.post(url, json=payload)
        response.raise_for_status()
        return response.json()
    
    def chat(self, 
             model: str, 
             messages: List[Dict[str, str]],
             stream: bool = False) -> Dict:
        """Send a chat completion request."""
        url = f"{self.base_url}/api/chat"
        payload = {
            "model": model,
            "messages": messages,
            "stream": stream
        }
        
        response = self.session.post(url, json=payload)
        response.raise_for_status()
        return response.json()
    
    def embeddings(self, model: str, prompt: str) -> List[float]:
        """Generate embeddings for the given text."""
        url = f"{self.base_url}/api/embeddings"
        payload = {
            "model": model,
            "prompt": prompt
        }
        
        response = self.session.post(url, json=payload)
        response.raise_for_status()
        return response.json()["embedding"]

# Example usage
client = OllamaClient()

# Simple completion
result = client.generate(
    model="llama2",
    prompt="Explain Kubernetes networking in 3 sentences.",
    options={"temperature": 0.7, "top_p": 0.9}
)
print(result["response"])

# Chat-style interaction
messages = [
    {"role": "system", "content": "You are a DevOps expert."},
    {"role": "user", "content": "How do I optimize Docker image builds?"}
]
chat_result = client.chat(model="llama2", messages=messages)
print(chat_result["message"]["content"])

Streaming Responses for Better UX

For long-running generations, implement streaming to provide real-time feedback:

def stream_generate(client: OllamaClient, model: str, prompt: str):
    """Stream tokens as they're generated."""
    url = f"{client.base_url}/api/generate"
    payload = {
        "model": model,
        "prompt": prompt,
        "stream": True
    }
    
    with client.session.post(url, json=payload, stream=True) as response:
        response.raise_for_status()
        for line in response.iter_lines():
            if line:
                chunk = json.loads(line)
                if not chunk.get("done"):
                    print(chunk["response"], end="", flush=True)
                else:
                    print()  # New line at end
                    return chunk

# Usage
stream_generate(client, "llama2", "Write a Dockerfile for a Python FastAPI app")

Containerizing Your LLM Application

Let’s create a production-ready containerized application that includes both Ollama and your custom service.

Multi-Stage Dockerfile

FROM python:3.11-slim as builder

WORKDIR /app

# Install dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir --user -r requirements.txt

FROM python:3.11-slim

WORKDIR /app

# Copy dependencies from builder
COPY --from=builder /root/.local /root/.local

# Copy application code
COPY . .

# Make sure scripts are executable
ENV PATH=/root/.local/bin:$PATH

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
  CMD python -c "import requests; requests.get('http://localhost:8000/health')"

EXPOSE 8000

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

Docker Compose for Local Development

version: '3.8'

services:
  ollama:
    image: ollama/ollama:latest
    container_name: ollama
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      timeout: 10s
      retries: 3

  llm-app:
    build: .
    container_name: llm-app
    ports:
      - "8000:8000"
    environment:
      - OLLAMA_BASE_URL=http://ollama:11434
      - MODEL_NAME=llama2
    depends_on:
      ollama:
        condition: service_healthy
    restart: unless-stopped

volumes:
  ollama_data:

Kubernetes Deployment

Deploy your LLM-powered application to Kubernetes with proper resource management and scaling capabilities.

Ollama StatefulSet

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: ollama
  namespace: llm-apps
spec:
  serviceName: ollama
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
          name: http
        resources:
          requests:
            memory: "8Gi"
            cpu: "2000m"
          limits:
            memory: "16Gi"
            cpu: "4000m"
            nvidia.com/gpu: "1"
        volumeMounts:
        - name: ollama-data
          mountPath: /root/.ollama
        livenessProbe:
          httpGet:
            path: /api/tags
            port: 11434
          initialDelaySeconds: 60
          periodSeconds: 30
        readinessProbe:
          httpGet:
            path: /api/tags
            port: 11434
          initialDelaySeconds: 30
          periodSeconds: 10
  volumeClaimTemplates:
  - metadata:
      name: ollama-data
    spec:
      accessModes: ["ReadWriteOnce"]
      storageClassName: fast-ssd
      resources:
        requests:
          storage: 50Gi
---
apiVersion: v1
kind: Service
metadata:
  name: ollama
  namespace: llm-apps
spec:
  selector:
    app: ollama
  ports:
  - port: 11434
    targetPort: 11434
    name: http
  clusterIP: None

Application Deployment

apiVersion: apps/v1
kind: Deployment
metadata:
  name: llm-app
  namespace: llm-apps
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llm-app
  template:
    metadata:
      labels:
        app: llm-app
    spec:
      containers:
      - name: llm-app
        image: your-registry/llm-app:latest
        ports:
        - containerPort: 8000
        env:
        - name: OLLAMA_BASE_URL
          value: "http://ollama.llm-apps.svc.cluster.local:11434"
        - name: MODEL_NAME
          value: "llama2"
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1Gi"
            cpu: "1000m"
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 30
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /ready
            port: 8000
          initialDelaySeconds: 10
          periodSeconds: 5
---
apiVersion: v1
kind: Service
metadata:
  name: llm-app
  namespace: llm-apps
spec:
  selector:
    app: llm-app
  ports:
  - port: 80
    targetPort: 8000
  type: LoadBalancer

Performance Optimization and Best Practices

Model Quantization

Use quantized models for better performance with minimal accuracy loss:

# Pull 4-bit quantized model
docker exec ollama ollama pull llama2:7b-chat-q4_0

# Compare model sizes
docker exec ollama ollama list

Connection Pooling and Caching

Implement caching for repeated queries to reduce latency:

from functools import lru_cache
import hashlib

class CachedOllamaClient(OllamaClient):
    def __init__(self, *args, **kwargs):
        super().__init__(*args, **kwargs)
        self.cache = {}
    
    def generate_cached(self, model: str, prompt: str, **kwargs) -> Dict:
        # Create cache key from prompt
        cache_key = hashlib.md5(f"{model}:{prompt}".encode()).hexdigest()
        
        if cache_key in self.cache:
            return self.cache[cache_key]
        
        result = self.generate(model, prompt, **kwargs)
        self.cache[cache_key] = result
        return result

Troubleshooting Common Issues

Model Loading Failures

If models fail to load, check available disk space and memory:

# Check Ollama logs
docker logs ollama

# Verify model integrity
docker exec ollama ollama show llama2

# Re-pull corrupted models
docker exec ollama ollama rm llama2
docker exec ollama ollama pull llama2

API Timeout Issues

Increase timeout values for larger models or complex prompts:

client = OllamaClient()
client.session.timeout = 300  # 5 minutes

# Or per-request
response = client.generate(
    model="llama2",
    prompt="Complex prompt",
    options={"num_predict": 2000}  # Limit tokens
)

Memory Management

Monitor and limit GPU memory usage:

# Check GPU utilization
nvidia-smi

# Set memory limits in Docker
docker run -d \
  --gpus '"device=0"' \
  --memory="8g" \
  --memory-swap="8g" \
  ollama/ollama:latest

Monitoring and Observability

Implement comprehensive monitoring for production deployments:

from prometheus_client import Counter, Histogram, start_http_server
import time

# Metrics
request_count = Counter('ollama_requests_total', 'Total requests', ['model', 'status'])
request_duration = Histogram('ollama_request_duration_seconds', 'Request duration', ['model'])

def monitored_generate(client, model, prompt):
    start_time = time.time()
    try:
        result = client.generate(model, prompt)
        request_count.labels(model=model, status='success').inc()
        return result
    except Exception as e:
        request_count.labels(model=model, status='error').inc()
        raise
    finally:
        duration = time.time() - start_time
        request_duration.labels(model=model).observe(duration)

# Start metrics server
start_http_server(9090)

Conclusion

Integrating Ollama’s API into your applications provides a powerful foundation for building LLM-powered services that run on your infrastructure. By following the patterns and best practices outlined in this guide, you can create production-ready applications that scale efficiently while maintaining control over your data and costs.

The combination of containerization, Kubernetes orchestration, and proper monitoring ensures your LLM applications remain reliable and performant under production workloads. As the LLM ecosystem continues to evolve, Ollama’s simple yet powerful API makes it easy to adapt and integrate new models as they become available.

Start experimenting with the code examples provided, and remember to monitor resource usage closely as you scale your deployments. The future of AI-powered applications is local-first, and Ollama puts that power directly in your hands.

Have Queries? Join https://launchpass.com/collabnix

Collabnix Team The Collabnix Team is a diverse collective of Docker, Kubernetes, and IoT experts united by a passion for cloud-native technologies. With backgrounds spanning across DevOps, platform engineering, cloud architecture, and container orchestration, our contributors bring together decades of combined experience from various industries and technical domains.
Join our Discord Server
Index