Using Ollama in Production: A Developer’s Practical Guide

Table of Contents

As a developer who’s worked extensively with AI tools, I’ve found Ollama to be an intriguing option for production deployments. While it’s known for local development, its capabilities extend far beyond that. Let’s dive into how we can leverage Ollama in production environments and explore some real-world use cases.

What Makes Ollama Production-Ready?

Before we jump into use cases, let’s address what makes Ollama suitable for production:

Local Model Deployment: Models run entirely on your infrastructure, ensuring data privacy and reducing latency.
API-First Architecture: RESTful API makes integration straightforward.
Resource Efficiency: Lighter footprint compared to running full cloud-based LLMs.
Version Control: Supports model versioning through Modelfiles.
Custom Model Support: Can run various open-source models with customization options.

Real-World Use Cases

1. Content Moderation System

One of the most practical applications I’ve implemented is using Ollama for content moderation. Here’s a basic example of how to set up a moderation endpoint:

        
from fastapi import FastAPI
import requests

app = FastAPI()

def check_content(content: str) -> dict:
    response = requests.post('http://localhost:11434/api/generate', 
        json={
            "model": "llama2",
            "prompt": f"Analyze if this content is appropriate: {content}",
            "system": "You are a content moderation system. Respond with JSON containing 'is_appropriate' and 'reason'."
        })
    return response.json()

@app.post("/moderate")
async def moderate_content(content: dict):
    result = check_content(content["text"])
    return {
        "status": "success",
        "moderation_result": result
    }

2. Code Documentation Generator

Another powerful use case is automated code documentation. I’ve implemented this in production for maintaining our internal APIs:

        
import ollama
import ast
import astor
from typing import Optional

def generate_docstring(code: str) -> str:
    """
    Generate a docstring for the given code using the CodeLlama model.
    
    Args:
        code (str): The Python code to generate a docstring for
        
    Returns:
        str: The generated docstring
    """
    response = ollama.generate(model='codellama', 
        prompt=f"Generate a detailed docstring for this code:\n{code}")
    return response['response']

def insert_docstring(node: ast.FunctionDef, docstring: str) -> None:
    """
    Insert a docstring into an AST function definition node.
    
    Args:
        node (ast.FunctionDef): The function node to modify
        docstring (str): The docstring to insert
    """
    # Create an AST node for the docstring
    docstring_node = ast.Expr(value=ast.Str(s=docstring))
    
    # Insert the docstring as the first node in the function body
    node.body.insert(0, docstring_node)

def process_file(filepath: str) -> Optional[str]:
    """
    Process a Python file and add docstrings to functions that don't have them.
    
    Args:
        filepath (str): Path to the Python file to process
        
    Returns:
        Optional[str]: The updated code with new docstrings, or None if there's an error
    """
    try:
        with open(filepath, 'r') as file:
            code = file.read()
            
        # Parse the code into an AST
        tree = ast.parse(code)
        modified = False
        
        # Walk through all nodes in the AST
        for node in ast.walk(tree):
            if isinstance(node, ast.FunctionDef) and not ast.get_docstring(node):
                # Generate docstring for the function
                function_code = astor.to_source(node)
                docstring = generate_docstring(function_code)
                
                # Insert the docstring into the function
                insert_docstring(node, docstring)
                modified = True
        
        if modified:
            # Convert the modified AST back to source code
            updated_code = astor.to_source(tree)
            
            # Write the updated code back to the file
            with open(filepath, 'w') as file:
                file.write(updated_code)
            
            return updated_code
        
        return None
    
    except Exception as e:
        print(f"Error processing file {filepath}: {str(e)}")
        return None

def main():
    """
    Main function to run the docstring generator on a specified file.
    """
    import sys
    
    if len(sys.argv) != 2:
        print("Usage: python script.py ")
        sys.exit(1)
    
    filepath = sys.argv[1]
    result = process_file(filepath)
    
    if result:
        print("Successfully updated docstrings in the file.")
    else:
        print("No updates were necessary or an error occurred.")

if __name__ == "__main__":
    main()

3. Automated Customer Support

We’ve implemented Ollama in our support workflow to handle initial customer inquiries:

        
from fastapi import FastAPI
import ollama
import json

app = FastAPI()

class SupportBot:
    def __init__(self):
        self.context = []
        
    async def get_response(self, query: str) -> str:
        response = await ollama.generate(
            model='mistral', 
            prompt=self.format_prompt(query),
            context=self.context
        )
        return response['response']
        
    def format_prompt(self, query: str) -> str:
        return f"""
        Context: You are a customer support agent.
        Previous messages: {json.dumps(self.context)}
        Customer query: {query}
        """

@app.post("/support")
async def handle_support_query(query: dict):
    bot = SupportBot()
    response = await bot.get_response(query["message"])
    return {"response": response}

Production Deployment Considerations

Infrastructure Requirements

        
# Example Docker configuration
FROM ollama/ollama:latest

# Set up monitoring
RUN apt-get update && apt-get install -y prometheus-node-exporter

# Configure resource limits
ENV OLLAMA_HOST=0.0.0.0
ENV OLLAMA_MODELS=/models
ENV CUDA_VISIBLE_DEVICES=0,1

# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
    CMD curl -f http://localhost:11434/api/health || exit 1

Monitoring Setup

        
from prometheus_client import Counter, Histogram
import time

request_count = Counter('ollama_requests_total', 'Total requests to Ollama')
response_time = Histogram('ollama_response_time_seconds', 'Response time in seconds')

def monitored_generate(prompt: str):
    request_count.inc()
    with response_time.time():
        response = ollama.generate(prompt=prompt)
    return response

Performance Optimization Tips

1. Model Quantization

        
# Example Modelfile for quantized model
FROM llama2
PARAMETER num_gpu_layers 35
PARAMETER rope_scaling {"type": "linear", "factor": 2.0}

2. Batch Processing


async def process_batch(prompts: List[str]):
    tasks = [ollama.generate(prompt=p) for p in prompts]
    return await asyncio.gather(*tasks)

Conclusion

Ollama has proven to be a robust solution for production deployments, especially when you need local model execution with reasonable performance. While it may not replace cloud-based solutions for all use cases, it fills an important niche in the AI deployment ecosystem.

Remember to:

Monitor system resources carefully
Implement proper error handling
Set up automated model updates
Configure appropriate scaling policies

For those interested in exploring Ollama further, you can check out the ollama GitHub repository with additional production-ready examples and deployment configurations.

Note: The code examples provided are simplified for clarity. Production implementations should include proper error handling, logging, and security measures.