As a developer who’s worked extensively with AI tools, I’ve found Ollama to be an intriguing option for production deployments. While it’s known for local development, its capabilities extend far beyond that. Let’s dive into how we can leverage Ollama in production environments and explore some real-world use cases.
What Makes Ollama Production-Ready?
Before we jump into use cases, let’s address what makes Ollama suitable for production:
- Local Model Deployment: Models run entirely on your infrastructure, ensuring data privacy and reducing latency.
- API-First Architecture: RESTful API makes integration straightforward.
- Resource Efficiency: Lighter footprint compared to running full cloud-based LLMs.
- Version Control: Supports model versioning through Modelfiles.
- Custom Model Support: Can run various open-source models with customization options.
Real-World Use Cases
1. Content Moderation System
One of the most practical applications I’ve implemented is using Ollama for content moderation. Here’s a basic example of how to set up a moderation endpoint:
from fastapi import FastAPI
import requests
app = FastAPI()
def check_content(content: str) -> dict:
response = requests.post('http://localhost:11434/api/generate',
json={
"model": "llama2",
"prompt": f"Analyze if this content is appropriate: {content}",
"system": "You are a content moderation system. Respond with JSON containing 'is_appropriate' and 'reason'."
})
return response.json()
@app.post("/moderate")
async def moderate_content(content: dict):
result = check_content(content["text"])
return {
"status": "success",
"moderation_result": result
}
2. Code Documentation Generator
Another powerful use case is automated code documentation. I’ve implemented this in production for maintaining our internal APIs:
import ollama
import ast
import astor
from typing import Optional
def generate_docstring(code: str) -> str:
"""
Generate a docstring for the given code using the CodeLlama model.
Args:
code (str): The Python code to generate a docstring for
Returns:
str: The generated docstring
"""
response = ollama.generate(model='codellama',
prompt=f"Generate a detailed docstring for this code:\n{code}")
return response['response']
def insert_docstring(node: ast.FunctionDef, docstring: str) -> None:
"""
Insert a docstring into an AST function definition node.
Args:
node (ast.FunctionDef): The function node to modify
docstring (str): The docstring to insert
"""
# Create an AST node for the docstring
docstring_node = ast.Expr(value=ast.Str(s=docstring))
# Insert the docstring as the first node in the function body
node.body.insert(0, docstring_node)
def process_file(filepath: str) -> Optional[str]:
"""
Process a Python file and add docstrings to functions that don't have them.
Args:
filepath (str): Path to the Python file to process
Returns:
Optional[str]: The updated code with new docstrings, or None if there's an error
"""
try:
with open(filepath, 'r') as file:
code = file.read()
# Parse the code into an AST
tree = ast.parse(code)
modified = False
# Walk through all nodes in the AST
for node in ast.walk(tree):
if isinstance(node, ast.FunctionDef) and not ast.get_docstring(node):
# Generate docstring for the function
function_code = astor.to_source(node)
docstring = generate_docstring(function_code)
# Insert the docstring into the function
insert_docstring(node, docstring)
modified = True
if modified:
# Convert the modified AST back to source code
updated_code = astor.to_source(tree)
# Write the updated code back to the file
with open(filepath, 'w') as file:
file.write(updated_code)
return updated_code
return None
except Exception as e:
print(f"Error processing file {filepath}: {str(e)}")
return None
def main():
"""
Main function to run the docstring generator on a specified file.
"""
import sys
if len(sys.argv) != 2:
print("Usage: python script.py ")
sys.exit(1)
filepath = sys.argv[1]
result = process_file(filepath)
if result:
print("Successfully updated docstrings in the file.")
else:
print("No updates were necessary or an error occurred.")
if __name__ == "__main__":
main()
3. Automated Customer Support
We’ve implemented Ollama in our support workflow to handle initial customer inquiries:
from fastapi import FastAPI
import ollama
import json
app = FastAPI()
class SupportBot:
def __init__(self):
self.context = []
async def get_response(self, query: str) -> str:
response = await ollama.generate(
model='mistral',
prompt=self.format_prompt(query),
context=self.context
)
return response['response']
def format_prompt(self, query: str) -> str:
return f"""
Context: You are a customer support agent.
Previous messages: {json.dumps(self.context)}
Customer query: {query}
"""
@app.post("/support")
async def handle_support_query(query: dict):
bot = SupportBot()
response = await bot.get_response(query["message"])
return {"response": response}
Production Deployment Considerations
Infrastructure Requirements
# Example Docker configuration
FROM ollama/ollama:latest
# Set up monitoring
RUN apt-get update && apt-get install -y prometheus-node-exporter
# Configure resource limits
ENV OLLAMA_HOST=0.0.0.0
ENV OLLAMA_MODELS=/models
ENV CUDA_VISIBLE_DEVICES=0,1
# Health check
HEALTHCHECK --interval=30s --timeout=10s --start-period=5s --retries=3 \
CMD curl -f http://localhost:11434/api/health || exit 1
Monitoring Setup
from prometheus_client import Counter, Histogram
import time
request_count = Counter('ollama_requests_total', 'Total requests to Ollama')
response_time = Histogram('ollama_response_time_seconds', 'Response time in seconds')
def monitored_generate(prompt: str):
request_count.inc()
with response_time.time():
response = ollama.generate(prompt=prompt)
return response
Performance Optimization Tips
1. Model Quantization
# Example Modelfile for quantized model
FROM llama2
PARAMETER num_gpu_layers 35
PARAMETER rope_scaling {"type": "linear", "factor": 2.0}
2. Batch Processing
async def process_batch(prompts: List[str]):
tasks = [ollama.generate(prompt=p) for p in prompts]
return await asyncio.gather(*tasks)
Conclusion
Ollama has proven to be a robust solution for production deployments, especially when you need local model execution with reasonable performance. While it may not replace cloud-based solutions for all use cases, it fills an important niche in the AI deployment ecosystem.
Remember to:
- Monitor system resources carefully
- Implement proper error handling
- Set up automated model updates
- Configure appropriate scaling policies
For those interested in exploring Ollama further, you can check out the ollama GitHub repository with additional production-ready examples and deployment configurations.
Note: The code examples provided are simplified for clarity. Production implementations should include proper error handling, logging, and security measures.