What is Ollama?
Ollama is a lightweight, extensible framework for building and running large language models locally. Run LLaMA, Mistral, CodeLlama, and other models on your machine without cloud dependencies.
Quick Installation
macOS
# Download and install
curl -fsSL https://ollama.com/install.sh | sh
# Alternative: Homebrew
brew install ollama
Linux
curl -fsSL https://ollama.com/install.sh | sh
Windows
# Download installer from ollama.com
# Or use winget
winget install Ollama.Ollama
Docker Installation
# Run Ollama in Docker
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
# Pull and run a model
docker exec -it ollama ollama run llama2
Starting Ollama Service
# Start Ollama service
ollama serve
# Run in background (Linux/macOS)
nohup ollama serve > ollama.log 2>&1 &
Basic Model Operations
Pull Models
# Pull LLaMA 2 (7B)
ollama pull llama2
# Pull specific model versions
ollama pull llama2:13b
ollama pull llama2:70b
# Pull other popular models
ollama pull mistral
ollama pull codellama
ollama pull vicuna
ollama pull neural-chat
List Available Models
# List installed models
ollama list
# Show model information
ollama show llama2
Remove Models
# Remove specific model
ollama rm llama2:13b
# Remove all versions of a model
ollama rm llama2
Running Models
Interactive Chat
# Start interactive session
ollama run llama2
# Run with specific parameters
ollama run llama2 --temperature 0.7
# Exit interactive mode
/bye
Single Prompt
# One-time prompt
ollama run llama2 "Explain quantum computing"
# With custom temperature
ollama run llama2 "Write Python code for sorting" --temperature 0.1
API Usage
REST API Examples
Basic Chat Completion
# Simple chat request
curl http://localhost:11434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama2",
"messages": [
{
"role": "user",
"content": "Why is the sky blue?"
}
]
}'
Streaming Response
# Stream response
curl http://localhost:11434/api/chat \
-H "Content-Type: application/json" \
-d '{
"model": "llama2",
"messages": [
{
"role": "user",
"content": "Write a Python function"
}
],
"stream": true
}'
Generate Text
# Generate completion
curl http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{
"model": "llama2",
"prompt": "The future of AI is",
"stream": false
}'
Python Integration
Basic Python Client
import requests
import json
def chat_with_ollama(model, message):
url = "http://localhost:11434/api/chat"
data = {
"model": model,
"messages": [{"role": "user", "content": message}],
"stream": False
}
response = requests.post(url, json=data)
return response.json()["message"]["content"]
# Usage
result = chat_with_ollama("llama2", "Explain machine learning")
print(result)
Streaming Python Client
import requests
import json
def stream_chat(model, message):
url = "http://localhost:11434/api/chat"
data = {
"model": model,
"messages": [{"role": "user", "content": message}],
"stream": True
}
with requests.post(url, json=data, stream=True) as response:
for line in response.iter_lines():
if line:
chunk = json.loads(line)
if "message" in chunk:
yield chunk["message"]["content"]
# Usage
for token in stream_chat("llama2", "Write a story"):
print(token, end="", flush=True)
Async Python Client
import aiohttp
import asyncio
import json
async def async_chat(model, message):
url = "http://localhost:11434/api/chat"
data = {
"model": model,
"messages": [{"role": "user", "content": message}],
"stream": False
}
async with aiohttp.ClientSession() as session:
async with session.post(url, json=data) as response:
result = await response.json()
return result["message"]["content"]
# Usage
async def main():
result = await async_chat("llama2", "Explain async programming")
print(result)
asyncio.run(main())
JavaScript/Node.js Integration
Basic Node.js Client
const axios = require('axios');
async function chatWithOllama(model, message) {
const response = await axios.post('http://localhost:11434/api/chat', {
model: model,
messages: [{ role: 'user', content: message }],
stream: false
});
return response.data.message.content;
}
// Usage
chatWithOllama('llama2', 'Explain JavaScript promises')
.then(result => console.log(result))
.catch(err => console.error(err));
Browser Integration
<!DOCTYPE html>
<html>
<head>
<title>Ollama Chat</title>
</head>
<body>
<div id="chat"></div>
<input type="text" id="input" placeholder="Ask something...">
<button onclick="sendMessage()">Send</button>
<script>
async function sendMessage() {
const input = document.getElementById('input');
const message = input.value;
try {
const response = await fetch('http://localhost:11434/api/chat', {
method: 'POST',
headers: {
'Content-Type': 'application/json',
},
body: JSON.stringify({
model: 'llama2',
messages: [{ role: 'user', content: message }],
stream: false
})
});
const data = await response.json();
document.getElementById('chat').innerHTML +=
`<p><strong>You:</strong> ${message}</p>
<p><strong>AI:</strong> ${data.message.content}</p>`;
input.value = '';
} catch (error) {
console.error('Error:', error);
}
}
</script>
</body>
</html>
Custom Model Configuration
Create Modelfile
# Modelfile for custom configuration
FROM llama2
# Set custom parameters
PARAMETER temperature 0.8
PARAMETER top_p 0.9
PARAMETER top_k 40
# Set custom system prompt
SYSTEM """
You are a helpful coding assistant. Always provide code examples.
"""
# Set custom template
TEMPLATE """{{ if .System }}<|system|>
{{ .System }}<|end|>
{{ end }}{{ if .Prompt }}<|user|>
{{ .Prompt }}<|end|>
{{ end }}<|assistant|>
"""
Build Custom Model
# Create custom model
ollama create my-assistant -f ./Modelfile
# Run custom model
ollama run my-assistant
Advanced Configuration
Environment Variables
# Set custom host and port
export OLLAMA_HOST=0.0.0.0:11434
# Set model storage location
export OLLAMA_MODELS=/custom/path/models
# Enable debug logging
export OLLAMA_DEBUG=1
# Set maximum concurrent requests
export OLLAMA_MAX_LOADED_MODELS=3
GPU Configuration
# Check GPU availability
ollama run llama2 --verbose
# Force CPU usage
OLLAMA_NUM_GPU=0 ollama serve
# Use specific GPU
CUDA_VISIBLE_DEVICES=0 ollama serve
Memory Management
# Set context window size
ollama run llama2 --context-length 4096
# Set batch size for processing
ollama run llama2 --batch-size 512
Popular Models and Use Cases
Code Generation
# CodeLlama for programming
ollama pull codellama
# Use for code generation
ollama run codellama "Write a Python web scraper"
# Code completion
ollama run codellama:code "def fibonacci(n):"
Specialized Models
# For chat/conversation
ollama pull vicuna
# For reasoning tasks
ollama pull neural-chat
# For multilingual tasks
ollama pull orca-mini
# For SQL generation
ollama pull sqlcoder
Troubleshooting
Common Issues
Port Already in Use
# Kill existing Ollama process
pkill ollama
# Or find and kill specific PID
lsof -i :11434
kill -9 <PID>
# Start with different port
OLLAMA_HOST=0.0.0.0:11435 ollama serve
Out of Memory
# Check system resources
free -h # Linux
vm_stat # macOS
# Use smaller model
ollama pull llama2:7b # Instead of 13b or 70b
# Reduce context length
ollama run llama2 --context-length 2048
Model Download Issues
# Clear corrupted downloads
rm -rf ~/.ollama/models/incomplete
# Download with resume support
ollama pull llama2 --insecure
Performance Optimization
System Tuning
# Increase file descriptors (Linux)
ulimit -n 65536
# Set memory limits
export OLLAMA_MAX_VRAM=8GB
# Enable memory mapping
export OLLAMA_MMAP=1
Monitoring
# Check model status
curl http://localhost:11434/api/tags
# Health check
curl http://localhost:11434/api/version
# Resource usage
curl http://localhost:11434/api/ps
Docker Compose Setup
version: '3.8'
services:
ollama:
image: ollama/ollama
container_name: ollama
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
environment:
- OLLAMA_HOST=0.0.0.0:11434
restart: unless-stopped
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
volumes:
ollama_data:
Integration Examples
FastAPI Integration
from fastapi import FastAPI
import requests
app = FastAPI()
@app.post("/chat")
async def chat_endpoint(message: str, model: str = "llama2"):
response = requests.post("http://localhost:11434/api/chat", json={
"model": model,
"messages": [{"role": "user", "content": message}],
"stream": False
})
return {"response": response.json()["message"]["content"]}
# Run with: uvicorn main:app --reload
Express.js Integration
const express = require('express');
const axios = require('axios');
const app = express();
app.use(express.json());
app.post('/chat', async (req, res) => {
try {
const { message, model = 'llama2' } = req.body;
const response = await axios.post('http://localhost:11434/api/chat', {
model,
messages: [{ role: 'user', content: message }],
stream: false
});
res.json({ response: response.data.message.content });
} catch (error) {
res.status(500).json({ error: error.message });
}
});
app.listen(3000, () => console.log('Server running on port 3000'));
CLI Automation Scripts
Batch Processing
#!/bin/bash
# batch_process.sh
MODEL="llama2"
INPUT_FILE="prompts.txt"
OUTPUT_DIR="outputs"
mkdir -p "$OUTPUT_DIR"
while IFS= read -r prompt; do
echo "Processing: $prompt"
output_file="$OUTPUT_DIR/$(echo "$prompt" | sed 's/[^a-zA-Z0-9]/_/g').txt"
ollama run "$MODEL" "$prompt" > "$output_file"
done < "$INPUT_FILE"
Model Management Script
#!/bin/bash
# manage_models.sh
case "$1" in
"install")
ollama pull llama2
ollama pull codellama
ollama pull mistral
;;
"cleanup")
ollama list | grep -v "NAME" | awk '{print $1}' | xargs -I {} ollama rm {}
;;
"status")
ollama list
;;
*)
echo "Usage: $0 {install|cleanup|status}"
exit 1
;;
esac
This comprehensive guide covers installation, basic usage, API integration, troubleshooting, and advanced configurations for Ollama, providing developers with practical code examples for immediate implementation.