Running large language models locally has become increasingly accessible thanks to tools like Ollama. This comprehensive guide will walk you through setting up and using Ollama with Python, enabling you to harness the power of AI models directly on your machine.
What is Ollama?
Ollama is an open-source platform that makes it easy to run large language models locally. It provides a simple API and command-line interface for downloading, running, and managing various AI models including Llama 2, Code Llama, Mistral, and many others. With Ollama, you can run these models without relying on external APIs or cloud services.
Prerequisites
Before we begin, ensure you have:
- Python 3.7 or higher installed
- At least 8GB of RAM (16GB+ recommended for larger models)
- Sufficient disk space for model downloads (models range from 1GB to 70GB+)
Step 1: Install Ollama
On macOS and Linux
Download and install Ollama using the official installer:
curl -fsSL https://ollama.ai/install.sh | sh
On Windows
Download the Windows installer from the official Ollama website and run it.
Verify Installation
After installation, verify Ollama is working:
ollama --version
Step 2: Download Your First Model
Ollama supports numerous models. Let’s start with Llama 2, a popular choice:
ollama pull llama2
For faster experimentation, you might prefer a smaller model:
ollama pull llama2:7b
Other popular models include:
mistral
: Fast and efficient modelcodellama
: Specialized for code generationphi
: Microsoft’s compact model
Step 3: Set Up Python Environment
Create a virtual environment to keep your project dependencies isolated:
python -m venv ollama-env
source ollama-env/bin/activate # On Windows: ollama-env\Scripts\activate
Install the required Python packages:
pip install ollama requests
Step 4: Basic Python Integration
Simple Chat Interface
Create your first Python script to interact with Ollama:
import ollama
def chat_with_ollama(model_name="llama2"):
"""Simple chat function using Ollama"""
print(f"Chatting with {model_name}. Type 'quit' to exit.")
while True:
user_input = input("\nYou: ")
if user_input.lower() == 'quit':
break
try:
response = ollama.chat(
model=model_name,
messages=[{
'role': 'user',
'content': user_input
}]
)
print(f"AI: {response['message']['content']}")
except Exception as e:
print(f"Error: {e}")
if __name__ == "__main__":
chat_with_ollama()
Streaming Responses
For real-time response streaming, use the streaming API:
import ollama
def stream_chat(model_name="llama2", prompt="Tell me about Python programming"):
"""Stream responses from Ollama"""
print("Streaming response:")
for chunk in ollama.chat(
model=model_name,
messages=[{'role': 'user', 'content': prompt}],
stream=True
):
print(chunk['message']['content'], end='', flush=True)
print() # New line after streaming
stream_chat()
Step 5: Advanced Usage
Managing Conversation Context
Maintain conversation history for more natural interactions:
import ollama
class OllamaChat:
def __init__(self, model_name="llama2"):
self.model_name = model_name
self.conversation_history = []
def send_message(self, message):
"""Send a message and maintain conversation context"""
self.conversation_history.append({
'role': 'user',
'content': message
})
response = ollama.chat(
model=self.model_name,
messages=self.conversation_history
)
assistant_message = response['message']['content']
self.conversation_history.append({
'role': 'assistant',
'content': assistant_message
})
return assistant_message
def clear_history(self):
"""Clear conversation history"""
self.conversation_history = []
# Usage example
chat = OllamaChat()
response1 = chat.send_message("What's the capital of France?")
print(f"AI: {response1}")
response2 = chat.send_message("What's the population of that city?")
print(f"AI: {response2}")
Custom Model Parameters
Fine-tune model behavior with custom parameters:
import ollama
def generate_with_params(prompt, model="llama2"):
"""Generate text with custom parameters"""
response = ollama.generate(
model=model,
prompt=prompt,
options={
'temperature': 0.7, # Creativity level (0.0 to 1.0)
'top_p': 0.9, # Nucleus sampling
'top_k': 40, # Top-k sampling
'repeat_penalty': 1.1, # Penalty for repetition
'num_ctx': 2048, # Context window size
}
)
return response['response']
result = generate_with_params("Write a creative story about a robot learning to paint:")
print(result)
Working with Different Models
Switch between models based on your needs:
import ollama
class ModelManager:
def __init__(self):
self.available_models = self.get_available_models()
def get_available_models(self):
"""Get list of available models"""
try:
models = ollama.list()
return [model['name'] for model in models['models']]
except Exception as e:
print(f"Error getting models: {e}")
return []
def generate_code(self, prompt):
"""Use CodeLlama for code generation"""
if 'codellama' not in self.available_models:
return "CodeLlama not available. Please run: ollama pull codellama"
response = ollama.generate(
model='codellama',
prompt=f"Generate Python code for: {prompt}"
)
return response['response']
def generate_text(self, prompt):
"""Use general model for text generation"""
model = 'llama2' if 'llama2' in self.available_models else self.available_models[0]
response = ollama.generate(model=model, prompt=prompt)
return response['response']
# Usage
manager = ModelManager()
code = manager.generate_code("a function to calculate fibonacci numbers")
print("Generated code:")
print(code)
Step 6: Building a Complete Application
Here’s a more comprehensive example that combines multiple features:
import ollama
import json
import time
from datetime import datetime
class SmartAssistant:
def __init__(self, model="llama2"):
self.model = model
self.conversation_log = []
self.start_time = datetime.now()
def log_interaction(self, user_input, ai_response, response_time):
"""Log conversations for analysis"""
self.conversation_log.append({
'timestamp': datetime.now().isoformat(),
'user_input': user_input,
'ai_response': ai_response,
'response_time': response_time,
'model_used': self.model
})
def process_query(self, query):
"""Process user query with timing"""
start_time = time.time()
try:
response = ollama.generate(
model=self.model,
prompt=query,
options={
'temperature': 0.8,
'top_p': 0.9,
'top_k': 40,
}
)
ai_response = response['response']
response_time = time.time() - start_time
self.log_interaction(query, ai_response, response_time)
return {
'response': ai_response,
'time_taken': round(response_time, 2),
'model': self.model
}
except Exception as e:
return {'error': str(e)}
def save_conversation_log(self, filename="conversation_log.json"):
"""Save conversation history to file"""
with open(filename, 'w') as f:
json.dump(self.conversation_log, f, indent=2)
print(f"Conversation log saved to {filename}")
def get_stats(self):
"""Get conversation statistics"""
if not self.conversation_log:
return "No conversations yet."
total_interactions = len(self.conversation_log)
avg_response_time = sum(log['response_time'] for log in self.conversation_log) / total_interactions
return f"""
Session Statistics:
- Total interactions: {total_interactions}
- Average response time: {avg_response_time:.2f} seconds
- Session duration: {datetime.now() - self.start_time}
- Model used: {self.model}
"""
# Usage example
assistant = SmartAssistant()
while True:
user_input = input("\nYou: ")
if user_input.lower() in ['quit', 'exit']:
break
elif user_input.lower() == 'stats':
print(assistant.get_stats())
continue
elif user_input.lower() == 'save':
assistant.save_conversation_log()
continue
result = assistant.process_query(user_input)
if 'error' in result:
print(f"Error: {result['error']}")
else:
print(f"AI ({result['time_taken']}s): {result['response']}")
print(assistant.get_stats())
assistant.save_conversation_log()
Best Practices
Performance Optimization
- Model Selection: Choose the right model size for your use case. Smaller models (7B parameters) are faster but less capable than larger ones (70B parameters).
- Context Management: Keep conversation history reasonable in length to maintain performance:
def trim_conversation_history(messages, max_length=10):
"""Keep only recent messages to maintain performance"""
if len(messages) > max_length:
# Keep system message (if any) and recent messages
system_messages = [msg for msg in messages if msg.get('role') == 'system']
recent_messages = messages[-(max_length-len(system_messages)):]
return system_messages + recent_messages
return messages
- Batch Processing: For multiple requests, consider processing them efficiently:
def batch_generate(prompts, model="llama2"):
"""Process multiple prompts efficiently"""
results = []
for prompt in prompts:
response = ollama.generate(model=model, prompt=prompt)
results.append(response['response'])
return results
Error Handling
Always implement robust error handling:
import ollama
import time
def safe_generate(prompt, model="llama2", max_retries=3):
"""Generate with retry logic"""
for attempt in range(max_retries):
try:
response = ollama.generate(model=model, prompt=prompt)
return response['response']
except Exception as e:
print(f"Attempt {attempt + 1} failed: {e}")
if attempt < max_retries - 1:
time.sleep(2 ** attempt) # Exponential backoff
else:
raise e
Troubleshooting
Common Issues
Model Not Found Error
# Download the model first
ollama pull model-name
Connection Errors
# Check if Ollama service is running
import subprocess
try:
subprocess.run(['ollama', 'list'], check=True, capture_output=True)
print("Ollama is running")
except subprocess.CalledProcessError:
print("Ollama service may not be running. Try: ollama serve")
Memory Issues
- Use smaller models for limited RAM
- Close other applications
- Consider using quantized models
Conclusion
Ollama provides a powerful and accessible way to run large language models locally with Python. This guide covered the basics of installation, setup, and usage, along with advanced features like conversation management and custom parameters.
With local models, you gain privacy, control, and the ability to experiment without API limits. Start with smaller models to get familiar with the workflow, then gradually explore larger, more capable models as your needs grow.
The combination of Ollama’s simplicity and Python’s flexibility opens up endless possibilities for building AI-powered applications that run entirely on your machine.