Using Ollama with Python: Step-by-Step Guide

Running large language models locally has become increasingly accessible thanks to tools like Ollama. This comprehensive guide will walk you through setting up and using Ollama with Python, enabling you to harness the power of AI models directly on your machine.

What is Ollama?

Ollama is an open-source platform that makes it easy to run large language models locally. It provides a simple API and command-line interface for downloading, running, and managing various AI models including Llama 2, Code Llama, Mistral, and many others. With Ollama, you can run these models without relying on external APIs or cloud services.

Prerequisites

Before we begin, ensure you have:

Python 3.7 or higher installed
At least 8GB of RAM (16GB+ recommended for larger models)
Sufficient disk space for model downloads (models range from 1GB to 70GB+)

Step 1: Install Ollama

On macOS and Linux

Download and install Ollama using the official installer:

curl -fsSL https://ollama.ai/install.sh | sh

On Windows

Download the Windows installer from the official Ollama website and run it.

Verify Installation

After installation, verify Ollama is working:

ollama --version

Step 2: Download Your First Model

Ollama supports numerous models. Let’s start with Llama 2, a popular choice:

ollama pull llama2

For faster experimentation, you might prefer a smaller model:

ollama pull llama2:7b

Other popular models include:

mistral: Fast and efficient model
codellama: Specialized for code generation
phi: Microsoft’s compact model

Step 3: Set Up Python Environment

Create a virtual environment to keep your project dependencies isolated:

python -m venv ollama-env
source ollama-env/bin/activate  # On Windows: ollama-env\Scripts\activate

Install the required Python packages:

pip install ollama requests

Step 4: Basic Python Integration

Simple Chat Interface

Create your first Python script to interact with Ollama:

import ollama

def chat_with_ollama(model_name="llama2"):
    """Simple chat function using Ollama"""
    print(f"Chatting with {model_name}. Type 'quit' to exit.")
    
    while True:
        user_input = input("\nYou: ")
        if user_input.lower() == 'quit':
            break
            
        try:
            response = ollama.chat(
                model=model_name,
                messages=[{
                    'role': 'user',
                    'content': user_input
                }]
            )
            print(f"AI: {response['message']['content']}")
        except Exception as e:
            print(f"Error: {e}")

if __name__ == "__main__":
    chat_with_ollama()

Streaming Responses

For real-time response streaming, use the streaming API:

import ollama

def stream_chat(model_name="llama2", prompt="Tell me about Python programming"):
    """Stream responses from Ollama"""
    print("Streaming response:")
    
    for chunk in ollama.chat(
        model=model_name,
        messages=[{'role': 'user', 'content': prompt}],
        stream=True
    ):
        print(chunk['message']['content'], end='', flush=True)
    print()  # New line after streaming

stream_chat()

Step 5: Advanced Usage

Managing Conversation Context

Maintain conversation history for more natural interactions:

import ollama

class OllamaChat:
    def __init__(self, model_name="llama2"):
        self.model_name = model_name
        self.conversation_history = []
    
    def send_message(self, message):
        """Send a message and maintain conversation context"""
        self.conversation_history.append({
            'role': 'user',
            'content': message
        })
        
        response = ollama.chat(
            model=self.model_name,
            messages=self.conversation_history
        )
        
        assistant_message = response['message']['content']
        self.conversation_history.append({
            'role': 'assistant',
            'content': assistant_message
        })
        
        return assistant_message
    
    def clear_history(self):
        """Clear conversation history"""
        self.conversation_history = []

# Usage example
chat = OllamaChat()
response1 = chat.send_message("What's the capital of France?")
print(f"AI: {response1}")

response2 = chat.send_message("What's the population of that city?")
print(f"AI: {response2}")

Custom Model Parameters

Fine-tune model behavior with custom parameters:

import ollama

def generate_with_params(prompt, model="llama2"):
    """Generate text with custom parameters"""
    response = ollama.generate(
        model=model,
        prompt=prompt,
        options={
            'temperature': 0.7,     # Creativity level (0.0 to 1.0)
            'top_p': 0.9,          # Nucleus sampling
            'top_k': 40,           # Top-k sampling
            'repeat_penalty': 1.1,  # Penalty for repetition
            'num_ctx': 2048,       # Context window size
        }
    )
    return response['response']

result = generate_with_params("Write a creative story about a robot learning to paint:")
print(result)

Working with Different Models

Switch between models based on your needs:

import ollama

class ModelManager:
    def __init__(self):
        self.available_models = self.get_available_models()
    
    def get_available_models(self):
        """Get list of available models"""
        try:
            models = ollama.list()
            return [model['name'] for model in models['models']]
        except Exception as e:
            print(f"Error getting models: {e}")
            return []
    
    def generate_code(self, prompt):
        """Use CodeLlama for code generation"""
        if 'codellama' not in self.available_models:
            return "CodeLlama not available. Please run: ollama pull codellama"
        
        response = ollama.generate(
            model='codellama',
            prompt=f"Generate Python code for: {prompt}"
        )
        return response['response']
    
    def generate_text(self, prompt):
        """Use general model for text generation"""
        model = 'llama2' if 'llama2' in self.available_models else self.available_models[0]
        response = ollama.generate(model=model, prompt=prompt)
        return response['response']

# Usage
manager = ModelManager()
code = manager.generate_code("a function to calculate fibonacci numbers")
print("Generated code:")
print(code)

Step 6: Building a Complete Application

Here’s a more comprehensive example that combines multiple features:

import ollama
import json
import time
from datetime import datetime

class SmartAssistant:
    def __init__(self, model="llama2"):
        self.model = model
        self.conversation_log = []
        self.start_time = datetime.now()
    
    def log_interaction(self, user_input, ai_response, response_time):
        """Log conversations for analysis"""
        self.conversation_log.append({
            'timestamp': datetime.now().isoformat(),
            'user_input': user_input,
            'ai_response': ai_response,
            'response_time': response_time,
            'model_used': self.model
        })
    
    def process_query(self, query):
        """Process user query with timing"""
        start_time = time.time()
        
        try:
            response = ollama.generate(
                model=self.model,
                prompt=query,
                options={
                    'temperature': 0.8,
                    'top_p': 0.9,
                    'top_k': 40,
                }
            )
            
            ai_response = response['response']
            response_time = time.time() - start_time
            
            self.log_interaction(query, ai_response, response_time)
            
            return {
                'response': ai_response,
                'time_taken': round(response_time, 2),
                'model': self.model
            }
            
        except Exception as e:
            return {'error': str(e)}
    
    def save_conversation_log(self, filename="conversation_log.json"):
        """Save conversation history to file"""
        with open(filename, 'w') as f:
            json.dump(self.conversation_log, f, indent=2)
        print(f"Conversation log saved to {filename}")
    
    def get_stats(self):
        """Get conversation statistics"""
        if not self.conversation_log:
            return "No conversations yet."
        
        total_interactions = len(self.conversation_log)
        avg_response_time = sum(log['response_time'] for log in self.conversation_log) / total_interactions
        
        return f"""
        Session Statistics:
        - Total interactions: {total_interactions}
        - Average response time: {avg_response_time:.2f} seconds
        - Session duration: {datetime.now() - self.start_time}
        - Model used: {self.model}
        """

# Usage example
assistant = SmartAssistant()

while True:
    user_input = input("\nYou: ")
    if user_input.lower() in ['quit', 'exit']:
        break
    elif user_input.lower() == 'stats':
        print(assistant.get_stats())
        continue
    elif user_input.lower() == 'save':
        assistant.save_conversation_log()
        continue
    
    result = assistant.process_query(user_input)
    
    if 'error' in result:
        print(f"Error: {result['error']}")
    else:
        print(f"AI ({result['time_taken']}s): {result['response']}")

print(assistant.get_stats())
assistant.save_conversation_log()

Best Practices

Performance Optimization

Model Selection: Choose the right model size for your use case. Smaller models (7B parameters) are faster but less capable than larger ones (70B parameters).
Context Management: Keep conversation history reasonable in length to maintain performance:

def trim_conversation_history(messages, max_length=10):
    """Keep only recent messages to maintain performance"""
    if len(messages) > max_length:
        # Keep system message (if any) and recent messages
        system_messages = [msg for msg in messages if msg.get('role') == 'system']
        recent_messages = messages[-(max_length-len(system_messages)):]
        return system_messages + recent_messages
    return messages

Batch Processing: For multiple requests, consider processing them efficiently:

def batch_generate(prompts, model="llama2"):
    """Process multiple prompts efficiently"""
    results = []
    for prompt in prompts:
        response = ollama.generate(model=model, prompt=prompt)
        results.append(response['response'])
    return results

Error Handling

Always implement robust error handling:

import ollama
import time

def safe_generate(prompt, model="llama2", max_retries=3):
    """Generate with retry logic"""
    for attempt in range(max_retries):
        try:
            response = ollama.generate(model=model, prompt=prompt)
            return response['response']
        except Exception as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                raise e

Troubleshooting

Common Issues

Model Not Found Error

# Download the model first
ollama pull model-name

Connection Errors

# Check if Ollama service is running
import subprocess
try:
    subprocess.run(['ollama', 'list'], check=True, capture_output=True)
    print("Ollama is running")
except subprocess.CalledProcessError:
    print("Ollama service may not be running. Try: ollama serve")

Memory Issues

Use smaller models for limited RAM
Close other applications
Consider using quantized models

Conclusion

Ollama provides a powerful and accessible way to run large language models locally with Python. This guide covered the basics of installation, setup, and usage, along with advanced features like conversation management and custom parameters.

With local models, you gain privacy, control, and the ability to experiment without API limits. Start with smaller models to get familiar with the workflow, then gradually explore larger, more capable models as your needs grow.

The combination of Ollama’s simplicity and Python’s flexibility opens up endless possibilities for building AI-powered applications that run entirely on your machine.