End To End RAG Agent With DeepSeek-R1 And Ollama: A Technical Deep Dive

Table of Contents

In this technical deep dive, I’ll walk through creating a complete Retrieval-Augmented Generation (RAG) agent using DeepSeek-R1 and Ollama. This approach combines the powerful reasoning capabilities of DeepSeek-R1 with the local deployment flexibility of Ollama to create an efficient, customizable knowledge retrieval system.

Introduction to DeepSeek-R1 and Ollama

DeepSeek-R1 is a reasoning-focused large language model (LLM) optimized for complex reasoning tasks. Released by DeepSeek AI, it demonstrates exceptional performance on logic, mathematics, and reasoning benchmarks while maintaining strong general capabilities.

Ollama is an open-source framework for running LLMs locally. It simplifies deployment with lightweight containerization, making it ideal for running models on personal hardware without cloud dependencies.

Running DeepSeek-R1 with Ollama: A Complete Guide

Architecture Overview

Our RAG agent will follow this high-level architecture:

Document Processing Pipeline: Convert documents to vector embeddings
Vector Database: Store and retrieve embeddings efficiently
Query Processing: Transform user queries into effective search vectors
Context Retrieval: Find relevant document chunks from the vector database
Generation: Use DeepSeek-R1 to produce accurate responses based on retrieved context

Let’s implement each component with working code.

Setting Up the Environment

First, let’s install the necessary dependencies:

# Install required packages
pip install langchain langchain_community pymupdf sentence-transformers \
    chromadb pydantic fastapi uvicorn

1. Document Processing Pipeline

We’ll start by creating a document loader and chunking system:

import os
from typing import List, Dict, Any
from langchain_community.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter

class DocumentProcessor:
    def __init__(self, chunk_size=1000, chunk_overlap=200):
        self.text_splitter = RecursiveCharacterTextSplitter(
            chunk_size=chunk_size,
            chunk_overlap=chunk_overlap
        )
    
    def load_and_split(self, file_path: str) -> List[Dict[str, Any]]:
        """Load a document and split it into chunks."""
        if file_path.endswith('.pdf'):
            loader = PyMuPDFLoader(file_path)
            documents = loader.load()
        else:
            # Handle text files
            with open(file_path, 'r') as f:
                text = f.read()
                documents = [{"page_content": text, "metadata": {"source": file_path}}]
        
        # Split documents into chunks
        chunks = self.text_splitter.split_documents(documents)
        
        # Format chunks for storage
        processed_chunks = []
        for i, chunk in enumerate(chunks):
            processed_chunks.append({
                "id": f"{os.path.basename(file_path)}_chunk_{i}",
                "text": chunk.page_content,
                "metadata": {
                    "source": chunk.metadata.get("source", file_path),
                    "page": chunk.metadata.get("page", 0)
                }
            })
        
        return processed_chunks

2. Vector Database with ChromaDB

Next, we’ll set up our vector database using ChromaDB:

import chromadb
from sentence_transformers import SentenceTransformer

class VectorStore:
    def __init__(self, collection_name="rag_documents", embedding_model="all-MiniLM-L6-v2"):
        # Initialize ChromaDB client
        self.client = chromadb.PersistentClient("./chroma_db")
        
        # Create or get collection
        self.collection = self.client.get_or_create_collection(
            name=collection_name,
            metadata={"hnsw:space": "cosine"}
        )
        
        # Initialize embedding model
        self.embedder = SentenceTransformer(embedding_model)
    
    def add_documents(self, documents: List[Dict[str, Any]]) -> None:
        """Add documents to the vector store."""
        ids = [doc["id"] for doc in documents]
        texts = [doc["text"] for doc in documents]
        metadatas = [doc["metadata"] for doc in documents]
        
        # Generate embeddings
        embeddings = self.embedder.encode(texts)
        
        # Add to collection
        self.collection.add(
            ids=ids,
            embeddings=embeddings.tolist(),
            documents=texts,
            metadatas=metadatas
        )
    
    def search(self, query: str, top_k=5) -> List[Dict[str, Any]]:
        """Search for relevant documents."""
        query_embedding = self.embedder.encode(query).tolist()
        
        results = self.collection.query(
            query_embeddings=[query_embedding],
            n_results=top_k
        )
        
        # Format results
        documents = []
        for i in range(len(results["ids"][0])):
            documents.append({
                "id": results["ids"][0][i],
                "text": results["documents"][0][i],
                "metadata": results["metadatas"][0][i],
                "score": results["distances"][0][i] if "distances" in results else None
            })
        
        return documents

3. Setting Up Ollama with DeepSeek-R1

Now let’s set up Ollama to serve the DeepSeek-R1 model:

# Create a Modelfile for DeepSeek-R1 in Ollama
modelfile_content = """
FROM deepseek-r1:6.7b

# Set parameters
PARAMETER temperature 0.1
PARAMETER top_p 0.9
PARAMETER top_k 40

# System prompt to enforce RAG behavior
SYSTEM You are an AI assistant powered by DeepSeek-R1. You will be provided with retrieved context from a knowledge base. Always use this context to answer questions. If the context doesn't contain the answer, say "I don't have enough information to answer that question." Always cite the source of your information.
"""

# Write the Modelfile
with open("Modelfile", "w") as f:
    f.write(modelfile_content)

# Create the model (run this in shell)
# !ollama create deepseek-r1-rag -f Modelfile

4. Creating the RAG Agent

Now, let’s build the RAG agent that integrates all components:

import requests
import json
from pydantic import BaseModel

class RAGAgent:
    def __init__(
        self,
        vector_store: VectorStore,
        ollama_url="http://localhost:11434/api/generate",
        model_name="deepseek-r1-rag",
        max_tokens=1024
    ):
        self.vector_store = vector_store
        self.ollama_url = ollama_url
        self.model_name = model_name
        self.max_tokens = max_tokens
    
    def _generate_prompt(self, query: str, contexts: List[Dict[str, Any]]) -> str:
        """Create a prompt for the LLM with retrieved contexts."""
        formatted_contexts = ""
        
        for i, ctx in enumerate(contexts):
            formatted_contexts += f"\nDocument {i+1} (Source: {ctx['metadata']['source']}, Page: {ctx['metadata'].get('page', 'N/A')}):\n{ctx['text']}\n"
        
        prompt = f"""I need you to answer the following question based on the retrieved information:

Question: {query}

Retrieved Context:
{formatted_contexts}

Answer the question using only the information provided in the retrieved context. If the context doesn't contain the answer, say "I don't have enough information to answer that question."
"""
        return prompt
    
    def _call_ollama(self, prompt: str) -> str:
        """Call the Ollama API to generate a response."""
        payload = {
            "model": self.model_name,
            "prompt": prompt,
            "stream": False,
            "options": {
                "num_predict": self.max_tokens,
                "temperature": 0.1,
            }
        }
        
        response = requests.post(self.ollama_url, json=payload)
        result = response.json()
        
        return result.get("response", "")
    
    def answer_question(self, query: str, top_k=5) -> Dict[str, Any]:
        """Process a question and return an answer with citations."""
        # Retrieve relevant documents
        retrieved_contexts = self.vector_store.search(query, top_k=top_k)
        
        # Generate prompt with contexts
        prompt = self._generate_prompt(query, retrieved_contexts)
        
        # Get answer from LLM
        answer = self._call_ollama(prompt)
        
        # Return structured response
        return {
            "query": query,
            "answer": answer,
            "sources": [
                {
                    "source": ctx["metadata"]["source"],
                    "page": ctx["metadata"].get("page", "N/A")
                }
                for ctx in retrieved_contexts
            ]
        }

5. Building a Simple API

Let’s wrap our RAG agent with a FastAPI service:

from fastapi import FastAPI, File, UploadFile, BackgroundTasks
from fastapi.responses import JSONResponse
import uvicorn
import shutil
import uuid
from typing import List

app = FastAPI(title="DeepSeek-R1 RAG API")

# Initialize components
doc_processor = DocumentProcessor()
vector_store = VectorStore()
rag_agent = RAGAgent(vector_store)

class QueryRequest(BaseModel):
    query: str
    top_k: int = 5

@app.post("/upload")
async def upload_document(file: UploadFile):
    """Upload and process a document."""
    # Save the file temporarily
    file_path = f"./temp_{uuid.uuid4()}{file.filename}"
    with open(file_path, "wb") as buffer:
        shutil.copyfileobj(file.file, buffer)
    
    # Process the document
    try:
        chunks = doc_processor.load_and_split(file_path)
        vector_store.add_documents(chunks)
        return {"message": f"Document processed successfully with {len(chunks)} chunks"}
    except Exception as e:
        return JSONResponse(
            status_code=500,
            content={"error": f"Failed to process document: {str(e)}"}
        )
    finally:
        # Clean up the temporary file
        os.remove(file_path)

@app.post("/query")
async def query(request: QueryRequest):
    """Answer a query using the RAG system."""
    try:
        result = rag_agent.answer_question(request.query, top_k=request.top_k)
        return result
    except Exception as e:
        return JSONResponse(
            status_code=500,
            content={"error": f"Failed to process query: {str(e)}"}
        )

if __name__ == "__main__":
    uvicorn.run(app, host="0.0.0.0", port=8000)

6. Improving with Reranking

To improve retrieval accuracy, let’s add a reranking step using cross-encoders:

from sentence_transformers import CrossEncoder

class EnhancedRAGAgent(RAGAgent):
    def __init__(
        self,
        vector_store: VectorStore,
        reranker_model="cross-encoder/ms-marco-MiniLM-L-6-v2",
        **kwargs
    ):
        super().__init__(vector_store, **kwargs)
        self.reranker = CrossEncoder(reranker_model)
    
    def answer_question(self, query: str, top_k=5, rerank_top_n=10) -> Dict[str, Any]:
        """Process a question with reranking for improved accuracy."""
        # Retrieve more documents than needed for reranking
        retrieved_contexts = self.vector_store.search(query, top_k=rerank_top_n)
        
        if retrieved_contexts:
            # Prepare pairs for reranking
            pairs = [(query, ctx["text"]) for ctx in retrieved_contexts]
            
            # Score with cross-encoder
            scores = self.reranker.predict(pairs)
            
            # Create scored pairs and sort
            scored_contexts = list(zip(retrieved_contexts, scores))
            scored_contexts.sort(key=lambda x: x[1], reverse=True)
            
            # Take top_k after reranking
            retrieved_contexts = [ctx for ctx, score in scored_contexts[:top_k]]
        
        # Generate prompt with contexts
        prompt = self._generate_prompt(query, retrieved_contexts)
        
        # Get answer from LLM
        answer = self._call_ollama(prompt)
        
        # Return structured response
        return {
            "query": query,
            "answer": answer,
            "sources": [
                {
                    "source": ctx["metadata"]["source"],
                    "page": ctx["metadata"].get("page", "N/A")
                }
                for ctx in retrieved_contexts
            ]
        }

7. Running the Complete System

Let’s put everything together in a simple script to demonstrate usage:

# main.py
import os
from document_processor import DocumentProcessor
from vector_store import VectorStore
from rag_agent import EnhancedRAGAgent

def main():
    print("Initializing DeepSeek-R1 RAG System...")
    
    # Initialize components
    doc_processor = DocumentProcessor()
    vector_store = VectorStore()
    rag_agent = EnhancedRAGAgent(vector_store)
    
    # Process documents in a directory
    doc_dir = "./documents"
    if not os.path.exists(doc_dir):
        os.makedirs(doc_dir)
        print(f"Created document directory at {doc_dir}")
        print("Please add documents to this directory and restart the script.")
        return
    
    # Process all documents
    for filename in os.listdir(doc_dir):
        file_path = os.path.join(doc_dir, filename)
        if os.path.isfile(file_path):
            print(f"Processing {filename}...")
            chunks = doc_processor.load_and_split(file_path)
            vector_store.add_documents(chunks)
            print(f"Added {len(chunks)} chunks to the vector store")
    
    # Interactive query loop
    print("\nDeepSeek-R1 RAG System ready! Type 'quit' to exit.")
    while True:
        query = input("\nEnter your question: ")
        if query.lower() == 'quit':
            break
        
        result = rag_agent.answer_question(query)
        print("\nAnswer:", result["answer"])
        print("\nSources:")
        for source in result["sources"]:
            print(f"- {source['source']}, Page: {source['page']}")

if __name__ == "__main__":
    main()

Performance Considerations

When running DeepSeek-R1 on Ollama, consider these performance optimizations:

Hardware Requirements: DeepSeek-R1 (6.7B parameters) works best with at least 16GB of RAM and a GPU with 8GB+ VRAM.
Quantization: Use quantized versions for better performance on consumer hardware: bashCopyollama pull deepseek-r1:6.7b-q4_K_M
Chunk Size Tuning: Experiment with different chunk sizes (500-1500 tokens) based on your document type.
Embeddings Model: The default all-MiniLM-L6-v2 model offers a good balance of performance and accuracy, but for specialized domains, consider using domain-specific embedding models.

Advanced Features

To enhance the system further, consider implementing these advanced features:

Conversational Memory: Add session-based history to maintain context across multiple queries.
Document Metadata Filtering: Add filtering capabilities to search within specific documents or document types.
Query Reformulation: Use DeepSeek-R1 to reformulate complex queries into more effective search queries.
Hybrid Search: Combine dense vector retrieval with sparse retrieval (BM25) for better results.

Conclusion

This technical deep dive has demonstrated how to build a complete RAG agent using DeepSeek-R1 and Ollama. The implementation provides a powerful, locally-deployed solution for knowledge retrieval and generation with high accuracy and control.

The system is modular and can be easily extended or modified to suit specific use cases, from customer support to research assistance. By using DeepSeek-R1’s strong reasoning capabilities together with effective retrieval techniques, we can create knowledge systems that provide accurate, contextual responses without hallucination.