In this technical deep dive, I’ll walk through creating a complete Retrieval-Augmented Generation (RAG) agent using DeepSeek-R1 and Ollama. This approach combines the powerful reasoning capabilities of DeepSeek-R1 with the local deployment flexibility of Ollama to create an efficient, customizable knowledge retrieval system.
Introduction to DeepSeek-R1 and Ollama
DeepSeek-R1 is a reasoning-focused large language model (LLM) optimized for complex reasoning tasks. Released by DeepSeek AI, it demonstrates exceptional performance on logic, mathematics, and reasoning benchmarks while maintaining strong general capabilities.
Ollama is an open-source framework for running LLMs locally. It simplifies deployment with lightweight containerization, making it ideal for running models on personal hardware without cloud dependencies.
Architecture Overview
Our RAG agent will follow this high-level architecture:
- Document Processing Pipeline: Convert documents to vector embeddings
- Vector Database: Store and retrieve embeddings efficiently
- Query Processing: Transform user queries into effective search vectors
- Context Retrieval: Find relevant document chunks from the vector database
- Generation: Use DeepSeek-R1 to produce accurate responses based on retrieved context
Let’s implement each component with working code.
Setting Up the Environment
First, let’s install the necessary dependencies:
# Install required packages
pip install langchain langchain_community pymupdf sentence-transformers \
chromadb pydantic fastapi uvicorn
1. Document Processing Pipeline
We’ll start by creating a document loader and chunking system:
import os
from typing import List, Dict, Any
from langchain_community.document_loaders import PyMuPDFLoader
from langchain.text_splitter import RecursiveCharacterTextSplitter
class DocumentProcessor:
def __init__(self, chunk_size=1000, chunk_overlap=200):
self.text_splitter = RecursiveCharacterTextSplitter(
chunk_size=chunk_size,
chunk_overlap=chunk_overlap
)
def load_and_split(self, file_path: str) -> List[Dict[str, Any]]:
"""Load a document and split it into chunks."""
if file_path.endswith('.pdf'):
loader = PyMuPDFLoader(file_path)
documents = loader.load()
else:
# Handle text files
with open(file_path, 'r') as f:
text = f.read()
documents = [{"page_content": text, "metadata": {"source": file_path}}]
# Split documents into chunks
chunks = self.text_splitter.split_documents(documents)
# Format chunks for storage
processed_chunks = []
for i, chunk in enumerate(chunks):
processed_chunks.append({
"id": f"{os.path.basename(file_path)}_chunk_{i}",
"text": chunk.page_content,
"metadata": {
"source": chunk.metadata.get("source", file_path),
"page": chunk.metadata.get("page", 0)
}
})
return processed_chunks
2. Vector Database with ChromaDB
Next, we’ll set up our vector database using ChromaDB:
import chromadb
from sentence_transformers import SentenceTransformer
class VectorStore:
def __init__(self, collection_name="rag_documents", embedding_model="all-MiniLM-L6-v2"):
# Initialize ChromaDB client
self.client = chromadb.PersistentClient("./chroma_db")
# Create or get collection
self.collection = self.client.get_or_create_collection(
name=collection_name,
metadata={"hnsw:space": "cosine"}
)
# Initialize embedding model
self.embedder = SentenceTransformer(embedding_model)
def add_documents(self, documents: List[Dict[str, Any]]) -> None:
"""Add documents to the vector store."""
ids = [doc["id"] for doc in documents]
texts = [doc["text"] for doc in documents]
metadatas = [doc["metadata"] for doc in documents]
# Generate embeddings
embeddings = self.embedder.encode(texts)
# Add to collection
self.collection.add(
ids=ids,
embeddings=embeddings.tolist(),
documents=texts,
metadatas=metadatas
)
def search(self, query: str, top_k=5) -> List[Dict[str, Any]]:
"""Search for relevant documents."""
query_embedding = self.embedder.encode(query).tolist()
results = self.collection.query(
query_embeddings=[query_embedding],
n_results=top_k
)
# Format results
documents = []
for i in range(len(results["ids"][0])):
documents.append({
"id": results["ids"][0][i],
"text": results["documents"][0][i],
"metadata": results["metadatas"][0][i],
"score": results["distances"][0][i] if "distances" in results else None
})
return documents
3. Setting Up Ollama with DeepSeek-R1
Now let’s set up Ollama to serve the DeepSeek-R1 model:
# Create a Modelfile for DeepSeek-R1 in Ollama
modelfile_content = """
FROM deepseek-r1:6.7b
# Set parameters
PARAMETER temperature 0.1
PARAMETER top_p 0.9
PARAMETER top_k 40
# System prompt to enforce RAG behavior
SYSTEM You are an AI assistant powered by DeepSeek-R1. You will be provided with retrieved context from a knowledge base. Always use this context to answer questions. If the context doesn't contain the answer, say "I don't have enough information to answer that question." Always cite the source of your information.
"""
# Write the Modelfile
with open("Modelfile", "w") as f:
f.write(modelfile_content)
# Create the model (run this in shell)
# !ollama create deepseek-r1-rag -f Modelfile
4. Creating the RAG Agent
Now, let’s build the RAG agent that integrates all components:
import requests
import json
from pydantic import BaseModel
class RAGAgent:
def __init__(
self,
vector_store: VectorStore,
ollama_url="http://localhost:11434/api/generate",
model_name="deepseek-r1-rag",
max_tokens=1024
):
self.vector_store = vector_store
self.ollama_url = ollama_url
self.model_name = model_name
self.max_tokens = max_tokens
def _generate_prompt(self, query: str, contexts: List[Dict[str, Any]]) -> str:
"""Create a prompt for the LLM with retrieved contexts."""
formatted_contexts = ""
for i, ctx in enumerate(contexts):
formatted_contexts += f"\nDocument {i+1} (Source: {ctx['metadata']['source']}, Page: {ctx['metadata'].get('page', 'N/A')}):\n{ctx['text']}\n"
prompt = f"""I need you to answer the following question based on the retrieved information:
Question: {query}
Retrieved Context:
{formatted_contexts}
Answer the question using only the information provided in the retrieved context. If the context doesn't contain the answer, say "I don't have enough information to answer that question."
"""
return prompt
def _call_ollama(self, prompt: str) -> str:
"""Call the Ollama API to generate a response."""
payload = {
"model": self.model_name,
"prompt": prompt,
"stream": False,
"options": {
"num_predict": self.max_tokens,
"temperature": 0.1,
}
}
response = requests.post(self.ollama_url, json=payload)
result = response.json()
return result.get("response", "")
def answer_question(self, query: str, top_k=5) -> Dict[str, Any]:
"""Process a question and return an answer with citations."""
# Retrieve relevant documents
retrieved_contexts = self.vector_store.search(query, top_k=top_k)
# Generate prompt with contexts
prompt = self._generate_prompt(query, retrieved_contexts)
# Get answer from LLM
answer = self._call_ollama(prompt)
# Return structured response
return {
"query": query,
"answer": answer,
"sources": [
{
"source": ctx["metadata"]["source"],
"page": ctx["metadata"].get("page", "N/A")
}
for ctx in retrieved_contexts
]
}
5. Building a Simple API
Let’s wrap our RAG agent with a FastAPI service:
from fastapi import FastAPI, File, UploadFile, BackgroundTasks
from fastapi.responses import JSONResponse
import uvicorn
import shutil
import uuid
from typing import List
app = FastAPI(title="DeepSeek-R1 RAG API")
# Initialize components
doc_processor = DocumentProcessor()
vector_store = VectorStore()
rag_agent = RAGAgent(vector_store)
class QueryRequest(BaseModel):
query: str
top_k: int = 5
@app.post("/upload")
async def upload_document(file: UploadFile):
"""Upload and process a document."""
# Save the file temporarily
file_path = f"./temp_{uuid.uuid4()}{file.filename}"
with open(file_path, "wb") as buffer:
shutil.copyfileobj(file.file, buffer)
# Process the document
try:
chunks = doc_processor.load_and_split(file_path)
vector_store.add_documents(chunks)
return {"message": f"Document processed successfully with {len(chunks)} chunks"}
except Exception as e:
return JSONResponse(
status_code=500,
content={"error": f"Failed to process document: {str(e)}"}
)
finally:
# Clean up the temporary file
os.remove(file_path)
@app.post("/query")
async def query(request: QueryRequest):
"""Answer a query using the RAG system."""
try:
result = rag_agent.answer_question(request.query, top_k=request.top_k)
return result
except Exception as e:
return JSONResponse(
status_code=500,
content={"error": f"Failed to process query: {str(e)}"}
)
if __name__ == "__main__":
uvicorn.run(app, host="0.0.0.0", port=8000)
6. Improving with Reranking
To improve retrieval accuracy, let’s add a reranking step using cross-encoders:
from sentence_transformers import CrossEncoder
class EnhancedRAGAgent(RAGAgent):
def __init__(
self,
vector_store: VectorStore,
reranker_model="cross-encoder/ms-marco-MiniLM-L-6-v2",
**kwargs
):
super().__init__(vector_store, **kwargs)
self.reranker = CrossEncoder(reranker_model)
def answer_question(self, query: str, top_k=5, rerank_top_n=10) -> Dict[str, Any]:
"""Process a question with reranking for improved accuracy."""
# Retrieve more documents than needed for reranking
retrieved_contexts = self.vector_store.search(query, top_k=rerank_top_n)
if retrieved_contexts:
# Prepare pairs for reranking
pairs = [(query, ctx["text"]) for ctx in retrieved_contexts]
# Score with cross-encoder
scores = self.reranker.predict(pairs)
# Create scored pairs and sort
scored_contexts = list(zip(retrieved_contexts, scores))
scored_contexts.sort(key=lambda x: x[1], reverse=True)
# Take top_k after reranking
retrieved_contexts = [ctx for ctx, score in scored_contexts[:top_k]]
# Generate prompt with contexts
prompt = self._generate_prompt(query, retrieved_contexts)
# Get answer from LLM
answer = self._call_ollama(prompt)
# Return structured response
return {
"query": query,
"answer": answer,
"sources": [
{
"source": ctx["metadata"]["source"],
"page": ctx["metadata"].get("page", "N/A")
}
for ctx in retrieved_contexts
]
}
7. Running the Complete System
Let’s put everything together in a simple script to demonstrate usage:
# main.py
import os
from document_processor import DocumentProcessor
from vector_store import VectorStore
from rag_agent import EnhancedRAGAgent
def main():
print("Initializing DeepSeek-R1 RAG System...")
# Initialize components
doc_processor = DocumentProcessor()
vector_store = VectorStore()
rag_agent = EnhancedRAGAgent(vector_store)
# Process documents in a directory
doc_dir = "./documents"
if not os.path.exists(doc_dir):
os.makedirs(doc_dir)
print(f"Created document directory at {doc_dir}")
print("Please add documents to this directory and restart the script.")
return
# Process all documents
for filename in os.listdir(doc_dir):
file_path = os.path.join(doc_dir, filename)
if os.path.isfile(file_path):
print(f"Processing {filename}...")
chunks = doc_processor.load_and_split(file_path)
vector_store.add_documents(chunks)
print(f"Added {len(chunks)} chunks to the vector store")
# Interactive query loop
print("\nDeepSeek-R1 RAG System ready! Type 'quit' to exit.")
while True:
query = input("\nEnter your question: ")
if query.lower() == 'quit':
break
result = rag_agent.answer_question(query)
print("\nAnswer:", result["answer"])
print("\nSources:")
for source in result["sources"]:
print(f"- {source['source']}, Page: {source['page']}")
if __name__ == "__main__":
main()
Performance Considerations
When running DeepSeek-R1 on Ollama, consider these performance optimizations:
- Hardware Requirements: DeepSeek-R1 (6.7B parameters) works best with at least 16GB of RAM and a GPU with 8GB+ VRAM.
- Quantization: Use quantized versions for better performance on consumer hardware: bashCopy
ollama pull deepseek-r1:6.7b-q4_K_M
- Chunk Size Tuning: Experiment with different chunk sizes (500-1500 tokens) based on your document type.
- Embeddings Model: The default
all-MiniLM-L6-v2
model offers a good balance of performance and accuracy, but for specialized domains, consider using domain-specific embedding models.
Advanced Features
To enhance the system further, consider implementing these advanced features:
- Conversational Memory: Add session-based history to maintain context across multiple queries.
- Document Metadata Filtering: Add filtering capabilities to search within specific documents or document types.
- Query Reformulation: Use DeepSeek-R1 to reformulate complex queries into more effective search queries.
- Hybrid Search: Combine dense vector retrieval with sparse retrieval (BM25) for better results.
Conclusion
This technical deep dive has demonstrated how to build a complete RAG agent using DeepSeek-R1 and Ollama. The implementation provides a powerful, locally-deployed solution for knowledge retrieval and generation with high accuracy and control.
The system is modular and can be easily extended or modified to suit specific use cases, from customer support to research assistance. By using DeepSeek-R1’s strong reasoning capabilities together with effective retrieval techniques, we can create knowledge systems that provide accurate, contextual responses without hallucination.