Running DeepSeek R1 on Azure Kubernetes Service (AKS) using Ollama

Table of Contents

Introduction

DeepSeek is an advanced open-source code language model (LLM) that has gained significant popularity in the developer community. When paired with Ollama, an easy-to-use framework for running and managing LLMs locally, and deployed on Azure Kubernetes Service (AKS), we can create a powerful, scalable, and cost-effective environment for AI applications.

This blog post walks through the process of deploying DeepSeek models on AKS using Ollama, providing you with a production-ready setup for your AI workloads.

Why DeepSeek on AKS with Ollama?

DeepSeek models offer excellent performance for code completion and generation tasks. By deploying these models on AKS with Ollama, you can:

Achieve enterprise-grade scalability with Kubernetes orchestration
Maintain data privacy by keeping your AI processing within your own infrastructure
Reduce costs compared to commercial API-based solutions
Customize the deployment to meet your specific requirements

Prerequisites

Before getting started, ensure you have:

An Azure subscription
Azure CLI installed and configured
kubectl installed
Basic knowledge of Kubernetes concepts
Docker 🐋 installed (for building custom images if needed)
A Pc 🖥️ or Compute Instance
A Reliable Internet Service 🛜

Setting Up Your AKS Cluster

First, let’s create an AKS cluster optimized for running LLMs:

# Create a resource group
az group create --name deepseek-rg --location eastus

# Create AKS cluster with GPU nodes
az aks create \
  --resource-group deepseek-rg \
  --name deepseek-cluster \
  --node-count 1 \
  --enable-cluster-autoscaler \
  --min-count 1 \
  --max-count 3 \
  --node-vm-size Standard_NC6s_v3 \
  --generate-ssh-keys

# Connect to the cluster
az aks get-credentials --resource-group deepseek-rg --name deepseek-cluster

Note: The Standard_NC6s_v3 VM size includes an NVIDIA Tesla V100 GPU. You may choose a different VM size based on your performance requirements and budget.

Installing NVIDIA Device Plugin

To enable GPU support in your AKS cluster, you need to install the NVIDIA device plugin:

kubectl create namespace gpu-resources
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

Verify that the plugin is running correctly:

kubectl get pods -n gpu-resources

You should see the NVIDIA device plugin pod in a running state.

Deploying Ollama on AKS

Now let’s deploy Ollama on our AKS cluster. We’ll create a Kubernetes deployment and service for Ollama:

Create a file named ollama-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama
  labels:
    app: ollama
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama
  template:
    metadata:
      labels:
        app: ollama
    spec:
      containers:
      - name: ollama
        image: ollama/ollama:latest
        ports:
        - containerPort: 11434
        resources:
          limits:
            nvidia.com/gpu: 1
        volumeMounts:
        - name: ollama-data
          mountPath: /root/.ollama
      volumes:
      - name: ollama-data
        persistentVolumeClaim:
          claimName: ollama-pvc
---
apiVersion: v1
kind: Service
metadata:
  name: ollama-service
spec:
  selector:
    app: ollama
  ports:
  - port: 11434
    targetPort: 11434
  type: LoadBalancer
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: ollama-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 50Gi
  storageClassName: managed-premium

Apply this configuration to your cluster:

kubectl apply -f ollama-deployment.yaml

Wait for the external IP to be assigned:

kubectl get service ollama-service --watch

Once the external IP is available, make note of it as we’ll use it to interact with Ollama.

Pulling and Running DeepSeek Models

Now that Ollama is deployed, let’s pull and run a DeepSeek model. You can do this using the Ollama CLI or API.

First, let’s create a pod that can communicate with our Ollama service:

kubectl run ollama-client --image=ubuntu --rm -it -- /bin/bash

Inside the pod, install curl and use it to interact with Ollama:

apt-get update && apt-get install -y curl

Pull the DeepSeek-Coder model:

curl -X POST http://ollama-service:11434/api/pull -d '{"name": "deepseek-coder:33b-instruct-q5_K_M"}'

This will start downloading the model. Depending on your internet connection and the size of the model, this might take some time.

Once the model is downloaded, you can run inferences using:

curl -X POST http://ollama-service:11434/api/generate -d '{
  "model": "deepseek-coder:33b-instruct-q5_K_M",
  "prompt": "Write a Python function to calculate the Fibonacci sequence",
  "stream": false
}'

Creating a Web UI for DeepSeek

To provide a user-friendly interface for your DeepSeek deployment, you can set up a web UI. One option is to use Ollama WebUI.

Create a file named webui-deployment.yaml:

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ollama-webui
spec:
  replicas: 1
  selector:
    matchLabels:
      app: ollama-webui
  template:
    metadata:
      labels:
        app: ollama-webui
    spec:
      containers:
      - name: ollama-webui
        image: ghcr.io/ollama-webui/ollama-webui:main
        ports:
        - containerPort: 3000
        env:
        - name: OLLAMA_API_BASE_URL
          value: "http://ollama-service:11434/api"
---
apiVersion: v1
kind: Service
metadata:
  name: ollama-webui-service
spec:
  selector:
    app: ollama-webui
  ports:
  - port: 80
    targetPort: 3000
  type: LoadBalancer

Apply this configuration:

kubectl apply -f webui-deployment.yaml

Get the external IP for the WebUI:

kubectl get service ollama-webui-service

Now you can access the WebUI using the external IP in your browser.

Performance Optimization

To optimize the performance of DeepSeek on AKS, consider the following:

1. GPU Selection

Different DeepSeek models have different GPU memory requirements:

DeepSeek-Coder 6.7B: Minimum 8GB VRAM
DeepSeek-Coder 33B: Minimum 24GB VRAM

Choose your VM size accordingly.

2. Horizontal Pod Autoscaling

Implement Horizontal Pod Autoscaling to handle varying loads:

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ollama-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ollama
  minReplicas: 1
  maxReplicas: 3
  metrics:
  - type: Resource
    resource:
      name: cpu
      target:
        type: Utilization
        averageUtilization: 80

3. Model Quantization

Ollama supports various quantization levels for DeepSeek models. Lower quantization levels (e.g., q4_0) require less GPU memory but may sacrifice some quality, while higher levels (e.g., q5_K_M) provide better quality but require more resources.

Production Considerations

For a production deployment, consider implementing:

Authentication and Authorization: Use Azure AD integration with AKS to secure access
Network Security: Implement network policies to restrict pod-to-pod communication
Monitoring: Set up Azure Monitor for container insights
Logging: Configure centralized logging with Azure Log Analytics
Backup: Regularly backup your persistent volumes containing model data

Here’s a sample network policy to restrict access to your Ollama service:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ollama-network-policy
spec:
  podSelector:
    matchLabels:
      app: ollama
  ingress:
  - from:
    - podSelector:
        matchLabels:
          app: ollama-webui
    ports:
    - protocol: TCP
      port: 11434

Example Use Case: Code Completion API

Let’s create a simple API service that uses DeepSeek for code completion:

from fastapi import FastAPI, HTTPException
import httpx
import os

app = FastAPI()

OLLAMA_URL = os.getenv("OLLAMA_URL", "http://ollama-service:11434")
MODEL_NAME = os.getenv("MODEL_NAME", "deepseek-coder:33b-instruct-q5_K_M")

@app.post("/complete-code")
async def complete_code(code_snippet: str, max_tokens: int = 500):
    try:
        async with httpx.AsyncClient() as client:
            response = await client.post(
                f"{OLLAMA_URL}/api/generate",
                json={
                    "model": MODEL_NAME,
                    "prompt": f"Complete the following code:\n\n{code_snippet}",
                    "max_tokens": max_tokens,
                    "stream": False
                },
                timeout=60.0
            )
            
            if response.status_code != 200:
                raise HTTPException(status_code=500, detail="Failed to generate code")
                
            result = response.json()
            return {"completion": result["response"]}
    except Exception as e:
        raise HTTPException(status_code=500, detail=f"Error: {str(e)}")

Save this as main.py and create a Dockerfile:

FROM python:3.9-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY main.py .

CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]

With requirements.txt:

fastapi
uvicorn
httpx

Build and deploy this API to your AKS cluster to create a complete code completion service.

Conclusion

Deploying DeepSeek on AKS with Ollama provides a powerful, scalable, and cost-effective solution for running AI workloads. This approach gives you full control over your infrastructure while benefiting from the flexibility and scalability of Kubernetes.

By following this guide, you can set up a production-ready environment for DeepSeek models that can be integrated into your development workflows, providing high-quality code completion and generation capabilities.

For more information on optimizing your AKS cluster for LLMs, check out the Azure High Performance Computing documentation.

Additional Resources

Have you deployed DeepSeek or other LLMs on AKS? Collabnix team is interested in how you went about it so share your experience in the comments below!