Introduction
DeepSeek is an advanced open-source code language model (LLM) that has gained significant popularity in the developer community. When paired with Ollama, an easy-to-use framework for running and managing LLMs locally, and deployed on Azure Kubernetes Service (AKS), we can create a powerful, scalable, and cost-effective environment for AI applications.
This blog post walks through the process of deploying DeepSeek models on AKS using Ollama, providing you with a production-ready setup for your AI workloads.
Why DeepSeek on AKS with Ollama?
DeepSeek models offer excellent performance for code completion and generation tasks. By deploying these models on AKS with Ollama, you can:
- Achieve enterprise-grade scalability with Kubernetes orchestration
- Maintain data privacy by keeping your AI processing within your own infrastructure
- Reduce costs compared to commercial API-based solutions
- Customize the deployment to meet your specific requirements
Prerequisites
Before getting started, ensure you have:
- An Azure subscription
- Azure CLI installed and configured
- kubectl installed
- Basic knowledge of Kubernetes concepts
- Docker 🐋 installed (for building custom images if needed)
- A Pc 🖥️ or Compute Instance
- A Reliable Internet Service 🛜
Setting Up Your AKS Cluster
First, let’s create an AKS cluster optimized for running LLMs:
# Create a resource group
az group create --name deepseek-rg --location eastus
# Create AKS cluster with GPU nodes
az aks create \
--resource-group deepseek-rg \
--name deepseek-cluster \
--node-count 1 \
--enable-cluster-autoscaler \
--min-count 1 \
--max-count 3 \
--node-vm-size Standard_NC6s_v3 \
--generate-ssh-keys
# Connect to the cluster
az aks get-credentials --resource-group deepseek-rg --name deepseek-cluster
Note: The Standard_NC6s_v3
VM size includes an NVIDIA Tesla V100 GPU. You may choose a different VM size based on your performance requirements and budget.
Installing NVIDIA Device Plugin
To enable GPU support in your AKS cluster, you need to install the NVIDIA device plugin:
kubectl create namespace gpu-resources
kubectl apply -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml
Verify that the plugin is running correctly:
kubectl get pods -n gpu-resources
You should see the NVIDIA device plugin pod in a running state.
Deploying Ollama on AKS
Now let’s deploy Ollama on our AKS cluster. We’ll create a Kubernetes deployment and service for Ollama:
Create a file named ollama-deployment.yaml
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama
labels:
app: ollama
spec:
replicas: 1
selector:
matchLabels:
app: ollama
template:
metadata:
labels:
app: ollama
spec:
containers:
- name: ollama
image: ollama/ollama:latest
ports:
- containerPort: 11434
resources:
limits:
nvidia.com/gpu: 1
volumeMounts:
- name: ollama-data
mountPath: /root/.ollama
volumes:
- name: ollama-data
persistentVolumeClaim:
claimName: ollama-pvc
---
apiVersion: v1
kind: Service
metadata:
name: ollama-service
spec:
selector:
app: ollama
ports:
- port: 11434
targetPort: 11434
type: LoadBalancer
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: ollama-pvc
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 50Gi
storageClassName: managed-premium
Apply this configuration to your cluster:
kubectl apply -f ollama-deployment.yaml
Wait for the external IP to be assigned:
kubectl get service ollama-service --watch
Once the external IP is available, make note of it as we’ll use it to interact with Ollama.
Pulling and Running DeepSeek Models
Now that Ollama is deployed, let’s pull and run a DeepSeek model. You can do this using the Ollama CLI or API.
First, let’s create a pod that can communicate with our Ollama service:
kubectl run ollama-client --image=ubuntu --rm -it -- /bin/bash
Inside the pod, install curl and use it to interact with Ollama:
apt-get update && apt-get install -y curl
Pull the DeepSeek-Coder model:
curl -X POST http://ollama-service:11434/api/pull -d '{"name": "deepseek-coder:33b-instruct-q5_K_M"}'
This will start downloading the model. Depending on your internet connection and the size of the model, this might take some time.
Once the model is downloaded, you can run inferences using:
curl -X POST http://ollama-service:11434/api/generate -d '{
"model": "deepseek-coder:33b-instruct-q5_K_M",
"prompt": "Write a Python function to calculate the Fibonacci sequence",
"stream": false
}'
Creating a Web UI for DeepSeek
To provide a user-friendly interface for your DeepSeek deployment, you can set up a web UI. One option is to use Ollama WebUI.
Create a file named webui-deployment.yaml
:
apiVersion: apps/v1
kind: Deployment
metadata:
name: ollama-webui
spec:
replicas: 1
selector:
matchLabels:
app: ollama-webui
template:
metadata:
labels:
app: ollama-webui
spec:
containers:
- name: ollama-webui
image: ghcr.io/ollama-webui/ollama-webui:main
ports:
- containerPort: 3000
env:
- name: OLLAMA_API_BASE_URL
value: "http://ollama-service:11434/api"
---
apiVersion: v1
kind: Service
metadata:
name: ollama-webui-service
spec:
selector:
app: ollama-webui
ports:
- port: 80
targetPort: 3000
type: LoadBalancer
Apply this configuration:
kubectl apply -f webui-deployment.yaml
Get the external IP for the WebUI:
kubectl get service ollama-webui-service
Now you can access the WebUI using the external IP in your browser.
Performance Optimization
To optimize the performance of DeepSeek on AKS, consider the following:
1. GPU Selection
Different DeepSeek models have different GPU memory requirements:
- DeepSeek-Coder 6.7B: Minimum 8GB VRAM
- DeepSeek-Coder 33B: Minimum 24GB VRAM
Choose your VM size accordingly.
2. Horizontal Pod Autoscaling
Implement Horizontal Pod Autoscaling to handle varying loads:
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: ollama-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: ollama
minReplicas: 1
maxReplicas: 3
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 80
3. Model Quantization
Ollama supports various quantization levels for DeepSeek models. Lower quantization levels (e.g., q4_0) require less GPU memory but may sacrifice some quality, while higher levels (e.g., q5_K_M) provide better quality but require more resources.
Production Considerations
For a production deployment, consider implementing:
- Authentication and Authorization: Use Azure AD integration with AKS to secure access
- Network Security: Implement network policies to restrict pod-to-pod communication
- Monitoring: Set up Azure Monitor for container insights
- Logging: Configure centralized logging with Azure Log Analytics
- Backup: Regularly backup your persistent volumes containing model data
Here’s a sample network policy to restrict access to your Ollama service:
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: ollama-network-policy
spec:
podSelector:
matchLabels:
app: ollama
ingress:
- from:
- podSelector:
matchLabels:
app: ollama-webui
ports:
- protocol: TCP
port: 11434
Example Use Case: Code Completion API
Let’s create a simple API service that uses DeepSeek for code completion:
from fastapi import FastAPI, HTTPException
import httpx
import os
app = FastAPI()
OLLAMA_URL = os.getenv("OLLAMA_URL", "http://ollama-service:11434")
MODEL_NAME = os.getenv("MODEL_NAME", "deepseek-coder:33b-instruct-q5_K_M")
@app.post("/complete-code")
async def complete_code(code_snippet: str, max_tokens: int = 500):
try:
async with httpx.AsyncClient() as client:
response = await client.post(
f"{OLLAMA_URL}/api/generate",
json={
"model": MODEL_NAME,
"prompt": f"Complete the following code:\n\n{code_snippet}",
"max_tokens": max_tokens,
"stream": False
},
timeout=60.0
)
if response.status_code != 200:
raise HTTPException(status_code=500, detail="Failed to generate code")
result = response.json()
return {"completion": result["response"]}
except Exception as e:
raise HTTPException(status_code=500, detail=f"Error: {str(e)}")
Save this as main.py
and create a Dockerfile:
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY main.py .
CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8000"]
With requirements.txt
:
fastapi
uvicorn
httpx
Build and deploy this API to your AKS cluster to create a complete code completion service.
Conclusion
Deploying DeepSeek on AKS with Ollama provides a powerful, scalable, and cost-effective solution for running AI workloads. This approach gives you full control over your infrastructure while benefiting from the flexibility and scalability of Kubernetes.
By following this guide, you can set up a production-ready environment for DeepSeek models that can be integrated into your development workflows, providing high-quality code completion and generation capabilities.
By following this guide, you can set up a production-ready environment for DeepSeek models that can be integrated into your development workflows, providing high-quality code completion and generation capabilities.
For more information on optimizing your AKS cluster for LLMs, check out the Azure High Performance Computing documentation.
Additional Resources
- DeepSeek GitHub Repository
- Ollama Documentation
- Azure Kubernetes Service Documentation
- NVIDIA GPU Operator for Kubernetes
Have you deployed DeepSeek or other LLMs on AKS? Collabnix team is interested in how you went about it so share your experience in the comments below!