Complete Guide to DeepSeek R1 Setup in 2025
DeepSeek-R1 Architecture Overview
DeepSeek-R1 represents a breakthrough in open-source reasoning models, utilizing a sophisticated Mixture of Experts (MoE) architecture with 671 billion parameters while efficiently activating only 37 billion parameters during inference.
Technical Specifications
Model Architecture: Transformer-based MoE with reinforcement learning optimization
Parameter Distribution: 671B total, 37B active per forward pass
Context Window: Up to 128K tokens (model-dependent)
Quantization Support: GGUF format with 4-bit to 16-bit precision
Memory Efficiency: Dynamic KV-cache with intelligent quantization
Model Variants & Resource Requirements
| Model Variant | Parameters | VRAM (GPU) | RAM (CPU) | Disk Space | Inference Speed |
|---|---|---|---|---|---|
| deepseek-r1:1.5b | 1.5B | 2GB | 4GB | 1.2GB | 45 tokens/sec |
| deepseek-r1:7b | 7B | 8GB | 16GB | 4.8GB | 35 tokens/sec |
| deepseek-r1:14b | 14B | 16GB | 32GB | 9.2GB | 28 tokens/sec |
| deepseek-r1:32b | 32B | 32GB | 64GB | 20GB | 22 tokens/sec |
| deepseek-r1:70b | 70B | 80GB | 128GB | 45GB | 18 tokens/sec |
| deepseek-r1:671b | 671B | 320GB+ | 512GB+ | 400GB | 12 tokens/sec |
System Requirements & Hardware Optimization {#system-requirements}
Minimum Hardware Configuration
CPU Requirements:
- Architecture: x86_64 with AVX2 support
- Cores: 8+ cores recommended for optimal performance
- Base Clock: 3.0GHz+ for real-time inference
- Cache: 16MB+ L3 cache for efficient model loading
Memory Specifications:
- DDR4: 3200MHz+ with dual-channel configuration
- Capacity: See model-specific requirements above
- ECC: Recommended for production deployments
- Swap: Configure 50% of RAM as swap space
Storage Optimization:
- NVMe SSD: Required for model storage and caching
- IOPS: 50K+ for concurrent model serving
- Sequential Read: 3GB/s+ for rapid model loading
- Free Space: 2x model size for optimal performance
GPU Configuration Matrix
NVIDIA GPU Compatibility (CUDA 11.8+ / 12.x):
# Check GPU specifications
nvidia-smi --query-gpu=name,memory.total,driver_version --format=csv
| GPU Series | VRAM | Recommended Models | Performance Tier |
|---|---|---|---|
| RTX 4090 | 24GB | Up to 14B | Excellent |
| RTX 4080 | 16GB | Up to 14B | Very Good |
| RTX 3090 | 24GB | Up to 14B | Good |
| Tesla V100 | 32GB | Up to 32B | Enterprise |
| A100 80GB | 80GB | Up to 70B | Data Center |
| H100 | 80GB | Up to 70B | Cutting Edge |
AMD GPU Support (ROCm 5.4+):
# Verify ROCm installation
rocm-smi --showproductname
Pre-Installation Environment Setup
Operating System Optimization
Linux Configuration (Ubuntu 22.04 LTS / CentOS 9):
# Update system packages
sudo apt update && sudo apt upgrade -y
# Install essential dependencies
sudo apt install -y curl wget git build-essential \
software-properties-common apt-transport-https \
ca-certificates gnupg lsb-release
# Configure system limits for high-performance AI workloads
echo "* soft nofile 65536" | sudo tee -a /etc/security/limits.conf
echo "* hard nofile 65536" | sudo tee -a /etc/security/limits.conf
echo "* soft memlock unlimited" | sudo tee -a /etc/security/limits.conf
echo "* hard memlock unlimited" | sudo tee -a /etc/security/limits.conf
# Optimize kernel parameters
echo "vm.swappiness=10" | sudo tee -a /etc/sysctl.conf
echo "vm.max_map_count=262144" | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
CUDA Toolkit Installation:
# Download and install CUDA 12.3
wget https://developer.download.nvidia.com/compute/cuda/12.3.0/local_installers/cuda_12.3.0_545.23.06_linux.run
sudo sh cuda_12.3.0_545.23.06_linux.run
# Add CUDA to PATH
echo 'export PATH=/usr/local/cuda-12.3/bin:$PATH' >> ~/.bashrc
echo 'export LD_LIBRARY_PATH=/usr/local/cuda-12.3/lib64:$LD_LIBRARY_PATH' >> ~/.bashrc
source ~/.bashrc
# Verify CUDA installation
nvcc --version
nvidia-smi
Docker Configuration (for containerized deployment):
# Install Docker with NVIDIA container runtime
sudo apt install -y docker.io
sudo systemctl enable docker
sudo usermod -aG docker $USER
# Install NVIDIA Container Toolkit
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
sudo apt update && sudo apt install -y nvidia-container-toolkit
sudo systemctl restart docker
Ollama Installation & Configuration
Native Installation Methods
Linux Installation:
# Method 1: Official installer (recommended)
curl -fsSL https://ollama.com/install.sh | sh
# Method 2: Manual installation
wget https://github.com/ollama/ollama/releases/latest/download/ollama-linux-amd64
sudo mv ollama-linux-amd64 /usr/local/bin/ollama
sudo chmod +x /usr/local/bin/ollama
# Create ollama user and service
sudo useradd -r -s /bin/false -m -d /usr/share/ollama ollama
sudo mkdir -p /usr/share/ollama/.ollama
sudo chown -R ollama:ollama /usr/share/ollama
Systemd Service Configuration:
# Create systemd service file
sudo tee /etc/systemd/system/ollama.service > /dev/null <<EOF
[Unit]
Description=Ollama Service
After=network-online.target
[Service]
ExecStart=/usr/local/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
Environment="OLLAMA_HOST=0.0.0.0"
Environment="OLLAMA_MODELS=/usr/share/ollama/.ollama/models"
[Install]
WantedBy=default.target
EOF
# Enable and start service
sudo systemctl daemon-reload
sudo systemctl enable ollama
sudo systemctl start ollama
Environment Variables Configuration:
# Create ollama configuration file
sudo mkdir -p /etc/ollama
sudo tee /etc/ollama/ollama.conf > /dev/null <<EOF
# Ollama Configuration
OLLAMA_HOST=0.0.0.0:11434
OLLAMA_ORIGINS=*
OLLAMA_MODELS=/var/lib/ollama/models
OLLAMA_KEEP_ALIVE=5m
OLLAMA_MAX_LOADED_MODELS=3
OLLAMA_MAX_QUEUE=512
OLLAMA_NUM_PARALLEL=4
OLLAMA_MAX_VRAM=0.9
OLLAMA_LLM_LIBRARY=cuda
EOF
Docker Deployment
Docker Compose Configuration:
# docker-compose.yml
version: '3.8'
services:
ollama:
image: ollama/ollama:latest
container_name: ollama-deepseek
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
- ./models:/models
environment:
- OLLAMA_KEEP_ALIVE=24h
- OLLAMA_HOST=0.0.0.0
- OLLAMA_MAX_LOADED_MODELS=2
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: all
capabilities: [gpu]
restart: unless-stopped
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/health"]
interval: 30s
timeout: 10s
retries: 3
volumes:
ollama_data:
driver: local
DeepSeek-R1 Model Deployment
Model Selection Strategy
Production Model Selection Matrix:
# Automated model selection based on system resources
#!/bin/bash
TOTAL_RAM=$(free -g | awk '/^Mem:/{print $2}')
GPU_VRAM=$(nvidia-smi --query-gpu=memory.total --format=csv,noheader,nounits | head -1)
if [ "$GPU_VRAM" -gt 70000 ]; then
MODEL="deepseek-r1:70b"
elif [ "$GPU_VRAM" -gt 30000 ]; then
MODEL="deepseek-r1:32b"
elif [ "$GPU_VRAM" -gt 14000 ]; then
MODEL="deepseek-r1:14b"
elif [ "$GPU_VRAM" -gt 6000 ]; then
MODEL="deepseek-r1:7b"
else
MODEL="deepseek-r1:1.5b"
fi
echo "Recommended model: $MODEL"
ollama pull $MODEL
Model Download & Verification
Progressive Model Download:
# Download with progress monitoring
ollama pull deepseek-r1:14b --progress
# Verify model integrity
ollama list | grep deepseek-r1
# Check model details
ollama show deepseek-r1:14b --verbose
Model Information Extraction:
# Extract model metadata
curl http://localhost:11434/api/show -d '{
"name": "deepseek-r1:14b"
}' | jq .
# List model parameters
ollama show deepseek-r1:14b --format json | jq '.parameters'
Custom Model Configuration
Advanced Modelfile Creation:
# Modelfile for production DeepSeek-R1
FROM deepseek-r1:14b
# System prompt optimization
SYSTEM """
You are DeepSeek-R1, an advanced reasoning AI assistant. You excel at:
- Complex mathematical problem solving
- Logical reasoning and analysis
- Code generation and debugging
- Scientific research assistance
Always think step-by-step and show your reasoning process.
"""
# Performance parameters
PARAMETER temperature 0.7
PARAMETER top_p 0.9
PARAMETER top_k 40
PARAMETER repeat_penalty 1.1
PARAMETER num_ctx 32768
PARAMETER num_batch 512
PARAMETER num_gqa 8
PARAMETER num_gpu 1
PARAMETER num_thread 8
# Memory optimization
PARAMETER mlock true
PARAMETER numa true
Custom Model Creation:
# Create optimized model
ollama create deepseek-r1-optimized -f ./Modelfile
# Test custom model
ollama run deepseek-r1-optimized "Explain quantum entanglement"
GPU Acceleration & CUDA Optimization
CUDA Memory Management
GPU Memory Profiling:
# Monitor GPU utilization during inference
#!/bin/bash
while true; do
nvidia-smi --query-gpu=timestamp,name,pci.bus_id,driver_version,pstate,\
temperature.gpu,utilization.gpu,utilization.memory,memory.total,memory.free,\
memory.used --format=csv,noheader,nounits
sleep 5
done
Memory Optimization Techniques:
# Configure GPU memory allocation
export CUDA_VISIBLE_DEVICES=0
export CUDA_MEMORY_FRACTION=0.9
export OLLAMA_MAX_VRAM=0.85
# Enable memory pooling
export CUDA_LAUNCH_BLOCKING=0
export CUDA_CACHE_DISABLE=0
Multi-GPU Configuration
Load Balancing Setup:
# Configure multiple GPUs
export CUDA_VISIBLE_DEVICES=0,1,2,3
export OLLAMA_NUM_GPU=4
# Ollama multi-GPU configuration
ollama serve --num-gpu 4 --gpu-memory-fraction 0.8
Tensor Parallelism Configuration:
# Python API for multi-GPU inference
import ollama
import asyncio
async def multi_gpu_inference():
client = ollama.AsyncClient()
# Configure multi-GPU setup
response = await client.generate(
model='deepseek-r1:32b',
prompt='Solve this complex mathematical proof...',
options={
'num_gpu': 4,
'gpu_split': [0.25, 0.25, 0.25, 0.25],
'tensor_parallel_size': 4
}
)
return response
# Run inference
result = asyncio.run(multi_gpu_inference())
Memory Management & Performance Tuning
Advanced Memory Configuration
System Memory Optimization:
# Configure huge pages for better memory performance
echo 'vm.nr_hugepages = 2048' | sudo tee -a /etc/sysctl.conf
echo 'vm.hugetlb_shm_group = 1000' | sudo tee -a /etc/sysctl.conf
sudo sysctl -p
# Enable transparent huge pages
echo always | sudo tee /sys/kernel/mm/transparent_hugepage/enabled
echo always | sudo tee /sys/kernel/mm/transparent_hugepage/defrag
KV-Cache Optimization:
# Configure KV-cache parameters for optimal performance
export OLLAMA_KV_CACHE_TYPE="f16" # or "f32" for higher precision
export OLLAMA_KV_CACHE_SIZE="8192" # Adjust based on available memory
export OLLAMA_CACHE_QUANTIZATION="true"
Performance Benchmarking Scripts
Comprehensive Benchmark Suite:
#!/usr/bin/env python3
import time
import requests
import json
import psutil
import nvidia_ml_py3 as nvml
class DeepSeekBenchmark:
def __init__(self, model_name="deepseek-r1:14b"):
self.model_name = model_name
self.base_url = "http://localhost:11434"
nvml.nvmlInit()
def benchmark_inference(self, prompts, iterations=10):
"""Benchmark inference performance"""
results = []
for prompt in prompts:
times = []
tokens_per_second = []
for i in range(iterations):
start_time = time.time()
response = requests.post(
f"{self.base_url}/api/generate",
json={
"model": self.model_name,
"prompt": prompt,
"stream": False,
"options": {
"num_predict": 500,
"temperature": 0.7
}
}
)
end_time = time.time()
inference_time = end_time - start_time
if response.status_code == 200:
data = response.json()
tokens = len(data.get('response', '').split())
tps = tokens / inference_time if inference_time > 0 else 0
times.append(inference_time)
tokens_per_second.append(tps)
results.append({
'prompt': prompt[:50] + "...",
'avg_time': sum(times) / len(times),
'avg_tokens_per_sec': sum(tokens_per_second) / len(tokens_per_second),
'min_time': min(times),
'max_time': max(times)
})
return results
def system_metrics(self):
"""Collect system performance metrics"""
# CPU metrics
cpu_percent = psutil.cpu_percent(interval=1)
memory = psutil.virtual_memory()
# GPU metrics
handle = nvml.nvmlDeviceGetHandleByIndex(0)
gpu_memory = nvml.nvmlDeviceGetMemoryInfo(handle)
gpu_utilization = nvml.nvmlDeviceGetUtilizationRates(handle)
return {
'cpu_percent': cpu_percent,
'memory_percent': memory.percent,
'memory_used_gb': memory.used / (1024**3),
'gpu_memory_percent': (gpu_memory.used / gpu_memory.total) * 100,
'gpu_utilization': gpu_utilization.gpu
}
# Usage example
if __name__ == "__main__":
benchmark = DeepSeekBenchmark()
test_prompts = [
"Explain quantum computing in detail",
"Write a Python function to implement merge sort",
"Analyze the economic impact of artificial intelligence",
"Solve this mathematical equation: 2x^2 + 5x - 3 = 0"
]
print("Running DeepSeek-R1 Performance Benchmark...")
results = benchmark.benchmark_inference(test_prompts)
for result in results:
print(f"Prompt: {result['prompt']}")
print(f"Average time: {result['avg_time']:.2f}s")
print(f"Tokens/sec: {result['avg_tokens_per_sec']:.2f}")
print("-" * 50)
print("\nSystem Metrics:")
metrics = benchmark.system_metrics()
for key, value in metrics.items():
print(f"{key}: {value}")
Advanced Configuration & Custom Modelfiles
Production-Grade Modelfile
Enterprise Modelfile Template:
# Enterprise DeepSeek-R1 Configuration
FROM deepseek-r1:32b
# System prompt with role definition
SYSTEM """
You are DeepSeek-R1 Enterprise, an advanced AI reasoning assistant optimized for:
CAPABILITIES:
- Complex problem solving and logical reasoning
- Mathematical computation and proof verification
- Code generation, review, and optimization
- Scientific research and data analysis
- Technical documentation and explanation
OPERATIONAL GUIDELINES:
- Always show step-by-step reasoning
- Cite sources when making factual claims
- Acknowledge uncertainty when appropriate
- Request clarification for ambiguous queries
- Maintain professional, helpful communication
SECURITY CONSTRAINTS:
- Do not process or generate harmful content
- Protect confidential information
- Follow data privacy regulations
- Verify inputs for potential security risks
"""
# Optimized parameters for enterprise use
PARAMETER temperature 0.3 # Lower for consistency
PARAMETER top_p 0.85 # Balanced creativity/precision
PARAMETER top_k 50 # Diverse but focused responses
PARAMETER repeat_penalty 1.15 # Reduce repetition
PARAMETER num_ctx 32768 # Extended context window
PARAMETER num_batch 128 # Batch processing optimization
PARAMETER num_gqa 8 # Grouped query attention
PARAMETER rope_frequency_base 10000
PARAMETER rope_frequency_scale 1.0
# Memory and performance optimization
PARAMETER mlock true # Lock model in memory
PARAMETER numa true # NUMA-aware allocation
PARAMETER use_mmap true # Memory mapping
PARAMETER use_mlock true # Prevent swapping
# Inference optimization
PARAMETER f16_kv true # 16-bit KV cache
PARAMETER logits_all false # Memory optimization
PARAMETER vocab_only false # Full model capabilities
PARAMETER embedding_only false # Generation mode
API Integration Examples
Python SDK Integration:
import ollama
import asyncio
from typing import List, Dict, Optional
import logging
class DeepSeekR1Client:
def __init__(self, host: str = "localhost", port: int = 11434):
self.client = ollama.Client(host=f"http://{host}:{port}")
self.model = "deepseek-r1-optimized"
async def reasoning_inference(
self,
prompt: str,
context: Optional[str] = None,
temperature: float = 0.3,
max_tokens: int = 2048
) -> Dict:
"""Perform reasoning-focused inference"""
full_prompt = f"""
Context: {context if context else 'None provided'}
Task: {prompt}
Please provide a detailed, step-by-step analysis with clear reasoning.
"""
try:
response = await self.client.generate(
model=self.model,
prompt=full_prompt,
options={
'temperature': temperature,
'num_predict': max_tokens,
'stop': ['</reasoning>', '###', '---']
}
)
return {
'success': True,
'response': response['response'],
'model': response['model'],
'created_at': response['created_at'],
'done': response['done'],
'total_duration': response.get('total_duration', 0),
'load_duration': response.get('load_duration', 0),
'prompt_eval_count': response.get('prompt_eval_count', 0),
'eval_count': response.get('eval_count', 0)
}
except Exception as e:
logging.error(f"Inference error: {str(e)}")
return {'success': False, 'error': str(e)}
def batch_inference(self, prompts: List[str]) -> List[Dict]:
"""Process multiple prompts efficiently"""
results = []
for prompt in prompts:
try:
result = asyncio.run(self.reasoning_inference(prompt))
results.append(result)
except Exception as e:
results.append({'success': False, 'error': str(e)})
return results
def stream_inference(self, prompt: str):
"""Stream responses for real-time applications"""
try:
stream = self.client.generate(
model=self.model,
prompt=prompt,
stream=True,
options={'temperature': 0.3}
)
for chunk in stream:
if chunk.get('response'):
yield chunk['response']
except Exception as e:
yield f"Error: {str(e)}"
# Usage example
client = DeepSeekR1Client()
# Single inference
result = asyncio.run(client.reasoning_inference(
"Analyze the time complexity of quicksort algorithm"
))
# Batch processing
prompts = [
"Explain machine learning fundamentals",
"Compare database indexing strategies",
"Optimize Python code performance"
]
batch_results = client.batch_inference(prompts)
# Streaming inference
for token in client.stream_inference("Explain quantum mechanics"):
print(token, end='', flush=True)
Production Deployment Strategies
High-Availability Setup
Load Balancer Configuration (nginx):
# /etc/nginx/sites-available/ollama-deepseek
upstream ollama_backend {
least_conn;
server 10.0.1.10:11434 max_fails=3 fail_timeout=30s;
server 10.0.1.11:11434 max_fails=3 fail_timeout=30s;
server 10.0.1.12:11434 max_fails=3 fail_timeout=30s;
}
server {
listen 80;
server_name deepseek-api.company.com;
# Rate limiting
limit_req_zone $binary_remote_addr zone=api:10m rate=10r/s;
limit_req zone=api burst=20 nodelay;
# Connection limits
limit_conn_zone $binary_remote_addr zone=addr:10m;
limit_conn addr 10;
location /api/ {
proxy_pass http://ollama_backend;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
# Timeout settings for long-running inference
proxy_connect_timeout 60s;
proxy_send_timeout 300s;
proxy_read_timeout 300s;
# Buffer settings
proxy_buffering on;
proxy_buffer_size 4k;
proxy_buffers 8 4k;
}
# Health check endpoint
location /health {
access_log off;
return 200 "healthy\n";
add_header Content-Type text/plain;
}
}
Monitoring & Observability
Prometheus Metrics Configuration:
# prometheus.yml
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'ollama-deepseek'
static_configs:
- targets: ['localhost:11434']
metrics_path: '/metrics'
scrape_interval: 30s
- job_name: 'node-exporter'
static_configs:
- targets: ['localhost:9100']
- job_name: 'nvidia-dcgm'
static_configs:
- targets: ['localhost:9400']
Custom Metrics Collection:
from prometheus_client import Counter, Histogram, Gauge, start_http_server
import time
# Metrics definitions
inference_requests = Counter('deepseek_inference_requests_total',
'Total inference requests', ['model', 'status'])
inference_duration = Histogram('deepseek_inference_duration_seconds',
'Inference duration in seconds', ['model'])
active_sessions = Gauge('deepseek_active_sessions',
'Number of active inference sessions')
gpu_memory_usage = Gauge('deepseek_gpu_memory_usage_bytes',
'GPU memory usage in bytes')
class MetricsCollector:
def collect_inference_metrics(self, model, duration, status):
inference_requests.labels(model=model, status=status).inc()
inference_duration.labels(model=model).observe(duration)
def update_system_metrics(self):
# Update GPU memory usage
try:
import nvidia_ml_py3 as nvml
nvml.nvmlInit()
handle = nvml.nvmlDeviceGetHandleByIndex(0)
info = nvml.nvmlDeviceGetMemoryInfo(handle)
gpu_memory_usage.set(info.used)
except Exception as e:
print(f"GPU metrics collection failed: {e}")
# Start metrics server
if __name__ == "__main__":
start_http_server(8000)
collector = MetricsCollector()
while True:
collector.update_system_metrics()
time.sleep(30)
Troubleshooting & Common Issues
Performance Issues
Problem: Slow inference speeds
# Diagnosis commands
nvidia-smi # Check GPU utilization
htop # Monitor CPU and memory
iotop # Check disk I/O
# Solutions
export OLLAMA_NUM_PARALLEL=1 # Reduce parallel requests
export OLLAMA_MAX_LOADED_MODELS=1 # Limit loaded models
export CUDA_VISIBLE_DEVICES=0 # Use specific GPU
Problem: Out of memory errors
# Check available memory
free -h
nvidia-smi
# Reduce model size or use quantized version
ollama pull deepseek-r1:7b-q4_0 # 4-bit quantized version
# Configure swap if needed
sudo fallocate -l 32G /swapfile
sudo chmod 600 /swapfile
sudo mkswap /swapfile
sudo swapon /swapfile
Model Loading Issues
Problem: Model fails to load
# Check model integrity
ollama list
ollama show deepseek-r1:14b
# Re-download if corrupted
ollama rm deepseek-r1:14b
ollama pull deepseek-r1:14b
# Check disk space
df -h /usr/share/ollama
Problem: CUDA errors
# Verify CUDA installation
nvcc --version
nvidia-smi
# Check CUDA compatibility
python3 -c "import torch; print(torch.cuda.is_available())"
# Reinstall CUDA drivers if needed
sudo apt purge nvidia-*
sudo apt install nvidia-driver-535 nvidia-cuda-toolkit
Network & API Issues
Problem: Connection timeouts
# Check Ollama service status
sudo systemctl status ollama
# Increase timeout values
export OLLAMA_KEEP_ALIVE=10m
export OLLAMA_REQUEST_TIMEOUT=300
# Test API connectivity
curl -X POST http://localhost:11434/api/generate \
-H "Content-Type: application/json" \
-d '{"model": "deepseek-r1:14b", "prompt": "Hello", "stream": false}'
Diagnostic Scripts
Comprehensive System Check:
#!/bin/bash
# DeepSeek-R1 System Diagnostic Script
echo "=== DeepSeek-R1 System Diagnostic ==="
echo "Date: $(date)"
echo
echo "=== System Information ==="
uname -a
lscpu | grep "Model name"
free -h
df -h
echo -e "\n=== GPU Information ==="
if command -v nvidia-smi &> /dev/null; then
nvidia-smi --query-gpu=name,driver_version,memory.total --format=csv
else
echo "NVIDIA GPU not detected"
fi
echo -e "\n=== CUDA Information ==="
if command -v nvcc &> /dev/null; then
nvcc --version
else
echo "CUDA not installed"
fi
echo -e "\n=== Ollama Status ==="
if systemctl is-active --quiet ollama; then
echo "Ollama service: RUNNING"
ollama list
else
echo "Ollama service: NOT RUNNING"
fi
echo -e "\n=== Network Connectivity ==="
if curl -s http://localhost:11434/api/tags >/dev/null; then
echo "Ollama API: ACCESSIBLE"
else
echo "Ollama API: NOT ACCESSIBLE"
fi
echo -e "\n=== Model Information ==="
if ollama list | grep -q deepseek-r1; then
echo "DeepSeek-R1 models installed:"
ollama list | grep deepseek-r1
else
echo "No DeepSeek-R1 models found"
fi
echo -e "\n=== Resource Usage ==="
echo "CPU Usage: $(top -bn1 | grep "Cpu(s)" | awk '{print $2}' | awk -F'%' '{print $1}')"
echo "Memory Usage: $(free | grep Mem | awk '{printf("%.1f%%"), $3/$2 * 100.0}')"
if command -v nvidia-smi &> /dev/null; then
echo "GPU Memory: $(nvidia-smi --query-gpu=memory.used,memory.total --format=csv,noheader,nounits | awk -F', ' '{printf("%.1f%%"), $1/$2 * 100.0}')"
fi
echo -e "\n=== Recommendations ==="
TOTAL_RAM=$(free -g | awk '/^Mem:/{print $2}')
if [ "$TOTAL_RAM" -lt 16 ]; then
echo "⚠️ Consider upgrading RAM for better performance"
fi
if ! command -v nvidia-smi &> /dev/null; then
echo "⚠️ GPU acceleration not available"
fi
if ! systemctl is-active --quiet ollama; then
echo "🔧 Start Ollama service: sudo systemctl start ollama"
fi
echo -e "\n=== Diagnostic Complete ==="
Conclusion
This comprehensive guide provides enterprise-grade deployment strategies for DeepSeek-R1 with Ollama in 2025. Following these technical specifications and optimization techniques will ensure optimal performance, reliability, and scalability for your local AI infrastructure.
Key Takeaways:
- Hardware Requirements: Match model size to available resources
- GPU Optimization: Leverage CUDA acceleration for maximum performance
- Memory Management: Configure swap and KV-cache for stability
- Production Deployment: Implement monitoring, load balancing, and high availability
- Performance Tuning: Use quantization and parameter optimization
Next Steps:
- Assess your hardware capabilities
- Follow the installation procedures systematically
- Implement monitoring and benchmarking
- Scale based on performance requirements
- Establish backup and disaster recovery procedures
For enterprise deployments, consider consulting with AI infrastructure specialists to optimize your specific use case and ensure compliance with organizational requirements.