Setting up a robust local development environment for AI applications is crucial for rapid prototyping, testing, and debugging before deploying to production. This comprehensive guide walks you through building a containerized, GPU-enabled local development stack that mirrors production environments while maintaining developer productivity.
Why Local AI Development Environments Matter
Modern AI applications require complex dependencies including specific Python versions, CUDA drivers, machine learning frameworks, and data processing libraries. A well-configured local environment enables:
- Faster iteration cycles: Test model changes without cloud deployment delays
- Cost optimization: Reduce cloud GPU costs during development
- Offline development: Work without constant internet connectivity
- Environment consistency: Eliminate “works on my machine” issues
- Privacy and security: Keep sensitive data and models local during development
Architecture Overview
Our local AI development environment leverages containerization with Docker and orchestration with Kubernetes (via Kind or Minikube), providing:
- GPU-accelerated compute with NVIDIA Container Toolkit
- Jupyter notebooks for interactive development
- MLflow for experiment tracking
- MinIO for S3-compatible object storage
- PostgreSQL for metadata storage
Prerequisites and System Requirements
Before starting, ensure your system meets these requirements:
- Docker Desktop 20.10+ or Docker Engine with Docker Compose
- NVIDIA GPU with CUDA support (optional but recommended)
- 16GB+ RAM (32GB recommended for large models)
- 50GB+ available disk space
- kubectl and Kind/Minikube installed
Installing NVIDIA Container Toolkit
For GPU support, install the NVIDIA Container Toolkit:
# Add NVIDIA package repositories
distribution=$(. /etc/os-release;echo $ID$VERSION_ID)
curl -s -L https://nvidia.github.io/nvidia-docker/gpgkey | sudo apt-key add -
curl -s -L https://nvidia.github.io/nvidia-docker/$distribution/nvidia-docker.list | \
sudo tee /etc/apt/sources.list.d/nvidia-docker.list
# Install nvidia-container-toolkit
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit
# Restart Docker daemon
sudo systemctl restart docker
# Verify installation
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
Building the Base AI Development Container
Create a custom Docker image with common AI/ML frameworks:
FROM nvidia/cuda:11.8.0-cudnn8-devel-ubuntu22.04
# Set environment variables
ENV DEBIAN_FRONTEND=noninteractive
ENV PYTHON_VERSION=3.10
ENV PYTHONUNBUFFERED=1
# Install system dependencies
RUN apt-get update && apt-get install -y \
python${PYTHON_VERSION} \
python3-pip \
git \
curl \
vim \
build-essential \
&& rm -rf /var/lib/apt/lists/*
# Upgrade pip and install core packages
RUN pip3 install --no-cache-dir --upgrade pip setuptools wheel
# Install AI/ML frameworks
RUN pip3 install --no-cache-dir \
torch==2.1.0 \
torchvision==0.16.0 \
tensorflow==2.15.0 \
transformers==4.35.0 \
scikit-learn==1.3.2 \
pandas==2.1.3 \
numpy==1.26.2 \
matplotlib==3.8.2 \
seaborn==0.13.0 \
jupyterlab==4.0.9 \
mlflow==2.9.0 \
optuna==3.4.0
# Set working directory
WORKDIR /workspace
# Expose Jupyter port
EXPOSE 8888
# Default command
CMD ["jupyter", "lab", "--ip=0.0.0.0", "--allow-root", "--no-browser"]
Build the image:
docker build -t ai-dev-base:latest -f Dockerfile.ai .
Docker Compose Configuration
Create a comprehensive docker-compose.yml for the complete development stack:
version: '3.8'
services:
jupyter:
image: ai-dev-base:latest
container_name: ai-jupyter
ports:
- "8888:8888"
volumes:
- ./notebooks:/workspace/notebooks
- ./data:/workspace/data
- ./models:/workspace/models
environment:
- JUPYTER_ENABLE_LAB=yes
- MLFLOW_TRACKING_URI=http://mlflow:5000
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
networks:
- ai-network
mlflow:
image: ghcr.io/mlflow/mlflow:v2.9.0
container_name: ai-mlflow
ports:
- "5000:5000"
environment:
- MLFLOW_BACKEND_STORE_URI=postgresql://mlflow:mlflow@postgres:5432/mlflow
- MLFLOW_DEFAULT_ARTIFACT_ROOT=s3://mlflow/artifacts
- AWS_ACCESS_KEY_ID=minioadmin
- AWS_SECRET_ACCESS_KEY=minioadmin
- MLFLOW_S3_ENDPOINT_URL=http://minio:9000
command: >
mlflow server
--backend-store-uri postgresql://mlflow:mlflow@postgres:5432/mlflow
--default-artifact-root s3://mlflow/artifacts
--host 0.0.0.0
--port 5000
depends_on:
- postgres
- minio
networks:
- ai-network
postgres:
image: postgres:15-alpine
container_name: ai-postgres
environment:
- POSTGRES_USER=mlflow
- POSTGRES_PASSWORD=mlflow
- POSTGRES_DB=mlflow
volumes:
- postgres-data:/var/lib/postgresql/data
networks:
- ai-network
minio:
image: minio/minio:latest
container_name: ai-minio
ports:
- "9000:9000"
- "9001:9001"
environment:
- MINIO_ROOT_USER=minioadmin
- MINIO_ROOT_PASSWORD=minioadmin
command: server /data --console-address ":9001"
volumes:
- minio-data:/data
networks:
- ai-network
minio-setup:
image: minio/mc:latest
container_name: ai-minio-setup
depends_on:
- minio
entrypoint: >
/bin/sh -c "
sleep 10;
/usr/bin/mc config host add myminio http://minio:9000 minioadmin minioadmin;
/usr/bin/mc mb myminio/mlflow;
/usr/bin/mc policy set download myminio/mlflow;
exit 0;
"
networks:
- ai-network
volumes:
postgres-data:
minio-data:
networks:
ai-network:
driver: bridge
Start the development environment:
# Create necessary directories
mkdir -p notebooks data models
# Start all services
docker-compose up -d
# Check service status
docker-compose ps
# View logs
docker-compose logs -f jupyter
Kubernetes-Based Local Development
For teams preferring Kubernetes, deploy the AI stack on Kind:
# Create Kind cluster with GPU support
cat <<EOF | kind create cluster --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
name: ai-dev
nodes:
- role: control-plane
extraMounts:
- hostPath: /dev/null
containerPath: /var/run/docker.sock
extraPortMappings:
- containerPort: 30888
hostPort: 8888
protocol: TCP
- containerPort: 30500
hostPort: 5000
protocol: TCP
EOF
Deploying Jupyter on Kubernetes
apiVersion: v1
kind: Namespace
metadata:
name: ai-dev
---
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: jupyter-pvc
namespace: ai-dev
spec:
accessModes:
- ReadWriteOnce
resources:
requests:
storage: 20Gi
---
apiVersion: apps/v1
kind: Deployment
metadata:
name: jupyter
namespace: ai-dev
spec:
replicas: 1
selector:
matchLabels:
app: jupyter
template:
metadata:
labels:
app: jupyter
spec:
containers:
- name: jupyter
image: ai-dev-base:latest
imagePullPolicy: IfNotPresent
ports:
- containerPort: 8888
env:
- name: JUPYTER_ENABLE_LAB
value: "yes"
- name: MLFLOW_TRACKING_URI
value: "http://mlflow-service:5000"
volumeMounts:
- name: workspace
mountPath: /workspace
resources:
limits:
nvidia.com/gpu: 1
requests:
memory: "8Gi"
cpu: "2"
volumes:
- name: workspace
persistentVolumeClaim:
claimName: jupyter-pvc
---
apiVersion: v1
kind: Service
metadata:
name: jupyter-service
namespace: ai-dev
spec:
type: NodePort
selector:
app: jupyter
ports:
- port: 8888
targetPort: 8888
nodePort: 30888
Apply the configuration:
# Load custom image into Kind
kind load docker-image ai-dev-base:latest --name ai-dev
# Apply Kubernetes manifests
kubectl apply -f jupyter-deployment.yaml
# Check deployment status
kubectl get pods -n ai-dev -w
# Get Jupyter token
kubectl logs -n ai-dev deployment/jupyter | grep token
Sample AI Development Workflow
Here’s a practical example of training a model with experiment tracking:
import mlflow
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset
import numpy as np
# Set MLflow tracking URI
mlflow.set_tracking_uri("http://mlflow:5000")
mlflow.set_experiment("image-classification")
# Define a simple neural network
class SimpleNet(nn.Module):
def __init__(self, input_size, hidden_size, num_classes):
super(SimpleNet, self).__init__()
self.fc1 = nn.Linear(input_size, hidden_size)
self.relu = nn.ReLU()
self.fc2 = nn.Linear(hidden_size, num_classes)
def forward(self, x):
out = self.fc1(x)
out = self.relu(out)
out = self.fc2(out)
return out
# Training function with MLflow tracking
def train_model(learning_rate=0.001, batch_size=32, epochs=10):
with mlflow.start_run():
# Log parameters
mlflow.log_param("learning_rate", learning_rate)
mlflow.log_param("batch_size", batch_size)
mlflow.log_param("epochs", epochs)
# Create dummy data
X_train = torch.randn(1000, 784)
y_train = torch.randint(0, 10, (1000,))
train_dataset = TensorDataset(X_train, y_train)
train_loader = DataLoader(train_dataset, batch_size=batch_size)
# Initialize model
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = SimpleNet(784, 128, 10).to(device)
criterion = nn.CrossEntropyLoss()
optimizer = optim.Adam(model.parameters(), lr=learning_rate)
# Training loop
for epoch in range(epochs):
total_loss = 0
for batch_x, batch_y in train_loader:
batch_x, batch_y = batch_x.to(device), batch_y.to(device)
optimizer.zero_grad()
outputs = model(batch_x)
loss = criterion(outputs, batch_y)
loss.backward()
optimizer.step()
total_loss += loss.item()
avg_loss = total_loss / len(train_loader)
mlflow.log_metric("loss", avg_loss, step=epoch)
print(f"Epoch [{epoch+1}/{epochs}], Loss: {avg_loss:.4f}")
# Save model
mlflow.pytorch.log_model(model, "model")
return model
# Run training
if __name__ == "__main__":
model = train_model(learning_rate=0.001, batch_size=64, epochs=20)
Troubleshooting Common Issues
GPU Not Detected in Container
If NVIDIA GPU isn’t accessible inside containers:
# Verify Docker can access GPU
docker run --rm --gpus all nvidia/cuda:11.8.0-base-ubuntu22.04 nvidia-smi
# Check NVIDIA Container Toolkit installation
nvidia-ctk --version
# Restart Docker daemon
sudo systemctl restart docker
# Verify Docker runtime configuration
cat /etc/docker/daemon.json
Ensure daemon.json includes:
{
"runtimes": {
"nvidia": {
"path": "nvidia-container-runtime",
"runtimeArgs": []
}
},
"default-runtime": "nvidia"
}
Out of Memory Errors
When encountering OOM errors during model training:
# Reduce batch size
batch_size = 16 # Instead of 64
# Enable gradient checkpointing for large models
model.gradient_checkpointing_enable()
# Use mixed precision training
from torch.cuda.amp import autocast, GradScaler
scaler = GradScaler()
with autocast():
outputs = model(inputs)
loss = criterion(outputs, targets)
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
Slow Data Loading
Optimize data loading performance:
# Increase number of workers
train_loader = DataLoader(
train_dataset,
batch_size=32,
num_workers=4, # Adjust based on CPU cores
pin_memory=True, # Faster GPU transfer
persistent_workers=True # Keep workers alive
)
Best Practices for Local AI Development
- Version control everything: Track Dockerfiles, docker-compose files, and Kubernetes manifests in Git
- Use environment variables: Never hardcode credentials or configuration values
- Implement data versioning: Use DVC or MLflow for dataset versioning
- Monitor resource usage: Use
docker statsorkubectl topto track resource consumption - Regular cleanup: Remove unused containers, images, and volumes to free disk space
- Separate development and production configs: Use different compose files for different environments
- Enable hot reloading: Mount code as volumes for faster iteration
- Document dependencies: Maintain clear requirements.txt or poetry.lock files
Performance Optimization Tips
# Enable Docker BuildKit for faster builds
export DOCKER_BUILDKIT=1
# Use multi-stage builds to reduce image size
# Cache pip packages
RUN --mount=type=cache,target=/root/.cache/pip \
pip install -r requirements.txt
# Clean up apt cache
RUN apt-get clean && rm -rf /var/lib/apt/lists/*
# Use .dockerignore to exclude unnecessary files
echo "__pycache__\n*.pyc\n.git\n.vscode" > .dockerignore
Conclusion
A well-configured local development environment for AI applications dramatically improves productivity and reduces iteration time. By leveraging containerization with Docker and orchestration with Kubernetes, you create a consistent, reproducible environment that mirrors production while maintaining flexibility for experimentation.
This setup provides GPU acceleration, experiment tracking, and all necessary infrastructure components locally, enabling rapid prototyping without cloud costs. As your projects scale, the containerized approach ensures smooth transitions from local development to cloud deployment.
Start with the Docker Compose setup for simplicity, then migrate to Kubernetes as your team and infrastructure requirements grow. Remember to regularly update your base images and dependencies to leverage the latest improvements in AI frameworks and tools.