AI is rapidly transforming how we build software—but testing it? That’s still catching up.
If you’re building GenAI apps, you’ve probably asked:
“How do I test LLM responses in CI without relying on expensive APIs like OpenAI or SageMaker?”
In this post, I’ll show you how to run large language models locally in GitHub Actions using Ollama, powered entirely by open source tools like Docker, GitHub self-hosted runners, and NVIDIA GPU support. No cloud APIs, no billing surprises—just fast, reproducible testing with real models.
🧠 What is Ollama?
Ollama makes running open LLMs as easy as running containers. With a single command, you can pull and serve models like llama2
, mistral
, codellama
, gemma
, and more—without needing to configure Python environments or CUDA libraries.
ollama pull llama2
ollama run llama2
Ollama includes a REST API and CLI for inference, and recently added support for structured outputs and embeddings.
🛠️ Running Ollama in CI with GPU Support
To run LLMs in your CI pipeline using open tools, you’ll need:
- 🖥️ A self-hosted GitHub Actions runner (Linux preferred)
- 🚀 NVIDIA GPU (consumer or datacenter-grade)
- 🔧 NVIDIA Container Toolkit for GPU passthrough
- 🐋 Docker (optional, if you want container isolation)
- 🧪 Ollama installed locally or inside a Docker container
✅ Setting Up a GitHub Self-Hosted Runner with GPU Support
- Install NVIDIA Drivers and verify GPU availability:
nvidia-smi
- Install NVIDIA Container Toolkit:
sudo apt install -y nvidia-container-toolkit sudo systemctl restart docker
- Register your GitHub self-hosted runner:
Follow the official guide. - Run Ollama locally or in a Docker container with GPU access:
curl -fsSL https://ollama.com/install.sh | sh
⚙️ GitHub Actions Workflow Example
Here’s a sample .github/workflows/ollama-e2e.yml
:
name: Ollama CI Test
on:
workflow_dispatch:
jobs:
ollama-test:
runs-on: [self-hosted, gpu]
steps:
- name: Checkout repo
uses: actions/checkout@v3
- name: Install Ollama
run: curl -fsSL https://ollama.com/install.sh | sudo sh
- name: Start Ollama Server
run: |
ollama serve &
sleep 5
curl -s http://localhost:11434
- name: Pull Llama2 model
run: ollama pull llama2
- name: Run inference via CLI
run: ollama run llama2 "How do MicroVMs compare to Docker in CI/CD workflows?"
- name: Run inference via REST API
run: |
curl -s http://localhost:11434/api/generate -d '{
"model": "llama2",
"stream": false,
"prompt": "List risks of running privileged containers in CI pipelines"
}' | jq
🔄 Why This Matters
This approach gives you:
- ✅ Local control over the models you test
- 🔓 Vendor-free testing (no lock-in to cloud LLM providers)
- 💡 Customizable models with Ollama’s
Modelfile
- 🧪 Reliable and repeatable GenAI test coverage
🔚 Wrapping Up
Ollama brings a Docker-like UX to LLMs. Combined with open-source CI workflows and GPU-powered self-hosted runners, you can now run full LLM tests—locally or in CI—without relying on paid APIs.
This opens the door to deterministic, structured, and cost-effective GenAI development workflows, from dev to test to production.
Want to extend this setup with LangChain, RAG, or multi-agent orchestration tools? Let me know—I’d be happy to walk you through more advanced examples.