Join our Discord Server
Abraham Dahunsi Web Developer 🌐 | Technical Writer ✍️| DevOps Enthusiast👨‍💻 | Python🐍 |

How to Deploy Llama 3 on Kubernetes

6 min read

Llama 3 is an advanced language model that leverages state-of-the-art neural network architectures to generate human-like text, perform complex reasoning, and understand context with high accuracy. Deploying Llama 3 on Kubernetes offers a scalable and resilient solution for integrating this powerful AI model into production environments, allowing for seamless management of resources and ensuring high availability.

This article explains how to deploy Llama 3 on a Kubernetes cluster, taking advantage of Kubernetes’ orchestration capabilities to efficiently manage the model’s compute resources and ensure optimal performance across multiple nodes.

Prerequisites

Before you begin:

  • Deploy a Kubernetes cluster with at least 3 nodes.
  • Install and configure the kubectl CLI on your local machine.
  • Basic understanding of Kubernetes.
  • Familiarity with machine learning concepts.

Preparing the Kubernetes Cluster

Before deploying Llama 3 on Kubernetes, ensure that your Kubernetes cluster is properly set up and configured. Follow the steps below to set up the cluster, configure a namespace for the deployment, and prepare persistent storage for Llama 3’s data.

Setting Up the Cluster

In this guide, we will use Google Kubernetes Engine (GKE) for setting up the Kubernetes cluster. If you don’t already have a Kubernetes cluster, follow these steps to create one in your cloud environment:

  1. Install the Google Cloud SDK and authenticate with your Google Cloud account.
gcloud auth login
  1. Create a new Kubernetes cluster.
gcloud container clusters create llama3-cluster --zone us-central1-a --num-nodes=3
  1. Get the credentials for your cluster.
gcloud container clusters get-credentials llama3-cluster --zone us-central1-a
  1. Verify that your cluster is up and running.
kubectl get nodes

Expected Output:

NAME                                      STATUS   ROLES    AGE     VERSION
gke-llama3-cluster-default-pool-1a2b3c4d  Ready    <none>   5m      v1.22.8-gke.100
gke-llama3-cluster-default-pool-2b3c4d5e  Ready    <none>   5m      v1.22.8-gke.100
gke-llama3-cluster-default-pool-3c4d5e6f  Ready    <none>   5m      v1.22.8-gke.100

Once your cluster is up and running, you’re ready to proceed with configuring the necessary resources for the Llama 3 deployment.

Configuring Namespace

Namespaces in Kubernetes provide a way to divide cluster resources between multiple users or teams. For this deployment, it’s a good idea to create a dedicated namespace to isolate Llama 3 and its associated resources.

  1. Create a Namespace llama3.
kubectl create namespace llama3
  1. Confirm the creation of the namespace.
kubectl get namespaces

Expected Output:

NAME              STATUS   AGE
default           Active   10m
kube-node-lease   Active   10m
kube-public       Active   10m
kube-system       Active   10m
llama3            Active   5s
  1. Set the Namespace as Default.
kubectl config set-context --current --namespace=llama3

This is to avoid specifying the namespace with every command.

This namespace will now serve as an isolated environment for your Llama 3 deployment.

Persistent Storage Setup

Llama 3, like most stateful applications, requires persistent storage to maintain its data across pod restarts. Kubernetes provides Persistent Volumes (PVs) and Persistent Volume Claims (PVCs) to manage storage resources.

  1. Create a Persistent Volume (PV) YAML file pv.yaml.
nano pv.yaml 
  1. Add the following content to the file.
apiVersion: v1
kind: PersistentVolume
metadata:
  name: llama3-pv
spec:
  capacity:
    storage: 20Gi
  accessModes:
    - ReadWriteOnce
  persistentVolumeReclaimPolicy: Retain
  hostPath:
    path: "/mnt/data"
  1. Apply the PV configuration.
kubectl apply -f pv.yaml
  1. Create a Persistent Volume Claim (PVC) YAML file pvc.yaml.
nano pvc.yaml 
  1. Enter the following content to the file.
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
  name: llama3-pvc
spec:
  accessModes:
    - ReadWriteOnce
  resources:
    requests:
      storage: 20Gi
  1. Apply the PVC configuration.
kubectl apply -f pvc.yaml
  1. Verify the PV and PVC.
kubectl get pv

Expected Output:

NAME        CAPACITY   ACCESS MODES   RECLAIM POLICY   STATUS   CLAIM     STORAGECLASS   REASON   AGE
llama3-pv   20Gi       RWO            Retain           Bound    default/llama3-pvc                1m
kubectl get pvc

Expected Output:

NAME         STATUS   VOLUME      CAPACITY   ACCESS MODES   STORAGECLASS   AGE
llama3-pvc   Bound    llama3-pv   20Gi       RWO                           1m

Containerizing Llama 3

To deploy Llama 3 on Kubernetes, you’ll need to containerize the application by creating a Docker image. Follow the steps below to build the Docker image for Llama 3 and push it to a container registry so that it can be easily accessed by your Kubernetes cluster.

Building the Docker Image

The first step in containerizing Llama 3 is to create a Dockerfile, which contains the instructions to build the Docker image. The Dockerfile defines the base image, installs necessary dependencies, and configures the Llama 3 application.

  1. Create a Dockerfile
# Use an official Python runtime as a parent image
FROM python:3.11-slim

# Set the working directory in the container
WORKDIR /app

# Copy the current directory contents into the container at /app
COPY . /app

# Install any needed packages specified in requirements.txt
RUN pip install --no-cache-dir -r requirements.txt

# Make port 8080 available to the world outside this container
EXPOSE 8080

# Define environment variable
ENV LLAMA_ENV=production

# Run Llama 3 when the container launches
CMD ["python", "llama3.py"]

In the Dockerfile above:

  • FROM python:3.11-slim: Specifies the base image, a lightweight version of Python 3.11.
  • WORKDIR /app: Sets the working directory inside the container.
  • COPY . /app: This copies the contents of your current directory (project files) into the container.
  • RUN pip install --no-cache-dir -r requirements.txt: Installs the necessary Python packages listed in requirements.txt.
  • EXPOSE 8080: Exposes port 8080, where Llama 3 will run.
  • CMD ["python", "llama3.py"]: Specifies the command to run Llama 3 when the container starts.
  1. Build the Docker Image.
docker build -t llama3:latest .

In the command above:

  • The -t llama3:latest option tags the image with the name llama3 and the tag latest.
  • The . at the end of the command specifies the build context, which is the current directory.
  1. Verify the Docker Image.
docker images

Expected Output:

REPOSITORY          TAG       IMAGE ID       CREATED        SIZE
llama3              latest    4d5e6f7g8h9i   1 minute ago   500MB
ubuntu              20.04     93fd78260bd1   2 weeks ago    72.9MB

Pushing to a Container Registry

After building the Docker image, the next step is to push it to a container registry. This allows your Kubernetes cluster to pull the image when deploying Llama 3. Follow the steps below to push the image to Docker Hub and Google Container Registry (GCR).

Docker Hub

  1. Log in to your Docker Hub account.
docker login
  1. Tag the image with your Docker Hub username and repository name.
docker tag llama3:latest yourdockerhubusername/llama3:latest
  1. Push the image to Docker Hub.
docker push yourdockerhubusername/llama3:latest
  1. Verify the image is in your Docker Hub repository by visiting your Docker Hub profile.

Google Container Registry (GCR)

  1. Configure Docker to authenticate with GCR.
gcloud auth configure-docker
  1. Tag the image with your GCR hostname, project ID, and image name.
docker tag llama3:latest gcr.io/your-project-id/llama3:latest
  1. Push the image to GCR.
docker push gcr.io/your-project-id/llama3:latest
  1. Verify the image is in your GCR by navigating to the GCR section in your Google Cloud Console.

With the Docker image pushed to a container registry, it is now accessible to your Kubernetes cluster, and you are ready to proceed with deploying Llama 3 on Kubernetes.

Deploying Llama 3 on Kubernetes

With the Docker image of Llama 3 now available in your container registry, the next step is to deploy it on your Kubernetes cluster. This section will cover creating the necessary Kubernetes manifests for the deployment, service, and configuration, applying these manifests using kubectl, and setting up auto scaling to ensure Llama 3 can handle varying loads.

Creating Deployment Manifests

  1. Create a llama3-deployment.yaml file.
nano llama3-deployment.yaml
  1. Enter the following content into the file.
apiVersion: apps/v1
kind: Deployment
metadata:
  name: llama3-deployment
  namespace: llama3
spec:
  replicas: 3
  selector:
    matchLabels:
      app: llama3
  template:
    metadata:
      labels:
        app: llama3
    spec:
      containers:
      - name: llama3-container
        image: yourdockerhubusername/llama3:latest
        ports:
        - containerPort: 8080
        env:
        - name: LLAMA_ENV
          valueFrom:
            configMapKeyRef:
              name: llama3-config
              key: LLAMA_ENV
        resources:
          requests:
            memory: "512Mi"
            cpu: "500m"
          limits:
            memory: "1024Mi"
            cpu: "1000m"

In the YAML file above:

  • replicas: 3: Specifies that three replicas of the Llama 3 pod should be running.
  • image: Specifies the Docker image to use for the Llama 3 container.
  • env: Configures environment variables, using values from a ConfigMap.
  1. Create a llama3-service.yaml file.
apiVersion: v1
kind: Service
metadata:
  name: llama3-service
  namespace: llama3
spec:
  type: LoadBalancer
  selector:
    app: llama3
  ports:
  - protocol: TCP
    port: 80
    targetPort: 8080

In the YAML file above:

  • type: LoadBalancer: Exposes the service to external traffic through a cloud provider’s load balancer.
  • port: 80: Maps the external port 80 to the internal container port 8080.
  1. Create a ConfigMap Manifest llama3-configmap.yaml.
apiVersion: v1
kind: ConfigMap
metadata:
  name: llama3-config
  namespace: llama3
data:
  LLAMA_ENV: "production"

In the YAML file above:

  • The LLAMA_ENV variable is set to “production,” which can be used by Llama 3 to determine its operating environment.

Applying Manifests

Once you have created the Deployment, Service, and ConfigMap manifests, you need to apply them to your Kubernetes cluster to create the necessary resources.

  1. Apply the ConfigMap Manifest.
kubectl apply -f llama3-configmap.yaml
  1. Check and verify that the ConfigMap has been created.
kubectl get configmaps -n llama3

Expected Output:

NAME               DATA   AGE
llama3-configmap   2      1m
  1. Apply the Deployment Manifest to create the Llama 3 pods.
kubectl apply -f llama3-deployment.yaml
  1. Check and verify the status of the deployment.
kubectl get deployments -n llama3

Expected Output:

NAME                       READY   UP-TO-DATE   AVAILABLE   AGE
llama3-deployment   3/3           3                      3                    2m
  1. Apply the Service Manifest to expose Llama 3.
kubectl apply -f llama3-service.yaml
  1. Check and verify that the service is running and note the external IP address assigned.
kubectl get services -n llama3

Expected Output:

NAME            TYPE           CLUSTER-IP    EXTERNAL-IP      PORT(S)          AGE
llama3-service  LoadBalancer   10.96.0.123   35.188.54.67     80:30123/TCP     3m

Your Llama 3 deployment should now be running and accessible via the external IP provided by the LoadBalancer service.

Configuring Auto Scaling

To ensure that Llama 3 can handle varying levels of traffic, you can configure the Horizontal Pod Autoscaler (HPA). The HPA automatically scales the number of pod replicas based on observed CPU utilization or other selected metrics.

  1. Create an HPA for Llama 3.
kubectl autoscale deployment llama3-deployment --cpu-percent=50 --min=3 --max=10 -n llama3

In the command above:

  • –cpu-percent=50: The target average CPU utilization for the pods is 50%.
  • –min=3: The minimum number of pod replicas is set to 3.
  • –max=10: The maximum number of pod replicas is set to 10.
  1. Monitor the HPA.
kubectl get hpa -n llama3

With the HPA configured, Llama 3 will automatically scale its resources to meet demand, ensuring optimal performance and resource utilization.

Conclusion

You have successfully deployed Llama 3 on a Kubernetes cluster, leveraging Kubernetes’ orchestration capabilities to ensure scalable and resilient operation of the model. This setup allows for efficient resource management and high availability across multiple nodes. For more information and advanced configuration options, visit the official Llama 3 and Kubernetes documentation.

Have Queries? Join https://launchpass.com/collabnix

Abraham Dahunsi Web Developer 🌐 | Technical Writer ✍️| DevOps Enthusiast👨‍💻 | Python🐍 |
Join our Discord Server
Index