Llama 3.3 70B and Ollama

Meta has introduced Llama 3.3, a 70-billion parameter large language model that provides performance comparable to the much larger Llama 3.1 405B but with drastically reduced computational demands. This makes high-quality AI accessible to developers who may lack enterprise-level hardware.

This guide explores Llama 3.3, delving into how it works, its use cases, and practical steps to deploy it with Docker.

What Is Llama 3.3?

Llama 3.3 is the latest innovation from Meta AI, designed to provide high-performance natural language processing with reduced hardware requirements.

Llama 3.3 is a text-only 70B instruction-tuned model that provides enhanced performance relative to Llama 3.1 70B–and to Llama 3.2 90B when used for text-only applications. Moreover, for some applications, Llama 3.3 70B approaches the performance of Llama 3.1 405B.

Note: Llama 3.3 70B is provided only as an instruction-tuned model; a pre-trained version is not available. Llama 3.3 uses the same prompt format as Llama 3.1. Prompts written for Llama 3.1 work unchanged with Llama 3.3.

Key Features

70 billion parameters optimized for efficiency.
Multilingual support for eight core languages: English, Spanish, French, Hindi, and more.
Enhanced performance through Grouped-Query Attention (GQA) and quantization techniques.
Optimized for developer-grade hardware, enabling local deployments on GPUs.

Example Use Case: A customer support chatbot running Llama 3.3 can respond in multiple languages while operating efficiently on a single GPU, making it accessible to startups and small teams.

How Does Llama 3.3 Work?

Model Architecture

Llama 3.3 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.

	Training Data	Params	Input modalities	Output modalities	Context length	GQA	Token count	Knowledge cutoff
Llama 3.3 (text only)	A new mix of publicly available online data.	70B	Multilingual Text	Multilingual Text and code	128k	Yes	15T+	December 2023

Llama 3.3 is built on a transformer architecture with 70 billion parameters, allowing it to:

Process large text inputs.
Generate contextually relevant responses.

Grouped-Query Attention (GQA): Reduces computational demands.
Quantization: Supports 8-bit and 4-bit precision for lower memory usage.

Efficiency Highlights:

Training and Fine-Tuning

Supervised Fine-Tuning (SFT): Trains on high-quality examples.
Reinforcement Learning with Human Feedback (RLHF): Ensures alignment with human preferences.

Llama 3.3 Benchmarks

Performance Overview:

Multilingual benchmarks: 91.1 on MGSM (0-shot).
Instruction following: 92.1 on IFEval.
Coding tasks: 88.4 on HumanEval (0-shot).

Training Data

Overview: Llama 3.3 was pretrained on ~15 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over 25M synthetically generated examples.

Data Freshness: The pretraining data has a cutoff of December 2023.

Benchmarks – English Text

In this section, we report the results for Llama 3.3 relative to our previous models.

Instruction tuned models

Category	Benchmark	# Shots	Metric	Llama 3.1 8B Instruct	Llama 3.1 70B Instruct	Llama-3.3 70B Instruct	Llama 3.1 405B Instruct
General	MMLU (CoT)	0	macro_avg/acc	73.0	86.0	86.0	88.6
	MMLU Pro (CoT)	5	macro_avg/acc	48.3	66.4	68.9	73.3
Steerability	IFEval			80.4	87.5	92.1	88.6
Reasoning	GPQA Diamond (CoT)	0	acc	31.8	48.0	50.5	49.0
Code	HumanEval	0	pass@1	72.6	80.5	88.4	89.0
	MBPP EvalPlus (base)	0	pass@1	72.8	86.0	87.6	88.6
Math	MATH (CoT)	0	sympy_intersection_score	51.9	68.0	77.0	73.8
Tool Use	BFCL v2	0	overall_ast_summary/macro_avg/valid	65.4	77.5	77.3	81.1
Multilingual	MGSM	0	em	68.9	86.9	91.1	91.6

Llama 3.3 Use Cases

1. Multilingual Chatbots

Build multilingual chatbots for global customer support.

Example: Use Llama 3.3 to create a chatbot that handles queries in English, Hindi, and Spanish.

2. Coding Support

Automate repetitive coding tasks with high accuracy.

Example: Generate Python scripts using Llama 3.3 directly from text-based prompts.

3. Synthetic Data Generation

Generate domain-specific datasets for NLP projects.

Example: Create synthetic datasets for training sentiment analysis models.

Where to download Llama 3.3 from?

You can download the model directly from here.

Run Llama 3.3 Using Ollama

Once Ollama is installed, you can run Llama 3.3 directly from the terminal:

ollama pull llama3.3
pulling manifest
pulling 4824460d29f2... 100% ▕████████████████████████████████████████████████████████████████████████▏  42 GB
pulling 948af2743fc7... 100% ▕████████████████████████████████████████████████████████████████████████▏ 1.5 KB
pulling bc371a43ce90... 100% ▕████████████████████████████████████████████████████████████████████████▏ 7.6 KB
pulling 53a87df39647... 100% ▕████████████████████████████████████████████████████████████████████████▏ 5.6 KB
pulling 56bb8bd477a5... 100% ▕████████████████████████████████████████████████████████████████████████▏   96 B
pulling c7091aa45e9b... 100% ▕████████████████████████████████████████████████████████████████████████▏  562 B
verifying sha256 digest
writing manifest
success

This command will download and run the Llama 3.3 model.

3. Running Ollama using Docker

If you prefer to use Docker, Ollama provides a Docker image that can run models with GPU acceleration. Here’s how to set it up:

For CPU Only:

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Unable to find image 'ollama/ollama:latest' locally
latest: Pulling from ollama/ollama
5270b739081b: Download complete
1961b46d5d4a: Download complete
b2f54319b7fe: Download complete
Digest: sha256:722ce8caba5f8b8bd2ee654b2e29466415be3071a704e3f4db1702b83c885f76
Status: Downloaded newer image for ollama/ollama:latest
980ef901ff155aef44206bcd098f3989b771e3409abd97d2baa8bdcdfc9375c9

For NVIDIA GPU:

Install the NVIDIA Container Toolkit:Follow the instructions on the NVIDIA Container Toolkit GitHub page to install the toolkit.
Run Ollama Inside a Docker Container:

docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama

Run a Model: After setting up the container, you can run a model like Llama 2 inside the container:

docker exec -it ollama ollama run llama3.3

FAQs

1. How does Llama 3.3 compare to GPT-4? Llama 3.3 provides competitive performance with lower computational and financial costs, making it ideal for smaller teams.

2. Can I fine-tune Llama 3.3? Yes, fine-tuning is supported, enabling you to tailor it for specific applications like domain-specific chatbots.

3. Does it support cloud deployment? Yes, Llama 3.3 can be deployed on cloud platforms or locally using Docker.

By following this guide, you can effectively leverage Llama 3.3 for your AI-driven projects, from building chatbots to generating synthetic datasets, while keeping costs and hardware requirements manageable.