Meta has introduced Llama 3.3, a 70-billion parameter large language model that provides performance comparable to the much larger Llama 3.1 405B but with drastically reduced computational demands. This makes high-quality AI accessible to developers who may lack enterprise-level hardware.
This guide explores Llama 3.3, delving into how it works, its use cases, and practical steps to deploy it with Docker.
What Is Llama 3.3?
Llama 3.3 is the latest innovation from Meta AI, designed to provide high-performance natural language processing with reduced hardware requirements.
Llama 3.3 is a text-only 70B instruction-tuned model that provides enhanced performance relative to Llama 3.1 70B–and to Llama 3.2 90B when used for text-only applications. Moreover, for some applications, Llama 3.3 70B approaches the performance of Llama 3.1 405B.
Note: Llama 3.3 70B is provided only as an instruction-tuned model; a pre-trained version is not available. Llama 3.3 uses the same prompt format as Llama 3.1. Prompts written for Llama 3.1 work unchanged with Llama 3.3.
Key Features
- 70 billion parameters optimized for efficiency.
- Multilingual support for eight core languages: English, Spanish, French, Hindi, and more.
- Enhanced performance through Grouped-Query Attention (GQA) and quantization techniques.
- Optimized for developer-grade hardware, enabling local deployments on GPUs.
Example Use Case: A customer support chatbot running Llama 3.3 can respond in multiple languages while operating efficiently on a single GPU, making it accessible to startups and small teams.
How Does Llama 3.3 Work?
Model Architecture
Llama 3.3 is an auto-regressive language model that uses an optimized transformer architecture. The tuned versions use supervised fine-tuning (SFT) and reinforcement learning with human feedback (RLHF) to align with human preferences for helpfulness and safety.
Training Data | Params | Input modalities | Output modalities | Context length | GQA | Token count | Knowledge cutoff | |
---|---|---|---|---|---|---|---|---|
Llama 3.3 (text only) | A new mix of publicly available online data. | 70B | Multilingual Text | Multilingual Text and code | 128k | Yes | 15T+ | December 2023 |
Llama 3.3 is built on a transformer architecture with 70 billion parameters, allowing it to:
- Process large text inputs.
- Generate contextually relevant responses.
- Grouped-Query Attention (GQA): Reduces computational demands.
- Quantization: Supports 8-bit and 4-bit precision for lower memory usage.
Efficiency Highlights:
Training and Fine-Tuning
- Supervised Fine-Tuning (SFT): Trains on high-quality examples.
- Reinforcement Learning with Human Feedback (RLHF): Ensures alignment with human preferences.
Llama 3.3 Benchmarks
Performance Overview:
- Multilingual benchmarks: 91.1 on MGSM (0-shot).
- Instruction following: 92.1 on IFEval.
- Coding tasks: 88.4 on HumanEval (0-shot).
Training Data
Overview: Llama 3.3 was pretrained on ~15 trillion tokens of data from publicly available sources. The fine-tuning data includes publicly available instruction datasets, as well as over 25M synthetically generated examples.
Data Freshness: The pretraining data has a cutoff of December 2023.
Benchmarks – English Text
In this section, we report the results for Llama 3.3 relative to our previous models.
Instruction tuned models
Category | Benchmark | # Shots | Metric | Llama 3.1 8B Instruct | Llama 3.1 70B Instruct | Llama-3.3 70B Instruct | Llama 3.1 405B Instruct |
---|---|---|---|---|---|---|---|
General | MMLU (CoT) | 0 | macro_avg/acc | 73.0 | 86.0 | 86.0 | 88.6 |
MMLU Pro (CoT) | 5 | macro_avg/acc | 48.3 | 66.4 | 68.9 | 73.3 | |
Steerability | IFEval | 80.4 | 87.5 | 92.1 | 88.6 | ||
Reasoning | GPQA Diamond (CoT) | 0 | acc | 31.8 | 48.0 | 50.5 | 49.0 |
Code | HumanEval | 0 | pass@1 | 72.6 | 80.5 | 88.4 | 89.0 |
MBPP EvalPlus (base) | 0 | pass@1 | 72.8 | 86.0 | 87.6 | 88.6 | |
Math | MATH (CoT) | 0 | sympy_intersection_score | 51.9 | 68.0 | 77.0 | 73.8 |
Tool Use | BFCL v2 | 0 | overall_ast_summary/macro_avg/valid | 65.4 | 77.5 | 77.3 | 81.1 |
Multilingual | MGSM | 0 | em | 68.9 | 86.9 | 91.1 | 91.6 |
Llama 3.3 Use Cases
1. Multilingual Chatbots
Build multilingual chatbots for global customer support.
Example: Use Llama 3.3 to create a chatbot that handles queries in English, Hindi, and Spanish.
2. Coding Support
Automate repetitive coding tasks with high accuracy.
Example: Generate Python scripts using Llama 3.3 directly from text-based prompts.
3. Synthetic Data Generation
Generate domain-specific datasets for NLP projects.
Example: Create synthetic datasets for training sentiment analysis models.
Where to download Llama 3.3 from?
You can download the model directly from here.
Run Llama 3.3 Using Ollama
Once Ollama is installed, you can run Llama 3.3 directly from the terminal:
ollama pull llama3.3
pulling manifest
pulling 4824460d29f2... 100% ▕████████████████████████████████████████████████████████████████████████▏ 42 GB
pulling 948af2743fc7... 100% ▕████████████████████████████████████████████████████████████████████████▏ 1.5 KB
pulling bc371a43ce90... 100% ▕████████████████████████████████████████████████████████████████████████▏ 7.6 KB
pulling 53a87df39647... 100% ▕████████████████████████████████████████████████████████████████████████▏ 5.6 KB
pulling 56bb8bd477a5... 100% ▕████████████████████████████████████████████████████████████████████████▏ 96 B
pulling c7091aa45e9b... 100% ▕████████████████████████████████████████████████████████████████████████▏ 562 B
verifying sha256 digest
writing manifest
success
This command will download and run the Llama 3.3 model.
3. Running Ollama using Docker
If you prefer to use Docker, Ollama provides a Docker image that can run models with GPU acceleration. Here’s how to set it up:
For CPU Only:
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
Unable to find image 'ollama/ollama:latest' locally
latest: Pulling from ollama/ollama
5270b739081b: Download complete
1961b46d5d4a: Download complete
b2f54319b7fe: Download complete
Digest: sha256:722ce8caba5f8b8bd2ee654b2e29466415be3071a704e3f4db1702b83c885f76
Status: Downloaded newer image for ollama/ollama:latest
980ef901ff155aef44206bcd098f3989b771e3409abd97d2baa8bdcdfc9375c9
For NVIDIA GPU:
- Install the NVIDIA Container Toolkit:Follow the instructions on the NVIDIA Container Toolkit GitHub page to install the toolkit.
- Run Ollama Inside a Docker Container:
docker run -d --gpus=all -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
- Run a Model: After setting up the container, you can run a model like Llama 2 inside the container:
docker exec -it ollama ollama run llama3.3
FAQs
1. How does Llama 3.3 compare to GPT-4? Llama 3.3 provides competitive performance with lower computational and financial costs, making it ideal for smaller teams.
2. Can I fine-tune Llama 3.3? Yes, fine-tuning is supported, enabling you to tailor it for specific applications like domain-specific chatbots.
3. Does it support cloud deployment? Yes, Llama 3.3 can be deployed on cloud platforms or locally using Docker.
By following this guide, you can effectively leverage Llama 3.3 for your AI-driven projects, from building chatbots to generating synthetic datasets, while keeping costs and hardware requirements manageable.