Meta's Llama 4 Models Land on Ollama!

Ollama, Llama, Llama 4, LLM, AI, Multimodal, Mixture of Experts

Sat May 03 2025

Introduction

Exciting news for the AI community! Meta's latest generation of powerful open-weight large language models, Llama 4, has arrived and is now accessible through Ollama. This means you can run these cutting-edge multimodal models directly on your local hardware. This post will guide you through the Llama 4 models available on Ollama and show you how to get started.

Llama 4

Llama 4 marks a significant advancement in open AI models, incorporating several key innovations:

Mixture-of-Experts (MoE) Architecture: Llama 4 utilizes an MoE design. While the total parameter count can be massive (up to 400B for Maverick!), only a fraction of these parameters (the most relevant "experts," around 17B) are activated for any given task. This leads to significantly more efficient inference and potentially faster response times compared to traditional "dense" models of similar scale.
Native Multimodality: Unlike models where vision capabilities might be bolted on, Llama 4 is designed from the ground up to process and understand both text and image inputs seamlessly.
Expanded Multilingual Capabilities: Trained on a vast dataset encompassing numerous languages, Llama 4 officially supports 12 languages at launch, including English, Arabic, French, German, Hindi, Italian, Portuguese, Spanish, Thai, Vietnamese, and more.
Very Large Context Windows: These models boast enormous context windows. Llama 4 Scout, for instance, supports up to 10 million tokens, allowing for incredibly long conversations, analysis of large documents, or reasoning over extensive codebases. (Note: Initial implementations on platforms like Ollama might support a portion of this maximum context).

Available Llama 4 Models on Ollama

Ollama currently provides access to the two primary instruction-tuned Llama 4 models released by Meta:

Llama 4 Scout (llama4:scout)
- Parameters: 109 Billion total parameters | ~17 Billion active parameters (16 experts).
- Size: Approximately 67GB (this can vary with quantization).
- Strengths: Designed as a highly capable yet efficient model. It aims to provide strong performance across general text and image understanding tasks and can often be run on a single high-end GPU (like an NVIDIA H100 or equivalent).
- Context: Up to 10 Million tokens.
- Primary Uses: Ideal for assistant-style chat, visual question answering, summarizing multiple documents, general knowledge queries.
Llama 4 Maverick (llama4:maverick)
- Parameters: 400 Billion total parameters | ~17 Billion active parameters (128 experts).
- Size: Approximately 243GB (significantly larger, requiring substantial resources).
- Strengths: With a much larger pool of experts, Maverick is potentially stronger in complex reasoning, coding, and specialized knowledge tasks compared to Scout.
- Context: Up to 1 Million tokens reported.
- Primary Uses: Suited for advanced reasoning, sophisticated code generation and analysis, challenging visual interpretation tasks.

(Resource Note: Running these models, especially Maverick, requires significant RAM and, for optimal performance, powerful GPU(s) with ample VRAM.)

How to Run Llama 4 with Ollama

Getting Llama 4 running locally with Ollama is simple:

Install or Update Ollama: Make sure you have the latest version of Ollama. If you don't have it installed, download it from the Ollama website.
Run from Terminal: Open your terminal or command prompt. Use the ollama run command followed by the model tag. Ollama handles the download and setup automatically.
- To run Llama 4 Scout:
```
  ollama run llama4:scout
```
- To run Llama 4 Maverick:
```
  ollama run llama4:maverick
```
(Reminder: Ensure your system meets the high resource requirements for Maverick before running this command.)
Start Interacting: Once the >>> prompt appears, the model is loaded, and you can type your text prompts directly!
- Using Images: For multimodal input (sending images along with text), you'll typically need to interact via the Ollama REST API or use a GUI client that supports multimodal requests. The basic ollama run command is text-only. Check the Ollama GitHub repository for API documentation and examples.

Exploring Further

Beyond the official Meta releases, the Ollama community often provides quantized versions (e.g., q4_K_M, q5_K_M, q6_K) of popular models. These can offer reduced file sizes and lower RAM/VRAM requirements, making powerful models accessible on less powerful hardware, albeit potentially with a small trade-off in accuracy. You can search for these community versions directly on the Ollama model library. For example, searching for llama4 might reveal quantized versions like ingu627/llama4-scout-q4.

Conclusion

Llama 4's availability on Ollama puts state-of-the-art, open, multimodal AI power within reach of developers and enthusiasts. The efficiency of the MoE architecture combined with native multimodal understanding and vast context windows opens up exciting possibilities. Whether you choose the agile Scout or the powerhouse Maverick, Ollama provides an easy gateway to explore this next generation of AI. Give them a try today!