Meta released Llama 4, the newest member of its Llama family of large language models, on April 5, 2025, marking a significant advancement in the field of AI, particularly in native multimodality and model architecture. At the center of this release are the Llama 4 Scout and Llama 4 Maverick models, as well as a preview of the giant "teacher" model that underpins the technology, Llama 4 Behemoth. This move not only demonstrates Meta's commitment to catching up and trying to outperform the industry's top models, but also continues its strategy of promoting an open source AI ecosystem.
Overview of the Llama 4 series of core models
Meta is launching two models from the Llama 4 series for developers and applications:
- Llama 4 ScoutThe model has 17 billion active parameters and 16 Experts, for a total of 109 billion parameters. The highlight is the best-in-class multimodal capability, which surpasses all previous Llama models. Even more remarkable is that it supports up to 10 million token context window, far exceeding Llama 3's 128K. Meta claims that the model outperforms Google's Gemma 3 , Gemini 2.0 Flash-Lite and Mistral AI (used form a nominal expression) Mistral 3.1 and quantized by Int4 to be deployed on a single NVIDIA H100 GPU, showing a focus on efficiency.
- Llama 4 Maverick: It also has 17 billion active parameters, but is equipped with 128 experts and 400 billion total parameters. Positioned as the top multimodal model in its class, it aims to challenge OpenAI's GPT-4o and Google's Gemini 2.0 Flash. According to data published by Meta, Maverick performs well in several benchmarks and comes close to having twice as many parameters as its active counterpart in terms of inference and coding power. DeepSeek v3 . An experimental version of the chat is available at LMArena The leaderboard received an ELO score of 1417, showing its potential for dialog capabilities. The model runs on a single NVIDIA H100 DGX host.
Additionally, Meta previewed Llama 4 Behemoth Modeling. It's a behemoth with 288 billion active parameters, 16 experts, and nearly 2 trillion total participants. It's still in training and not yet released, but it serves as a "teacher" model for Scout and Maverick, boosting the latter's performance through knowledge distillation. Behemoth excels in math, multilingual, and image benchmarks, and Meta says it outperforms GPT-4.5 on some STEM benchmarks like MATH-500 and GPQA Diamond, Anthropic (used form a nominal expression) Claude Sonnet 3.7 and Google's Gemini 2.0 Pro.
Key technological innovations: MoE, multimodality and long contexts
The Llama 4 series introduces a number of key technological innovations that underpin its improved performance.
Hybrid Model of Expertise (MoE) Architecture
Llama 4 is Meta's first family of models to utilize the Mixed Expert (MoE) architecture, which has gained attention in recent years because it allows models to activate only a small number of parameters (i.e., "experts") per inference, while having a large number of total parameters. This design enables higher performance than traditional dense (Dense) models for a given training computational budget and improves the computational efficiency of training and inference.
As an example, Llama 4 Maverick has 400 billion total parameters distributed among 128 routing experts and one shared expert. During inference, each token is sent to both the shared expert and one of the routing experts. The model uses alternating dense and MoE layers to balance performance and efficiency. This architecture allows Maverick to be deployed on a single H100 DGX host or to maximize efficiency through distributed reasoning, reducing the cost and latency of model services.
Native Multimodality and Early Fusion
Unlike previous approaches that trained linguistic models before adapting them to visual capabilities, Llama 4 is designed to be a native multimodal model. It employs an Early Fusion strategy that seamlessly integrates textual and visual tokens in the model backbone network. This means that the model can be jointly pre-trained with a large amount of unlabeled text, image and video data. Llama 4 also improves its visual encoder, which is based on MetaCLIP but co-trained with a frozen Llama model to better accommodate large language models. The model can handle up to 48 image inputs in pre-training and performs well with up to 8 image inputs in post-training tests, and is able to handle multiple image inputs and textual cues for visual reasoning and comprehension.
Breakthrough 10 Million Token Context Window
Another distinguishing feature of Llama 4 Scout is its industry-leading context window of up to 10 million tokens. Such a huge context capacity opens up new possibilities for processing very long documents, analyzing large-scale datasets or having conversations that require long-term memory, for example:
- Summary and Q&A of multiple lengthy reports or books.
- Analyze the complete code base to understand dependencies or find errors.
- Provide deep personalization based on weeks or even months of user activity.
- Information extraction and analysis in areas such as law or medicine where large amounts of text need to be processed.
Meta credits Scout's long context capabilities to its innovative iRoPE Architecture. The architecture uses a context length of 256K for both pre-training and post-training, and combines interleaved attention layers (without position embeddings) and inference time temperature scaling to enhance length generalization. The "i" stands for interleaved, and "RoPE" refers to Rotary Position Embeddings, hinting at its support for "infinite The "i" stands for interleaved, and "RoPE" refers to Rotary Position Embeddings, hinting at the long-term goal of supporting "infinite" context length.
Training methods and performance
The development of Llama 4 involved many improvements to the training process.
Pre-training phase:
- Data and scale: The total amount of training data is over 30 trillion tokens, more than twice as much as Llama 3, and includes a diverse set of text, image, and video datasets. Support for 200 languages, with more than 100 languages having more than 1 billion tokens of training data.
- Efficiency and precision: Training is performed with FP8 precision to increase computational efficiency (390 TFLOPs/GPU for training Behemoth on 32K GPUs), while claiming not to sacrifice quality.
- Hyperparameter optimization: Developed a new technique called MetaP for reliably setting key hyperparameters (e.g., learning rate per layer, initialization scale) and found that these parameters migrate well across batch sizes, model widths, depths, and number of training tokens.
- Mid-training: Introduction of specific training methods and datasets after pre-training for extending long context capabilities and improving core performance.
Post-training phase:
- new process: A new process of Lightweight Supervised Fine-Tuning (SFT) -> Online Reinforcement Learning (RL) -> Lightweight Direct Preference Optimization (DPO) was adopted Meta found that traditional SFT and DPO may over constrain the model, limiting exploration at the RL stage, and impacting inference, coding, and mathematical ability.
- Data filtering: Remove more than 50% of SFT data labeled as "easy" by using the Llama model as a judge and focusing on lightweight SFTs on more difficult datasets.
- Online Intensive Learning: Significant performance gains were achieved by carefully selecting more difficult prompts (prompts) during the multimodal online RL phase. A continuous online RL strategy was employed, alternating between training the model and filtering the data using the model, retaining only moderately to difficult prompts, thus striking a balance between computation and accuracy.
- Lightweight DPO: Finally a lightweight DPO is performed to deal with boundary cases related to the quality of the model's response, balancing the model's intelligence and dialog capabilities.
Performance Comparison:
Meta published benchmark results showing:
- Llama 4 Maverick: Outperforms GPT-4o and Gemini 2.0 on coding, inference, multilingual, long context, and image benchmarks. Comparable to DeepSeek v3.1 with larger parameter scales in coding and inference.
- Llama 4 Scout: Best-in-class performance, outperforming Gemma 3, Gemini 2.0 Flash-Lite, Mistral 3.1, and all previous generations of Llama models. Outperforms all previous generations of Llama models in image grounding.
It is important to note that benchmark results are typically obtained under specific conditions and performance in real-world applications may vary. However, these data at least indicate that Llama 4 has the potential to compete with industry-leading models on a number of dimensions.
Behemoth: The "Teacher" of 2 Trillion Parameters and the Training Challenge
Llama 4 Behemoth exists not only to demonstrate Meta's ability to train very large scale models, but more critically to act as a 'teacher' in the development of Llama 4. Through codistillation, Behemoth transferred its power in math, multilingualism, and image understanding to the smaller Maverick model. meta developed a new codistillation loss function that dynamically adjusts the weights of the soft target (probability distributions from the teacher's model) and the hard target (the real labels) during the training process. weights. This co-distillation during pre-training amortizes the significant computational cost of computing distillation targets for the student model.
Training a model with 2 trillion parameters is a huge engineering challenge in itself. meta disclosed some experience with the post-training phase:
- data pruning: In order to maximize performance, SFT data needs to be pruned more aggressively than the smaller models, by as much as 95%, to focus on high-quality and difficult data.
- Intensive learning: Large-scale RL after lightweight SFT is particularly important for improving the model's reasoning and coding capabilities.The RL strategy focuses on selecting difficult cues through pass@k analysis and constructing training sessions of increasing difficulty. Dynamically filtering out zero advantage cues and mixing cues from different ability dimensions into training batches is critical for improving math, reasoning, and coding performance.
- The instructions follow: Sampling diverse system instructions (system instructions) is essential to ensure that the model maintains good instruction adherence in reasoning and coding tasks.
- infrastructure: For MoE architectures and hyperscaling, Meta has optimized its parallelization design and developed a fully asynchronous online RL training framework. The framework allows flexible allocation of different models to different GPUs, balancing resources based on computational speed, and reportedly improving training efficiency by ~10x over the previous generation.
Security and Bias Mitigation
Like all large-scale model developers, Meta emphasizes its focus on security and responsible AI in Llama 4 development. Its approach covers the full lifecycle of model development:
- Pre- and post-training mitigation:: Use measures such as data filtering in the pre-training phase; apply a variety of techniques in the post-training phase, including injecting the right amount of safety data at each stage to make the model conform to the usage strategy.
- System level tools: Meta open-sources a range of security tools for developers to integrate:
- Llama Guard: Input/output security detection model developed based on MLCommons' risk taxonomy.
- Prompt Guard: A classifier model for detecting malicious hints (e.g., jailbreak attacks) and injection attacks.
- CyberSecEval: An assessment tool for evaluating and mitigating the security risks of generative AI networks.
Meta emphasizes the customizability of these tools, allowing developers to tailor security policies to their needs.
- Evaluation and Red Team Testing: Conducted modeling tests of the system with automated and manual adversarial probing. Developed a system called GOAT (Generative Offensive Agent Testing) 's new approach to improving test coverage and efficiency by simulating multiple rounds of interactions from moderately skilled attackers, allowing human red team experts to focus on more cutting-edge risk areas.
Addressing bias
Meta recognizes the problem of bias that is prevalent in large language models, particularly the historical tendency to favor certain positions on controversial political and social topics, which stems in part from the inherent bias of Internet training data. The goal of Llama 4 is to remove model bias so that it can understand and articulate different sides of controversial issues without bias or judgment.
Meta claims that Llama 4 has made significant progress in this area, outperforming Llama 3 and comparing favorably with the Grok Quite:
- On controversial political and social issues, the percentage of refusals to answer dropped from 7% in Llama 3.3 to less than 2%.
- The proportion of prompts that showed unequal (biased) refusals was lower than 1% for refusal responses.
- On a set of controversial political or social issues, Llama 4 exhibits a strong political orientation with a frequency of responses comparable to Grok, about half that of Llama 3.3.
Meta says it will continue to work to reduce the rate of bias in its models.
Openness, Usability and Ecosystems
Llama 4 Scout and Llama 4 Maverick are now available for download on llama.com and Hugging Face, and will be coming to all major cloud, data and edge computing partners. Users can also experience the Llama 4 powered by WhatsApp, Messenger, Instagram Direct, and Meta. Meta AI The
Meta has re-emphasized its commitment to an open ecosystem, believing that openness drives innovation. The release of Llama 4, in particular its MoE architecture, native multimodal capabilities, and ultra-long context windows, certainly provides powerful new tools for AI developers and researchers. However, the actual performance of these advanced models, their ease of use, the cost of fine-tuning, and the potential risks they pose while driving "personalized experiences" remain to be tested and observed. Whether the Llama 4 series will truly usher in a new era of AI led by Meta is something the market will ultimately answer. Meta has also announced that it will share more information about its vision at LlamaCon on April 29th.