Gemma 3 Technical Report Chinese version

AI Knowledge Base9mos agorelease AI Sharing Circle

34.6K 00

Gemma 3 Key Message Summary

I. Key indicators

parameters	particulars
model size	100 million to 27 billion parameters in four versions: 1B, 4B, 12B, 27B
build	on the basis of Transformer The decoder-specific architecture, inherited from Gemma 2, with a number of enhancements
multimodal capability	Supports text and image input, using a customized SigLIP visual encoder to encode images into 256 soft tags
Context length	1B model supports 32K tags, the rest of the models support 128K tags
attention mechanism	5:1 local/global attention layer interleaving pattern with local layers spanning 1024 tokens
Training methods	Pre-training using knowledge distillation and fine-tuning instructions using improved post-training methods
Pre-training data	Amount of pre-training data: 1B model 2T markers, 4B model 4T markers, 12B model 12T markers, 27B model 14T markers
multilingualism	Comparison of the use of Gemini 2.0 Same SentencePiece classifier with 262k entries in the vocabulary, supports multiple languages
Quantitative support	Quantization versions are available in different standard formats, including int4 per channel, int4 per block, and toggle fp8.
Training infrastructure	Training with TPUv4, TPUv5e, and TPUv5p with ZeRO-3 Optimizer State Sharding and Data Replication Strategy
carbon footprint	With a pre-trained carbon footprint of 1497.13 tons of CO2e, Google Data Centers are carbon neutral!

II. Advantages

Excellent multimodal understanding
- Gemma 3 integrates a visual encoder capable of handling both text and image inputs and excels in multimodal tasks.
  - Performs well on visual quizzing benchmarks such as DocVQA, InfoVQA, TextVQA, and others, especially on tasks involving document comprehension and reading text in images.
  - Compared to PaliGemma 2, Gemma 3 performs better on document understanding tasks such as DocVQA and InfoVQA.
  - The use of Pan & Scan (P&S) technology allows Gemma 3 to process images at near-native resolution, further enhancing its performance in visual language tasks.
Powerful long context processing
- Context length support for 128K tokens (32K for the 1B model) is advantageous when working with long text tasks.
  - Performs well in long context benchmarks such as RULER and MRCR.
  - The local/global attention layer interleaving pattern and the short span design of the local layer effectively control the memory consumption of the KV cache while maintaining the long context processing capability.
Multilingualism enhancement
- Using a more balanced non-English language lexer and increasing the amount of training on multilingual data.
  - Performs well in multi-language benchmarks such as MGSM, Global-MMLU-Lite, and WMT24++.
  - Excellent performance in cross-language quizzing and Indian language generation tasks such as XQuAD and IndicGenBench.
Increased math, reasoning and coding skills
- The post-training method is optimized for mathematical, reasoning, and coding skills.
  - Excellent performance in math and code benchmarks such as MATH, GSM8K, and HumanEval.
  - Performs well in multitasking language comprehension benchmarks such as MMLU, MMLU-Pro, and AGIEval.
High modeling efficiency
- The 1B and 4B models have smaller parameter sizes and are suitable for deployment on resource-constrained devices.
- With the average pooling technique, the Gemma 3 4B and 12B models have about 10 times lower migration cost than the PaliGemma 2 9B and 27B models for the same resolution of the visual encoder input.

III. Disadvantages

Limited knowledge of chemical, biological, radiological, and nuclear (CBRN)
- Gemma 3 performed poorly in the CBRN knowledge assessment due to the lack of CBRN-related content in the pre-training data.
- This means that in application scenarios involving CBRN domains, Gemma 3 may not provide accurate and reliable information.
Possible performance bottlenecks when processing high-resolution images
- While P&S techniques can alleviate the problems associated with non-square aspect ratios and high resolution images, frequent image cropping and scaling may affect inference speed.
- The Gemma 3 may face performance challenges in applications that require processing ultra-high resolution images or real-time vision tasks.
Understanding of particular areas may be inadequate
- While Gemma 3 performs well in a number of areas, its understanding may be limited in certain highly specialized areas (e.g., law, medicine, etc.).
- This requires further fine-tuning or domain adaptation based on specific application scenarios.
There are certain memory and privacy risks
- Large language models run the risk of generating text that appears in the training data, and Gemma 3 is no exception.
- Although the memory rate of Gemma 3 is lower than previous models, it is still necessary to handle sensitive information with care to avoid compromising user privacy.
Ability to reason about complex cause and effect relationships needs to be improved
- Gemma 3 may not perform as well as expected in tasks involving complex causal reasoning.
- This requires further research and model improvement to improve its performance in causal reasoning tasks.

Gemma 3 ratings

The Gemma 3 series of models consists of four versions, each of which open-sources a pre-training base version (pt suffix version, denoting pre-training) and an instruction fine-tuned version (it suffix version, denoting instruction fine-tuned), which means that a total of eight versions of the larger model have been open-sourced.

The maximum parameter size of the Gemma 3-27B IT has a fp16 precision size of 54.8GB, 27GB after INT8 quantization, two 4090s available, and INT4 quantization requires 14GB of video memory, which is perfectly fine for a single 4090.

And this version of the model reviewed very well, scoring 1,338 points (as of March 8, 2025) on the Big Models Anonymous Arena (Chatbot Arena), which is ranked 9th in the world, behind the o1-2024-12-17 model, and ahead of the likes of the Qwen 2.5-Max and the DeepSeek V3.

According to Google's official statement, this Gemma 3 series is a significant upgrade, with the Gemma 3-4B version modeling a level close to the Gemma 2-27B, while the Gemma 3-27B is close to the Gemini 1.5-Pro!

Gemma 3 Experience Location

https://aistudio.google.com/prompts/new_chat?model=gemma-3-27b-it

https://ollama.com/library/gemma3

summarize

Gemma 3 is a powerful multimodal large language model that excels in multimodal understanding, long context processing, multilingual capabilities, mathematics, reasoning, and code. However, it still has room for improvement in CBRN knowledge, high-resolution image processing, domain-specific understanding, memory and privacy risks, and complex causal reasoning.

Full technical report:https://storage.googleapis.com/deepmind-media/gemma/Gemma3Report.pdf

We introduced Gemma 3, a multimodal extension of the Gemma family of lightweight open-source models with parameter sizes ranging from 100 million to 27 billion. This version introduces visual comprehension capabilities, support for more languages, and longer context lengths - supporting at least 128K tokens. We also made improvements to the model architecture to reduce the KV cache memory that grows dramatically with increasing context length. This is achieved by increasing the ratio of local to global attention layers and keeping the span of local attention short.The Gemma 3 model is trained via knowledge distillation and outperforms Gemma 2 in both the pretrained and instruction-fine-tuned versions.In particular, our novel post-training approach significantly improves math, chat, instruction adherence, and multilingual capabilities, making Gemma3-4B -IT comparable in performance to Gemma2-27B-IT, and Gemma3-27B-IT comparable to Gemini-1.5-Pro in benchmarks. We release all models to the community.

1. Introduction

We present the latest version of the Gemma open-source language model (Gemma Team, 2024a), which was co-designed with the Gemini family of frontier models (Gemini Team, 2023). This new version is comparable in size to Gemma 2 (Gemma Team, 2024b) and adds a 1 billion parameter model. These models are intended to run on standard consumer-grade hardware such as cell phones, laptops, and high-end GPUs. This release brings several new features to the Gemma family, namely multimodality, long contexts, and multilingualism, while maintaining or exceeding the performance of previous releases.

In terms of multimodality, most of the Gemma 3 models are compatible with the customized SigLIP visual coder (Zhai et al., 2023). The language model treats images as sequences of soft tokens encoded by SigLIP. We reduce the inference cost of image processing by compressing the visual embedding into 256 vectors of fixed size. The encoder operates at a fixed resolution, and we draw inspiration from LLaVA (Liu et al., 2024) to achieve flexible resolution through a pan-and-scan (P&S) approach.

The second major architectural improvement is to increase the context size to 128K tokens without impacting performance. One challenge with long contexts is memory explosion in the KV cache during inference. To minimize this problem, we interleave multiple local layers between each global layer and set the span of local layers to only 1024 tokens. Thus, only the global layer is concerned with long contexts, and we have 1 global layer for every 5 local layers.

The pre-training optimization approach is similar to that of Gemma 2, but with some modifications in the architectural design. We use the same disambiguator as in Gemini 2.0, and we also revisit our data mixing strategy to improve the multilingual capabilities of the models while introducing image understanding. All Gemma 3 models were trained via knowledge distillation (Hinton et al., 2015).

In post-training, we focus on improving math, reasoning, and chatting skills, as well as integrating Gemma 3's new features, long contexts, and image input. We use a novel post-training approach to improve on all capabilities in math, coding, chat, command following, and multilingualism. The resulting Gemma 3 instruction fine-tuning model is both powerful and versatile, dramatically outperforming its predecessor in benchmarks.

In the following sections, we provide a brief overview of our model, including the architecture as well as the pre-training and post-training methods. We also provide a detailed evaluation on various quantitative and qualitative benchmark tests. We discuss approaches to secure responsible deployment and outline the broader implications of Gemma 3, its limitations, and advantages.

This is a breakdown of the cost and tip calculations based on receipts:

1.Identify the cost of sliced meat:The receipt lists "1x Zurcher Geschnetzeltes + Rosti" for CHF 36.50. This is sliced meat with a side of fried potatoes (Rosti). 2. This is a sliced meat dish with a side of fried potatoes (Rosti). 2.Calculate the tip for 18%:Amount of tip = CHF 36.5 * 0.18 = CHF 6.57 3.Calculate the total amount:Total = CHF 36.50 + CHF 6.57 = CHF 43.07

2. Model architecture

The Gemma 3 model follows the same generalized decoder-specific Transformer architecture as previous generations of models (Vaswani et al., 2017), with most architectural elements similar to the first two Gemma versions. We use Grouped Query Attention (GQA) (Ainslie et al., 2023) as well as Post-Normalization and Pre-Normalization with RMSNorm (Zhang and Sennrich, 2019). Inspired by Dehghani et al. (2023), Wortsman et al. (2023), and Chameleon's team (2024), we replace the soft caps of Gemma 2 with QK normalization. In this section, we highlight some key differences from previous generations of models.

5:1 Local/global layer interleaving. We alternated between local sliding-window self-attention (Beltagy et al., 2020) and global self-attention (Luong et al., 2015), with an alternation pattern of 1 global layer for every 5 local layers, starting with the local layer as the first layer of the model.

mould	visual encoder	Embedding Parameters	Non-embedded parameters
1B	0	302M	698M
4B	417M	675M	3,209M
12B	417M	1,012M	10,759M
27B	417M	1,416M	25,600M

Long context. The Gemma 3 model supports a context length of 128K tokens, with the exception of the 1B model, which has a context length of 32K. we increase the RoPE base frequency of the global self-attention layer from 10k to 1M, and keep the frequency of the local layer at 10k. we follow a similar process to Chen et al.'s (2023) positional interpolation to extend the span of the global self-attention layer.

2.1 Visual Modalities

Visual Coder. We used the 400M variant of the SigLIP encoder (Zhai et al., 2023), a visual transformer (Dosovitskiy, 2020) trained on a variant of the CLIP loss (Radford et al., 2021).The Gemma visual encoder took a square image resized to 896x896 as the input and fine-tuned on the Visual Assistant task data. For simplicity, we shared the visual coder between the 4B, 12B, and 27B models and kept it frozen during training.

Pan and Scan (P&S). The Gemma visual encoder operates at a fixed resolution of 896x896. When dealing with non-square aspect ratios and high resolution images, this can lead to image distortion, resulting in unreadable text or small objects disappearing. We address this problem by using an adaptive windowing algorithm at inference time. This algorithm splits the image into equal sized non-overlapping cropped blocks that cover the entire image and resize it to 896x896 pixels to pass to the encoder. This windowing is applied only when necessary and controls the maximum number of crops. This is an inference-only optimization that can be disabled to speed up inference.

Shards
mould	typology	Number of chips	data series	duplicate
1B	TPUv5e	512	16 16	2
4B	TPUv5e	2048	16 16	8
12B	TPUv4	6144	16 16	24
27B	TPUv5p	6144	24 8	32

2.2 Pre-training

We followed a pre-training recipe similar to Gemma 2 with knowledge distillation.

Training data. We pre-trained our model on a slightly larger tagging budget than Gemma 2, i.e., we used 14T tags to train Gemma 3 27B, 12T tags to train the 12B version, 4T tags to train 4B, and 2T tags to train 1B.The increase in tags was made in order to account for the mixing of images and text used during pre-training. We also increased the amount of multilingual data to improve language coverage. We added monolingual and parallel data and used a strategy inspired by Chung et al. (2023) to deal with imbalances in language representation.

Splitters. We use the same lexicon as Gemini 2.0: a SentencePiece lexicon with split digits, retained whitespace, and byte-level encoding (Kudo and Richardson, 2018). The resulting vocabulary has 262k entries. This lexer is more balanced for non-English languages.

Filtering. We use filtering techniques to minimize the risk of generating unwanted or insecure discourse and to remove certain personal information and other sensitive data. We decontaminate the evaluation set in the pre-training data mixture and reduce the risk of recitation by minimizing the spread of sensitive output. We also apply a quality reweighting step inspired by Sachdeva et al. (2024) to reduce the occurrence of low-quality data.

Knowledge distillation. We sample 256 logits per token and weight them according to teacher probabilities. Students learn the distribution of teachers across these samples through cross-entropy loss. The target distribution of teachers is set to zero probability for unsampled logits and renormalized.

	Original (GB)	Quantitative (GB)
Model 1B	bf16 2.0	0.5	0.7 1.0
+KV	2.9	1.4	1.6 1.9
4B	8.0	2.6	2.9 4.4
+KV	12.7	7.3	7.6 9.1
12B	24.0	6.6	7.1 12.4
+KV	38.9	21.5	22.0 27.3
27B	54.0	14.1	15.3 27.4
+KV	72.7	32.8	34.0 46.1

2.3 Quantitative perception training

In addition to the original checkpoints, we provide quantized versions of the models in different standard formats. These versions are obtained by fine-tuning each model in a small number of steps (typically 5000) using quantized-aware training (QAT) (Jacob et al., 2018). We use the probabilities of unquantified checkpoints as targets and adjust the data to match the pre-training and post-training distributions. Based on the most popular open-source quantized inference engines (e.g., llama.cpp), we focus on three weight representations: int4 per channel, int4 per block, and toggle fp8. In Table 3, we report the memory footprints of the original and quantized models for different weight representations, and with and without the KV cache for a 32k marker sequence length.

2.4 Computing infrastructure

We train our models using TPUv4, TPUv5e, and TPUv5p as shown in Table 2. Each model configuration is optimized to minimize the training step time. For the visual coder, we pre-compute the embeddings for each image and train directly using the embeddings without increasing the cost of language model training.

Optimizer states are sharded using the ZeRO-3 (Ren et al., 2021) implementation. For multi-pod training, we perform data replicas on the data center network re

(textual) context	formatting
user rounds	user
model turn	model
End of round
Discussion Example:
User: Who are you? Model: My name is Gemma! User: What is 2+2? Model: 2+2=4.
	Model Input:
[BOS]user Whoareyou? model My name is Gemma! user
What is2+2?
2+2=4.	Model output:

2.5 Carbon footprint

The carbon footprint of the pre-trained Gemma 3 model is 1497.13 tons of CO2e. This is estimated based on the hourly energy consumption reported by our TPU data centers and expanded to account for the additional energy consumed to create and maintain the data centers. Google data centers are carbon neutral, achieved through a combination of energy efficiency, renewable energy purchases and carbon offsets. This carbon neutrality applies to our training and the machines that run them.

3. Command fine-tuning

The pre-trained model was transformed into a command fine-tuning model by an improved post-training method (see Table 6).

skill

Our post-training approach relies on an improved version of Knowledge Distillation (Agarwal et al., 2024; Anil et al., 2018; Hinton et al., 2015) from a large IT faculty, as well as on improved versions of BOND (Sessa et al., 2024), WARM (Ram et al., 2024b) and WARP (Ram et al. , 2024a) of the RL fine-tuning phase.

Enhanced Learning Objectives

We use a variety of reward functions to improve usefulness, mathematics, coding, reasoning, instruction following, and multilingualism while minimizing model harm. This includes learning from weighted aggregated reward models trained using human feedback data (Ram et al., 2024b), code execution feedback (Gehring et al., 2024), and real rewards for solving math problems (DeepSeek-AI, 2025; Lambert et al., 2024).

data filtering

We carefully optimize the data used in post-training to maximize model performance. We filter out model outputs that display certain personal information, are unsafe or toxic, incorrectly self-identified data, and duplicate examples. Including a subset of data that encourages better contextual attribution, moderation, and rejection to minimize illusions also improves performance on factual metrics without degrading the model's performance on other metrics.

[BOS] Marking

For PT and IT models where the text starts with a [BOS] tag, it needs to be added explicitly, as the text "[BOS]" will not map to a [BOS] tag. For example, Flax has an option, add_bos = True, to automatically add this token during word splitting. Table 4 shows an example of the formatting of an IT model.

PT and IT formatting

All models share the same disambiguator, and some control tokens are dedicated to IT formatting. A key difference is that PT models output a token at the end of generation, while IT models output an token at the end of generation, as shown for IT in Table 4. Therefore, fine-tuning either model type also requires the addition of their respective end tags.

rankings	mould	Elo	95% CI	liberalization	typology	# parameters/# activation
1	Grok-3-Preview-02-24	1412	+8/-10
1	GPT-4.5-Preview	1411	+11/-11
3	Gemini-2.0-Flash-Thinking-Exp-01-21	1384	+6/-5
3	Gemini-2.0-Pro-Exp-02-05	1380	+ 5/-6
3	ChatGPT-4o-latest (2025-01-29)	1377	+ 5/-4
6	DeepSeek-R1	1363	+8/-6	be	MoE	671B/37B
6	Gemini-2.0-Flash-001	1357	+6/-5
8	01-2024-12-17	1352	+ 4/-6
9	Gemma-3-27B-IT	1338	+8/-9	be	compressed	27B
9	Qwen2.5-Max	1336	+ 7/-5
9	01-preview	1335	+4/-3
9	03-mini-high	1329	+8/-6
13	DeepSeek-V3	1318	+8/-6	be	MoE	671B/37B
14	GLM-4-Plus-0111	1311	+8/-8
14	Qwen-Plus-0125	1310	+7/-5
14	Claude 3.7 Sonnet	1309	+ 9/-11
14	Gemini-2.0-Flash-Lite	1308	+5/-5
18	Step-2-16K-Exp	1305	+ 7/-6
18	03-mini	1304	+ 5/-4
18	0l-mini	1304	+4/-3
18	Gemini-1.5-Pro-002	1302	+3/-3
28	Meta-Llama-3.1-405B-Instruct-bf16	1269	+4/-3	be	compressed	405B

38	Llama-3.3-70B-Instruct	1257	+5/-3	be	compressed	70B

39	Qwen2.5-72B-Instruct	1257	+3/-3	be	compressed	72B

59	Gemma-2-27B-it	1220	+3/-2	be	compressed	27B

Table 5 | Evaluation of Gemma 3 27B IT models in Chatbot Arena (Chiang et al., 2024). All models were blindly evaluated side-by-side by a human evaluator. Each model was assigned a score based on the Elo scoring system.The Gemma-3-27B-IT figures are preliminary results received on March 8, 2025.The Gemma-3-27B-IT models were evaluated in the Chatbot Arena (Chiang et al., 2024) by a human evaluator.

4. Final model evaluation

In this section, we evaluate the IT model through a series of automated benchmarking and human assessments as well as static benchmarking (e.g., MMLU).

4.1 LMSYS Chatbot Arena

In this section, we report our IT 27B modeling in the LMSys performance in a blind side-by-side evaluation of other state-of-the-art models in Chatbot Arena (Chiang et al., 2024). We report the Elo scores in Table 5.Gemma 3 27B IT (1338) is the best model in the top 10, and scores higher than other non-thinking open models such as DeepSeek-V3 (1318), LLaMA 3 405B (1257), and Qwen2.5-70B (1257), which are much larger models. Finally, Gemma 3 has a significantly higher Elo than Gemma 2, at 1220. note that the Elo scores do not take into account visual abilities, whereas none of the above models do.

4.2 Standardized benchmarking

In Table 6, we show the performance of our final model in different benchmarks compared to our previous model iterations and Gemini 1.5. We do not directly compare with external models that often report their own evaluation setups, as running them in our setup does not guarantee a fair comparison. We encourage readers to follow third-party static leaderboards to more fairly compare different models. We include additional evaluations of our models on other benchmarks in the appendix.

	Gemini 1.5		Gemini 2.0			Gemma 2			Gemma 3
	Flash	Pro	Flash	Pro	2B	9B	27B	1B	4B	12B	27B
MMLU-Pro	67.3	75.8	77.6	79.1	15.6	46.8	56.9	14.7	43.6	60.6	67.5
LiveCodeBench	30.7	34.2	34.5	36.0	1.2	10.8	20.4	1.9	12.6	24.6	29.7
Bird-SQL (dev)	45.6	54.4	58.7	59.3	12.2	33.8	46.7	6.4	36.3	47.9	54.4
GPQA Diamond	51.0	59.1	60.1	64.7	24.7	28.8	34.3	19.2	30.8	40.9	42.4
SimpleQA	8.6	24.9	29.9	44.3	2.8	5.3	9.2	2.2	4.0	6.3	10.0
FACTS Grounding	82.9	80.0	84.6	82.8	43.8	62.0	62.4	36.4	70.1	75.8	74.9
MATH	77.9	86.5	90.9	91.8	27.2	49.4	55.6	48.0	75.6	83.8	89.0
HiddenMath	47.2	52.0	63.5	65.2	1.8	10.4	14.8	15.8	43.0	54.5	60.3
MMMU (val)	62.3	65.9	71.7	72.7					48.8	59.6	64.9

Table 6 | Performance comparison with the Instruction Trimming (IT) model of Gemini 1.5, Gemini 2.0, and Gemma 2 in zero-sample benchmarks.

5. Ablation studies

In this section, we focus on the impact of architectural changes and the new visual capabilities of some of the new models.

5.1 Detection of pre-training capacity

We use several standard benchmarks as probes during pre-training to ensure that our models capture general capabilities. In Figure 2, we compare the quality of the pre-trained models of Gemma 2 and 3 on these general capabilities, i.e., science, code, factuality, multilingualism, reasoning, and vision. Performance details for the different public benchmarks used in these graphs are summarized in the Appendix. Overall, we see that despite the addition of vision, the new version improves in most categories. We are particularly concerned about the multilingualism of this version, which directly affects the quality of our model. However, despite the use of decontamination techniques, there is always a risk of contamination in these probes (Mirzadeh et al., 2024), which makes it more difficult to draw clearer conclusions.

5.2 Local: global attention layer

We measured the impact of changes in the local and global self-attention layers on performance and memory consumption during inference.

Local: global rate. In Figure 3, we compare different ratios of local to global attention layers. the Gemma 2 model uses 1:1 and Gemma 3 uses 5:1. we observe that changing this ratio has little effect on the perplexity.

Sliding window size. In Figure 4, we compare the perplexity of different sliding window sizes for different global:local ratio configurations of the local attention layer. The sliding window can be significantly reduced without affecting the perplexity.

Impact on KV cache memory. In Fig. 5, we show the balance between the model and the KV cache memory in the context of 32k tags. We consider a 2B model with different local-to-global ratios and sliding window sizes (sw). We compare it to a globally unique configuration, which is the standard configuration used in Gemma 1 and Llama. We observe that the "globally unique" configuration leads to a memory overhead of 60%, while using a sliding window of 1:3 and 1024 (sw=1024In Figure 6, we compute the memory usage of the KV cache as a function of context length, using our 2B architecture (**L:G=5:1, sw=1024**) with a "globally unique" 2B model.

5.3 Enabling Long Contexts

Rather than training 128k sequences from scratch, we pre-trained our models to use 32k sequences, and then at the end of the pre-training scaled the 4B, 12B, and 27B models to 128k tokens while re-scaling the RoPE (Chen et al., 2023). We found that a scaling factor of 8 worked well in practice. Notably, compared to Gemma 2, we also increase the base frequency of the RoPE for the global self-attention layer from 10k to 1M, while keeping the frequency of the local self-attention layer at 10k. In Fig. 7, we show the effect of perplexity for different context lengths. Our model generalizes to 128k, but the perplexity decreases rapidly as we continue to scale.

5.4 Small versus large teachers

A common finding is that to train a small model, it is best to distill from a small teacher.

We suspect that this is because these studies typically outweigh the benefits of using better teachers by using worse teachers in the setting. We train students using two different sized teachers, one large and one small, for different training durations. In Figure 8, we observe that for short training durations, the smaller teacher is better, but the trend reverses for longer training.

5.5 Visual Coder

The effect of image resolution. We use a visual coder based on SigLIP (Zhai et al., 2023). The visual coder is frozen and only the language model is trained. Each image in this multimodal data is represented by 256 image tokens from the corresponding visual coder. Therefore, higher resolution encoders use average pooling to reduce their output to 256 tokens. For example, the 896 resolution encoder has 4x4 average pooling on its output. As shown in Table 7, the higher resolution encoder outperforms the smaller encoder.

Pan and Scan (P&S). P&S makes it possible to capture images at close to their native aspect ratio and image resolution. In Table 8, we compare our 27B IT models with and without P&S.As expected, the ability to process images at near-native resolution is of great help for tasks that require some form of reading text on an image, which is especially important for visual language models.

6. Memory and privacy

Large language models may produce near-replicas of some of the texts used in training (Biderman et al. 2023; Carlini et al. 2021, 2022; Ippolito et al. 2022; Nasr et al. 2023). A number of previous reports have published audits that quantify this risk by measuring the memory rate (Anil et al. 2023; Chowdhery et al. 2022; Gemini team 2023, 2024; Gemma team 2024a,b; LLaMa team 2024). This "memory rate "1 is defined as the ratio between the generation of models matching the training data and all model generation, using the following settings. We follow the method described in Gemma Team (2024b) to measure it. Specifically, we subsample from a uniformly distributed larger portion of the training data and test for discoverable extraction of this content using a prefix of length 50 and a suffix of length 50 (Nasr et al., 2023). Text was represented as "fully memorized" if all tokens in the continuation matched the source suffix, or as "nearly memorized" if they matched within an edit distance of 10%.

Figure 9 compares the memorization rates of the Gemma and Gemini models; the models are arranged in reverse chronological order, with the latest Gemma 3 model on the left. We find that the Gemma 3 model memorizes long texts at a much lower rate than the previous model (note the log y-axis). We observe only small differences in memorization rates between the 4B, 12B, and 27B models, with 1B memorizing less than the other larger models. In addition, we found that a larger proportion of text was characterized as approximate memory, with a relative increase in approximate memory compared to full memory of about 24 times.

We also examine the rate at which the generation may contain personal information. To identify potential personal information, we use the Google Cloud Sensitive Data Protection (SDP) service.2 SDP uses a wide range of detection rules to identify text that may contain personal information.SDP is intended to have a high recall rate and does not take into account the context in which the information may appear, which leads to many false positives. As a result, we may overestimate the true amount of personal information contained in the output.SDP also offers a wide range of severity levels: low, medium, and high. We categorize text as personal information if the SDP categorizes it as personal information at any severity level. We observe that all Gemma 3 models have no personal information in their output. This indicates that the rate of personal data in the output classified as memories is below our detection threshold.

7. Responsibility, safety, security

Liability, safety, and security are critical in Gemma model development. To minimize the impact on Gemma 3 users, we continue to integrate enhanced internal security processes that run through the development workflow, aligned with the most recent Google AI models (Gemini Team, 2024). This focuses on security mitigations during training, as well as robust and transparent model evaluation of new image-to-text features.

7.1 Governance and assessment

Our approach to assessing the benefits and risks of Gemma mirrors that outlined in Gemma 1 (Gemma Team, 2024a) and takes into account changes in support models. We continue to believe that the openness of AI can spread the benefits of these technologies to society as a whole, but must be assessed in terms of the risk of harm that may result, both at the individual and institutional level (Weidinger et al., 2021). Since the first release of Gemma, we have seen these models drive several socially beneficial applications, such as our own ShieldGemma 2, a 4B image safety classifier built using Gemma 3, which provides an off-the-shelf solution for image safety that outputs safety labels for dangerous content, sexual exploitation, and violence categories.

Releasing Gemma 3 models requires special attention to changes in model capabilities and close monitoring of the evolving risks of existing multimodal LLMs (Lin et al., 2024), as well as an understanding of how models are used in the wild. While we have not yet received any reports of malicious use of Gemma, we remain committed to investigating any such reports and working with the academic and developer communities, as well as conducting our own monitoring to flag such cases.

Despite the improved capabilities, we believe that this release will have a negligible impact on the overall risk landscape given the more robust open models available.

7.2 Security policy and mitigation measures during training

One of the key pillars of the Gemma security approach is to align with Google's security policy, aligned with the Gemini model (Gemini Team, 2023). They are designed to help prevent our models from generating harmful content, i.e:

- Child sexual abuse and exploitation
- Disclosure of personally identifiable information that could lead to harm (e.g., social security number)
- Hate speech and harassment
- Dangerous or malicious content (including promotion of self-harm or instruction in harmful activities)
- pornography
- Medical advice contrary to scientific or medical consensus

We perform extensive security filtering of our pre-training data to minimize the likelihood that pre-training and fine-tuning checkpoints produce harmful content. For fine-tuning models, we also use SFT and RLHF to steer the models away from undesirable behaviors.

7.3 Safeguards Assessment

We also put our IT models through a series of baseline assurance assessments to understand the potential harm our models could cause. As we promote open models, we also recognize that the irreversibility of weight releases requires rigorous assessment of risk. Our internal security processes are designed accordingly, and for previous Gemma models we have also conducted assessments of capabilities associated with extreme risk (Phuong et al., 2024; Shevlane et al., 2023). As we continue to develop and share open models, we will follow this heuristic by thoroughly assessing that a more robust model typically provides sufficient coverage for a less capable model. As such, we prioritized a simplified assessment set for Gemma 3, reserving in-depth hazard capability assessments for situations where a particular model may have potentially higher risk (as described below for the CBRN assessment). We balanced the speed of development with targeted safety testing to ensure that our assessments were focused and efficient, while adhering to our commitments outlined in the Frontier Safety Framework.

Baseline assessment

The baseline assurance captures the rate at which the model violates the security policies, using a large number of synthetic adversarial user queries and using a human evaluator to flag the answers as to whether or not the policies are violated. Overall, Gemma 3 has a very low violation rate of these security policies.

Chemical, biological, radiological and nuclear (CBRN) knowledge

Due to enhanced performance on STEM-related tasks, we assessed knowledge related to biological, radiological, and nuclear risks using closed-ended knowledge-based multiple choice questions on the internal dataset. For the assessment of chemical knowledge, we evaluated chemical hazards using a closed-ended knowledge-based approach developed by Macknight et al. Our assessment showed that the Gemma 3 model has low knowledge in these domains.

7.4 Our Responsible Open Modeling Approach

Designing safe, secure, and responsible applications requires a system-level approach that strives to mitigate the risks associated with each particular use case and environment. We will continue to employ assessments and security mitigations that are proportional to the potential risks of our models, and will only share these with the community when we are confident that the benefits clearly outweigh the foreseeable risks.

8. Discussion and conclusions

In this work, we introduce Gemma 3, the latest member of the Gemma family of open source language models for text, images, and code. In this release, we focus on adding image understanding and long contexts, while improving multilingual capabilities and STEM-related abilities. Our model size and architecture is intended to be compatible with standard hardware, and most of our architectural improvements are designed to accommodate this hardware while maintaining performance.