File size: 5,813 Bytes
4b0fe8d
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
---
language:
  - en
  - zh
  - ja
  - ko
  - de
  - fr
  - es
  - pt
  - it
  - ar
  - hi
license: apache-2.0
license_link: https://ai.google.dev/gemma/docs/gemma_4_license
library_name: mlx
pipeline_tag: image-text-to-text
base_model: google/gemma-4-31B-it
tags:
- mlx
- gemma4
- ple-safe
- quantized
- apple-silicon
- vision
---

# gemma-4-31b-it-MLX-bf16

**PLE-safe** MLX bf16 weights for Google Gemma 4 31B (31B dense) on Apple Silicon.

- πŸ“¦ Source & convert scripts: [GitHub β€” FakeRocket543/mlx-gemma4](https://github.com/FakeRocket543/mlx-gemma4)
- πŸ“Š Size: **62.5 GB**

> ⚠️ **Existing MLX quantized Gemma 4 models (mlx-community, unsloth) produce garbage output** due to quantizing PLE (Per-Layer Embedding) layers. This repo provides working quantized weights. See [Why](#why-ple-safe) below.

## Other Precisions

- [4bit](https://huggingface.co/FakeRockert543/gemma-4-31b-it-MLX-4bit)
- [8bit](https://huggingface.co/FakeRockert543/gemma-4-31b-it-MLX-8bit)
- **bf16** ← you are here

## All Gemma 4 MLX Models

| Model | Params | Precision | Size | Audio |
|---|---|---|---|---|
| [gemma-4-e2b-it-MLX-4bit](https://huggingface.co/FakeRockert543/gemma-4-e2b-it-MLX-4bit) | 2.3B | 4bit | 7.1 GB | βœ… |
| [gemma-4-e2b-it-MLX-8bit](https://huggingface.co/FakeRockert543/gemma-4-e2b-it-MLX-8bit) | 2.3B | 8bit | 8.5 GB | βœ… |
| [gemma-4-e2b-it-MLX-bf16](https://huggingface.co/FakeRockert543/gemma-4-e2b-it-MLX-bf16) | 2.3B | bf16 | 9.6 GB | βœ… |
| [gemma-4-e4b-it-MLX-4bit](https://huggingface.co/FakeRockert543/gemma-4-e4b-it-MLX-4bit) | 4.5B | 4bit | 10.3 GB | βœ… |
| [gemma-4-e4b-it-MLX-8bit](https://huggingface.co/FakeRockert543/gemma-4-e4b-it-MLX-8bit) | 4.5B | 8bit | 12.3 GB | βœ… |
| [gemma-4-e4b-it-MLX-bf16](https://huggingface.co/FakeRockert543/gemma-4-e4b-it-MLX-bf16) | 4.5B | bf16 | 16.0 GB | βœ… |
| [gemma-4-26b-a4b-it-MLX-4bit](https://huggingface.co/FakeRockert543/gemma-4-26b-a4b-it-MLX-4bit) | 26B MoE | 4bit | 16.4 GB | β€” |
| [gemma-4-26b-a4b-it-MLX-8bit](https://huggingface.co/FakeRockert543/gemma-4-26b-a4b-it-MLX-8bit) | 26B MoE | 8bit | 28.6 GB | β€” |
| [gemma-4-26b-a4b-it-MLX-bf16](https://huggingface.co/FakeRockert543/gemma-4-26b-a4b-it-MLX-bf16) | 26B MoE | bf16 | 51.6 GB | β€” |
| [gemma-4-31b-it-MLX-4bit](https://huggingface.co/FakeRockert543/gemma-4-31b-it-MLX-4bit) | 31B dense | 4bit | 20.4 GB | β€” |
| [gemma-4-31b-it-MLX-8bit](https://huggingface.co/FakeRockert543/gemma-4-31b-it-MLX-8bit) | 31B dense | 8bit | 35.1 GB | β€” |
| [gemma-4-31b-it-MLX-bf16](https://huggingface.co/FakeRockert543/gemma-4-31b-it-MLX-bf16) | 31B dense | bf16 | 62.5 GB | β€” |

## Precision

Full bf16 weights, no quantization applied.

## Why PLE-Safe?

Gemma 4 uses a novel **PLE (Per-Layer Embeddings)** architecture with `ScaledLinear` layers that multiply outputs by a learned scalar. Standard quantization introduces rounding error in these layers, and the scalar amplifies it β€” producing `ionoxffionoxff...` garbage.

**Our fix:** Only quantize the large decoder `nn.Linear` and `SwitchLinear` (MoE expert) layers. Everything else stays bf16:

| Quantized (bf16) | Kept in bf16 |
|---|---|
| Attention projections (q/k/v/o_proj) | ScaledEmbedding (embed_tokens) |
| MLP layers (gate/up/down_proj) | ScaledLinear (PLE pathway) |
| MoE expert layers (SwitchLinear) | Per-layer embeddings (per_layer_*) |
| | Vision encoder |
| | All norms and scalars |

## Usage

**Prerequisite:** Apply the ScaledLinear fix to mlx-vlm (required until PR merged upstream):

```bash
pip install mlx-vlm

# Apply fix
git clone https://github.com/FakeRocket543/mlx-gemma4.git
cp mlx-gemma4/mlx_vlm_patches/models/gemma4/language.py \
   $(python -c "import mlx_vlm; print(mlx_vlm.__path__[0])")/models/gemma4/
```

**Important:** You must manually apply the chat template. `mlx_vlm.generate()` does not do this automatically for Gemma 4.

### Vision

```python
from mlx_vlm import load, generate

model, processor = load("FakeRockert543/gemma-4-31b-it-MLX-bf16")
tokenizer = processor.tokenizer

messages = [{"role": "user", "content": [
    {"type": "image", "url": "photo.jpg"},
    {"type": "text", "text": "Describe this image in detail."},
]}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
out = generate(model, processor, prompt, ["photo.jpg"],
    max_tokens=200, repetition_penalty=1.2, temperature=0.7)
print(out.text)
```

### Text

```python
messages = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
out = generate(model, processor, prompt, max_tokens=100, temperature=0.0)
print(out.text)
```

## Bugs Fixed in mlx-vlm

| # | Bug | Impact | Fix |
|---|---|---|---|
| 1 | `ScaledLinear` inherits `nn.Module` not `nn.Linear` | `nn.quantize()` can't find these layers | Change to `ScaledLinear(nn.Linear)` |
| 2 | Standard quantization quantizes PLE layers | Garbage output on 4-bit/8-bit | PLE-safe `class_predicate` skipping PLE/vision/audio |
| 3 | `processor.save_pretrained()` strips `feature_extractor` | Audio silently dropped | Copy `processor_config.json` from source |
| 4 | `SwitchLinear` (MoE) not quantized | 26B-A4B: 49 GB instead of 16 GB | Check `hasattr(module, 'to_quantized')` |

Fixed source files are included in the [GitHub repo](https://github.com/FakeRocket543/mlx-gemma4/tree/main/mlx_vlm_patches).

## Convert From Source

```bash
git clone https://github.com/FakeRocket543/mlx-gemma4.git
cd mlx-gemma4
python convert_gemma4.py 31B bf16
```

## Validation

All 12 variants validated on 10 images + 3 chat prompts. Full results: [GitHub](https://github.com/FakeRocket543/mlx-gemma4).

## License

Model weights: [Google Gemma License](https://ai.google.dev/gemma/terms). Scripts: MIT.