Instructions to use FakeRockert543/gemma-4-31b-it-MLX-bf16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- MLX
How to use FakeRockert543/gemma-4-31b-it-MLX-bf16 with MLX:
# Make sure mlx-vlm is installed # pip install --upgrade mlx-vlm from mlx_vlm import load, generate from mlx_vlm.prompt_utils import apply_chat_template from mlx_vlm.utils import load_config # Load the model model, processor = load("FakeRockert543/gemma-4-31b-it-MLX-bf16") config = load_config("FakeRockert543/gemma-4-31b-it-MLX-bf16") # Prepare input image = ["http://images.cocodataset.org/val2017/000000039769.jpg"] prompt = "Describe this image." # Apply chat template formatted_prompt = apply_chat_template( processor, config, prompt, num_images=1 ) # Generate output output = generate(model, processor, formatted_prompt, image) print(output) - Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- LM Studio
- Pi
How to use FakeRockert543/gemma-4-31b-it-MLX-bf16 with Pi:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "FakeRockert543/gemma-4-31b-it-MLX-bf16"
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "mlx-lm": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "FakeRockert543/gemma-4-31b-it-MLX-bf16" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use FakeRockert543/gemma-4-31b-it-MLX-bf16 with Hermes Agent:
Start the MLX server
# Install MLX LM: uv tool install mlx-lm # Start a local OpenAI-compatible server: mlx_lm.server --model "FakeRockert543/gemma-4-31b-it-MLX-bf16"
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default FakeRockert543/gemma-4-31b-it-MLX-bf16
Run Hermes
hermes
File size: 5,813 Bytes
4b0fe8d | 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 | ---
language:
- en
- zh
- ja
- ko
- de
- fr
- es
- pt
- it
- ar
- hi
license: apache-2.0
license_link: https://ai.google.dev/gemma/docs/gemma_4_license
library_name: mlx
pipeline_tag: image-text-to-text
base_model: google/gemma-4-31B-it
tags:
- mlx
- gemma4
- ple-safe
- quantized
- apple-silicon
- vision
---
# gemma-4-31b-it-MLX-bf16
**PLE-safe** MLX bf16 weights for Google Gemma 4 31B (31B dense) on Apple Silicon.
- π¦ Source & convert scripts: [GitHub β FakeRocket543/mlx-gemma4](https://github.com/FakeRocket543/mlx-gemma4)
- π Size: **62.5 GB**
> β οΈ **Existing MLX quantized Gemma 4 models (mlx-community, unsloth) produce garbage output** due to quantizing PLE (Per-Layer Embedding) layers. This repo provides working quantized weights. See [Why](#why-ple-safe) below.
## Other Precisions
- [4bit](https://huggingface.co/FakeRockert543/gemma-4-31b-it-MLX-4bit)
- [8bit](https://huggingface.co/FakeRockert543/gemma-4-31b-it-MLX-8bit)
- **bf16** β you are here
## All Gemma 4 MLX Models
| Model | Params | Precision | Size | Audio |
|---|---|---|---|---|
| [gemma-4-e2b-it-MLX-4bit](https://huggingface.co/FakeRockert543/gemma-4-e2b-it-MLX-4bit) | 2.3B | 4bit | 7.1 GB | β
|
| [gemma-4-e2b-it-MLX-8bit](https://huggingface.co/FakeRockert543/gemma-4-e2b-it-MLX-8bit) | 2.3B | 8bit | 8.5 GB | β
|
| [gemma-4-e2b-it-MLX-bf16](https://huggingface.co/FakeRockert543/gemma-4-e2b-it-MLX-bf16) | 2.3B | bf16 | 9.6 GB | β
|
| [gemma-4-e4b-it-MLX-4bit](https://huggingface.co/FakeRockert543/gemma-4-e4b-it-MLX-4bit) | 4.5B | 4bit | 10.3 GB | β
|
| [gemma-4-e4b-it-MLX-8bit](https://huggingface.co/FakeRockert543/gemma-4-e4b-it-MLX-8bit) | 4.5B | 8bit | 12.3 GB | β
|
| [gemma-4-e4b-it-MLX-bf16](https://huggingface.co/FakeRockert543/gemma-4-e4b-it-MLX-bf16) | 4.5B | bf16 | 16.0 GB | β
|
| [gemma-4-26b-a4b-it-MLX-4bit](https://huggingface.co/FakeRockert543/gemma-4-26b-a4b-it-MLX-4bit) | 26B MoE | 4bit | 16.4 GB | β |
| [gemma-4-26b-a4b-it-MLX-8bit](https://huggingface.co/FakeRockert543/gemma-4-26b-a4b-it-MLX-8bit) | 26B MoE | 8bit | 28.6 GB | β |
| [gemma-4-26b-a4b-it-MLX-bf16](https://huggingface.co/FakeRockert543/gemma-4-26b-a4b-it-MLX-bf16) | 26B MoE | bf16 | 51.6 GB | β |
| [gemma-4-31b-it-MLX-4bit](https://huggingface.co/FakeRockert543/gemma-4-31b-it-MLX-4bit) | 31B dense | 4bit | 20.4 GB | β |
| [gemma-4-31b-it-MLX-8bit](https://huggingface.co/FakeRockert543/gemma-4-31b-it-MLX-8bit) | 31B dense | 8bit | 35.1 GB | β |
| [gemma-4-31b-it-MLX-bf16](https://huggingface.co/FakeRockert543/gemma-4-31b-it-MLX-bf16) | 31B dense | bf16 | 62.5 GB | β |
## Precision
Full bf16 weights, no quantization applied.
## Why PLE-Safe?
Gemma 4 uses a novel **PLE (Per-Layer Embeddings)** architecture with `ScaledLinear` layers that multiply outputs by a learned scalar. Standard quantization introduces rounding error in these layers, and the scalar amplifies it β producing `ionoxffionoxff...` garbage.
**Our fix:** Only quantize the large decoder `nn.Linear` and `SwitchLinear` (MoE expert) layers. Everything else stays bf16:
| Quantized (bf16) | Kept in bf16 |
|---|---|
| Attention projections (q/k/v/o_proj) | ScaledEmbedding (embed_tokens) |
| MLP layers (gate/up/down_proj) | ScaledLinear (PLE pathway) |
| MoE expert layers (SwitchLinear) | Per-layer embeddings (per_layer_*) |
| | Vision encoder |
| | All norms and scalars |
## Usage
**Prerequisite:** Apply the ScaledLinear fix to mlx-vlm (required until PR merged upstream):
```bash
pip install mlx-vlm
# Apply fix
git clone https://github.com/FakeRocket543/mlx-gemma4.git
cp mlx-gemma4/mlx_vlm_patches/models/gemma4/language.py \
$(python -c "import mlx_vlm; print(mlx_vlm.__path__[0])")/models/gemma4/
```
**Important:** You must manually apply the chat template. `mlx_vlm.generate()` does not do this automatically for Gemma 4.
### Vision
```python
from mlx_vlm import load, generate
model, processor = load("FakeRockert543/gemma-4-31b-it-MLX-bf16")
tokenizer = processor.tokenizer
messages = [{"role": "user", "content": [
{"type": "image", "url": "photo.jpg"},
{"type": "text", "text": "Describe this image in detail."},
]}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
out = generate(model, processor, prompt, ["photo.jpg"],
max_tokens=200, repetition_penalty=1.2, temperature=0.7)
print(out.text)
```
### Text
```python
messages = [{"role": "user", "content": "What is the capital of France?"}]
prompt = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
out = generate(model, processor, prompt, max_tokens=100, temperature=0.0)
print(out.text)
```
## Bugs Fixed in mlx-vlm
| # | Bug | Impact | Fix |
|---|---|---|---|
| 1 | `ScaledLinear` inherits `nn.Module` not `nn.Linear` | `nn.quantize()` can't find these layers | Change to `ScaledLinear(nn.Linear)` |
| 2 | Standard quantization quantizes PLE layers | Garbage output on 4-bit/8-bit | PLE-safe `class_predicate` skipping PLE/vision/audio |
| 3 | `processor.save_pretrained()` strips `feature_extractor` | Audio silently dropped | Copy `processor_config.json` from source |
| 4 | `SwitchLinear` (MoE) not quantized | 26B-A4B: 49 GB instead of 16 GB | Check `hasattr(module, 'to_quantized')` |
Fixed source files are included in the [GitHub repo](https://github.com/FakeRocket543/mlx-gemma4/tree/main/mlx_vlm_patches).
## Convert From Source
```bash
git clone https://github.com/FakeRocket543/mlx-gemma4.git
cd mlx-gemma4
python convert_gemma4.py 31B bf16
```
## Validation
All 12 variants validated on 10 images + 3 chat prompts. Full results: [GitHub](https://github.com/FakeRocket543/mlx-gemma4).
## License
Model weights: [Google Gemma License](https://ai.google.dev/gemma/terms). Scripts: MIT.
|