How to use from
Unsloth Studio
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for deucebucket/Granite-4.1-30B-Cerebellum-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for deucebucket/Granite-4.1-30B-Cerebellum-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required
# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for deucebucket/Granite-4.1-30B-Cerebellum-GGUF to start chatting
Quick Links

Granite 4.1-30B โ€” Cerebellum GGUF

Ablation-guided mixed-precision quantization of ibm-granite/granite-4.1-30b. 30B parameters, dense architecture with GQA, 64 layers.

What is Cerebellum?

Instead of uniform quantization, we measure which weight groups survive aggressive compression and which don't. Groups that tolerate Q2_K get demoted; groups that don't stay at Q3_K_M or higher. The result: smaller files with less quality loss than uniform quants of the same size.

Files

File Size Description
Granite-4.1-30B-Cerebellum-v2.gguf 13 GB Optimal mix โ€” 3 groups demoted (attn_k, attn_q, attn_output), 4 kept at Q3_K_M
Granite-4.1-30B-Cerebellum-v1.gguf 12 GB Aggressive โ€” 5 groups demoted (all attn + ffn_gate)

Benchmarks

Evaluated using our standardized benchmark suite (ARC-Challenge, HellaSwag, MMLU, HumanEval) with temperature=0, no thinking mode.

Cerebellum v2 (13 GB) โ€” Recommended

Benchmark Score Questions
ARC-Challenge 91.6% 1,172
HellaSwag 88.9% 10,042
MMLU 73.5% 11,643
HumanEval 82.3% 164

Size vs Quality

Model Size BPW PPL (wiki)
Q3_K_M (baseline) 14 GB 3.94 8.3736
Cerebellum v2 13 GB 3.76 8.4912
Cerebellum v1 12 GB 3.50 9.1405

v2 saves 1 GB (7%) over Q3_K_M with only +1.4% perplexity increase โ€” and the 3 demoted groups actually improved perplexity individually during ablation.

Methodology

  1. Group ablation: Demote each of 7 weight groups (attn_k, attn_q, attn_v, attn_output, ffn_gate, ffn_up, ffn_down) to Q2_K individually. Measure PPL impact.
  2. Identify improvers: Three groups (attn_k, attn_q, attn_output) showed lower PPL when demoted โ€” the Q3_K_M precision was actually hurting these layers.
  3. Build optimal mix: v2 demotes only the 3 groups that improve; v1 additionally demotes attn_v and ffn_gate.

Ablation Results

Group PPL when demoted Delta vs baseline
attn_k 8.1639 -0.2097 (improved!)
attn_q 8.3144 -0.0592 (improved!)
attn_output 8.3539 -0.0197 (improved!)
attn_v 8.4713 +0.0977
ffn_gate 8.5967 +0.2231
ffn_up 9.0587 +0.6851
ffn_down 9.0190 +0.6454

v2 Override Map

Demoted (Q2_K): attn_k, attn_q, attn_output (all 64 layers)
Sacred (kept at Q3_K_M): attn_v, ffn_gate, ffn_up, ffn_down

Usage

Works with any llama.cpp-compatible tool:

# llama.cpp
./llama-server --model Granite-4.1-30B-Cerebellum-v2.gguf -ngl 99 --ctx-size 4096

# Ollama (create Modelfile pointing to the GGUF)
# LM Studio (drag and drop)
# koboldcpp, text-generation-webui, etc.

Hardware Requirements

  • v2 (13 GB): Fits in 16 GB VRAM with room for context. RTX 4060 Ti 16GB, RTX 3090, etc.
  • v1 (12 GB): Fits in 16 GB VRAM with generous context, or tight in 12 GB.

Credits

Quantized with Cerebellum โ€” ablation-guided mixed-precision quantization by deucebucket.
Base model by IBM Granite.

Downloads last month
80
GGUF
Model size
29B params
Architecture
granite
Hardware compatibility
Log In to add your hardware

We're not able to determine the quantization variants.

Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for deucebucket/Granite-4.1-30B-Cerebellum-GGUF

Quantized
(28)
this model