Instructions to use deucebucket/Granite-4.1-30B-Cerebellum-GGUF with libraries, inference providers, notebooks, and local apps. Follow these links to get started.
- Libraries
- llama-cpp-python
How to use deucebucket/Granite-4.1-30B-Cerebellum-GGUF with llama-cpp-python:
# !pip install llama-cpp-python from llama_cpp import Llama llm = Llama.from_pretrained( repo_id="deucebucket/Granite-4.1-30B-Cerebellum-GGUF", filename="Granite-4.1-30B-Cerebellum-v1.gguf", )
llm.create_chat_completion( messages = "No input example has been defined for this model task." )
- Notebooks
- Google Colab
- Kaggle
- Local Apps Settings
- llama.cpp
How to use deucebucket/Granite-4.1-30B-Cerebellum-GGUF with llama.cpp:
Install from brew
brew install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf deucebucket/Granite-4.1-30B-Cerebellum-GGUF # Run inference directly in the terminal: llama-cli -hf deucebucket/Granite-4.1-30B-Cerebellum-GGUF
Install from WinGet (Windows)
winget install llama.cpp # Start a local OpenAI-compatible server with a web UI: llama-server -hf deucebucket/Granite-4.1-30B-Cerebellum-GGUF # Run inference directly in the terminal: llama-cli -hf deucebucket/Granite-4.1-30B-Cerebellum-GGUF
Use pre-built binary
# Download pre-built binary from: # https://github.com/ggerganov/llama.cpp/releases # Start a local OpenAI-compatible server with a web UI: ./llama-server -hf deucebucket/Granite-4.1-30B-Cerebellum-GGUF # Run inference directly in the terminal: ./llama-cli -hf deucebucket/Granite-4.1-30B-Cerebellum-GGUF
Build from source code
git clone https://github.com/ggerganov/llama.cpp.git cd llama.cpp cmake -B build cmake --build build -j --target llama-server llama-cli # Start a local OpenAI-compatible server with a web UI: ./build/bin/llama-server -hf deucebucket/Granite-4.1-30B-Cerebellum-GGUF # Run inference directly in the terminal: ./build/bin/llama-cli -hf deucebucket/Granite-4.1-30B-Cerebellum-GGUF
Use Docker
docker model run hf.co/deucebucket/Granite-4.1-30B-Cerebellum-GGUF
- LM Studio
- Jan
- Ollama
How to use deucebucket/Granite-4.1-30B-Cerebellum-GGUF with Ollama:
ollama run hf.co/deucebucket/Granite-4.1-30B-Cerebellum-GGUF
- Unsloth Studio
How to use deucebucket/Granite-4.1-30B-Cerebellum-GGUF with Unsloth Studio:
Install Unsloth Studio (macOS, Linux, WSL)
curl -fsSL https://unsloth.ai/install.sh | sh # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for deucebucket/Granite-4.1-30B-Cerebellum-GGUF to start chatting
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex # Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for deucebucket/Granite-4.1-30B-Cerebellum-GGUF to start chatting
Using HuggingFace Spaces for Unsloth
# No setup required # Open https://huggingface.co/spaces/unsloth/studio in your browser # Search for deucebucket/Granite-4.1-30B-Cerebellum-GGUF to start chatting
- Pi
How to use deucebucket/Granite-4.1-30B-Cerebellum-GGUF with Pi:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf deucebucket/Granite-4.1-30B-Cerebellum-GGUF
Configure the model in Pi
# Install Pi: npm install -g @mariozechner/pi-coding-agent # Add to ~/.pi/agent/models.json: { "providers": { "llama-cpp": { "baseUrl": "http://localhost:8080/v1", "api": "openai-completions", "apiKey": "none", "models": [ { "id": "deucebucket/Granite-4.1-30B-Cerebellum-GGUF" } ] } } }Run Pi
# Start Pi in your project directory: pi
- Hermes Agent new
How to use deucebucket/Granite-4.1-30B-Cerebellum-GGUF with Hermes Agent:
Start the llama.cpp server
# Install llama.cpp: brew install llama.cpp # Start a local OpenAI-compatible server: llama-server -hf deucebucket/Granite-4.1-30B-Cerebellum-GGUF
Configure Hermes
# Install Hermes: curl -fsSL https://hermes-agent.nousresearch.com/install.sh | bash hermes setup # Point Hermes at the local server: hermes config set model.provider custom hermes config set model.base_url http://127.0.0.1:8080/v1 hermes config set model.default deucebucket/Granite-4.1-30B-Cerebellum-GGUF
Run Hermes
hermes
- Docker Model Runner
How to use deucebucket/Granite-4.1-30B-Cerebellum-GGUF with Docker Model Runner:
docker model run hf.co/deucebucket/Granite-4.1-30B-Cerebellum-GGUF
- Lemonade
How to use deucebucket/Granite-4.1-30B-Cerebellum-GGUF with Lemonade:
Pull the model
# Download Lemonade from https://lemonade-server.ai/ lemonade pull deucebucket/Granite-4.1-30B-Cerebellum-GGUF
Run and chat with the model
lemonade run user.Granite-4.1-30B-Cerebellum-GGUF-{{QUANT_TAG}}List all available models
lemonade list
Install Unsloth Studio (Windows)
irm https://unsloth.ai/install.ps1 | iex
# Run unsloth studio
unsloth studio -H 0.0.0.0 -p 8888
# Then open http://localhost:8888 in your browser
# Search for deucebucket/Granite-4.1-30B-Cerebellum-GGUF to start chattingUsing HuggingFace Spaces for Unsloth
# No setup required# Open https://huggingface.co/spaces/unsloth/studio in your browser
# Search for deucebucket/Granite-4.1-30B-Cerebellum-GGUF to start chattingGranite 4.1-30B โ Cerebellum GGUF
Ablation-guided mixed-precision quantization of ibm-granite/granite-4.1-30b. 30B parameters, dense architecture with GQA, 64 layers.
What is Cerebellum?
Instead of uniform quantization, we measure which weight groups survive aggressive compression and which don't. Groups that tolerate Q2_K get demoted; groups that don't stay at Q3_K_M or higher. The result: smaller files with less quality loss than uniform quants of the same size.
Files
| File | Size | Description |
|---|---|---|
Granite-4.1-30B-Cerebellum-v2.gguf |
13 GB | Optimal mix โ 3 groups demoted (attn_k, attn_q, attn_output), 4 kept at Q3_K_M |
Granite-4.1-30B-Cerebellum-v1.gguf |
12 GB | Aggressive โ 5 groups demoted (all attn + ffn_gate) |
Benchmarks
Evaluated using our standardized benchmark suite (ARC-Challenge, HellaSwag, MMLU, HumanEval) with temperature=0, no thinking mode.
Cerebellum v2 (13 GB) โ Recommended
| Benchmark | Score | Questions |
|---|---|---|
| ARC-Challenge | 91.6% | 1,172 |
| HellaSwag | 88.9% | 10,042 |
| MMLU | 73.5% | 11,643 |
| HumanEval | 82.3% | 164 |
Size vs Quality
| Model | Size | BPW | PPL (wiki) |
|---|---|---|---|
| Q3_K_M (baseline) | 14 GB | 3.94 | 8.3736 |
| Cerebellum v2 | 13 GB | 3.76 | 8.4912 |
| Cerebellum v1 | 12 GB | 3.50 | 9.1405 |
v2 saves 1 GB (7%) over Q3_K_M with only +1.4% perplexity increase โ and the 3 demoted groups actually improved perplexity individually during ablation.
Methodology
- Group ablation: Demote each of 7 weight groups (attn_k, attn_q, attn_v, attn_output, ffn_gate, ffn_up, ffn_down) to Q2_K individually. Measure PPL impact.
- Identify improvers: Three groups (attn_k, attn_q, attn_output) showed lower PPL when demoted โ the Q3_K_M precision was actually hurting these layers.
- Build optimal mix: v2 demotes only the 3 groups that improve; v1 additionally demotes attn_v and ffn_gate.
Ablation Results
| Group | PPL when demoted | Delta vs baseline |
|---|---|---|
| attn_k | 8.1639 | -0.2097 (improved!) |
| attn_q | 8.3144 | -0.0592 (improved!) |
| attn_output | 8.3539 | -0.0197 (improved!) |
| attn_v | 8.4713 | +0.0977 |
| ffn_gate | 8.5967 | +0.2231 |
| ffn_up | 9.0587 | +0.6851 |
| ffn_down | 9.0190 | +0.6454 |
v2 Override Map
Demoted (Q2_K): attn_k, attn_q, attn_output (all 64 layers)
Sacred (kept at Q3_K_M): attn_v, ffn_gate, ffn_up, ffn_down
Usage
Works with any llama.cpp-compatible tool:
# llama.cpp
./llama-server --model Granite-4.1-30B-Cerebellum-v2.gguf -ngl 99 --ctx-size 4096
# Ollama (create Modelfile pointing to the GGUF)
# LM Studio (drag and drop)
# koboldcpp, text-generation-webui, etc.
Hardware Requirements
- v2 (13 GB): Fits in 16 GB VRAM with room for context. RTX 4060 Ti 16GB, RTX 3090, etc.
- v1 (12 GB): Fits in 16 GB VRAM with generous context, or tight in 12 GB.
Credits
Quantized with Cerebellum โ ablation-guided mixed-precision quantization by deucebucket.
Base model by IBM Granite.
- Downloads last month
- 80
We're not able to determine the quantization variants.
Model tree for deucebucket/Granite-4.1-30B-Cerebellum-GGUF
Base model
ibm-granite/granite-4.1-30b
Install Unsloth Studio (macOS, Linux, WSL)
# Run unsloth studio unsloth studio -H 0.0.0.0 -p 8888 # Then open http://localhost:8888 in your browser # Search for deucebucket/Granite-4.1-30B-Cerebellum-GGUF to start chatting