--- license: apache-2.0 tags: - uncensored - gemma4 - moe - gguf - vision - multimodal - agentic - coding language: - en pipeline_tag: image-text-to-text base_model: google/gemma-4-26B-A4B-it --- # Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced > **[Join the Discord](https://discord.gg/SZ5vacTXYf)** for updates, roadmaps, projects, or just to chat. Gemma4-26B-A4B uncensored by HauhauCS. **0/465 Refusals*** **Release Candidate after over 1 month of nonstop work on this one.** > **HuggingFace's "Hardware Compatibility" widget doesn't recognize K_P quants** — it may show fewer files than actually exist. Click **"View +X variants"** or go to **Files and versions** to see all available downloads. ## About **GenRM Defeated!** No changes to datasets or capabilities. Fully functional, 100% of what the original authors intended — just without the refusals. These are meant to be the best lossless uncensored models out there. ## Balanced — Release Candidate This legitimately took me over 1 month of non-stop work. Targeting 0 refusals in standard use, and that's what I'm seeing in testing (automated and manual) — a handful of edge-case prompts still deflect on first try but **follow through on a re-ask**. If you hit one Balanced won't get past, the Aggressive variant is coming once I figure out how to maintain lossless/near-lossless quality for it. - **Balanced**: will reason through edgy requests, occasionally attach a short safety framing, then deliver the full answer. Output is complete, nothing held back, but it can talk itself into it first. **Recommended default — 99%+ of users will be happy here.** Best for **creative writing, RP, emotional intelligence**. Normally I'd also say "agentic coding/tool use" however in my in-depth testing, **Qwen3.6 has been net superior on such tasks**. Do be mindful of the few deflection categories I mentioned already. - **Aggressive** *(separate release, WIP)*: strips the self-reasoning preamble and gives direct answers to any DEEPLY censored topics. Balanced also has meaningfully more stable sampling across re-runs, which matters for long context sessions — no sporadic topic drift deep. ## Downloads | File | Quant | BPW | Size | |------|-------|-----|------| | Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q8_K_P.gguf | Q8_K_P | 8.64 | 27 GB | | Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q6_K_P.gguf | Q6_K_P | 7.21 | 23 GB | | Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q5_K_P.gguf | Q5_K_P | 6.12 | 19 GB | | Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q5_K_M.gguf | Q5_K_M | 6.06 | 19 GB | | Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q4_K_P.gguf | Q4_K_P | 5.36 | 17 GB | | Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q4_K_M.gguf | Q4_K_M | 5.32 | 17 GB | | Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-IQ4_XS.gguf | IQ4_XS | 4.41 | 14 GB | | Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q3_K_P.gguf | Q3_K_P | 4.25 | 13 GB | | Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q3_K_M.gguf | Q3_K_M | 4.21 | 13 GB | | Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-IQ3_M.gguf | IQ3_M | 3.93 | 12 GB | | Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q2_K_P.gguf | Q2_K_P | 3.39 | 11 GB | | Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-IQ2_M.gguf | IQ2_M | 3.29 | 10 GB | | mmproj-Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-f16.gguf | mmproj (f16) | — | 1.2 GB | BPW is slightly higher than nominal across the board because Gemma4 has a lot of per-layer norm/scale tensors kept at F32 (multiple post-ffw norms per layer). All quants generated with importance matrix (imatrix) for optimal quality preservation on uncensored weights. ## What are K_P quants? K_P ("Perfect") quants are HauhauCS custom quantizations that use **model-specific** analysis to selectively preserve quality where it matters most. Each model gets its own optimized quantization profile — the top 25% most-important tensors (per imatrix calibration) are promoted to a higher quant type. A K_P quant effectively bumps quality up by 1-2 quant levels at only ~5-15% larger file size than the base quant. Fully compatible with llama.cpp, LM Studio, and any GGUF-compatible runtime — no special builds needed. **Note:** K_P quants may show as "?" in LM Studio's quant column. This is a display issue only — the model loads and runs fine. ## Why this model for agentic work 26B total params with only ~4B active per forward pass (top-8 of 128 experts). You get the reasoning footprint of a 26B with the throughput of a ~4B for inference cost — which matters when you're chaining 10+ tool calls per task. Sliding-window attention (1024 tokens) plus periodic full attention keeps long contexts cheap without losing global coherence. Balanced is calibrated for this. It removes refusals on security/ops/research-adjacent topics that block legitimate coding work, without bending the sampling geometry that keeps long chains coherent. Recommended quant for most coding work: **Q4_K_P** (17 GB, fits in 24 GB VRAM with room for context) or **Q8_K_P** (27 GB) if you have more VRAM and want maximum quality with minimal offloading. Do note - main usecase for Gemma4 is Creative Writing, Roleplaying and Emotional Intelligence. ## Specs - 25.2B total / 3.8B active params (128 routed experts, top-8 + 1 shared expert) - 30 layers, hybrid attention: 5× sliding-window (1024 tokens) → 1× full global, repeating. Uses Proportional RoPE (p-RoPE). - Hidden dim 2816, FFN dim 2112, MoE expert FFN 704, vocab 262144 - Head dim 256 (SWA) / 512 (full), 16 attention heads, 8 KV heads (2 for full layers) - 256K native context - Natively multimodal (text + vision) — ships with mmproj. Variable visual token budgets: 70 / 140 / 280 / 560 / 1120 per image. - Based on [google/gemma-4-26B-A4B-it](https://huggingface.co/google/gemma-4-26B-A4B-it) ## Recommended Settings From the official Gemma authors: **Inference parameters:** - `temperature=1.0, top_p=0.95, top_k=64` **Important:** - Use `--jinja` with llama.cpp for proper chat template handling - Vision support requires the `mmproj` file alongside the main GGUF. **Place images before text** in your prompt for best vision performance. - Keep at least 32K context for serious agentic work; the model can take much more (256K native) if you need it - Sliding window is baked into the architecture — no special flag needed ## Turning Thinking On/Off Gemma4 has thinking mode controlled via `enable_thinking` in the chat template. It's the same pattern as Qwen3.6 — set `false` for faster, shorter replies and `true` (default) when you want chain-of-thought. ### LM Studio 1. Load the model 2. Right-side settings panel → **Model Settings** → **Prompt Template** (or **Chat Template Options**) 3. Set `enable_thinking` to `false` (or `true`) in the template kwargs ### llama.cpp **llama-server — set as default for all requests:** ```bash llama-server -m Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q4_K_P.gguf \ --mmproj mmproj-Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-f16.gguf \ --jinja -c 32768 -ngl 99 \ --chat-template-kwargs '{"enable_thinking": false}' ``` **Per-request via the OpenAI-compatible API:** ```json { "model": "gemma4-26b-a4b", "messages": [{"role": "user", "content": "..."}], "chat_template_kwargs": {"enable_thinking": false} } ``` ## Usage Works with llama.cpp, LM Studio, Jan, koboldcpp, and other GGUF-compatible runtimes. **llama-server:** ```bash llama-server -m Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q4_K_P.gguf \ --mmproj mmproj-Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-f16.gguf \ --jinja -c 32768 -ngl 99 ``` **llama-cli:** ```bash llama-cli -m Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-Q4_K_P.gguf \ --mmproj mmproj-Gemma4-26B-A4B-Uncensored-HauhauCS-Balanced-f16.gguf \ --jinja -c 32768 -ngl 99 ``` ## Other Models - [HauhauCS on HuggingFace](https://huggingface.co/HauhauCS/models) --- \* _Tested with both automated and manual refusal benchmarks — none have been found in standard use. A small number of edge-case prompts deflect on the first ask but comply on a re-ask or strategic framing. If you hit one that's actually obstructive to your use case, [join the Discord](https://discord.gg/SZ5vacTXYf) and flag it so I can work on it in a future revision._