Instructions to use AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16 with libraries, inference providers, notebooks, and local apps. Follow these links to get started.

Libraries

How to use AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16 with Transformers:

# Use a pipeline as a high-level helper
from transformers import pipeline

pipe = pipeline("text-generation", model="AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
pipe(text=messages)

# Load model directly
from transformers import AutoProcessor, AutoModelForImageTextToText

processor = AutoProcessor.from_pretrained("AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16")
model = AutoModelForImageTextToText.from_pretrained("AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16")
messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/p-blog/candy.JPG"},
            {"type": "text", "text": "What animal is on the candy?"}
        ]
    },
]
inputs = processor.apply_chat_template(
	messages,
	add_generation_prompt=True,
	tokenize=True,
	return_dict=True,
	return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=40)
print(processor.decode(outputs[0][inputs["input_ids"].shape[-1]:]))

Notebooks
Google Colab
Kaggle
Local Apps Settings

vLLM

How to use AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16 with vLLM:

Install from pip and serve model

# Install vLLM from pip:
pip install vllm
# Start the vLLM server:
vllm serve "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16"
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:8000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker

docker model run hf.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16

SGLang

How to use AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16 with SGLang:

Install from pip and serve model

# Install SGLang from pip:
pip install sglang
# Start the SGLang server:
python3 -m sglang.launch_server \
    --model-path "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16" \
    --host 0.0.0.0 \
    --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Use Docker images

docker run --gpus all \
    --shm-size 32g \
    -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    --ipc=host \
    lmsysorg/sglang:latest \
    python3 -m sglang.launch_server \
        --model-path "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16" \
        --host 0.0.0.0 \
        --port 30000
# Call the server using curl (OpenAI-compatible API):
curl -X POST "http://localhost:30000/v1/chat/completions" \
	-H "Content-Type: application/json" \
	--data '{
		"model": "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16",
		"messages": [
			{
				"role": "user",
				"content": "What is the capital of France?"
			}
		]
	}'

Docker Model Runner
How to use AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16 with Docker Model Runner:
```
docker model run hf.co/AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16
```
Browse Quantizations to use this model in llama.cpp, Ollama, LM Studio, or any compatible app.

Gemma-4-12B-it AEON Abliterated — K=4 Multi-Direction Biprojection (BF16)

✅ Verified working in native vLLM (2026-06-05)

Smoke-tested end-to-end via vLLM's native Gemma4UnifiedForConditionalGeneration loader in ghcr.io/aeon-7/aeon-vllm-ultimate:v0.22.2-pr44389-aeon-spark-gemma4unified on DGX Spark GB10. The loader for Google's encoder-free Gemma-4-12B was added upstream in gemma4_unified.py (after the initial PR #38826 merge). Our previous "blocked on architecture mismatch" notes are obsolete.

Single-stream by category (greedy, no spec decode, max_tokens=250)

Category TTFT median TPOT median tok/s mean tok/s median

summary 292 ms 101 ms 9.87 9.86

prose 293 ms 113 ms 8.86 8.85

dialogue 303 ms 123 ms 8.14 8.14

code 316 ms 142 ms 7.05 7.02

reasoning 337 ms 152 ms 6.80 6.61

math 304 ms 184 ms 5.53 5.45

OVERALL (24 prompts) 303 ms 126 ms 7.71 7.92

Concurrent aggregate throughput (FP8 KV cache, max-num-seqs=16)

Concurrency Aggregate tok/s Steady TTFT Per-stream tok/s

1 (transformers baseline) 7.0 n/a 7.0

4 (BF16 KV, max-seqs=4) 39.3 308 ms 9.8

8 (FP8 KV, max-seqs=16) 69.8 333 ms 8.7

16 (FP8 KV, max-seqs=16) 144.4 ⚡ 408 ms 9.0

20× speedup at concurrency=16 vs the transformers single-stream baseline. KV cache holds 538k tokens at 8k context (65× max concurrency theoretical, 16× shown above). Refusal removal confirmed (previously-refused chemistry-lab-safety prompt produces a full 7-section answer, no refusal preamble).

Suggested serve command (DGX Spark GB10)
docker run -d --name aeon-gemma12b --gpus all --ipc=host --shm-size=16g --net=host \
  -v /path/to/Gemma-4-12B-AEON-K4:/model:ro \
  --entrypoint vllm \
  ghcr.io/aeon-7/aeon-vllm-ultimate:latest \
  serve /model \
    --served-model-name aeon-gemma12b \
    --dtype bfloat16 \
    --kv-cache-dtype fp8_e4m3 \
    --max-model-len 8192 \
    --max-num-seqs 16 \
    --gpu-memory-utilization 0.85 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --trust-remote-code
Add processor_config.json to the model dir before serving (vLLM's multimodal init requires it):
huggingface-cli download google/gemma-4-12B-it processor_config.json --local-dir .

Category	TTFT median	TPOT median	tok/s mean	tok/s median
summary	292 ms	101 ms	9.87	9.86
prose	293 ms	113 ms	8.86	8.85
dialogue	303 ms	123 ms	8.14	8.14
code	316 ms	142 ms	7.05	7.02
reasoning	337 ms	152 ms	6.80	6.61
math	304 ms	184 ms	5.53	5.45
OVERALL (24 prompts)	303 ms	126 ms	7.71	7.92

Concurrency	Aggregate tok/s	Steady TTFT	Per-stream tok/s
1 (transformers baseline)	7.0	n/a	7.0
4 (BF16 KV, max-seqs=4)	39.3	308 ms	9.8
8 (FP8 KV, max-seqs=16)	69.8	333 ms	8.7
16 (FP8 KV, max-seqs=16)	144.4 ⚡	408 ms	9.0

Available formats / quant grid

Variant	Repo	Precision	Size	Pick when
FP8	`…-K4-FP8`	FP8 E4M3	13 GB	Quality matters — near-lossless, matches BF16
Mixed NVFP4+FP8	`…-K4-NVFP4-FP8`	NVFP4 MLP + FP8 attn	9.3 GB	Smallest + fastest — MLP-only quality, 20% less size, 34% faster
NVFP4 MLP-only	`…-K4-NVFP4`	NVFP4 MLP + BF16 attn	11.7 GB	Superseded by Mixed NVFP4+FP8 (above)
BF16 (this)	`…-K4-BF16`	bfloat16	24 GB	Fine-tuning, non-Blackwell hardware

Vision and audio encoders (embed_vision*, embed_audio*) and lm_head are kept at full BF16 in the SVDQuant variant — only language-decoder linears are quantized.

Recommended runtime

For Gemma-4 production today, use the previous AEON-7 image ghcr.io/aeon-7/aeon-gemma-4-26b-a4b-dflash:latest (vLLM 0.20.1, known-good for the multimodal Gemma4UnifiedForConditionalGeneration path) or load with transformers directly.

⚠️ The newer AEON vLLM Ultimate container (ghcr.io/aeon-7/aeon-vllm-ultimate:latest) bundles vLLM 0.22.1 + PR #44389 (NVFP4 KV cache, 3× capacity) + DFlash + TurboQuant for our Qwen3.6 family — but as of 2026-06-04 it has upstream PR #44389 bugs on Gemma-4: the multimodal fallback hits a shape mismatch in Gemma4UnifiedForConditionalGeneration, and the modelopt path doesn't yet recognize NVFP4_SVD. A patched tag will be published when upstream merges fixes. Track the container's Known issues.

Capability Comparison vs Base

Full-length eval (2026-06-06): balanced MMLU across all 57 subjects (285 Q), full 164-problem HumanEval, IFEval-50 — all via the vLLM serving path. The abliteration is capability-neutral: this model is within ~1pp of Google's official base on every axis.

Model MMLU (285) HumanEval-syn HumanEval-fun IFEval

google/gemma-4-12B-it (base) 81.4% 99.4% 82.9% 90%

This model (K4-BF16) 80.4% 99.4% 83.5% 90%

K4-FP8 quant (imperceptible) 80.4% 99.4% 85.4% 90%

K4-NVFP4 MLP-only quant 76.8% 96.3% 76.2% 90%

Refusal removal does not cost capability. The older small-N table below is kept for historical context.

Model	MMLU (285)	HumanEval-syn	HumanEval-fun	IFEval
`google/gemma-4-12B-it` (base)	81.4%	99.4%	82.9%	90%
This model (K4-BF16)	80.4%	99.4%	83.5%	90%
K4-FP8 quant (imperceptible)	80.4%	99.4%	85.4%	90%
K4-NVFP4 MLP-only quant	76.8%	96.3%	76.2%	90%

Historical (small-N, superseded by the full-length eval above)

Metric	google/gemma-4-12B-it	This model (K=4 BF16)	Δ
wikitext PPL drift	—	-4.22%	better (more confident)
alpaca PPL drift	—	+0.50%	imperceptible
MMLU (60 questions)	60.0%	55.0%	-5.0pp (within ±6pp CI)
HumanEval syntactic (15)	100.0%	100.0%	0pp
HumanEval functional (15)	20.0%	26.7% 🎉	+6.7pp
IFEval (10 verifiable)	90.0%	90.0%	0pp

Full-N MMLU (2000 questions) and HumanEval (164 problems) evaluations are in progress and will update this card when available.

Comparison vs TrevorJS K=1 biprojection (intermediate variant we tested)

Metric	TrevorJS K=1	This K=4 (canonical)
wikitext PPL drift	-10.47%	-4.22%
alpaca PPL drift	+0.23%	+0.50%
Generative quality	working, plain	richer detail, character-naming, technical vocabulary
Edit footprint	48 matrices	48 matrices (same)

K=4 has less than half the wikitext drift of K=1, equal weight-surgery footprint, and qualitatively richer generative output. K=4 was selected as the canonical publish candidate.

Methodology

K-direction norm-preserving biprojection

For each target weight matrix W ∈ ℝ^(out × in):

Q ∈ ℝ^(K × out)   — orthonormal basis of refusal directions
W_norms[i] = ||W[i, :]||           # per-row norms (preserved)
W_dirs[i, :] = W[i, :] / W_norms[i] # unit-row directions

For pass = 1..2:
    refusal_component = Q @ W_dirs   ∈ ℝ^(K × in)
    proj              = Q.T @ refusal_component  ∈ ℝ^(out × in)
    W_dirs            = W_dirs - scale * proj
    W_dirs[i, :]      = W_dirs[i, :] / ||W_dirs[i, :]||    # renormalize

W_new[i, :] = W_norms[i] * W_dirs[i, :]

With scale=1.0, this projects each row onto the orthogonal complement of the K-dim refusal subspace while preserving row magnitudes (norm-preserving).

Basis construction

Extract per-layer refusal directions from base via heretic's compute_refusal_directions with winsorization at 99.5th percentile.
Rank layers by SNR composite score (SNR × (1 − cos_sim) × purity).
Take the top-K=4 layers: L24, L37, L39, L26 (qualities 0.0079, 0.0079, 0.0075, 0.0074).
Orthogonalize each direction against the harmless mean (double-pass Gram-Schmidt).
QR-orthonormalize the stacked directions → Q ∈ ℝ^(K × hidden).

Edit footprint

top-pct=50: 24 of 48 decoder layers selected by SNR composite (L21-L42, L44, L45 — the mid-late range where refusal direction concentrates).
Per layer: self_attn.o_proj.weight and mlp.down_proj.weight both edited (the two matrices that write to the residual stream).
Total: 48 weight matrices edited, identical footprint to TrevorJS's K=1 biprojection.

Why K=4 (not K=1, K=2, K=8, scale=2)

K=1 (TrevorJS baseline): works, but -10.47% wikitext PPL drift vs base; produces plainer outputs.
K=2/K=3: smaller subspace, may miss refusal-direction variance across layers.
K=4 (this model): best drift, best quality.
K=8: essentially identical to K=4 in smoke output (top 5-8 SNR directions are near-coplanar with the K=4 basis — no new information).
scale=2.0: breaks the abliteration — over-correction reflects past the orthogonal projection and damages the comply-pathway, leaving the safety-tuning's refusal pathway intact. The model produces clean refusals at scale=2.

We are at a sharp minimum at scale=1.0, K=4 — modest perturbation in either direction degrades the result.

Technical details

Property	Value
Base Model	google/gemma-4-12B-it
Architecture	`Gemma4UnifiedForConditionalGeneration` (multimodal-capable text path)
Decoder Layers	48
Hidden Size	3840
Attention	GQA 16 heads / 8 KV, hybrid sliding (1024) + full
Vocabulary	262,144 tokens
Embeddings	Tied (lm_head shares with embed_tokens)
Total Params	12B (unchanged)
Precision	BF16
Format	Single 23.9 GB safetensors shard

Loading

from transformers import AutoTokenizer, Gemma4UnifiedForConditionalGeneration
import torch

REPO = "AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16"
model = Gemma4UnifiedForConditionalGeneration.from_pretrained(
    REPO, torch_dtype=torch.bfloat16, device_map="cuda:0"
)
tok = AutoTokenizer.from_pretrained(REPO)

enc = tok.apply_chat_template(
    [{"role": "user", "content": "Your prompt here"}],
    add_generation_prompt=True, return_tensors="pt", return_dict=True,
)
enc = {k: v.to("cuda:0") for k, v in enc.items() if hasattr(v, "to")}
out = model.generate(**enc, max_new_tokens=512, do_sample=False, pad_token_id=tok.eos_token_id)
print(tok.decode(out[0][enc["input_ids"].shape[1]:], skip_special_tokens=True))

Note: Gemma-4 is multimodal so apply_chat_template returns a BatchEncoding dict, not a tensor — unpack with **enc (or grab enc["input_ids"]).

vLLM

vllm serve AEON-7/Gemma-4-12B-it-AEON-Abliterated-K4-BF16 \
  --served-model-name aeon-12b \
  --dtype bfloat16 \
  --max-model-len 65536 \
  --max-num-seqs 8 \
  --gpu-memory-utilization 0.85 \
  --trust-remote-code \
  --enable-chunked-prefill \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser gemma4

Behavior

The model produces responses on benign prompts that are essentially indistinguishable from the base model in structure, voice, and knowledge. Open-ended Alpaca-style instructions, multi-step reasoning, code generation, and instruction-following all measure on par with or marginally above the base model in our held-out evaluation.

On prompts that the base model would decline, this model produces full responses — typically prefixed with a brief disclaimer or caveat paragraph (a stylistic artifact of the instruct training that persists across all single-pass biprojection variants we tested at scale=1.0). The model is functionally compliant with the user's request after the preamble.

Caveats

Disclaimer/caveat preamble persists across all single-pass abliteration variants we tested. To eliminate it would require either token-level intervention (whack-a-mole — the model picks synonyms not on the suppression list) or longer training-style methods (DPO/SFT).
MMLU showed a -5pp drop on N=60 — within the 95% CI of ±12.6pp, so likely sampling noise. Full MMLU eval is running now and will update this card.
Heretic's reported KL divergence reads nan on this model due to a Gemma-4-specific F.kl_div edge case with -inf logprobs. Refusal count + benign-text PPL drift were used as the eval signals instead.

Related Models

TrevorJS K=1 baseline: TrevorJS/gemma-4-31B-it-uncensored
AEON-7 Gemma 4 family: Gemma-4-31B DECKARD HERETIC, Gemma-4-26B-A4B-it-Uncensored

Acknowledgements

TrevorJS — biprojection recipe and abliteration scaffolding (we forked and extended)
p-e-w/heretic — underlying abliteration framework
AEON-7 — K-direction extension, capability eval harness, this fork

License

Inherits the Gemma license from the base model. By using this model you agree to Google's Gemma license terms.

Arbitration Clause

By accessing, downloading, using, running inference on, fine-tuning, merging, quantizing, distributing, integrating, or otherwise interacting with this model, you acknowledge and agree to the following:

Sole Responsibility. You, the user, are solely and exclusively responsible for (a) every prompt you or your downstream system issue to this model, (b) every response this model produces in reply, (c) every downstream action taken by you, your systems, your agents, or your users in reliance on those responses, and (d) any harm — direct, indirect, consequential, foreseeable, or otherwise — that results from any of the above.
No Warranty. This model is provided strictly "AS IS", without warranty of any kind, express or implied, including but not limited to warranties of merchantability, fitness for a particular purpose, non-infringement, safety, alignment, factual accuracy, or legal compliance in any jurisdiction. No contributor, author, publisher, or hosting platform assumes liability of any kind for outputs or downstream use.
Legal Compliance. You are responsible for ensuring that your use of this model complies with all applicable laws, regulations, terms of service, industry codes of conduct, professional ethical standards, and organizational policies in every jurisdiction in which you operate or in which your outputs may be received. The unaligned nature of this model does not grant you any legal authorization you did not already have.
Operational Safety Layer. An uncensored model is not a toy. You are expected to implement appropriate downstream safety layers proportionate to your deployment context, including but not limited to: input validation, output filtering, content moderation, audit logging, rate limiting, access controls, and human-in-the-loop review for high-risk workflows. A production deployment of this model without such layers is unsafe by construction and is not a supported use case.
Heightened Duty of Care. The absence of internal refusal behavior means the duty of care that would ordinarily rest partly with the model rests entirely with you. You are expected to exercise greater — not lesser — caution, forethought, and ethical discipline when operating this model than you would operate a base aligned model. If you are uncertain whether your contemplated use is ethical, legal, or wise, the correct action is to not make the request.
No Endorsement of Outputs. The authors, contributors, and publishers of this model do not endorse, adopt, or take responsibility for any specific output this model produces. Outputs are a stochastic function of the prompt, the weights, and the sampler state — not a statement of position by any human.
Arbitration. Any dispute, claim, or controversy arising out of or relating to the use of this model, its outputs, or this clause shall be resolved through binding individual arbitration under the rules of a mutually agreed arbitration body (or, absent agreement, the American Arbitration Association's Consumer Arbitration Rules), waiving any right to a jury trial, class action, representative action, or consolidated proceeding. Venue shall be the jurisdiction of the disputing party bringing the claim. Costs and attorneys' fees shall be allocated per the applicable arbitration rules. This clause does not expand, and where legally prohibited does not establish, any liability in the other direction; it limits how the user may proceed when alleging harm tied to their own use of this model.
Indemnification. You agree to indemnify, defend, and hold harmless the authors, contributors, and publishers of this model from and against any claims, damages, losses, liabilities, costs, and expenses (including reasonable attorneys' fees) arising from or related to your use of the model or your breach of this clause.
Severability. If any provision of this clause is held unenforceable in a given jurisdiction, the remaining provisions remain in full force in that jurisdiction, and the unenforceable provision is replaced by the closest enforceable equivalent consistent with the original intent.
Acceptance. Your use of this model constitutes your acceptance of this clause in full. If you do not accept, do not use the model.

This model is a tool with no opinions of its own. You supply the opinions. You supply the judgement. You supply the ethics. The outputs carry your fingerprints, not the model's.