NanoGPT-X

What This Is

NanoGPT-X is a 15.6M parameter decoder-only transformer trained on WikiText-2. It integrates architectural innovations from DeepSeek, Meta, Google, Microsoft, Mistral, and Stanford into a single file that trains on a T4 GPU in under 5 hours.

Architecture

Input -> [Embed] -> [Block x 4] -> [RMSNorm] -> [LM Head] -> Output
                    |
                    v
              Block = x + alpha * Attn(Norm(x)) + alpha * MLP(Norm(x))
                    |
                    v
              Attn  = MLA (default) | DiffAttn (optional)
              MLP   = SwiGLU
              alpha = DeepNorm scaling = sqrt(2N) = 2.83

Components

Component	Source	What It Does
MLA	DeepSeek-V3	KV cache compression to 32-dim latent (8x smaller)
MTP	DeepSeek-V3	Predicts t+2, t+3 alongside t+1 for better efficiency
DiffAttn	Microsoft 2024	Signal-minus-noise attention filtering
SWA	Mistral	Local attention window of 128 tokens
RoPE+NTK	Meta / CodeLLaMA	Relative position with length extrapolation
DeepNorm	Microsoft	Residual scaling for deep network stability
RMSNorm	LLaMA / PaLM	Fast normalization without mean-centering
QK-Norm	Gemma 2	Pre-attention query/key normalization
SwiGLU	PaLM / LLaMA	Gated FFN activation (8/3 ratio)
Z-loss	PaLM / Chinchilla	Logit regularization preventing softmax drift
Lion	Google Brain 2023	Sign-momentum optimizer
WSD	DeepSeek / MiniMax	Warmup-Stable-Decay LR schedule
Flash Attention 2	Stanford	O(1) memory fused attention kernel
torch.compile	PyTorch 2.0+	Graph compilation with operator fusion

Downloads last month: -; Downloads are not tracked for this model. How to track

Inference Providers NEW

This model isn't deployed by any Inference Provider. 🙋 Ask for provider support