NanoGPT-X
What This Is
NanoGPT-X is a 15.6M parameter decoder-only transformer trained on WikiText-2. It integrates architectural innovations from DeepSeek, Meta, Google, Microsoft, Mistral, and Stanford into a single file that trains on a T4 GPU in under 5 hours.
Architecture
Input -> [Embed] -> [Block x 4] -> [RMSNorm] -> [LM Head] -> Output
|
v
Block = x + alpha * Attn(Norm(x)) + alpha * MLP(Norm(x))
|
v
Attn = MLA (default) | DiffAttn (optional)
MLP = SwiGLU
alpha = DeepNorm scaling = sqrt(2N) = 2.83
Components
| Component | Source | What It Does |
|---|---|---|
| MLA | DeepSeek-V3 | KV cache compression to 32-dim latent (8x smaller) |
| MTP | DeepSeek-V3 | Predicts t+2, t+3 alongside t+1 for better efficiency |
| DiffAttn | Microsoft 2024 | Signal-minus-noise attention filtering |
| SWA | Mistral | Local attention window of 128 tokens |
| RoPE+NTK | Meta / CodeLLaMA | Relative position with length extrapolation |
| DeepNorm | Microsoft | Residual scaling for deep network stability |
| RMSNorm | LLaMA / PaLM | Fast normalization without mean-centering |
| QK-Norm | Gemma 2 | Pre-attention query/key normalization |
| SwiGLU | PaLM / LLaMA | Gated FFN activation (8/3 ratio) |
| Z-loss | PaLM / Chinchilla | Logit regularization preventing softmax drift |
| Lion | Google Brain 2023 | Sign-momentum optimizer |
| WSD | DeepSeek / MiniMax | Warmup-Stable-Decay LR schedule |
| Flash Attention 2 | Stanford | O(1) memory fused attention kernel |
| torch.compile | PyTorch 2.0+ | Graph compilation with operator fusion |
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support