title: ML Workbench
emoji: π§
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: mit
ML Workbench
Three tools in one Space:
| Tab | What it does |
|---|---|
| Token Tax Workbench | Benchmark tokenizer families across languages, inspect deployable model mappings, run scenario tradeoff analyses, and audit formulas/sources |
| Model Comparison | Compare any two free models via OpenRouter side-by-side β reasoning trace, token counts, per-model inference parameters |
| Tokenizer Inspector | Paste text, see colour-coded token splits, token IDs, fragmentation ratios, OOV flags, and language efficiency scores |
Token Tax Workbench
The workbench is designed as a four-step analysis flow:
- Benchmark: Compare tokenizer families on a strict verified multilingual corpus
- Catalog: Inspect deployable models, current pricing, context windows, and tokenizer mappings
- Scenario Lab: Turn tokenizer differences into cost, scale, and context tradeoff views under your assumptions
- Audit: Inspect formulas, data dictionary, sources, provenance, and exclusions
What to do first
- Start in Benchmark to see which tokenizer families inflate tokens for your target languages
- Move to Catalog to see which real models sit on top of those families
- Use Scenario Lab to test your traffic assumptions
- Use Audit when you need to verify where a number came from
Data policy
- Strict verified data is shown by default
- Proxy tokenizer mappings stay hidden until explicitly enabled
- Latency and throughput are only shown when surfaced metadata exists
- The app separates measured benchmark evidence from scenario-derived estimates
Why it matters
The same semantic content can use very different token counts across languages and tokenizer families. That changes:
- API cost
- effective context window
- scaling behavior under traffic
This workbench helps you inspect those tradeoffs directly instead of assuming one model behaves equally across all languages.
Model Comparison
Compare how reasoning and standard models respond to the same question. Pick two models from the dropdown, adjust temperature / max-tokens if you want, and click Compare β.
The left panel shows the model's reasoning trace (if it has one) plus its final answer and token counts. The right panel shows the second model's response.
Models available
All free-tier via OpenRouter β no credit card required.
| Label | Model ID |
|---|---|
| Step 3.5 Flash (Reasoning) | stepfun/step-3.5-flash |
| Llama-3.1-8B | meta-llama/llama-3.1-8b-instruct |
| Gemma-3-27B | google/gemma-3-27b-it:free |
| Mistral-7B | mistralai/mistral-7b-instruct:free |
Preset questions
Chosen to expose the "overthinking" failure mode of reasoning models:
| Question | Why interesting |
|---|---|
| How many r's in "strawberry"? | Tokenization trap β letter-counting spiral |
| Bat and ball cost $1.10... | CRT problem β intuitive wrong answer is $0.10, correct is $0.05 |
| Is 9677 a prime number? | Forces real arithmetic vs pattern recall |
| Monty Hall problem | Famous model-confuser β counterintuitive correct answer |
| Fold paper 42 times | Exponential growth β tests step-by-step reasoning vs approximation |
Tokenizer Inspector
Paste any text and see exactly how a tokenizer splits it.
Single mode
- Choose a tokenizer (GPT-2, Llama-3, or Mistral)
- Set the OOV threshold (tokens per word that counts as suspicious β default 3)
- Click Tokenize
Output: colour-coded token spans (red = OOV-flagged words), token count, fragmentation ratio, detected language.
Compare mode
- Same input text, two tokenizers side by side
- Shows token count per tokenizer and the ratio between them
Tokenizers
| Label | Model |
|---|---|
| gpt2 | gpt2 |
| llama-3 | NousResearch/Meta-Llama-3-8B |
| mistral | mistralai/Mistral-7B-v0.1 |
Language efficiency score
When the input is not English, the app translates it via OpenRouter and compares token counts:
score = english_token_count / input_token_count
Score > 1.0: the source language is more compact than English for this tokenizer. Score < 1.0: the source language uses more tokens. Score = 1.0: English input (no translation needed).
Run locally
Requires uv.
# Install dependencies (creates .venv)
make install
# Run the app
make run
Open the local Gradio URL printed in your terminal. Set your OpenRouter API key in the UI.
To run with a pre-set server key (skips the key input field):
OPENROUTER_API_KEY=sk-or-... make run
Run tests
make test
Runs the workbench test suite with coverage.
Deploy to HF Mirror Spaces
First-time setup
# Authenticate with HF Mirror
hf login
Create a new Space at huggingface.co/new-space:
- SDK: Docker
- Visibility: Public or Private
Set your OpenRouter API key as a Space secret
In your Space settings β Secrets β add:
Name: OPENROUTER_API_KEY
Value: sk-or-...
This lets the app run without users needing their own key.
Deploy
HF_SPACE=your-username/your-space-name make deploy
This pushes the current repo root directly to your Space.
Update after changes
Same command β make deploy uploads the current repo state on every run.
Notes
- API keys are never stored; used only for the duration of each request.
- The tokenizer tab downloads model tokenizer configs on first use (~seconds). Subsequent calls use a local cache.
- This Space runs as a Docker Space and starts via
bootstrap.py. - The image is intentionally minimal: Python,
requirements.txt, and the runtime modules only.