wb / README.md
nad707's picture
feat: flatten repo and rebootstrap hf workspace
bf96836
metadata
title: ML Workbench
emoji: 🧠
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: mit

ML Workbench

Three tools in one Space:

Tab What it does
Token Tax Workbench Benchmark tokenizer families across languages, inspect deployable model mappings, run scenario tradeoff analyses, and audit formulas/sources
Model Comparison Compare any two free models via OpenRouter side-by-side β€” reasoning trace, token counts, per-model inference parameters
Tokenizer Inspector Paste text, see colour-coded token splits, token IDs, fragmentation ratios, OOV flags, and language efficiency scores

Token Tax Workbench

The workbench is designed as a four-step analysis flow:

  1. Benchmark: Compare tokenizer families on a strict verified multilingual corpus
  2. Catalog: Inspect deployable models, current pricing, context windows, and tokenizer mappings
  3. Scenario Lab: Turn tokenizer differences into cost, scale, and context tradeoff views under your assumptions
  4. Audit: Inspect formulas, data dictionary, sources, provenance, and exclusions

What to do first

  • Start in Benchmark to see which tokenizer families inflate tokens for your target languages
  • Move to Catalog to see which real models sit on top of those families
  • Use Scenario Lab to test your traffic assumptions
  • Use Audit when you need to verify where a number came from

Data policy

  • Strict verified data is shown by default
  • Proxy tokenizer mappings stay hidden until explicitly enabled
  • Latency and throughput are only shown when surfaced metadata exists
  • The app separates measured benchmark evidence from scenario-derived estimates

Why it matters

The same semantic content can use very different token counts across languages and tokenizer families. That changes:

  • API cost
  • effective context window
  • scaling behavior under traffic

This workbench helps you inspect those tradeoffs directly instead of assuming one model behaves equally across all languages.


Model Comparison

Compare how reasoning and standard models respond to the same question. Pick two models from the dropdown, adjust temperature / max-tokens if you want, and click Compare β†’.

The left panel shows the model's reasoning trace (if it has one) plus its final answer and token counts. The right panel shows the second model's response.

Models available

All free-tier via OpenRouter β€” no credit card required.

Label Model ID
Step 3.5 Flash (Reasoning) stepfun/step-3.5-flash
Llama-3.1-8B meta-llama/llama-3.1-8b-instruct
Gemma-3-27B google/gemma-3-27b-it:free
Mistral-7B mistralai/mistral-7b-instruct:free

Preset questions

Chosen to expose the "overthinking" failure mode of reasoning models:

Question Why interesting
How many r's in "strawberry"? Tokenization trap β€” letter-counting spiral
Bat and ball cost $1.10... CRT problem β€” intuitive wrong answer is $0.10, correct is $0.05
Is 9677 a prime number? Forces real arithmetic vs pattern recall
Monty Hall problem Famous model-confuser β€” counterintuitive correct answer
Fold paper 42 times Exponential growth β€” tests step-by-step reasoning vs approximation

Tokenizer Inspector

Paste any text and see exactly how a tokenizer splits it.

Single mode

  • Choose a tokenizer (GPT-2, Llama-3, or Mistral)
  • Set the OOV threshold (tokens per word that counts as suspicious β€” default 3)
  • Click Tokenize

Output: colour-coded token spans (red = OOV-flagged words), token count, fragmentation ratio, detected language.

Compare mode

  • Same input text, two tokenizers side by side
  • Shows token count per tokenizer and the ratio between them

Tokenizers

Label Model
gpt2 gpt2
llama-3 NousResearch/Meta-Llama-3-8B
mistral mistralai/Mistral-7B-v0.1

Language efficiency score

When the input is not English, the app translates it via OpenRouter and compares token counts:

score = english_token_count / input_token_count

Score > 1.0: the source language is more compact than English for this tokenizer. Score < 1.0: the source language uses more tokens. Score = 1.0: English input (no translation needed).


Run locally

Requires uv.

# Install dependencies (creates .venv)
make install

# Run the app
make run

Open the local Gradio URL printed in your terminal. Set your OpenRouter API key in the UI.

To run with a pre-set server key (skips the key input field):

OPENROUTER_API_KEY=sk-or-... make run

Run tests

make test

Runs the workbench test suite with coverage.


Deploy to HF Mirror Spaces

First-time setup

# Authenticate with HF Mirror
hf login

Create a new Space at huggingface.co/new-space:

  • SDK: Docker
  • Visibility: Public or Private

Set your OpenRouter API key as a Space secret

In your Space settings β†’ Secrets β†’ add:

Name:  OPENROUTER_API_KEY
Value: sk-or-...

This lets the app run without users needing their own key.

Deploy

HF_SPACE=your-username/your-space-name make deploy

This pushes the current repo root directly to your Space.

Update after changes

Same command β€” make deploy uploads the current repo state on every run.


Notes

  • API keys are never stored; used only for the duration of each request.
  • The tokenizer tab downloads model tokenizer configs on first use (~seconds). Subsequent calls use a local cache.
  • This Space runs as a Docker Space and starts via bootstrap.py.
  • The image is intentionally minimal: Python, requirements.txt, and the runtime modules only.