Spaces:

nad707
/

wb

Sleeping

App Files Files Community

wb / README.md

nad707

feat: flatten repo and rebootstrap hf workspace

bf96836 2 months ago

preview code

raw

history blame contribute delete

5.93 kB

metadata

title: ML Workbench
emoji: 🧠
colorFrom: blue
colorTo: indigo
sdk: docker
app_port: 7860
pinned: false
license: mit

ML Workbench

Three tools in one Space:

Tab	What it does
Token Tax Workbench	Benchmark tokenizer families across languages, inspect deployable model mappings, run scenario tradeoff analyses, and audit formulas/sources
Model Comparison	Compare any two free models via OpenRouter side-by-side — reasoning trace, token counts, per-model inference parameters
Tokenizer Inspector	Paste text, see colour-coded token splits, token IDs, fragmentation ratios, OOV flags, and language efficiency scores

Token Tax Workbench

The workbench is designed as a four-step analysis flow:

Benchmark: Compare tokenizer families on a strict verified multilingual corpus
Catalog: Inspect deployable models, current pricing, context windows, and tokenizer mappings
Scenario Lab: Turn tokenizer differences into cost, scale, and context tradeoff views under your assumptions
Audit: Inspect formulas, data dictionary, sources, provenance, and exclusions

What to do first

Start in Benchmark to see which tokenizer families inflate tokens for your target languages
Move to Catalog to see which real models sit on top of those families
Use Scenario Lab to test your traffic assumptions
Use Audit when you need to verify where a number came from

Data policy

Strict verified data is shown by default
Proxy tokenizer mappings stay hidden until explicitly enabled
Latency and throughput are only shown when surfaced metadata exists
The app separates measured benchmark evidence from scenario-derived estimates

Why it matters

The same semantic content can use very different token counts across languages and tokenizer families. That changes:

API cost
effective context window
scaling behavior under traffic

This workbench helps you inspect those tradeoffs directly instead of assuming one model behaves equally across all languages.

Model Comparison

Compare how reasoning and standard models respond to the same question. Pick two models from the dropdown, adjust temperature / max-tokens if you want, and click Compare →.

The left panel shows the model's reasoning trace (if it has one) plus its final answer and token counts. The right panel shows the second model's response.

Models available

All free-tier via OpenRouter — no credit card required.

Label	Model ID
Step 3.5 Flash (Reasoning)	`stepfun/step-3.5-flash`
Llama-3.1-8B	`meta-llama/llama-3.1-8b-instruct`
Gemma-3-27B	`google/gemma-3-27b-it:free`
Mistral-7B	`mistralai/mistral-7b-instruct:free`

Preset questions

Chosen to expose the "overthinking" failure mode of reasoning models:

Question	Why interesting
How many r's in "strawberry"?	Tokenization trap — letter-counting spiral
Bat and ball cost $1.10...	CRT problem — intuitive wrong answer is $0.10, correct is $0.05
Is 9677 a prime number?	Forces real arithmetic vs pattern recall
Monty Hall problem	Famous model-confuser — counterintuitive correct answer
Fold paper 42 times	Exponential growth — tests step-by-step reasoning vs approximation

Tokenizer Inspector

Paste any text and see exactly how a tokenizer splits it.

Single mode

Choose a tokenizer (GPT-2, Llama-3, or Mistral)
Set the OOV threshold (tokens per word that counts as suspicious — default 3)
Click Tokenize

Output: colour-coded token spans (red = OOV-flagged words), token count, fragmentation ratio, detected language.

Compare mode

Same input text, two tokenizers side by side
Shows token count per tokenizer and the ratio between them

Tokenizers

Label	Model
gpt2	`gpt2`
llama-3	`NousResearch/Meta-Llama-3-8B`
mistral	`mistralai/Mistral-7B-v0.1`

Language efficiency score

When the input is not English, the app translates it via OpenRouter and compares token counts:

score = english_token_count / input_token_count

Score > 1.0: the source language is more compact than English for this tokenizer. Score < 1.0: the source language uses more tokens. Score = 1.0: English input (no translation needed).

Run locally

Requires uv.

# Install dependencies (creates .venv)
make install

# Run the app
make run

Open the local Gradio URL printed in your terminal. Set your OpenRouter API key in the UI.

To run with a pre-set server key (skips the key input field):

OPENROUTER_API_KEY=sk-or-... make run

Run tests

make test

Runs the workbench test suite with coverage.

Deploy to HF Mirror Spaces

First-time setup

# Authenticate with HF Mirror
hf login

Create a new Space at huggingface.co/new-space:

SDK: Docker
Visibility: Public or Private

Set your OpenRouter API key as a Space secret

In your Space settings → Secrets → add:

Name:  OPENROUTER_API_KEY
Value: sk-or-...

This lets the app run without users needing their own key.

Deploy

HF_SPACE=your-username/your-space-name make deploy

This pushes the current repo root directly to your Space.

Update after changes

Same command — make deploy uploads the current repo state on every run.

Notes

API keys are never stored; used only for the duration of each request.
The tokenizer tab downloads model tokenizer configs on first use (~seconds). Subsequent calls use a local cache.
This Space runs as a Docker Space and starts via bootstrap.py.
The image is intentionally minimal: Python, requirements.txt, and the runtime modules only.