tokenizers / README.md
afeng's picture
update output window
405302e

A newer version of the Gradio SDK is available: 6.1.0

Upgrade
metadata
title: Tokenizer Playground
emoji: πŸ”€
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: true
license: mit
models:
  - Qwen/Qwen3-0.6B
  - Qwen/Qwen2.5-7B
  - meta-llama/Llama-3.1-8B
  - openai-community/gpt2
  - mistralai/Mistral-7B-v0.1
  - google/gemma-7b
tags:
  - tokenizer
  - nlp
  - text-processing
  - research-tool
short_description: Interactive tokenizer tool for NLP researchers

πŸ”€ Tokenizer Playground

An interactive web application for experimenting with various Hugging Face tokenizers. Perfect for NLP researchers and developers who need to quickly test and compare different tokenization strategies.

Features

πŸ”€ Tokenize Tab

  • Convert any text into tokens using popular models
  • View tokens, token IDs, and detailed token information
  • See tokenization statistics (tokens per character, vocabulary size, etc.)
  • Support for adding/removing special tokens
  • Custom model support via Hugging Face model IDs

πŸ”„ Detokenize Tab

  • Convert token IDs back to text
  • Support for various input formats (list, comma-separated, space-separated)
  • Option to skip special tokens
  • Verification of round-trip tokenization

πŸ“Š Compare Tab

  • Compare tokenization across multiple models simultaneously
  • See token count differences and efficiency metrics
  • Identify which tokenizer is most efficient for your use case
  • Sort results by token count

πŸ“– Vocabulary Tab

  • Explore tokenizer vocabulary details
  • View special tokens and their configurations
  • See vocabulary size and tokenizer type
  • Browse first 100 tokens in the vocabulary

Supported Models

Pre-configured Models

  • Qwen Series: Qwen 3, Qwen 2.5, Qwen 2, Qwen 1 (multiple sizes)
  • Llama Series: Llama 3.2, Llama 3.1, Llama 2 (multiple sizes)
  • GPT Models: GPT-2, GPT-NeoX
  • Google Models: Gemma, T5, BERT
  • Mistral Models: Mistral 7B, Mixtral 8x7B
  • Other Models: DeepSeek, Phi, Yi, BLOOM, OPT, StableLM

Custom Models

You can use any tokenizer available on the Hugging Face Hub by entering its model ID in the "Custom Model ID" field. Examples:

  • facebook/bart-base
  • EleutherAI/gpt-j-6b
  • bigscience/bloom
  • stabilityai/stablelm-2-1_6b

Technical Details

  • Built with Gradio for an intuitive web interface
  • Uses Hugging Face Transformers for tokenizer support
  • Supports both fast (Rust-based) and slow (Python-based) tokenizers
  • Caches loaded tokenizers for improved performance
  • Handles special tokens and custom vocabularies

Quick Start

  1. Select a tokenizer from the dropdown or enter a custom model ID
  2. Enter your text in the input field
  3. Click the action button (Tokenize, Decode, Compare, or Analyze)
  4. View the results in the output fields

Tips

  • Different tokenizers can produce significantly different token counts for the same text
  • Special tokens (like [CLS], [SEP], <s>, </s>) are model-specific
  • Subword tokenization allows handling of out-of-vocabulary words
  • Token efficiency directly impacts model inference costs and API usage

Local Development

To run this application locally:

# Clone the repository
git clone <your-repo-url>
cd tokenizer-playground

# Install dependencies
pip install -r requirements.txt

# Run the application
python app.py

The application will be available at http://localhost:7860

License

This project is licensed under the MIT License.