Spaces:
Running
Running
A newer version of the Gradio SDK is available:
6.1.0
metadata
title: Tokenizer Playground
emoji: π€
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: true
license: mit
models:
- Qwen/Qwen3-0.6B
- Qwen/Qwen2.5-7B
- meta-llama/Llama-3.1-8B
- openai-community/gpt2
- mistralai/Mistral-7B-v0.1
- google/gemma-7b
tags:
- tokenizer
- nlp
- text-processing
- research-tool
short_description: Interactive tokenizer tool for NLP researchers
π€ Tokenizer Playground
An interactive web application for experimenting with various Hugging Face tokenizers. Perfect for NLP researchers and developers who need to quickly test and compare different tokenization strategies.
Features
π€ Tokenize Tab
- Convert any text into tokens using popular models
- View tokens, token IDs, and detailed token information
- See tokenization statistics (tokens per character, vocabulary size, etc.)
- Support for adding/removing special tokens
- Custom model support via Hugging Face model IDs
π Detokenize Tab
- Convert token IDs back to text
- Support for various input formats (list, comma-separated, space-separated)
- Option to skip special tokens
- Verification of round-trip tokenization
π Compare Tab
- Compare tokenization across multiple models simultaneously
- See token count differences and efficiency metrics
- Identify which tokenizer is most efficient for your use case
- Sort results by token count
π Vocabulary Tab
- Explore tokenizer vocabulary details
- View special tokens and their configurations
- See vocabulary size and tokenizer type
- Browse first 100 tokens in the vocabulary
Supported Models
Pre-configured Models
- Qwen Series: Qwen 3, Qwen 2.5, Qwen 2, Qwen 1 (multiple sizes)
- Llama Series: Llama 3.2, Llama 3.1, Llama 2 (multiple sizes)
- GPT Models: GPT-2, GPT-NeoX
- Google Models: Gemma, T5, BERT
- Mistral Models: Mistral 7B, Mixtral 8x7B
- Other Models: DeepSeek, Phi, Yi, BLOOM, OPT, StableLM
Custom Models
You can use any tokenizer available on the Hugging Face Hub by entering its model ID in the "Custom Model ID" field. Examples:
facebook/bart-baseEleutherAI/gpt-j-6bbigscience/bloomstabilityai/stablelm-2-1_6b
Technical Details
- Built with Gradio for an intuitive web interface
- Uses Hugging Face Transformers for tokenizer support
- Supports both fast (Rust-based) and slow (Python-based) tokenizers
- Caches loaded tokenizers for improved performance
- Handles special tokens and custom vocabularies
Quick Start
- Select a tokenizer from the dropdown or enter a custom model ID
- Enter your text in the input field
- Click the action button (Tokenize, Decode, Compare, or Analyze)
- View the results in the output fields
Tips
- Different tokenizers can produce significantly different token counts for the same text
- Special tokens (like
[CLS],[SEP],<s>,</s>) are model-specific - Subword tokenization allows handling of out-of-vocabulary words
- Token efficiency directly impacts model inference costs and API usage
Local Development
To run this application locally:
# Clone the repository
git clone <your-repo-url>
cd tokenizer-playground
# Install dependencies
pip install -r requirements.txt
# Run the application
python app.py
The application will be available at http://localhost:7860
License
This project is licensed under the MIT License.