Spaces:

afeng
/

tokenizers

Running

App Files Files Community

tokenizers / README.md

afeng

update output window

405302e about 1 month ago

preview code

raw

history blame contribute delete

3.45 kB

A newer version of the Gradio SDK is available: 6.1.0

Upgrade

metadata

title: Tokenizer Playground
emoji: 🔤
colorFrom: blue
colorTo: purple
sdk: gradio
sdk_version: 4.44.1
app_file: app.py
pinned: true
license: mit
models:
  - Qwen/Qwen3-0.6B
  - Qwen/Qwen2.5-7B
  - meta-llama/Llama-3.1-8B
  - openai-community/gpt2
  - mistralai/Mistral-7B-v0.1
  - google/gemma-7b
tags:
  - tokenizer
  - nlp
  - text-processing
  - research-tool
short_description: Interactive tokenizer tool for NLP researchers

🔤 Tokenizer Playground

An interactive web application for experimenting with various Hugging Face tokenizers. Perfect for NLP researchers and developers who need to quickly test and compare different tokenization strategies.

Features

🔤 Tokenize Tab

Convert any text into tokens using popular models
View tokens, token IDs, and detailed token information
See tokenization statistics (tokens per character, vocabulary size, etc.)
Support for adding/removing special tokens
Custom model support via Hugging Face model IDs

🔄 Detokenize Tab

Convert token IDs back to text
Support for various input formats (list, comma-separated, space-separated)
Option to skip special tokens
Verification of round-trip tokenization

📊 Compare Tab

Compare tokenization across multiple models simultaneously
See token count differences and efficiency metrics
Identify which tokenizer is most efficient for your use case
Sort results by token count

📖 Vocabulary Tab

Explore tokenizer vocabulary details
View special tokens and their configurations
See vocabulary size and tokenizer type
Browse first 100 tokens in the vocabulary

Supported Models

Pre-configured Models

Qwen Series: Qwen 3, Qwen 2.5, Qwen 2, Qwen 1 (multiple sizes)
Llama Series: Llama 3.2, Llama 3.1, Llama 2 (multiple sizes)
GPT Models: GPT-2, GPT-NeoX
Google Models: Gemma, T5, BERT
Mistral Models: Mistral 7B, Mixtral 8x7B
Other Models: DeepSeek, Phi, Yi, BLOOM, OPT, StableLM

Custom Models

You can use any tokenizer available on the Hugging Face Hub by entering its model ID in the "Custom Model ID" field. Examples:

facebook/bart-base
EleutherAI/gpt-j-6b
bigscience/bloom
stabilityai/stablelm-2-1_6b

Technical Details

Built with Gradio for an intuitive web interface
Uses Hugging Face Transformers for tokenizer support
Supports both fast (Rust-based) and slow (Python-based) tokenizers
Caches loaded tokenizers for improved performance
Handles special tokens and custom vocabularies

Quick Start

Select a tokenizer from the dropdown or enter a custom model ID
Enter your text in the input field
Click the action button (Tokenize, Decode, Compare, or Analyze)
View the results in the output fields

Tips

Different tokenizers can produce significantly different token counts for the same text
Special tokens (like [CLS], [SEP], <s>, </s>) are model-specific
Subword tokenization allows handling of out-of-vocabulary words
Token efficiency directly impacts model inference costs and API usage

Local Development

To run this application locally:

# Clone the repository
git clone <your-repo-url>
cd tokenizer-playground

# Install dependencies
pip install -r requirements.txt

# Run the application
python app.py

The application will be available at http://localhost:7860

License

This project is licensed under the MIT License.