Llama-3.2-11B-Vision-PokemonCard-OCR-LoRA
📋 Description
This is a Vision-Language Model (VLM) fine-tuned from Llama-3.2-11B-Vision-Instruct to perform Optical Character Recognition (OCR) on Pokémon trading cards. The model is trained to extract card names and card numbers in JSON format.
🎯 Purpose
This model specializes in:
- Recognizing Pokémon card names (English/Japanese)
- Extracting card numbers
- Outputting results in standardized JSON format
🔧 Technical Specifications
Base Model
- Original Model:
unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit - Quantization: 4-bit (BitsAndBytes)
- Framework: Unsloth + Transformers
Fine-tuning Parameters
- LoRA r: 16
- LoRA alpha: 16
- LoRA dropout: 0.05
- Target modules: Vision layers + Language layers + Attention + MLP
- Learning rate: 2e-4
- Batch size: 2 (per device)
- Gradient accumulation: 8 steps
- Total training steps: 100
- Optimizer: AdamW 8-bit
Dataset
- Total samples: 4,033 Pokémon cards
- Train set: 3,629 samples (90%)
- Test set: 404 samples (10%)
- Format: Card images + Ground truth JSON
📊 Training Results
| Metric | Value |
|---|---|
| Training Loss (step 50) | 0.0127 |
| Validation Loss (step 50) | 0.0125 |
| Training Loss (step 100) | 0.0267 |
| Validation Loss (step 100) | 0.0105 |
| Training Time | ~1h 43m |
🚀 Installation
pip install unsloth datasets transformers accelerate trl bitsandbytes pillow tqdm
pip install pyarrow==19.0.0
💻 Usage
1. Load Model
from unsloth import FastVisionModel
import torch
# Load base model
model_id = "unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit"
model, tokenizer = FastVisionModel.from_pretrained(
model_id,
load_in_4bit=True,
use_gradient_checkpointing="unsloth",
)
# Load LoRA adapter
lora_model_path = "lora_pokemon_model"
model.load_adapter(lora_model_path, adapter_name="default")
FastVisionModel.for_inference(model)
2. Inference
from transformers import TextStreamer
# Path to card image
image_path = "path/to/pokemon_card.jpg"
# Instruction prompt
instruction = "You are an OCR expert specialized in Pokémon cards. Extract the card name and card number in JSON format."
# Prepare input
messages = [
{"role": "user", "content": [
{"type": "image"},
{"type": "text", "text": instruction}
]}
]
input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(
image_path,
input_text,
add_special_tokens=False,
return_tensors="pt",
).to("cuda" if torch.cuda.is_available() else "cpu")
# Generate
text_streamer = TextStreamer(tokenizer, skip_prompt=True)
outputs = model.generate(
**inputs,
streamer=text_streamer,
max_new_tokens=128,
temperature=0.7,
top_p=0.9,
use_cache=True,
)
# Decode result
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)
3. Example Output
Input: Pokémon card image
Output:
{
"gt_parse": {
"card_details": [
{"name_en": "Sableye"},
{"card_number": "61/113"}
]
}
}
📁 Dataset Structure
datasets/
├── images/ # Card images (.jpg)
└── ground_truth/ # JSON ground truth files
├── card1.json
├── card2.json
└── ...
Ground Truth Format:
{
"gt_parse": {
"card_details": [
{"name_en": "Card Name"},
{"card_number": "XX/YYY"}
]
}
}
⚙️ System Requirements
- GPU: NVIDIA GPU with at least 15GB VRAM (T4/V100/A100 recommended)
- RAM: Minimum 16GB
- CUDA: 11.8+ or 12.x
- Python: 3.8+
📝 Training from Scratch
To retrain the model:
- Prepare dataset according to the structure above
- Run the notebook
fineture_pokemon_card.ipynb - Model will be saved to
lora_pokemon_model/
🔍 Limitations
- Model only recognizes English and Japanese card names and card numbers
- Accuracy depends on input image quality
- Does not support extraction of other information (HP, Type, Rarity, etc.)
🤝 Contributing
Contributions to improve the model are welcome! Please create an issue or pull request.
📄 License
This model inherits the license from Llama 3.2 (Meta) and Unsloth.
🙏 Credits
- Base Model: Meta (Llama 3.2)
- Optimization: Unsloth
- Dataset: Pokemon TCG Card Database
Note: This model is optimized for research and learning purposes. For production use, further evaluation of accuracy and robustness is required.
Model tree for netprtony/Llama-3.2-11B-Vision-PokemonCard-OCR-LoRA
Base model
meta-llama/Llama-3.2-11B-Vision-InstructSpace using netprtony/Llama-3.2-11B-Vision-PokemonCard-OCR-LoRA 1
Evaluation results
- validation loss on Pokemon TCG Cardsself-reported0.011