Llama-3.2-11B-Vision-PokemonCard-OCR-LoRA

📋 Description

This is a Vision-Language Model (VLM) fine-tuned from Llama-3.2-11B-Vision-Instruct to perform Optical Character Recognition (OCR) on Pokémon trading cards. The model is trained to extract card names and card numbers in JSON format.

🎯 Purpose

This model specializes in:

Recognizing Pokémon card names (English/Japanese)
Extracting card numbers
Outputting results in standardized JSON format

🔧 Technical Specifications

Base Model

Original Model: unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit
Quantization: 4-bit (BitsAndBytes)
Framework: Unsloth + Transformers

Fine-tuning Parameters

- LoRA r: 16
- LoRA alpha: 16
- LoRA dropout: 0.05
- Target modules: Vision layers + Language layers + Attention + MLP
- Learning rate: 2e-4
- Batch size: 2 (per device)
- Gradient accumulation: 8 steps
- Total training steps: 100
- Optimizer: AdamW 8-bit

Dataset

Total samples: 4,033 Pokémon cards
Train set: 3,629 samples (90%)
Test set: 404 samples (10%)
Format: Card images + Ground truth JSON

📊 Training Results

Metric	Value
Training Loss (step 50)	0.0127
Validation Loss (step 50)	0.0125
Training Loss (step 100)	0.0267
Validation Loss (step 100)	0.0105
Training Time	~1h 43m

🚀 Installation

pip install unsloth datasets transformers accelerate trl bitsandbytes pillow tqdm
pip install pyarrow==19.0.0

💻 Usage

1. Load Model

from unsloth import FastVisionModel
import torch

# Load base model
model_id = "unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit"
model, tokenizer = FastVisionModel.from_pretrained(
    model_id,
    load_in_4bit=True,
    use_gradient_checkpointing="unsloth",
)

# Load LoRA adapter
lora_model_path = "lora_pokemon_model"
model.load_adapter(lora_model_path, adapter_name="default")
FastVisionModel.for_inference(model)

2. Inference

from transformers import TextStreamer

# Path to card image
image_path = "path/to/pokemon_card.jpg"

# Instruction prompt
instruction = "You are an OCR expert specialized in Pokémon cards. Extract the card name and card number in JSON format."

# Prepare input
messages = [
    {"role": "user", "content": [
        {"type": "image"},
        {"type": "text", "text": instruction}
    ]}
]

input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True)
inputs = tokenizer(
    image_path,
    input_text,
    add_special_tokens=False,
    return_tensors="pt",
).to("cuda" if torch.cuda.is_available() else "cpu")

# Generate
text_streamer = TextStreamer(tokenizer, skip_prompt=True)
outputs = model.generate(
    **inputs,
    streamer=text_streamer,
    max_new_tokens=128,
    temperature=0.7,
    top_p=0.9,
    use_cache=True,
)

# Decode result
result = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(result)

3. Example Output

Input: Pokémon card image

Output:

{
  "gt_parse": {
    "card_details": [
      {"name_en": "Sableye"},
      {"card_number": "61/113"}
    ]
  }
}

📁 Dataset Structure

datasets/
├── images/              # Card images (.jpg)
└── ground_truth/        # JSON ground truth files
    ├── card1.json
    ├── card2.json
    └── ...

Ground Truth Format:

{
  "gt_parse": {
    "card_details": [
      {"name_en": "Card Name"},
      {"card_number": "XX/YYY"}
    ]
  }
}

⚙️ System Requirements

GPU: NVIDIA GPU with at least 15GB VRAM (T4/V100/A100 recommended)
RAM: Minimum 16GB
CUDA: 11.8+ or 12.x
Python: 3.8+

📝 Training from Scratch

To retrain the model:

Prepare dataset according to the structure above
Run the notebook fineture_pokemon_card.ipynb
Model will be saved to lora_pokemon_model/

🔍 Limitations

Model only recognizes English and Japanese card names and card numbers
Accuracy depends on input image quality
Does not support extraction of other information (HP, Type, Rarity, etc.)

🤝 Contributing

Contributions to improve the model are welcome! Please create an issue or pull request.

📄 License

This model inherits the license from Llama 3.2 (Meta) and Unsloth.

🙏 Credits

Base Model: Meta (Llama 3.2)
Optimization: Unsloth
Dataset: Pokemon TCG Card Database

Note: This model is optimized for research and learning purposes. For production use, further evaluation of accuracy and robustness is required.

Downloads last month: -; Downloads are not tracked for this model. How to track

Model tree for netprtony/Llama-3.2-11B-Vision-PokemonCard-OCR-LoRA

Base model

meta-llama/Llama-3.2-11B-Vision-Instruct

Quantized

unsloth/Llama-3.2-11B-Vision-Instruct-bnb-4bit

Adapter

(7)

this model

Space using netprtony/Llama-3.2-11B-Vision-PokemonCard-OCR-LoRA 1

Evaluation results

validation loss on Pokemon TCG Cards
self-reported

0.011