LLaVA 7B — Multimodal Continual Pretraining (CPT) with LoRA Adapters
Model type: Vision-Language Causal Model
Base model: llava-hf/llava-1.5-7b-hf
License: Llama 2 Community License (inherits from base)
Framework: Axolotl + DeepSpeed ZeRO-1
Overview
llava-7b-cpt is a continual-pretrained multimodal version of LLaVA 1.5 7B, extending its visual and textual reasoning capabilities through domain-specific continual pretraining (CPT).
The process follows a two-stage adaptation flow:
Textual CPT (Stage 1):
- Base:
llava-hf/llava-1.5-7b-hf - Objective: text-only continual pretraining on scientific, governmental, news, and encyclopedic corpora.
- Base:
Multimodal CPT (Stage 2, this release):
- Base: the Stage 1 text-CPT model
- Objective: multimodal (image–text) continual pretraining using image-caption dialogue data.
This pipeline enhances LLaVA’s factual grounding and image-conditioned understanding of technical and energy-domain visual content.
Training was performed on the Leonardo EuroHPC supercomputer using Axolotl 0.6 with DeepSpeed ZeRO-1 and bfloat16 precision.
Training Setup
| Component | Specification |
|---|---|
| Objective | Multimodal continual pretraining (image–text dialogue) |
| Adapter type | LoRA |
| Precision | bfloat16 |
| Hardware | 8 nodes × 2 × NVIDIA A100 64 GB GPUs |
| Framework | Axolotl + DeepSpeed ZeRO-1 (PyTorch 2.5.1 + CUDA 12.1) |
| Runtime | ≈ 24 hours |
| Checkpoints | Saved every epoch |
| Vision tower | Frozen |
| Text backbone | LoRA-updated only |
| Loss watchdog | Disabled for multimodal phase |
Dataset
The multimodal CPT stage was trained on image–caption chat-style pairs, using an Axolotl-compatible JSONL format (mm_captions_chat.jsonl) of LLaVA-style message lists.
| File | Description |
|---|---|
| mm_captions_chat.jsonl | Image–text dialogues for visual captioning and VQA adaptation |
| images/ | Folder of referenced image files used by the dataset entries |
Each entry contains alternating user (image + text prompt) and assistant (caption/answer) messages in a chat structure compatible with the llava chat template.
Hyperparameters
| Parameter | Value |
|---|---|
| Sequence length | 2048 |
| Micro batch size | 1 |
| Gradient accumulation | 4 |
| Epochs | 1 |
| Max steps | 6000 |
| Learning rate | 0.00015 |
| LR scheduler | cosine |
| Optimizer | AdamW (8-bit) |
| Warmup ratio | 0.1 |
| Weight decay | 0.0 |
| LoRA rank (r) | 16 |
| LoRA alpha | 32 |
| LoRA dropout | 0.05 |
| LoRA target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj |
| Gradient checkpointing | ✅ |
| Flash attention | ❌ (disabled for stability) |
| Image size | 512 |
| Resize algorithm | bilinear |
Model Flow
Base: llava-hf/llava-1.5-7b-hf
Stage 1 — Textual Continual Pretraining (CPT) → llava-7b-text-cpt
Stage 2 — Multimodal Continual Pretraining (CPT) → ubitech-edg/llava-7b-cpt
Tokenizer & Processor
| Component | Value |
|---|---|
| Tokenizer type | AutoTokenizer |
| Processor type | AutoProcessor |
| Special tokens | <pad> = ID 32001 |
| Chat template | llava |
Usage
To load and run llava-7b-cpt locally for image–text generation:
from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch
model_id = "ubitech-edg/llava-7b-cpt"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
image = Image.open("example.jpg").convert("RGB")
prompt = "USER: <image>\nDescribe this image in two sentences.\nASSISTANT:"
inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda")
with torch.inference_mode():
output = model.generate(**inputs, max_new_tokens=128)
print(processor.decode(output[0], skip_special_tokens=True))
- Downloads last month
- 28