LLaVA 7B — Multimodal Continual Pretraining (CPT) with LoRA Adapters

Model type: Vision-Language Causal Model
Base model: llava-hf/llava-1.5-7b-hf
License: Llama 2 Community License (inherits from base)
Framework: Axolotl + DeepSpeed ZeRO-1


Overview

llava-7b-cpt is a continual-pretrained multimodal version of LLaVA 1.5 7B, extending its visual and textual reasoning capabilities through domain-specific continual pretraining (CPT).
The process follows a two-stage adaptation flow:

  1. Textual CPT (Stage 1):

    • Base: llava-hf/llava-1.5-7b-hf
    • Objective: text-only continual pretraining on scientific, governmental, news, and encyclopedic corpora.
  2. Multimodal CPT (Stage 2, this release):

    • Base: the Stage 1 text-CPT model
    • Objective: multimodal (image–text) continual pretraining using image-caption dialogue data.

This pipeline enhances LLaVA’s factual grounding and image-conditioned understanding of technical and energy-domain visual content.
Training was performed on the Leonardo EuroHPC supercomputer using Axolotl 0.6 with DeepSpeed ZeRO-1 and bfloat16 precision.


Training Setup

Component Specification
Objective Multimodal continual pretraining (image–text dialogue)
Adapter type LoRA
Precision bfloat16
Hardware 8 nodes × 2 × NVIDIA A100 64 GB GPUs
Framework Axolotl + DeepSpeed ZeRO-1 (PyTorch 2.5.1 + CUDA 12.1)
Runtime ≈ 24 hours
Checkpoints Saved every epoch
Vision tower Frozen
Text backbone LoRA-updated only
Loss watchdog Disabled for multimodal phase

Dataset

The multimodal CPT stage was trained on image–caption chat-style pairs, using an Axolotl-compatible JSONL format (mm_captions_chat.jsonl) of LLaVA-style message lists.

File Description
mm_captions_chat.jsonl Image–text dialogues for visual captioning and VQA adaptation
images/ Folder of referenced image files used by the dataset entries

Each entry contains alternating user (image + text prompt) and assistant (caption/answer) messages in a chat structure compatible with the llava chat template.


Hyperparameters

Parameter Value
Sequence length 2048
Micro batch size 1
Gradient accumulation 4
Epochs 1
Max steps 6000
Learning rate 0.00015
LR scheduler cosine
Optimizer AdamW (8-bit)
Warmup ratio 0.1
Weight decay 0.0
LoRA rank (r) 16
LoRA alpha 32
LoRA dropout 0.05
LoRA target modules q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Gradient checkpointing
Flash attention ❌ (disabled for stability)
Image size 512
Resize algorithm bilinear

Model Flow

Base: llava-hf/llava-1.5-7b-hf

Stage 1 — Textual Continual Pretraining (CPT) → llava-7b-text-cpt

Stage 2 — Multimodal Continual Pretraining (CPT) → ubitech-edg/llava-7b-cpt


Tokenizer & Processor

Component Value
Tokenizer type AutoTokenizer
Processor type AutoProcessor
Special tokens <pad> = ID 32001
Chat template llava

Usage

To load and run llava-7b-cpt locally for image–text generation:

from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch

model_id = "ubitech-edg/llava-7b-cpt"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

image = Image.open("example.jpg").convert("RGB")
prompt = "USER: <image>\nDescribe this image in two sentences.\nASSISTANT:"

inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda")

with torch.inference_mode():
    output = model.generate(**inputs, max_new_tokens=128)

print(processor.decode(output[0], skip_special_tokens=True))
Downloads last month
28
Safetensors
Model size
7B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for ubitech-edg/llava-7b-cpt

Adapter
(131)
this model
Adapters
1 model