LLaVA 7B — Multimodal Continual Pretraining (CPT) with LoRA Adapters

Model type: Vision-Language Causal Model
Base model: llava-hf/llava-1.5-7b-hf
License: Llama 2 Community License (inherits from base)
Framework: Axolotl + DeepSpeed ZeRO-1

Overview

llava-7b-cpt is a continual-pretrained multimodal version of LLaVA 1.5 7B, extending its visual and textual reasoning capabilities through domain-specific continual pretraining (CPT).
The process follows a two-stage adaptation flow:

Textual CPT (Stage 1):
- Base: llava-hf/llava-1.5-7b-hf
- Objective: text-only continual pretraining on scientific, governmental, news, and encyclopedic corpora.
Multimodal CPT (Stage 2, this release):
- Base: the Stage 1 text-CPT model
- Objective: multimodal (image–text) continual pretraining using image-caption dialogue data.

This pipeline enhances LLaVA’s factual grounding and image-conditioned understanding of technical and energy-domain visual content.
Training was performed on the Leonardo EuroHPC supercomputer using Axolotl 0.6 with DeepSpeed ZeRO-1 and bfloat16 precision.

Training Setup

Component	Specification
Objective	Multimodal continual pretraining (image–text dialogue)
Adapter type	LoRA
Precision	bfloat16
Hardware	8 nodes × 2 × NVIDIA A100 64 GB GPUs
Framework	Axolotl + DeepSpeed ZeRO-1 (PyTorch 2.5.1 + CUDA 12.1)
Runtime	≈ 24 hours
Checkpoints	Saved every epoch
Vision tower	Frozen
Text backbone	LoRA-updated only
Loss watchdog	Disabled for multimodal phase

Dataset

The multimodal CPT stage was trained on image–caption chat-style pairs, using an Axolotl-compatible JSONL format (mm_captions_chat.jsonl) of LLaVA-style message lists.

File	Description
mm_captions_chat.jsonl	Image–text dialogues for visual captioning and VQA adaptation
images/	Folder of referenced image files used by the dataset entries

Each entry contains alternating user (image + text prompt) and assistant (caption/answer) messages in a chat structure compatible with the llava chat template.

Hyperparameters

Parameter	Value
Sequence length	2048
Micro batch size	1
Gradient accumulation	4
Epochs	1
Max steps	6000
Learning rate	0.00015
LR scheduler	cosine
Optimizer	AdamW (8-bit)
Warmup ratio	0.1
Weight decay	0.0
LoRA rank (r)	16
LoRA alpha	32
LoRA dropout	0.05
LoRA target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Gradient checkpointing	✅
Flash attention	❌ (disabled for stability)
Image size	512
Resize algorithm	bilinear

Model Flow

Base: llava-hf/llava-1.5-7b-hf

Stage 1 — Textual Continual Pretraining (CPT) → llava-7b-text-cpt

Stage 2 — Multimodal Continual Pretraining (CPT) → ubitech-edg/llava-7b-cpt

Tokenizer & Processor

Component	Value
Tokenizer type	`AutoTokenizer`
Processor type	`AutoProcessor`
Special tokens	`<pad>` = ID 32001
Chat template	`llava`

Usage

To load and run llava-7b-cpt locally for image–text generation:

from transformers import AutoModelForCausalLM, AutoProcessor
from PIL import Image
import torch

model_id = "ubitech-edg/llava-7b-cpt"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

image = Image.open("example.jpg").convert("RGB")
prompt = "USER: <image>\nDescribe this image in two sentences.\nASSISTANT:"

inputs = processor(images=image, text=prompt, return_tensors="pt").to("cuda")

with torch.inference_mode():
    output = model.generate(**inputs, max_new_tokens=128)

print(processor.decode(output[0], skip_special_tokens=True))

Downloads last month: 28

Safetensors

Model size

7B params

Tensor type

BF16

Model tree for ubitech-edg/llava-7b-cpt

Base model

llava-hf/llava-1.5-7b-hf

Adapter

(131)

this model

Adapters

1 model