1

proxima-ocr-d.markdown-post3.0.l

proxima-ocr-d.markdown-post3.0.l is an experimental document AI multimodal model fine-tuned on top of Qwen3-VL-8B-Instruct, optimized for high precision OCR and structured document reconstruction. The model converts documents into Markdown, HTML-Markdown, and hybrid enriched documentation formats capable of embedding inline programming languages and reconstructing complex layouts such as tables, forms, and mathematical content.

Key Enhancements

  • Dynamic Markdown Reconstruction Converts complex documents to structured Markdown or HTML-Markdown while preserving layout hierarchy, formatting consistency, semantic ordering, and section alignment.

  • Inline Code and Language Embedding Direct adaptation of Python, JavaScript, LaTeX, and shell syntax into reconstructed documents for technical and research documentation.

  • High Fidelity OCR and Visual Parsing Accurate recognition of text across structured and unstructured scanned documents, including multi page layout reasoning.

  • Complex Layout Interpretation Interprets tables, grids, equations, graphs, multi column layouts, and forms without structural distortion.

  • Document Retrieval and Semantic Linking Efficient multi page chunking with cross reference recognition and content traceability.

  • Multimodal Long Reasoning Supports advanced document question answering and reasoning across long input streams such as slides and manuscripts.


👉 This model is a stage progression model, and it may currently contain artifacts.


Example Preview

[1] Markdown HTML

Input Image Markdown Preview Page 1 Markdown Preview Page 2
1 Page1 Page2

[2] JSON Nodes

Input Image Node Preview Page 1 Node Preview Page 2
1 Page1 Page2

[3] YAML Nodes

Input Image Node Preview Page 1 Node Preview Page 2
input Page1 Page2

Quick Start with Transformers

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info

model = Qwen3VLForConditionalGeneration.from_pretrained(
    "prithivMLmods/proxima-ocr-d.markdown-post3.0.l", torch_dtype="auto", device_map="auto"
)

processor = AutoProcessor.from_pretrained("prithivMLmods/proxima-ocr-d.markdown-post3.0.l")

messages = [
    {
        "role": "user",
        "content": [
            {
                "type": "image",
                "image": "https://qianwen-res.oss-cn-beijing.aliyuncs.com/Qwen-VL/assets/demo.jpeg",
            },
            {"type": "text", "text": "Convert to Markdown."},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
)
inputs = inputs.to("cuda")

generated_ids = model.generate(**inputs, max_new_tokens=2048)
generated_ids_trimmed = [
    out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)
]
output_text = processor.batch_decode(
    generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False
)
print(output_text)

Intended Use

  • OCR to Markdown or HTML-Markdown conversion
  • Complex document reconstruction and formatting regeneration
  • Multi page document reasoning and retrieval
  • Table extraction and structured output transformation
  • Mathematical OCR and LaTeX conversion
  • Form extraction and structured entity generation
  • Knowledge base indexing and large document QA
  • Documentation regeneration for enterprise automation

Limitations

  • Accuracy may drop on extremely damaged or poorly scanned images
  • Significant GPU VRAM required for long sequences and multi page documents
  • Language accuracy varies for low resource scripts
  • Complex objects such as mixed orientation blocks may require secondary post processing
  • May occasionally produce formatting misalignment in highly irregular layouts

Training Details

Parameter Value
Dataset Size approx. 544K [ modular combination open source data & synthetic document data entries from Gemini 3 Pro ]
Architecture Qwen3VLForConditionalGeneration
Training Time approx. 17,040 seconds (4 h 44 m)
Precision bfloat16
Hardware 4x H100 SXM (320 GB VRAM)
System Memory 752 GB RAM
CPU 80 vCPU

References

Downloads last month
30
Safetensors
Model size
9B params
Tensor type
BF16
·
Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for prithivMLmods/proxima-ocr-d.markdown-post3.0.l

Finetuned
(87)
this model
Quantizations
3 models

Collection including prithivMLmods/proxima-ocr-d.markdown-post3.0.l