GroundNext-7B-V0

  ๐ŸŒ Website   |   ๐Ÿ“‘ Paper   |   ๐Ÿค— Dataset   |   ๐Ÿค– Model  

Highlights

GroundNext-7B-V0 is a state-of-the-art vision-language model for GUI element grounding, developed as part of the GroundCUA project. This model features:

  • Superior grounding accuracy achieving 52.9% on ScreenSpot-Pro, 67.7% on OSWorld-G, and 60.3% on UI-Vision benchmarks
  • Exceptional cross-platform generalization with 81.1% accuracy on MMBench-GUI and 90.4% on ScreenSpot-v2 despite desktop-only training
  • Data-efficient training achieving state-of-the-art results with only 700K training examples vs 9M+ in prior work
  • Strong agentic capabilities reaching 50.6% overall success rate on OSWorld when paired with reasoning models
  • Native tool-calling support with built-in computer use action space for mouse, keyboard, and screen interactions

Model Overview

GroundNext-7B-V0 has the following characteristics:

  • Type: Vision-Language Model for GUI Grounding
  • Base Model: Qwen2.5-VL-7B-Instruct
  • Training Approach: Two-stage (Supervised Fine-tuning + Reinforcement Learning with RLOO)
  • Number of Parameters: 7.0B
  • Training Data: 700K human-annotated desktop demonstrations from GroundCUA dataset
  • Context Length: 262,144 tokens (inherited from base model)
  • Specialization: Desktop GUI element grounding with cross-platform generalization

For more details about the training methodology, dataset, and comprehensive benchmarks, please refer to our paper, GitHub repository, and project website.

Performance

Desktop Grounding Benchmarks

Qwen2.5-VL-7B UI-TARS-72B GroundNext-7B-V0
ScreenSpot-Pro 29.7 38.1 52.9
OSWorld-G 42.7 57.1 67.7
UI-Vision 16.5 25.5 60.3
Avg (Desktop) 29.6 40.2 60.3

Cross-Platform Generalization (Desktop, Mobile & Web)

Qwen2.5-VL-7B UI-TARS-72B GroundNext-7B-V0
MMBench-GUI 33.9 74.3 81.1
ScreenSpot-v2 88.8 90.3 90.4
Avg (Mobile/Web) 61.4 82.3 85.8

Agentic Performance on OSWorld

When combined with OpenAI o3 for reasoning, GroundNext-7B-V0 demonstrates strong end-to-end computer use capabilities:

Model OS Office Daily Pro Workflow Overall
OpenAI o3 62.5 14.5 21.4 38.8 16.5 23.0
CUA 23.9 34.6 55.1 18.3 18.3 31.4
OpenCUA-72B 58.3 47.0 53.8 73.5 20.4 46.1
UI-TARS-1.5-7B 33.3 29.9 37.9 53.1 9.1 29.6
JEDI-7B w/ o3 50.0 46.1 61.9 75.5 35.3 51.0
GroundNext-3B w/ o3 62.5 47.0 55.0 73.5 36.5 50.6

Note: GroundNext-7B-V0 results with o3 integration forthcoming.

Quickstart

The code of GroundNext-7B-V0 is compatible with the latest Hugging Face transformers library and follows the Qwen2.5-VL implementation.

With transformers<4.37.0, you may encounter compatibility issues. We recommend using transformers>=4.37.0.

Installation

pip install transformers>=4.37.0 torch torchvision accelerate
pip install qwen-vl-utils  # For image processing utilities

Basic Inference

The following code snippet demonstrates how to use GroundNext-7B-V0 for GUI element grounding:

import torch
from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor
from PIL import Image
import groundcua
import io
from urllib.request import urlopen

model_name = "ServiceNow/GroundNext-7B-V0"

# Load model and processor
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    attn_implementation="flash_attention_2",
    device_map="auto",
    trust_remote_code=True
).eval()

processor = AutoProcessor.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)

# Configure generation
model.generation_config.temperature = groundcua.DEFAULT_TEMPERATURE
model.generation_config.do_sample = False
model.generation_config.use_cache = True

# Load and prepare image
url = "https://huggingface.co/datasets/ServiceNow/GroundCUA/resolve/main/images/7-Zip/001f0079a489909eb94e47c2374b7bf36ab1842e314592ce30a34d18a54eb1df.png"
image = Image.open(io.BytesIO(urlopen(url).read()))
image, (width, height) = groundcua.prepare_image(image)

# Create messages and generate
instruction = "Click on the 'File' button"
messages = groundcua.create_messages(instruction, image, width, height)

input_text = tokenizer.apply_chat_template(messages, add_generation_prompt=True, tokenize=False)
inputs = processor(text=[input_text], images=[image], videos=None, padding=True, return_tensors="pt").to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=groundcua.DEFAULT_MAX_NEW_TOKENS)
generated_ids_trimmed = [out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)]

response = processor.batch_decode(generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False)[0]
print(response)
# Expected output: <tool_call>{"name": "computer_use", "arguments": {"action": "left_click", "coordinate": [x, y]}}</tool_call>

Deployment with vLLM

For production deployment, you can use vLLM to create OpenAI-compatible API endpoints:

vLLM:

vllm serve ServiceNow/GroundNext-7B-V0 --max-model-len 8192

Note: Adjust max-model-len or context-length based on your hardware capabilities. For typical GUI grounding tasks, 8192 tokens is sufficient.

Best Practices

To achieve optimal grounding performance, we recommend:

  1. Image Preprocessing:

    • Use high-resolution screenshots (minimum 800x600)
    • Ensure UI elements are clearly visible
    • Maintain original aspect ratios when resizing
  2. Prompt Engineering:

    • Be specific about the target element (e.g., "Click on the blue 'Submit' button in the top-right corner" or "Click on the following element: Save")
    • Include element attributes when available (color, position, text)
  3. Generation Parameters:

    • Use temperature=0.0 for deterministic grounding
    • Set max_new_tokens=128 (sufficient for tool calls)
    • Enable use_cache=True for faster inference
  4. System Prompt:

    • Always include the system prompt with actual screen dimensions
    • Replace {width} and {height} with true screenshot dimensions
    • Maintain the tool signature format for proper JSON parsing
  5. Post-processing:

    • Parse <tool_call> tags to extract JSON
    • Validate coordinates are within screen bounds

Training

GroundNext-7B-V0 was trained using a two-stage approach:

  1. Supervised Fine-tuning (SFT): Trained on 700K human-annotated desktop demonstrations from the GroundCUA dataset
  2. Reinforcement Learning (RLOO): Further optimized using reward-based learning with custom GUI grounding rewards

For detailed training instructions, dataset preparation, and reproduction steps, please visit our GitHub repository.

Limitations and Future Work

  • Desktop-focused: Primarily trained on desktop environments (though shows strong cross-platform generalization)
  • Action space: Currently supports mouse click action only
  • Languages: Optimized for English UI elements
  • Resolution: Performance may vary with extremely high or low resolution images

Citation

If you use GroundNext-7B-V0 in your research, please cite:

@misc{feizi2025groundingcomputeruseagents,
      title={Grounding Computer Use Agents on Human Demonstrations}, 
      author={Aarash Feizi and Shravan Nayak and Xiangru Jian and Kevin Qinghong Lin and Kaixin Li and Rabiul Awal and Xing Han Lรน and Johan Obando-Ceron and Juan A. Rodriguez and Nicolas Chapados and David Vazquez and Adriana Romero-Soriano and Reihaneh Rabbany and Perouz Taslakian and Christopher Pal and Spandana Gella and Sai Rajeswar},
      year={2025},
      eprint={2511.07332},
      archivePrefix={arXiv},
      primaryClass={cs.LG},
      url={https://arxiv.org/abs/2511.07332}, 
}

License

This model is released under the Apache 2.0 License, following the base Qwen2.5-VL-7B-Instruct model. See the LICENSE for details.

Acknowledgements

We thank:

  • The Qwen team for the excellent Qwen2.5-VL foundation models
  • The open-source community for tools and frameworks that made this work possible
  • Human annotators who contributed to the GroundCUA dataset
Downloads last month
659
Safetensors
Model size
8B params
Tensor type
BF16
ยท
Inference Providers NEW
This model isn't deployed by any Inference Provider. ๐Ÿ™‹ Ask for provider support

Model tree for ServiceNow/GroundNext-7B-V0

Finetuned
(907)
this model
Quantizations
3 models