Last updated: January 2025
This document reviews the state-of-the-art in UI element grounding—the task of locating specific UI elements in screenshots given natural language descriptions or click coordinates. Key findings:
- Best accuracy on hard benchmarks is ~62% (UI-TARS 1.5 on ScreenSpot-Pro)
- Progressive cropping improves accuracy by 254% (ScreenSeekeR technique)
- OmniParser + GPT-4o achieves 39.6% on ScreenSpot-Pro
- Small targets (icons, tiny buttons) remain the hardest challenge
Paper: SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents
The first realistic GUI grounding benchmark spanning multiple platforms.
| Metric | Value |
|---|---|
| Screenshots | 600+ |
| Instructions | 1,200+ |
| Platforms | iOS, Android, macOS, Windows, Web |
| Element types | Text, widgets, icons |
| Avg target size | 2.01% of screen |
Key insight: Created by the SeeClick team to evaluate GUI grounding capabilities. Includes both text-based elements and visual widgets/icons.
Paper: ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use
A significantly harder benchmark focusing on professional applications with tiny UI elements.
| Metric | Value |
|---|---|
| Screenshots | 1,581 |
| Applications | 23 across 5 industries |
| Platforms | Windows, macOS, Linux |
| Avg target size | 0.07% of screen (29x smaller than ScreenSpot) |
| Text/Icon split | 62.6% / 37.4% |
Applications covered:
- Development: VSCode, PyCharm, Android Studio, VMware
- Creative: Photoshop, Premiere, Illustrator, Blender, DaVinci Resolve
- CAD/Engineering: AutoCAD, SolidWorks, Inventor, Vivado, Quartus
- Scientific: MATLAB, Stata, EViews
- Office: Word, Excel, PowerPoint
Key insight: Professional software has much smaller targets than consumer apps. Models that work on ScreenSpot often fail catastrophically on ScreenSpot-Pro.
| Benchmark | Focus | Notes |
|---|---|---|
| MiniWob | Web automation | 125 web-based tasks |
| AITW (Android In The Wild) | Mobile automation | Real Android interactions |
| Mind2Web | Web navigation | Cross-website generalization |
| OSWorld | Desktop OS tasks | Full computer control |
| AndroidWorld | Mobile OS tasks | Android device control |
Paper: OmniParser for Pure Vision Based GUI Agent Code: github.com/microsoft/OmniParser
A screen parsing tool that uses Set-of-Mark (SoM) prompting to overlay bounding boxes on UI screenshots.
Architecture:
- Icon detection: Fine-tuned YOLO model
- Icon captioning: Fine-tuned Florence-2 model
- Text detection: PaddleOCR / EasyOCR
Performance:
| Benchmark | OmniParser + GPT-4o | GPT-4o alone | Improvement |
|---|---|---|---|
| ScreenSpot-Pro | 39.6% | 0.8% | +4,850% |
| SeeAssign | 93.8% | 70.5% | +33% |
V2 Improvements:
- 60% faster inference (0.6s/frame on A100)
- Better small element detection
- Cleaner training data
Limitations:
- Still struggles with very small icons
- 39.6% accuracy leaves significant room for improvement
- Requires external LLM for action selection
Paper: UI-TARS: Pioneering Automated GUI Interaction with Native Agents Model: ByteDance-Seed/UI-TARS-1.5-7B
A native GUI agent that directly perceives screenshots and outputs actions.
Key innovations:
- Enhanced Perception: Large-scale GUI screenshot training
- Unified Action Modeling: Standardized actions across platforms
- System-2 Reasoning: Deliberate multi-step decision making
Architecture:
- Base: Qwen-2-VL (7B and 72B variants)
- Training: ~50 billion tokens of GUI data
- Sizes: 2B, 7B, 72B
Performance (UI-TARS 1.5):
| Benchmark | UI-TARS 1.5 | Claude 3.7 | GPT-4o |
|---|---|---|---|
| ScreenSpot-Pro | 61.6% | 27.7% | ~1% |
| OSWorld (100 steps) | 42.5% | 28.0% | - |
| AndroidWorld | 64.2% | - | 34.5% |
Key insight: UI-TARS 1.5 is currently SOTA on most GUI benchmarks and is fully open source (Apache 2.0).
Paper: SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents Code: github.com/njucckevin/SeeClick
A visual GUI agent focused on grounding pre-training.
Key contribution: Demonstrated that GUI grounding pre-training significantly improves downstream task performance.
Performance:
- MiniWob: 73.6% (vs 65.5% WebGUM, 67.0% Pix2Act)
- AITW ClickAcc: 66.4%
- Mind2Web: Nearly doubled Qwen-VL baseline
Paper: Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs
Specialized for mobile UI understanding with "any resolution" support.
Key features:
- Handles elongated aspect ratios (mobile screens)
- Divides screen into sub-images for detail magnification
- Supports bounding boxes, scribbles, and point inputs
Performance:
- Icon recognition: 95% accuracy
- Widget classification: 90% accuracy
- Widget/icon grounding: 92-93% accuracy
Limitation: Focused on mobile; may not generalize to desktop professional apps.
Paper: CogAgent: A Visual Language Model for GUI Agents
Early visual language model for GUI agents supporting both PC and Android.
Key insight: Encodes UI structures and action semantics into shared embedding space.
Source: ScreenSpot-Pro paper
The most effective technique for improving grounding accuracy on small targets.
How it works:
- Use GPT-4o to predict likely UI regions ("menu bar", "properties panel")
- Recursively crop to those regions
- Run grounding model on simplified sub-images
- Use Gaussian scoring to vote on candidate locations
- Apply non-maximum suppression and refine
Results:
| Method | ScreenSpot-Pro Accuracy | Improvement |
|---|---|---|
| OS-Atlas-7B (baseline) | 18.9% | - |
| Iterative Narrowing | 31.9% | +69% |
| ReGround | 40.2% | +113% |
| ScreenSeekeR | 48.1% | +254% |
Key insight: Strategic, LLM-guided cropping massively outperforms single-pass detection. This is the technique most relevant to our "robust detection" approach.
Source: Set-of-Mark Prompting
Instead of asking models to predict coordinates, overlay numbered bounding boxes and ask for the box ID.
Used by: OmniParser, many GUI agents
Benefit: Reduces coordinate prediction errors; leverages model's ability to match descriptions to labeled regions.
Source: Ferret-UI, various papers
High-resolution screens require special handling:
- Split into sub-images based on aspect ratio
- Process at multiple scales
- Merge detections with NMS
Source: YOLO data augmentation
Common augmentations for UI detection:
- Rotation (±15°)
- Saturation/exposure changes (0.5x-2x)
- Hue shifts
- Random cropping
Note: These improve training robustness but are different from test-time augmentation strategies.
For each click location:
1. Run OmniParser on original image
2. If no element found at click point:
- Try crop around click (200px, 300px, etc.)
- Try brightness adjustments
- Try grayscale
- Try contrast changes
3. Return first successful detection
For each target instruction:
1. Use GPT-4o to predict likely UI regions
2. Hierarchically decompose: "menu bar" → "File menu" → "Save button"
3. Crop to predicted regions
4. Run grounding model on cropped region
5. Use Gaussian voting across candidates
6. Refine with NMS
| Aspect | Our Approach | ScreenSeekeR |
|---|---|---|
| Cropping strategy | Fixed sizes, centered on click | LLM-predicted regions |
| Transform selection | Sequential trial | Hierarchical reasoning |
| Theoretical basis | Ad-hoc | GUI hierarchy knowledge |
| Improvement | Unknown | +254% validated |
-
Replace random transforms with LLM-guided cropping
- Use GPT-4o/Claude to predict likely regions
- Leverage UI hierarchy (toolbar, sidebar, main content)
-
Consider switching base model
- UI-TARS 1.5: 61.6% vs OmniParser's 39.6%
- Open source, similar resource requirements
-
Evaluate on standard benchmarks
- ScreenSpot for general evaluation
- ScreenSpot-Pro for professional apps
| Model/Method | ScreenSpot | ScreenSpot-Pro | OSWorld | Notes |
|---|---|---|---|---|
| UI-TARS 1.5-7B | - | 61.6% | 42.5% | Current SOTA, open source |
| ScreenSeekeR + OS-Atlas | - | 48.1% | - | Progressive cropping |
| OmniParser + GPT-4o | - | 39.6% | - | Our current approach |
| OS-Atlas-7B | - | 18.9% | - | Without cropping |
| Claude 3.7 | - | 27.7% | 28.0% | |
| GPT-4o | - | 0.8% | - | Without SoM |
| SeeClick | 73.6% (MiniWob) | - | - | GUI grounding pioneer |
| Ferret-UI | 95% (icons) | - | - | Mobile-focused |
-
How much does progressive cropping help UI-TARS?
- UI-TARS already achieves 61.6% without ScreenSeekeR
- Could combination push to 70%+?
-
What's the ceiling for small icon detection?
- 0.07% screen area is ~20x20 pixels on 1080p
- May require specialized icon detection models
-
How do these methods perform on our specific use case?
- Click-to-element mapping vs instruction grounding
- May have different characteristics
-
Cost/latency tradeoffs?
- ScreenSeekeR requires multiple GPT-4o calls
- UI-TARS is single-pass but requires GPU
-
Cheng et al. "SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents" ACL 2024. arXiv:2401.10935
-
Li et al. "ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use" 2025. arXiv:2504.07981
-
Lu et al. "OmniParser for Pure Vision Based GUI Agent" 2024. arXiv:2408.00203
-
ByteDance. "UI-TARS: Pioneering Automated GUI Interaction with Native Agents" 2025. arXiv:2501.12326
-
You et al. "Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs" ECCV 2024. Paper
-
Hong et al. "CogAgent: A Visual Language Model for GUI Agents" 2023. arXiv:2312.08914
-
Yang et al. "Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V" 2023. arXiv:2310.11441