Literature Review: UI Element Grounding

Last updated: January 2025

Executive Summary

This document reviews the state-of-the-art in UI element grounding—the task of locating specific UI elements in screenshots given natural language descriptions or click coordinates. Key findings:

Best accuracy on hard benchmarks is ~62% (UI-TARS 1.5 on ScreenSpot-Pro)
Progressive cropping improves accuracy by 254% (ScreenSeekeR technique)
OmniParser + GPT-4o achieves 39.6% on ScreenSpot-Pro
Small targets (icons, tiny buttons) remain the hardest challenge

1. Benchmarks

1.1 ScreenSpot (SeeClick, ACL 2024)

Paper: SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents

The first realistic GUI grounding benchmark spanning multiple platforms.

Metric	Value
Screenshots	600+
Instructions	1,200+
Platforms	iOS, Android, macOS, Windows, Web
Element types	Text, widgets, icons
Avg target size	2.01% of screen

Key insight: Created by the SeeClick team to evaluate GUI grounding capabilities. Includes both text-based elements and visual widgets/icons.

1.2 ScreenSpot-Pro (NUS, 2025)

Paper: ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use

A significantly harder benchmark focusing on professional applications with tiny UI elements.

Metric	Value
Screenshots	1,581
Applications	23 across 5 industries
Platforms	Windows, macOS, Linux
Avg target size	0.07% of screen (29x smaller than ScreenSpot)
Text/Icon split	62.6% / 37.4%

Applications covered:

Development: VSCode, PyCharm, Android Studio, VMware
Creative: Photoshop, Premiere, Illustrator, Blender, DaVinci Resolve
CAD/Engineering: AutoCAD, SolidWorks, Inventor, Vivado, Quartus
Scientific: MATLAB, Stata, EViews
Office: Word, Excel, PowerPoint

Key insight: Professional software has much smaller targets than consumer apps. Models that work on ScreenSpot often fail catastrophically on ScreenSpot-Pro.

1.3 Other Benchmarks

Benchmark	Focus	Notes
MiniWob	Web automation	125 web-based tasks
AITW (Android In The Wild)	Mobile automation	Real Android interactions
Mind2Web	Web navigation	Cross-website generalization
OSWorld	Desktop OS tasks	Full computer control
AndroidWorld	Mobile OS tasks	Android device control

2. Models & Methods

2.1 OmniParser (Microsoft, 2024-2025)

Paper: OmniParser for Pure Vision Based GUI Agent Code: github.com/microsoft/OmniParser

A screen parsing tool that uses Set-of-Mark (SoM) prompting to overlay bounding boxes on UI screenshots.

Architecture:

Icon detection: Fine-tuned YOLO model
Icon captioning: Fine-tuned Florence-2 model
Text detection: PaddleOCR / EasyOCR

Performance:

Benchmark	OmniParser + GPT-4o	GPT-4o alone	Improvement
ScreenSpot-Pro	39.6%	0.8%	+4,850%
SeeAssign	93.8%	70.5%	+33%

V2 Improvements:

60% faster inference (0.6s/frame on A100)
Better small element detection
Cleaner training data

Limitations:

Still struggles with very small icons
39.6% accuracy leaves significant room for improvement
Requires external LLM for action selection

2.2 UI-TARS (ByteDance, 2024-2025)

Paper: UI-TARS: Pioneering Automated GUI Interaction with Native Agents Model: ByteDance-Seed/UI-TARS-1.5-7B

A native GUI agent that directly perceives screenshots and outputs actions.

Key innovations:

Enhanced Perception: Large-scale GUI screenshot training
Unified Action Modeling: Standardized actions across platforms
System-2 Reasoning: Deliberate multi-step decision making

Architecture:

Base: Qwen-2-VL (7B and 72B variants)
Training: ~50 billion tokens of GUI data
Sizes: 2B, 7B, 72B

Performance (UI-TARS 1.5):

Benchmark	UI-TARS 1.5	Claude 3.7	GPT-4o
ScreenSpot-Pro	61.6%	27.7%	~1%
OSWorld (100 steps)	42.5%	28.0%	-
AndroidWorld	64.2%	-	34.5%

Key insight: UI-TARS 1.5 is currently SOTA on most GUI benchmarks and is fully open source (Apache 2.0).

2.3 SeeClick (Nanjing University, ACL 2024)

Paper: SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents Code: github.com/njucckevin/SeeClick

A visual GUI agent focused on grounding pre-training.

Key contribution: Demonstrated that GUI grounding pre-training significantly improves downstream task performance.

Performance:

MiniWob: 73.6% (vs 65.5% WebGUM, 67.0% Pix2Act)
AITW ClickAcc: 66.4%
Mind2Web: Nearly doubled Qwen-VL baseline

2.4 Ferret-UI (Apple, ECCV 2024)

Paper: Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs

Specialized for mobile UI understanding with "any resolution" support.

Key features:

Handles elongated aspect ratios (mobile screens)
Divides screen into sub-images for detail magnification
Supports bounding boxes, scribbles, and point inputs

Performance:

Icon recognition: 95% accuracy
Widget classification: 90% accuracy
Widget/icon grounding: 92-93% accuracy

Limitation: Focused on mobile; may not generalize to desktop professional apps.

2.5 CogAgent (Tsinghua/Zhipu, 2023)

Paper: CogAgent: A Visual Language Model for GUI Agents

Early visual language model for GUI agents supporting both PC and Android.

Key insight: Encodes UI structures and action semantics into shared embedding space.

3. Techniques for Improving Grounding

3.1 Progressive Cropping (ScreenSeekeR)

Source: ScreenSpot-Pro paper

The most effective technique for improving grounding accuracy on small targets.

How it works:

Use GPT-4o to predict likely UI regions ("menu bar", "properties panel")
Recursively crop to those regions
Run grounding model on simplified sub-images
Use Gaussian scoring to vote on candidate locations
Apply non-maximum suppression and refine

Results:

Method	ScreenSpot-Pro Accuracy	Improvement
OS-Atlas-7B (baseline)	18.9%	-
Iterative Narrowing	31.9%	+69%
ReGround	40.2%	+113%
ScreenSeekeR	48.1%	+254%

Key insight: Strategic, LLM-guided cropping massively outperforms single-pass detection. This is the technique most relevant to our "robust detection" approach.

3.2 Set-of-Mark (SoM) Prompting

Source: Set-of-Mark Prompting

Instead of asking models to predict coordinates, overlay numbered bounding boxes and ask for the box ID.

Used by: OmniParser, many GUI agents

Benefit: Reduces coordinate prediction errors; leverages model's ability to match descriptions to labeled regions.

3.3 Resolution Enhancement

Source: Ferret-UI, various papers

High-resolution screens require special handling:

Split into sub-images based on aspect ratio
Process at multiple scales
Merge detections with NMS

3.4 Data Augmentation

Source: YOLO data augmentation

Common augmentations for UI detection:

Rotation (±15°)
Saturation/exposure changes (0.5x-2x)
Hue shifts
Random cropping

Note: These improve training robustness but are different from test-time augmentation strategies.

4. Comparison: Our Approach vs SOTA

4.1 Our Current Approach

For each click location:
  1. Run OmniParser on original image
  2. If no element found at click point:
     - Try crop around click (200px, 300px, etc.)
     - Try brightness adjustments
     - Try grayscale
     - Try contrast changes
  3. Return first successful detection

4.2 ScreenSeekeR Approach

For each target instruction:
  1. Use GPT-4o to predict likely UI regions
  2. Hierarchically decompose: "menu bar" → "File menu" → "Save button"
  3. Crop to predicted regions
  4. Run grounding model on cropped region
  5. Use Gaussian voting across candidates
  6. Refine with NMS

4.3 Key Differences

Aspect	Our Approach	ScreenSeekeR
Cropping strategy	Fixed sizes, centered on click	LLM-predicted regions
Transform selection	Sequential trial	Hierarchical reasoning
Theoretical basis	Ad-hoc	GUI hierarchy knowledge
Improvement	Unknown	+254% validated

4.4 Recommendations

Replace random transforms with LLM-guided cropping
- Use GPT-4o/Claude to predict likely regions
- Leverage UI hierarchy (toolbar, sidebar, main content)
Consider switching base model
- UI-TARS 1.5: 61.6% vs OmniParser's 39.6%
- Open source, similar resource requirements
Evaluate on standard benchmarks
- ScreenSpot for general evaluation
- ScreenSpot-Pro for professional apps

5. Performance Summary Table

Model/Method	ScreenSpot	ScreenSpot-Pro	OSWorld	Notes
UI-TARS 1.5-7B	-	61.6%	42.5%	Current SOTA, open source
ScreenSeekeR + OS-Atlas	-	48.1%	-	Progressive cropping
OmniParser + GPT-4o	-	39.6%	-	Our current approach
OS-Atlas-7B	-	18.9%	-	Without cropping
Claude 3.7	-	27.7%	28.0%
GPT-4o	-	0.8%	-	Without SoM
SeeClick	73.6% (MiniWob)	-	-	GUI grounding pioneer
Ferret-UI	95% (icons)	-	-	Mobile-focused

6. Open Questions

How much does progressive cropping help UI-TARS?
- UI-TARS already achieves 61.6% without ScreenSeekeR
- Could combination push to 70%+?
What's the ceiling for small icon detection?
- 0.07% screen area is ~20x20 pixels on 1080p
- May require specialized icon detection models
How do these methods perform on our specific use case?
- Click-to-element mapping vs instruction grounding
- May have different characteristics
Cost/latency tradeoffs?
- ScreenSeekeR requires multiple GPT-4o calls
- UI-TARS is single-pass but requires GPU

7. References

Cheng et al. "SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents" ACL 2024. arXiv:2401.10935
Li et al. "ScreenSpot-Pro: GUI Grounding for Professional High-Resolution Computer Use" 2025. arXiv:2504.07981
Lu et al. "OmniParser for Pure Vision Based GUI Agent" 2024. arXiv:2408.00203
ByteDance. "UI-TARS: Pioneering Automated GUI Interaction with Native Agents" 2025. arXiv:2501.12326
You et al. "Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs" ECCV 2024. Paper
Hong et al. "CogAgent: A Visual Language Model for GUI Agents" 2023. arXiv:2312.08914
Yang et al. "Set-of-Mark Prompting Unleashes Extraordinary Visual Grounding in GPT-4V" 2023. arXiv:2310.11441

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Literature Review: UI Element Grounding

Executive Summary

1. Benchmarks

1.1 ScreenSpot (SeeClick, ACL 2024)

1.2 ScreenSpot-Pro (NUS, 2025)

1.3 Other Benchmarks

2. Models & Methods

2.1 OmniParser (Microsoft, 2024-2025)

2.2 UI-TARS (ByteDance, 2024-2025)

2.3 SeeClick (Nanjing University, ACL 2024)

2.4 Ferret-UI (Apple, ECCV 2024)

2.5 CogAgent (Tsinghua/Zhipu, 2023)

3. Techniques for Improving Grounding

3.1 Progressive Cropping (ScreenSeekeR)

3.2 Set-of-Mark (SoM) Prompting

3.3 Resolution Enhancement

3.4 Data Augmentation

4. Comparison: Our Approach vs SOTA

4.1 Our Current Approach

4.2 ScreenSeekeR Approach

4.3 Key Differences

4.4 Recommendations

5. Performance Summary Table

6. Open Questions

7. References

FilesExpand file tree

literature_review.md

Latest commit

History

literature_review.md

File metadata and controls

Literature Review: UI Element Grounding

Executive Summary

1. Benchmarks

1.1 ScreenSpot (SeeClick, ACL 2024)

1.2 ScreenSpot-Pro (NUS, 2025)

1.3 Other Benchmarks

2. Models & Methods

2.1 OmniParser (Microsoft, 2024-2025)

2.2 UI-TARS (ByteDance, 2024-2025)

2.3 SeeClick (Nanjing University, ACL 2024)

2.4 Ferret-UI (Apple, ECCV 2024)

2.5 CogAgent (Tsinghua/Zhipu, 2023)

3. Techniques for Improving Grounding

3.1 Progressive Cropping (ScreenSeekeR)

3.2 Set-of-Mark (SoM) Prompting

3.3 Resolution Enhancement

3.4 Data Augmentation

4. Comparison: Our Approach vs SOTA

4.1 Our Current Approach

4.2 ScreenSeekeR Approach

4.3 Key Differences

4.4 Recommendations

5. Performance Summary Table

6. Open Questions

7. References