GUI Grounding Pre-training Data for SeeClick

This document describes the acquisition of the pre-training data used by SeeClick.

Tips: In GUI grounding data, the position of the target element is recorded in the bbox key, represented by [left, top, right, bottom]. Each value is a [0, 1] decimal number indicating the ratio of the corresponding position to the width or height of the image.

Mobile data

The images for mobile data are part of the RICO dataset [1], which can be downloaded from here. Alternatively, we provide a packaged zip file.

Widget Captioning

Widget Captioning data are collected by [2]. The part used for SeeClick training can be downloaded in here.

Each sample contain:

img_filename: the interface screenshot file
instruction: human instruction
bbox: the bounding box of the target element corresponding to instruction

RICOSCA

RICOSCA is a dataset automatically labeled using Android VH in [3]. The part used for SeeClick training can be downloaded in here.

Each sample contain:

img_filename: the interface screenshot file
instruction: automatically labeled instruction
bbox: the bounding box of the target element corresponding to instruction

Screen Summarization

Screen Summarization data are collected by [4]. The part used for SeeClick training can be downloaded in here.

Each sample contain:

img_filename: the interface screenshot file
captions: a list of captions for the screenshot

Web data

The web data used by SeeClick for training was crawled from websites provided by Common Crawl, containing more than 270k webpage screenshots and over 3 million webpage elements. The crawled web screenshots is in here (include 270k webpage screenshots, 130G), for convenience we also provide a subset of 10,000 images. The annotation elements and text are available at here.

Each sample contain:

img_filename: the interface screenshot file
url: the url of the webpage
elements: the target elements in the webpage
- instruction: automatically crawled text/instruction for the element
- bbox: the bounding box of the target element
- data_type: "text"/"hover", the two types of element collected by SeeClick

General data

We use LLaVA-Instruct-150K as general data for training SeeClick.

[1] Rico: A mobile app dataset for building data-driven design applications

[2] Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements

[3] Mapping Natural Language Instructions to Mobile UI Action Sequences

[4] Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

readme_data.md

readme_data.md

GUI Grounding Pre-training Data for SeeClick

Mobile data

Widget Captioning

RICOSCA

Screen Summarization

Web data

General data

Files

readme_data.md

Latest commit

History

readme_data.md

File metadata and controls

GUI Grounding Pre-training Data for SeeClick

Mobile data

Widget Captioning

RICOSCA

Screen Summarization

Web data

General data