Skip to content

Latest commit

 

History

History
65 lines (49 loc) · 3.18 KB

readme_data.md

File metadata and controls

65 lines (49 loc) · 3.18 KB

GUI Grounding Pre-training Data for SeeClick

This document describes the acquisition of the pre-training data used by SeeClick.


Tips: In GUI grounding data, the position of the target element is recorded in the bbox key, represented by [left, top, right, bottom]. Each value is a [0, 1] decimal number indicating the ratio of the corresponding position to the width or height of the image.


Mobile data

The images for mobile data are part of the RICO dataset [1], which can be downloaded from here. Alternatively, we provide a packaged zip file.

Widget Captioning

Widget Captioning data are collected by [2]. The part used for SeeClick training can be downloaded in here.

Each sample contain:

  • img_filename: the interface screenshot file
  • instruction: human instruction
  • bbox: the bounding box of the target element corresponding to instruction

RICOSCA

RICOSCA is a dataset automatically labeled using Android VH in [3]. The part used for SeeClick training can be downloaded in here.

Each sample contain:

  • img_filename: the interface screenshot file
  • instruction: automatically labeled instruction
  • bbox: the bounding box of the target element corresponding to instruction

Screen Summarization

Screen Summarization data are collected by [4]. The part used for SeeClick training can be downloaded in here.

Each sample contain:

  • img_filename: the interface screenshot file
  • captions: a list of captions for the screenshot

Web data

The web data used by SeeClick for training was crawled from websites provided by Common Crawl, containing more than 270k webpage screenshots and over 3 million webpage elements. The crawled web screenshots is in here (include 270k webpage screenshots, 130G), for convenience we also provide a subset of 10,000 images. The annotation elements and text are available at here.

Each sample contain:

  • img_filename: the interface screenshot file
  • url: the url of the webpage
  • elements: the target elements in the webpage
    • instruction: automatically crawled text/instruction for the element
    • bbox: the bounding box of the target element
    • data_type: "text"/"hover", the two types of element collected by SeeClick

General data

We use LLaVA-Instruct-150K as general data for training SeeClick.


[1] Rico: A mobile app dataset for building data-driven design applications

[2] Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements

[3] Mapping Natural Language Instructions to Mobile UI Action Sequences

[4] Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning