This document describes the acquisition of the pre-training data used by SeeClick.
Tips: In GUI grounding data, the position of the target element is recorded in the bbox
key, represented by [left, top, right, bottom]
.
Each value is a [0, 1] decimal number indicating the ratio of the corresponding position to the width or height of the image.
The images for mobile data are part of the RICO dataset [1], which can be downloaded from here. Alternatively, we provide a packaged zip file.
Widget Captioning data are collected by [2]. The part used for SeeClick training can be downloaded in here.
Each sample contain:
img_filename
: the interface screenshot fileinstruction
: human instructionbbox
: the bounding box of the target element corresponding to instruction
RICOSCA is a dataset automatically labeled using Android VH in [3]. The part used for SeeClick training can be downloaded in here.
Each sample contain:
img_filename
: the interface screenshot fileinstruction
: automatically labeled instructionbbox
: the bounding box of the target element corresponding to instruction
Screen Summarization data are collected by [4]. The part used for SeeClick training can be downloaded in here.
Each sample contain:
img_filename
: the interface screenshot filecaptions
: a list of captions for the screenshot
The web data used by SeeClick for training was crawled from websites provided by Common Crawl, containing more than 270k webpage screenshots and over 3 million webpage elements. The crawled web screenshots is in here (include 270k webpage screenshots, 130G), for convenience we also provide a subset of 10,000 images. The annotation elements and text are available at here.
Each sample contain:
img_filename
: the interface screenshot fileurl
: the url of the webpageelements
: the target elements in the webpageinstruction
: automatically crawled text/instruction for the elementbbox
: the bounding box of the target elementdata_type
: "text"/"hover", the two types of element collected by SeeClick
We use LLaVA-Instruct-150K as general data for training SeeClick.
[1] Rico: A mobile app dataset for building data-driven design applications
[2] Widget Captioning: Generating Natural Language Description for Mobile User Interface Elements
[3] Mapping Natural Language Instructions to Mobile UI Action Sequences
[4] Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning